command/script for finding *non*duplicate files?
Karl Auer
kauer at biplane.com.au
Mon Feb 8 11:02:12 UTC 2016
On Mon, 2016-02-08 at 10:14 +0000, Adam Funk wrote:
> I'm looking for a
> way to find files in one directory (& its subdirectories) that are
> *not* duplicated in another one.
> [...]
> Any ideas? (I tried googling 'find non-duplicate files' & things like
> that, but it just keeps telling me how to find duplicate ones.)
Think about it this way: The non-duplicate ones are the ones that are
left when you've removed the duplicate ones.
So 1) make a list of all the files in both directories
2) remove all the duplicates
3) the remainder are the non-duplicates
ls -c 1 /that/directory/path > t1.txt
ls -c 2 /this/directory/path > t2.txt
cat t1.txt t2.txt | sort | uniq -u
sort puts identical lines together, uniq -u throws away all non-unique
lines leaving only the unique ones.
To make the comparison case-insensitive use -i option to uniq as well.
In fact, read "man uniq" and "man sort" :-)
For example I just made these two directories:
kauer at karl:~/temp$ ls -la this
total 800
drwxrwxr-x 2 kauer kauer 4096 Feb 8 21:38 .
drwxr-xr-x 13 kauer kauer 4096 Feb 8 21:39 ..
-rw-rw-r-- 1 kauer kauer 401920 Nov 10 12:26 sw.doc
-rw-rw-r-- 1 kauer kauer 93980 Nov 11 15:04 sw.odt
-rw-rw-r-- 1 kauer kauer 154960 Nov 10 12:27 sw_raw.txt
-rw-rw-r-- 1 kauer kauer 154335 Nov 10 10:31 sw.txt
kauer at karl:~/temp$ ls -la that
total 708
drwxrwxr-x 2 kauer kauer 4096 Feb 8 21:38 .
drwxr-xr-x 13 kauer kauer 4096 Feb 8 21:39 ..
-rw-rw-r-- 1 kauer kauer 401920 Nov 10 12:26 sw.doc
-rw-rw-r-- 1 kauer kauer 154960 Nov 10 12:27 sw_raw.txt
-rw-rw-r-- 1 kauer kauer 154335 Nov 10 10:31 sw.txt
Note that sw.odt is only in the "this" directory.
The above three commands do this:
kauer at karl:~/temp$ ls -c1 that > t1.txt
kauer at karl:~/temp$ ls -c1 this > t2.txt
kauer at karl:~/temp$ cat t1.txt t2.txt | sort | uniq -u
sw.odt
The same principle will work however you obtain the two lists - just
make sure that the lists don't include any path information, or every
file will be "unique". You can even use ssh to get a remote list.
It's a bit more work to figure out which directory the non-duplicate is
in; simplest to look at each line in the output and check both
directories:
kauer at karl:~/temp$ for i in `cat t1.txt t2.txt | sort | uniq -u` ; do
{
if [ -e "that/$i" ] ; then
echo $i is uniquely in that
else
echo $i is uniquely in this
fi
}
done
sw.odt is uniquely in this
I removed the ">" symbols so it didn't get flagged as quoted text. If
you have zillions of non-duplicate files you may run into command-line
length limits, in which case just output the pipe to a file and use
xargs.
Or you could look in the original lists instead (especially if one or
both directories are remote):
kauer at karl:~/temp$ for i in `cat t1.txt t2.txt | sort | uniq -u` ; do
{
if grep -q "$i" t1.txt ; then
echo $i is uniquely in that
else
echo $i is uniquely in this
fi
}
done
sw.odt is uniquely in this
Again, I removed the ">" symbols so it didn't get flagged as quoted
text.
Regards, K.
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer at biplane.com.au)
http://www.biplane.com.au/kauer
http://twitter.com/kauer389
GPG fingerprint: E00D 64ED 9C6A 8605 21E0 0ED0 EE64 2BEE CBCB C38B
Old fingerprint: 3C41 82BE A9E7 99A1 B931 5AE7 7638 0147 2C3C 2AC4
More information about the ubuntu-users
mailing list