command/script for finding *non*duplicate files?

Karl Auer kauer at biplane.com.au
Mon Feb 8 11:02:12 UTC 2016


On Mon, 2016-02-08 at 10:14 +0000, Adam Funk wrote:
> I'm looking for a
> way to find files in one directory (& its subdirectories) that are
> *not* duplicated in another one.
> [...]
> Any ideas?  (I tried googling 'find non-duplicate files' & things like
> that, but it just keeps telling me how to find duplicate ones.)

Think about it this way: The non-duplicate ones are the ones that are
left when you've removed the duplicate ones.

So 1) make a list of all the files in both directories
   2) remove all the duplicates
   3) the remainder are the non-duplicates

ls -c 1 /that/directory/path > t1.txt
ls -c 2 /this/directory/path > t2.txt
cat t1.txt t2.txt | sort | uniq -u

sort puts identical lines together, uniq -u throws away all non-unique
lines leaving only the unique ones.

To make the comparison case-insensitive use -i option to uniq as well.
In fact, read "man uniq" and "man sort" :-)

For example I just made these two directories:

kauer at karl:~/temp$ ls -la this
total 800
drwxrwxr-x  2 kauer kauer   4096 Feb  8 21:38 .
drwxr-xr-x 13 kauer kauer   4096 Feb  8 21:39 ..
-rw-rw-r--  1 kauer kauer 401920 Nov 10 12:26 sw.doc
-rw-rw-r--  1 kauer kauer  93980 Nov 11 15:04 sw.odt
-rw-rw-r--  1 kauer kauer 154960 Nov 10 12:27 sw_raw.txt
-rw-rw-r--  1 kauer kauer 154335 Nov 10 10:31 sw.txt
kauer at karl:~/temp$ ls -la that
total 708
drwxrwxr-x  2 kauer kauer   4096 Feb  8 21:38 .
drwxr-xr-x 13 kauer kauer   4096 Feb  8 21:39 ..
-rw-rw-r--  1 kauer kauer 401920 Nov 10 12:26 sw.doc
-rw-rw-r--  1 kauer kauer 154960 Nov 10 12:27 sw_raw.txt
-rw-rw-r--  1 kauer kauer 154335 Nov 10 10:31 sw.txt

Note that sw.odt is only in the "this" directory.

The above three commands do this:
kauer at karl:~/temp$ ls -c1 that > t1.txt
kauer at karl:~/temp$ ls -c1 this > t2.txt
kauer at karl:~/temp$ cat t1.txt t2.txt | sort | uniq -u
sw.odt

The same principle will work however you obtain the two lists - just
make sure that the lists don't include any path information, or every
file will be "unique". You can even use ssh to get a remote list.

It's a bit more work to figure out which directory the non-duplicate is
in; simplest to look at each line in the output and check both
directories:

kauer at karl:~/temp$ for i in `cat t1.txt t2.txt | sort | uniq -u` ; do
{
   if [ -e "that/$i" ] ; then
      echo $i is uniquely in that
   else
      echo $i is uniquely in this
   fi
}
done
sw.odt is uniquely in this

I removed the ">" symbols so it didn't get flagged as quoted text. If
you have zillions of non-duplicate files you may run into command-line
length limits, in which case just output the pipe to a file and use
xargs.

Or you could look in the original lists instead (especially if one or
both directories are remote):

kauer at karl:~/temp$ for i in `cat t1.txt t2.txt | sort | uniq -u` ; do
{
   if grep -q "$i" t1.txt ; then
      echo $i is uniquely in that
   else
      echo $i is uniquely in this
   fi
}
done
sw.odt is uniquely in this

Again, I removed the ">" symbols so it didn't get flagged as quoted
text. 

Regards, K.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer at biplane.com.au)
http://www.biplane.com.au/kauer
http://twitter.com/kauer389

GPG fingerprint: E00D 64ED 9C6A 8605 21E0 0ED0 EE64 2BEE CBCB C38B
Old fingerprint: 3C41 82BE A9E7 99A1 B931 5AE7 7638 0147 2C3C 2AC4






More information about the ubuntu-users mailing list