finding duplicate files in backups

Mon Jun 6 23:39:29 UTC 2011

On Mon, June 6, 2011 20:32, Abhishek Dixit wrote:
> Hi,
> I have a 1 TB USB used as backup drive. Due to various reasons same
> file existed in different file systems and has been backed up multiple
> times in this 1 TB hard disk.
> I want to keep only one single copy of those files.The problem is
> these files are spread in different file systems and (multiple
> partitions) and are present here and there at various locations which
> I do not remember.
> I want to achieve following
>
> 1) Reduce n number of occurrence of same file at different location to
> 1 occurrence.
>
> 2) Since I do not know the name of files which have multiple
> occurrences how can I easily find this.
>
> 3) Is there a way I can create an index of files and directories
> present on my laptop for example when you open a book then each book
> has an index page which tells on which page number what is present.The
> same way.
>
> What can be an easy way to achieve above?

I'm not sure if I understand your question but if this is your situation:

* file 'foo' can exist on various filesystems, for example /dev/sda1
(ext3), /dev/sda2 (ext4), /dev/sda3 (xfs)
* your 1 TB backup drive has 1 filesystem, preferably ext2/3/4
* you backup /dev/sda1, /dev/sda2, /dev/sda3 to your 1 TB backup drive
/dev/sdb1

If this describes your situation, then you should take a look at
hardlink.py. If you don't know what a hard link is then I suggest that you
read a bit about it, it really sounds like something that could be useful
for you.

I suggest that you schedule hardlink.py to run after every backup.
Discover how it works, play with it for a while before you release it on
production data.