Scripting question -- duplicate record problem

Patton Echols p.echols at comcast.net
Fri Dec 10 00:05:12 UTC 2010


I am working on  a script to clean up the export of a proprietary db  
for gmail contacts import.  The only way to get info out of the db is to 
export to a flat file.

I have been able to get the export to be only the fields I want, but 
there are problems with the data. 

I am extracting only the records with email addresses by using the 
following:
gawk -F, '{ if ( match($4, "@") ) print };' gmail-export.txt > 
gmail-export.csv

But some of the records are duplicates, but not identical duplicates.  
Most of the issues are where a duplicate is created with different 
formatting of the phone numbers.  Eg.  (123) 456-7890 is different from 
1234567890. 

Can anyone suggest a way to have the script throw out the second record 
based only on two fields (First name and last name)?

I confess I am stuck thinking of this as an awk problem and my 
creativity is shot!
Thanks for any thoughts.


--PE




More information about the ubuntu-users mailing list