scripting / data problem.

Patton Echols p.echols at comcast.net
Sun Nov 8 01:31:25 UTC 2009


I have a data file problem that I hope I can get an assist on.  Not an 
"Ubuntu" question really, but this is about the smartest group I know to 
ask!

Note:  I'm not asking anyone to do my homework.  In fact, I'd be happy 
with the RTFM answer, as long as you could tell me which FM and why you 
say so.  Unfortunately, it takes a bit to explain the problem, so this 
is kind of long.

Here is the issue:

I have a number of different, partially overlapping in the form of (2) 
MS Outlook contact databases, (1) MS Access database, (1) quickbooks and 
(1) SQL db. 

Though we don't have access the SQL available yet, we currently think 
that CSV files are the common denominator that everything can import and 
export. 

The problem is that the various lists don't all have the same 
information so I can't just cat them together and sort with a "unique" 
operator.  That's a vague statement.  Here is what I mean by file and field:

file 1 - | email |
file 2 - | email | f-name | l-name |
file 3 - | f-name | l-name | company | phone |
file 4 - | company | address | city | state |

(By the way, each one has some duplicates with less than complete info.)

What we need is to be able to populate all of these files with all the 
available data for each contact.  The concept I have in mind is to 
compare the files in pairs.  So file 1 would go into file-L  (for Last)  
Then file 1 would be compared with each line of file-L and if there is a 
record with matching email, then any extra info is added , but not 
overwriting anything that is already in any field  So the final file 
would look like this:

| file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company 
| address | city | state | phone |

The first four fields would have an "x" or something that could be 
subsequently used to parse them back to the originals.

I have thought about  scripting something that would, "for each line in 
file 1, use the fields in the line for variables, feed them to "awk" 
which tests file 2 and returns results to the script to be added to file-L"

I have also thought that awk  could do the whole thing, but I'd hate to 
plow through the entire user guide if someone told me it was hopeless. 

I also wondered whether I could create a mysql database with this info, 
and there are commands that can merge records (rather than just 
eliminate what looks like a duplicate). 

Or perhaps there is another strategy???

Obviously I have a lot of learning to do, but I am wondering about what? 

Thanks very much for all of you who have read this far.  I'd appreciate 
any insight you may have into the best approach. 

--PE





More information about the ubuntu-users mailing list