scripting / data problem.

Mon Nov 9 21:13:54 UTC 2009

On 11/09/2009 03:31 AM, Hal Burgiss wrote:
> On Sun, Nov 08, 2009 at 11:41:34PM -0800, Patton Echols wrote:
>   
>>>   
>>>       
>>>> file 1 - | email |
>>>> file 2 - | email | f-name | l-name |
>>>> file 3 - | f-name | l-name | company | phone |
>>>> file 4 - | company | address | city | state |
>>>>
>>>> (By the way, each one has some duplicates with less than complete info.)
>>>>
>>>> What we need is to be able to populate all of these files with all the 
>>>> available data for each contact.  The concept I have in mind is to 
>>>> compare the files in pairs.  So file 1 would go into file-L  (for Last)  
>>>> Then file 1 would be compared with each line of file-L and if there is a 
>>>> record with matching email, then any extra info is added , but not 
>>>> overwriting anything that is already in any field  So the final file 
>>>> would look like this:
>>>>         
>
> [...]
>
>   
>>> I'll vote for the mysql approach. Create a csv for each file, import that into
>>> mysql so you have 4 tables. Then create a query  that will dump out the data
>>> the way you want. You can dump that into another table or export it. For this
>>> to work, the fields will have to match in each table. So Comany in file3 is
>>> equivalent to Comany in file4 (not something like Acme, Inc vs Acme,
>>> Incorporated). 
>>>   
>>>       
>> Do you mean that all the field data must match?  Or just, for example, 
>> the email fields?    Because part of the problem is that I have non 
>> matching / missing data depending on file.
>>     
>
> In order for the relations to work across all 4 files, the fields that are
> being keyed on, *should* match exactly. You can get around the missing data by
> using a 'left join' syntax. 
>   

Ok, I'll read up on that syntax.  Since the point of the exercise is to 
merge records where there is overlap AND no conflicting data.

>  
>   
>>> SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a,
>>> file2 b, file3 c, file4 d WHERE a.email = b.email and
>>> concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company
>>> = d.company INTO OUTFILE '/tmp/output.txt'   (crude and not tested).
>>>
>>> Data manipulation is much more flexible once its in a database. At least when
>>> you have something this squirrelly.
>>>       
>
>   
>> Does the concat commands mean that there would no longer be a separate 
>> first name and last name?
>>     
>
> No, that's just part of the query that contstructs a relationship across all
> four tables. The output part is the 'b.firstname'. The concat thing is just
> one way to make sure the columns you are trying to match, actually do match.
> This will be problematic if one table has 'John Doe' and the other has
> 'Johnathon Q. Doe, III' type of stuff. 
>   

My current sense of the data is that the missing info is the basic 
problem, not conflicting info.  What may need to happen is that we merge 
the missing info and leave "duplicate" records to be dealt with by hand 
where "John" has his name spelled different ways.

>  
>   
>> Also, it looks like the "=" is a requirement.  Is that what you meant by 
>> the fields having to be the same?  Or could I arrange for it to have 
>> file 1 overwrite file 2 for certain fields.  Or only blank fields?
>>     
>
> An "equivalency" is a lot cleaner for the query part anyway (where you are
> matching data in one table against another). But you can do anything you want
> with the output part. So, instead of using b.firstname, you could use
> c.firstname instead. They would both be in their respective tables (a, b, c,
> and d). 
>
>
>   

Ok, so the output is the part between the SELECT and FROM statements? 

So much to learn, so little time! (kidding, this is really helpful)

Thanks.