Scripting Question

Fri Feb 20 08:11:30 UTC 2009

On 02/14/2009 01:36 AM, Matthew Flaschen wrote:
> H.S. wrote:
>   
>> H.S. wrote:
>>     
>>> Patton Echols wrote:
>>>       
>>>> I have a fairly massive flat file, comma delimited, that I want to 
>>>> extract info from.  Specifically, I want to extract the first and last 
>>>> name and email addresses for those who have them to a new file with just 
>>>> that info. (The windows database program that this comes from simply 
>>>> will not do it)  I can grep the file for the @ symbol to at least 
>>>> exclude the lines without an email address (or the @ symbol in the notes 
>>>> field)  But if I can figure this out, I can also adapt what I learn for 
>>>> the next time.  Can anyone point me in the right direction for my "light 
>>>> reading?"
>>>>
>>>> By the way, I used 'head' to get the first line, with the field names.  
>>>> This is the first of about 2300 records, the reason not to do it by hand.
>>>>
>>>> patton at laptop:~$ head -1 contacts.txt
>>>> "Business Title","First Name","Middle Name","Last Name","","Business 
>>>> Company Name","","Business Title","Business Street 1","Business Street 
>>>> 2","Business Street 3","Business City","Business State","Business 
>>>> Zip","Business Country","Home Street 1","Home Street 2","Home Street 
>>>> 3","Home City","Home State","Home Zip","Home Country","Other Street 
>>>> 1","Other Street 2","Other Street 3","Other City","Other State","Other 
>>>> Zip","Other Country","Assistant Phone","Business Fax Number","Business 
>>>> Phone","Business 2 Phone","","Car Phone","","Home Fax Number","Home 
>>>> Phone","Home 2 Phone","ISDN Phone","Mobile Phone","Other Fax 
>>>> Number","Other Phone","Pager 
>>>> Phone","","","","","","","","","","","","","Business Email","","Home 
>>>> Email","","Other 
>>>> Email","","","","","","","","","","","","Notes","","","","","","","","","","","","","","Business 
>>>> Web Page"
>>>>
>>>>
>>>>         
>>> Here is one crude method. Assume that the above long single line is in a
>>> file called test.db. Then the following bash command will output the
>>> Business Email from that file (this is one long command):
>>> $> cat test.db  | sed -e 's/\(.*Business Email\"\),"\(.*\)/\2/g' | awk
>>> 'BEGIN { FS = "\"" } ; {print $1}'
>>>
>>> Similarly, the following gives the First name, Middle name and the Last
>>> name.
>>> $> cat test.db  | sed -e 's/\(^"Business Title\"\),"\(.*\)/\2/g' | awk
>>> 'BEGIN { FS = "," } ; {print $1, $2, $3}'  | tr -d '"'
>>>
>>> Now, you can run this command on each line of your actual database file
>>> (using the bash while and read commands) and you should get the business
>>> email address and the names. If there is no email address, the output
>>> will be blank.
>>>
>>> Here is an untested set of commands to read each line from a file
>>> (full.db) to generate names and email:
>>> $> cat full.db | while read line; do
>>>     echo "${line}" | sed -e 's/\(^"Business Title\"\),"\(.*\)/\2/g' |
>>> awk 'BEGIN { FS = "," } ; {print $1, $2, $3}'  | tr -d '"';
>>>     echo "${line}" |  sed -e 's/\(.*Business Email\"\),"\(.*\)/\2/g' |
>>> awk 'BEGIN { FS = "\"" } ; {print $1}'
>>> done
>>>
>>> But note that this is really a crude method. I am sure others can
>>> suggest more elegant ways to accomplish this. The above method will at
>>> least get you started.
>>>
>>> Warm regards.
>>>
>>>       
>> More concise (given the order of data fields is constant) and probably
>> more efficient and better (the following is one long line):
>>
>> #---------------------------------------------
>> $> cat full.db | while read line; do echo "${line}" |awk 'BEGIN { FS =
>> "," }; {print $2, $3, $4,  $58}' | tr -d '"'; done
>>     
>
> There are a few issues.  There's no need for cat, read line...done, tr,
> or echo (shell scripting is slow, especially when you fork multiple
> processes for every line).  This didn't handle all the emails and that's
> the wrong field number.  And it doesn't output in CSV format.  Finally,
> the above prints every line, not only those with emails.  So I get:
>
> gawk -F, '{ if ( match($57$59$61, "@") ) print
> $2","$4","$57","$59","$61};' contacts.txt>processed_contacts.txt
>
> That's all one line.
>
> Matt Flaschen
>
>
>
>   
Thanks to everyone who responded to this.  I really didn't plan to ask 
the question and then drop off the face of the earth for a week, but 
life happened.  Matt's solution worked like a charm so I responded to 
this one, but I learned something from all of the discussion and I 
appreciate it.

As an aside, I manually cleaned out the few duplicate lines in the 
result.   I am going to read 'man gawk' to see if I could figure out how 
to clean duplicates automatically.