Scripting Question
Patton Echols
p.echols at comcast.net
Fri Feb 20 08:11:30 UTC 2009
On 02/14/2009 01:36 AM, Matthew Flaschen wrote:
> H.S. wrote:
>
>> H.S. wrote:
>>
>>> Patton Echols wrote:
>>>
>>>> I have a fairly massive flat file, comma delimited, that I want to
>>>> extract info from. Specifically, I want to extract the first and last
>>>> name and email addresses for those who have them to a new file with just
>>>> that info. (The windows database program that this comes from simply
>>>> will not do it) I can grep the file for the @ symbol to at least
>>>> exclude the lines without an email address (or the @ symbol in the notes
>>>> field) But if I can figure this out, I can also adapt what I learn for
>>>> the next time. Can anyone point me in the right direction for my "light
>>>> reading?"
>>>>
>>>> By the way, I used 'head' to get the first line, with the field names.
>>>> This is the first of about 2300 records, the reason not to do it by hand.
>>>>
>>>> patton at laptop:~$ head -1 contacts.txt
>>>> "Business Title","First Name","Middle Name","Last Name","","Business
>>>> Company Name","","Business Title","Business Street 1","Business Street
>>>> 2","Business Street 3","Business City","Business State","Business
>>>> Zip","Business Country","Home Street 1","Home Street 2","Home Street
>>>> 3","Home City","Home State","Home Zip","Home Country","Other Street
>>>> 1","Other Street 2","Other Street 3","Other City","Other State","Other
>>>> Zip","Other Country","Assistant Phone","Business Fax Number","Business
>>>> Phone","Business 2 Phone","","Car Phone","","Home Fax Number","Home
>>>> Phone","Home 2 Phone","ISDN Phone","Mobile Phone","Other Fax
>>>> Number","Other Phone","Pager
>>>> Phone","","","","","","","","","","","","","Business Email","","Home
>>>> Email","","Other
>>>> Email","","","","","","","","","","","","Notes","","","","","","","","","","","","","","Business
>>>> Web Page"
>>>>
>>>>
>>>>
>>> Here is one crude method. Assume that the above long single line is in a
>>> file called test.db. Then the following bash command will output the
>>> Business Email from that file (this is one long command):
>>> $> cat test.db | sed -e 's/\(.*Business Email\"\),"\(.*\)/\2/g' | awk
>>> 'BEGIN { FS = "\"" } ; {print $1}'
>>>
>>> Similarly, the following gives the First name, Middle name and the Last
>>> name.
>>> $> cat test.db | sed -e 's/\(^"Business Title\"\),"\(.*\)/\2/g' | awk
>>> 'BEGIN { FS = "," } ; {print $1, $2, $3}' | tr -d '"'
>>>
>>> Now, you can run this command on each line of your actual database file
>>> (using the bash while and read commands) and you should get the business
>>> email address and the names. If there is no email address, the output
>>> will be blank.
>>>
>>> Here is an untested set of commands to read each line from a file
>>> (full.db) to generate names and email:
>>> $> cat full.db | while read line; do
>>> echo "${line}" | sed -e 's/\(^"Business Title\"\),"\(.*\)/\2/g' |
>>> awk 'BEGIN { FS = "," } ; {print $1, $2, $3}' | tr -d '"';
>>> echo "${line}" | sed -e 's/\(.*Business Email\"\),"\(.*\)/\2/g' |
>>> awk 'BEGIN { FS = "\"" } ; {print $1}'
>>> done
>>>
>>> But note that this is really a crude method. I am sure others can
>>> suggest more elegant ways to accomplish this. The above method will at
>>> least get you started.
>>>
>>> Warm regards.
>>>
>>>
>> More concise (given the order of data fields is constant) and probably
>> more efficient and better (the following is one long line):
>>
>> #---------------------------------------------
>> $> cat full.db | while read line; do echo "${line}" |awk 'BEGIN { FS =
>> "," }; {print $2, $3, $4, $58}' | tr -d '"'; done
>>
>
> There are a few issues. There's no need for cat, read line...done, tr,
> or echo (shell scripting is slow, especially when you fork multiple
> processes for every line). This didn't handle all the emails and that's
> the wrong field number. And it doesn't output in CSV format. Finally,
> the above prints every line, not only those with emails. So I get:
>
> gawk -F, '{ if ( match($57$59$61, "@") ) print
> $2","$4","$57","$59","$61};' contacts.txt>processed_contacts.txt
>
> That's all one line.
>
> Matt Flaschen
>
>
>
>
Thanks to everyone who responded to this. I really didn't plan to ask
the question and then drop off the face of the earth for a week, but
life happened. Matt's solution worked like a charm so I responded to
this one, but I learned something from all of the discussion and I
appreciate it.
As an aside, I manually cleaned out the few duplicate lines in the
result. I am going to read 'man gawk' to see if I could figure out how
to clean duplicates automatically.
More information about the ubuntu-users
mailing list