AWK experts - how would I code around this in awk...

Steve Flynn anothermindbomb at gmail.com
Tue Feb 23 09:58:21 UTC 2010


On Tue, Feb 23, 2010 at 12:31 AM, Karl Auer <kauer at biplane.com.au> wrote:

> Plus you might want to issue a warning if a partial record turns up,
> rather than silently discard it. With 3.8 terabytes to convert, I'm
> guessing you won't be watching the migration.

I'll probably see most of it sadly, one of the perks of the contract.
Oh how we love 18 hour shifts.

> How are you sanity checking the input? Is that being left to the DB
> import stage? There might be things you can do that will save a lot of
> time, disk space and CPU cycles if the export was done wrongly. You
> might need to check for 7-bit clean or something. Anything unexpected,
> basically.

The data is pipe delimited, each file (there aer around 150 of them)
having a header record naming the fields I'm about to see.

Sanity checks thus far consist of "Count how many pipes in the header,
compare this with every following record and complain about any where
the numbers don't match". This has already caught the surprising
number of people who have pipes in their email addresses and house
addresses.

This was actually how I found the "embedded CR/LF" issue as we found a
number of records which were short thanks to a carriage returns in
some fields and of course all the lovely people with pipes in their
email adresses show up as long records.


I'm now registering all new business for myself with
another,|`\mindbomb#{}[]!"£$%^&*()@"gmail.com.... I don't have the
heart to add a drop table sql injection attack to that...

-- 
Steve
When one person suffers from a delusion it is insanity. When many
people suffer from a delusion it is called religion.

09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0




More information about the ubuntu-users mailing list