Text processing tools and/or languages
Chris Green
cl at isbd.net
Fri Jan 21 09:45:09 UTC 2022
On Thu, Jan 20, 2022 at 09:41:25PM +0000, Peter Flynn wrote:
> On 20/01/2022 15:42, Chris Green wrote:
> > I'm looking for tools (if there are any) for processing a text file
> > line by line sequentially.
>
> Pretty mcuh all the standard Unix text tool do exactly this.
>
> > As it goes through the file it needs to make decisions based on the
> > contents of the line(s) of text and change its state as it goes.
> > The decisions it makes depend on the state it's in.
>
> awk is the obvious choice to me, but for others it would be one of the
> common scripting languages like Perl or Python.
>
> A lot may depend on the nature of the data and what you want to do with
> it. Picking the right tool for the job isn't always simple, although
> there is a tendency for people to stick to one tool they know well, and
> shoe-horn every task into the constraints of that tool :-)
>
> > Basically I'm processing some (fairly) fixed format messages from a
> > forum to remove some matched header and trailer lines, modify and
> > output a few other matched lines and simply output the body of the
> > message.
> >
> > The (most) difficult bit is removing blank lines before something.
>
> As many tools don't read ahead to the next line, you will need to set
> some kind of flag value to indicate what type of line the previous line
> was, and make decisions on that basis.
>
> There are some languages with built-in features for doing exactly this
> kind of processing, understanding the concept of "lines of importance
> separated by blank lines". Omnimark and Saxon are the two I have used most.
>
Yes, I think you understand the problem. Awk is ideal for the pattern
matching but doesn't (as I said) really lend itself to having a
concept of a 'position' in the file.
As I said elsewhere in this thread I've produced a fairly short (18
lines) script in awk that is doing what I need (I expect I'll find
some edge cases that break it and need more code). It's OK but for me
it doesn't really fit into the awk paradigm very well.
--
Chris Green
More information about the ubuntu-users
mailing list