Text processing tools and/or languages

Fri Jan 21 09:45:09 UTC 2022

On Thu, Jan 20, 2022 at 09:41:25PM +0000, Peter Flynn wrote:
> On 20/01/2022 15:42, Chris Green wrote:
> > I'm looking for tools (if there are any) for processing a text file
> > line by line sequentially.
> 
> Pretty mcuh all the standard Unix text tool do exactly this.
> 
> > As it goes through the file it needs to make decisions based on the
> > contents of the line(s) of text and change its state as it goes.
> > The decisions it makes depend on the state it's in.
> 
> awk is the obvious choice to me, but for others it would be one of the
> common scripting languages like Perl or Python.
> 
> A lot may depend on the nature of the data and what you want to do with
> it. Picking the right tool for the job isn't always simple, although
> there is a tendency for people to stick to one tool they know well, and
> shoe-horn every task into the constraints of that tool :-)
> 
> > Basically I'm processing some (fairly) fixed format messages from a
> > forum to remove some matched header and trailer lines, modify and
> > output a few other matched lines and simply output the body of the
> > message.
> > 
> > The (most) difficult bit is removing blank lines before something.
> 
> As many tools don't read ahead to the next line, you will need to set
> some kind of flag value to indicate what type of line the previous line
> was, and make decisions on that basis.
> 
> There are some languages with built-in features for doing exactly this
> kind of processing, understanding the concept of "lines of importance
> separated by blank lines". Omnimark and Saxon are the two I have used most.
> 
Yes, I think you understand the problem.  Awk is ideal for the pattern
matching but doesn't (as I said) really lend itself to having a
concept of a 'position' in the file.

As I said elsewhere in this thread I've produced a fairly short (18
lines) script in awk that is doing what I need (I expect I'll find
some edge cases that break it and need more code).  It's OK but for me
it doesn't really fit into the awk paradigm very well.

-- 
Chris Green