Text processing tools and/or languages

Mike Marchywka marchywka at hotmail.com
Thu Jan 20 19:34:57 UTC 2022


On Thu, Jan 20, 2022 at 03:42:00PM +0000, Chris Green wrote:
> I'm looking for tools (if there are any) for processing a text file
> line by line sequentially.
> 
 This is how most of linux works :) 


> As it goes through the file it needs to make decisions based on the
> contents of the line(s) of text and change its state as it goes.
> The decisions it makes depend on the state it's in.
>

I know you got the normal suggestions for sed/awk/python /whatever
but I just converted a bash script into c++ code just
because of the data structures and logic or state info.

If you look for "marchywka" and "toobib" you can see the task,

https://tug.org/pipermail/texhax/2021-March/024918.html


This is just c++ code that now makes use of unix fifo's
for streaming data.  You can use std::system 

https://en.cppreference.com/w/cpp/utility/program/system

with i/o through fifo's or pipes in /tmp.
This also can make debugging easy if you also write 
intermediate reulsts to files. 

Once you have the c++ skeleton this becomes very simple
and if nothing else allows you to prototype until
you integrate code. For example, json parsing is
a big task in this. I can just invoke a json parsing
utility until I include the headers and compile it in.


I know python is popular but c++ is simple and versatile.


This particular task of finding bibtex entries let
me develop a bunch of text processing classes that
can split lines into ragged tables that come close
to parsing requirements for many apps. Usually
this returns a vector of strings that are more or less
words making most of what you want to do easy
miscellaneous glue logic.




 
> Basically I'm processing some (fairly) fixed format messages from a
> forum to remove some matched header and trailer lines, modify and
> output a few other matched lines and simply output the body of the
> message.
> 
> The (most) difficult bit is removing blank lines before something.
> 
> E.g. we have a message that starts:-
> 
>    A new topic has been created on the forum
> 
>    Message Subject : weed webinar 31 January
> 
>    Category : Waterways Continental Europe
> 
>    Posted by : Fred Bloggs
> 
> 
> I want to delete everything up to and including the blank line after 'Message Subject'
> then keep (i.e. output) the 'Category' line and the 'Posted by' lines without the blank
> lines in between.
> 
> I can't delete all blank lines because I want to retain spacing
> in the message body later.  So I need to be able to do things
> like deleting blank lines unless I am in the message body.
> 
> Are there specific tools for doing this sort of thing or should
> I just write a program (probably in Python) that reads lines,
> does actions as required and remembers its state as it goes?
> 
> I got some of the way using sed but it's very difficult to 'delete
> the line before XXXX' with sed.  It *might* be that awk would be
> better but I don't see it handling the state/sequential bit any
> better than sed.
> 
> Any/all advice would be very welcome.
> 
> -- 
> Chris Green
> 
> -- 
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X




More information about the ubuntu-users mailing list