Text processing tools and/or languages
Mike Marchywka
marchywka at hotmail.com
Thu Jan 20 19:34:57 UTC 2022
On Thu, Jan 20, 2022 at 03:42:00PM +0000, Chris Green wrote:
> I'm looking for tools (if there are any) for processing a text file
> line by line sequentially.
>
This is how most of linux works :)
> As it goes through the file it needs to make decisions based on the
> contents of the line(s) of text and change its state as it goes.
> The decisions it makes depend on the state it's in.
>
I know you got the normal suggestions for sed/awk/python /whatever
but I just converted a bash script into c++ code just
because of the data structures and logic or state info.
If you look for "marchywka" and "toobib" you can see the task,
https://tug.org/pipermail/texhax/2021-March/024918.html
This is just c++ code that now makes use of unix fifo's
for streaming data. You can use std::system
https://en.cppreference.com/w/cpp/utility/program/system
with i/o through fifo's or pipes in /tmp.
This also can make debugging easy if you also write
intermediate reulsts to files.
Once you have the c++ skeleton this becomes very simple
and if nothing else allows you to prototype until
you integrate code. For example, json parsing is
a big task in this. I can just invoke a json parsing
utility until I include the headers and compile it in.
I know python is popular but c++ is simple and versatile.
This particular task of finding bibtex entries let
me develop a bunch of text processing classes that
can split lines into ragged tables that come close
to parsing requirements for many apps. Usually
this returns a vector of strings that are more or less
words making most of what you want to do easy
miscellaneous glue logic.
> Basically I'm processing some (fairly) fixed format messages from a
> forum to remove some matched header and trailer lines, modify and
> output a few other matched lines and simply output the body of the
> message.
>
> The (most) difficult bit is removing blank lines before something.
>
> E.g. we have a message that starts:-
>
> A new topic has been created on the forum
>
> Message Subject : weed webinar 31 January
>
> Category : Waterways Continental Europe
>
> Posted by : Fred Bloggs
>
>
> I want to delete everything up to and including the blank line after 'Message Subject'
> then keep (i.e. output) the 'Category' line and the 'Posted by' lines without the blank
> lines in between.
>
> I can't delete all blank lines because I want to retain spacing
> in the message body later. So I need to be able to do things
> like deleting blank lines unless I am in the message body.
>
> Are there specific tools for doing this sort of thing or should
> I just write a program (probably in Python) that reads lines,
> does actions as required and remembers its state as it goes?
>
> I got some of the way using sed but it's very difficult to 'delete
> the line before XXXX' with sed. It *might* be that awk would be
> better but I don't see it handling the state/sequential bit any
> better than sed.
>
> Any/all advice would be very welcome.
>
> --
> Chris Green
>
> --
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
--
mike marchywka
306 charles cox
canton GA 30115
USA, Earth
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X
More information about the ubuntu-users
mailing list