Fun with oneliners

Kjeldgaard Morten mok at bioxray.dk
Wed Dec 5 15:31:55 GMT 2007


Hi all, this is a long-ish post, but I hope you will enjoy it.

The other night I was discussing with persia, apachelogger and  
norsetto on #ubuntu-motu, where they were using their awesome MOTU  
powers to demolish my poor little theseus package.

While apachelogger was disassembling and devastating the files in  
debian/ one by one, persia was vehemently attacking a little awk  
script that I used in the package. I'll make a long story short, and  
just say that I needed to extract the upstream version number from  
debian/changelog file from inside debian/rules.

My approach when scripting is always to try to use a little hammer  
first. If that doesn't work, I'll use a bigger hammer, and if _that_  
doesn't work, I'll use a giant hammer.

The first line of the changelog file, which I am interested in, looks  
like this:

theseus (1.1.5-0ubuntu1) hardy; urgency=low

Due to the strict format of the changelog file, it will always look  
like that, but of course the version numbers etc. kan vary. I am  
interested in extracting the string "1.1.5". So, first, I pulled out  
my little hammer, which consists of a pipeline of standard shell  
tools, such as head, tail, cut, sort, etc. The following little  
hammer solves the problem.

head -1 changelog |cut -f2  -d' '|cut -f1 -d'-'|cut -f2 -d'('

What's wrong with that? Well nothing, it's just, kinda ugly. We can  
do better. I pulled out the bigger hammer, sed, but within a minute  
or two it grew sour on me and I took out my favourite big hammer,  
awk. Awk is indeed awesome, it's an incredible tool, and if you don't  
know it, you're missing out. Awk is extremely powerful, and very easy  
to understand. An awk script is basically a series of patterns and  
actions, like so:

/pattern/ {action}

If the pattern - an ordinary regexp - matches a line, the action is  
performed on that line. The stuff inside the curly brackets is very  
reminiscent of C syntax, so if you're familiar with that, you're off  
to the races. In fact, awk is so powerful, that Henry Spencer has  
written an nroff formatter, called awf, in the language (sic!). Henry  
writes he can't believe he wrote it. Neither can anyone else :-)

There are several flavours of awk. I like gawk, which is the GNU one.  
It contains several extensions to the original language. So, here is  
the gawk oneliner that extracts the version:

gawk '{match($0,/\((.*)-/,arr);print arr[1];exit}' < changelog

As you see, only the action is used here. We call the function match,  
which actually takes over the regexp matching job normally carried  
out by the pattern. Let's dissect the regexp.

First, it will try to match the initial left parenthesis, that is  
what the \( is for. The next part is (.*), here the parentheses are  
not escaped, so they have a special meaning, namely a grouping.  
Inside the grouping, we look for an arbitrary run of characters. This  
run ends when a dash is encountered. But now the grouping becomes  
important, because the match function will place the matched pattern  
in arr[0] - this is "(1.1.5-" in this case, and the groupings in the  
following array elements. So arr[1] contains the desired string "1.1.5".

Well, as mentioned, persia didn't like that too well. You have to  
Build-depend on gawk, he said. You can use mawk, said norsetto, it's  
part of the basic build environment. Granted, the gawk binary is  
293K, and mawk is only 93K. It's saving valuable resources!  
Unfortunately, the "match" function syntax was not accepted by mawk,  
so I got a syntax error. But, not to worry, of course it can be done  
with mawk!

StevenK said: Why dont you just do:  dpkg-parsechangelog | grep  
Version | cut -d\  -f2 ?

Well, at this stage, we were into optimization, finding the very last  
CPU cycle and the very last bit of RAM. It was becoming a dogma-film  
like situation: we value the minimalist creative ideal. And dpkg- 
parsechangelog is a Perl script. Yeeechh.

The next oneliner worked with both gawk and awk.

mawk '{match($0,/\(.*-/);print substr($0,RSTART+1,RLENGTH-2);exit}' <  
changelog

In this awk dialect, the match function will set the beginning and  
the end of the string that matches the regexp. It will not deal with  
groupings, so the '()' surrounding the .* are gone. Another function,  
substr, is used to extract the wanted version string from the input  
string ($0). Mission accomplished. Success!

But no, no, no. Persia was still not happy. "I'll accept gawk, but  
couldn't resist your last comment", he said, referring to a comment I  
had made about efficiency. Persia pushed me back to using sed. He said:
"Isn't it just something like sed /^theseus\s\([\d\.]*\)-.*/\1/p |  
head -1 "? And indeed, the size of /bin/sed is only 40K. A huge  
saving of resources compared to gawk!

I copy-pasted it, but it didn't quite work. Hmm. Back to the drawing  
board. Then I came up with another suggestion:

sed 's/.*(//; s/-.*//;q' < changelog

Let's examine the regexp again. It is a series of "substitute"  
statements, separated by semicolons. These are executed on every line  
in the file. The first deletes everything up to, and including, the  
first '('. The next deletes from the dash to end-of-line. The third  
statement quits the program after the first line.

But persia was still not happy. He was using his MOTU powers, driving  
me forward, at every step, for perfection! I started to look at  
persia's oneliner again, and finally got it twisted so it worked for me:

sed 's/.*(\(.*\)-.*).*/\1/;q' < changelog

Let's analyse the regexp again. We are using "grouping" again, but  
unlike awk (unfortunately) sed has a reversed interpretation of  
parentheses. In sed, they have to be escaped to signify a grouping.  
Inside the first pair of /'s is the regexp that recognized the whole  
first line. There is a grouping around the characters between the '('  
and the '-' in that line, in other words, the version. The sed  
statement thus a substitution, where the whole line is replaced by  
grouping 1, which is referenced as an escaped nr. 1. Voila!

Finally, persia, that relentless seeker of perfection, was satisfied!  
The package was uploaded to REVU, quickly sponsored by apachelogger  
and norsetto, and is now already accepted for Universe.

So, what can all we MOTU-hopefuls learn from this story? Well, be  
patient when you work on your package, don't get frustrated! Have  
some fun on the irc channel, show the MOTUs what you can do, and  
learn from them! You may even teach them a trick or two ;-)

Cheers,
Morten

PS: The entire #ubunto-motu session can be viewed at http:// 
irclogs.ubuntu.com/2007/12/03/%23ubuntu-motu.html







More information about the Ubuntu-motu mailing list