Scripting / one liner help [solved]

Thu Aug 11 01:44:44 UTC 2011

On Wed, August 10, 2011 5:06 pm, Patton Echols wrote:
> On 08/10/2011 03:43 PM, Jordon Bedwell wrote:
>> On Wed, August 10, 2011 2:52 pm, Hal Burgiss wrote:
>>> Its attempting to capture the string in between:
>>>
>>> SRC="  and the next doublequote: ".  The [^"] stops the capture at the
>>> double quote. The capture should then include any character that is NOT
>>> a
>>> double quote. If not careful, the expression could get "greedy" and
>>> start
>>> matching other double quotes on the same line.  This should stop that
>>> effect. The \1 is a reference back to the capture that is in the
>>> parenthesis, in sed syntax, which essentially just preserves the
>>> captured
>>> characters, and ignores the rest. Does that make sense?
>> Because it should be:
>>
>> grep -iPo "<img[^>]+>" file.html | \
>> sed -n 's/<img src=['\''"]\([^"'\'']*\).*/\1/pgI'
>>
>> [COPY AND PASTE BOTH LINES AT ONCE AND PRESS THE ENTER KEY]
>
> Thanks, that works great and solves the immediate problem.  For purposes
> of my CLE (continuing linux education) I hope you will indulge me in the
> same question you posed to Hal.  How's it work?  I get the -io grep
> tags.  The -P enables perl regex?  What part of the grep string is the
> perl part.

BRE: grep -io "<img[^>]\+>" index.html. I chose Perl syntax by habit, not
by need. So to answer your question the "+", for this, Perl and ERE are
the same. It won't be till later when you start doing some hardcore
regexps you see the differ between ERE and Perl and others.

> Then I also wonder how the sed statement works.  I am still trying to
> figure sed (and plain old regex) out.

\'' is a bash escape for ' so you should read it without \''. It's a BRE
so think \( is ( in ERE or Perl syntax. /g tells it to do it globally, not
only act on the first instance it finds and exit and /I tells it to ignore
the case. \1 (\n) is a backreference which is should have been one of the
first things you learnt about Regexp's.

Now on to the rest of it:
sed 's/<img src=['\''"]\([^"'\'']*\).*/\1/gI
sed -n 's/<img src=['\''"]\([^"'\'']*\).*/\1/pgI

At this point, for you, these two are the same and a preference by choice,
the latter being of my own preference the former being chosen by whoever
likes it.  They both do the same thing right now for you on your usage. 
In later applications where more advanced things happen you will start to
notice the differences.  To elaborate this:

*IF index.html was a FULL HTML page*
*THEN: sed -n 's/<img src=['\''"]\([^"'\'']*\).*/\1/pgI' 1.html > 1.txt
*IS:* image.jpg [Assuming <img /> is on it's own line with no wrappers]
*AND:* sed 's/<img src=['\''"]\([^"'\'']*\).*/\1/Ig' 1.html > 1.txt
*IS:* the same index.html page with those changes done in place.

Since I'm horrible at teaching, in other words the first with -n /p will
only show the backreferences in that example and the second will replace
those lines in the file leaving everything else intact.  Do them both on
your file with > filename.txt and you will see what I mean instantly.

Somebody else might be better at explaining, I am a doer and and outputter
not really a teacher, I can show you how to do a lot but when it comes to
explaining how I did it you're barking up the wrong tree because to me it
comes out as pro English, to you it comes out as jibberish.  To me it
comes out as this is how it's done and to you it comes out as "what the
hell did he just say? he pretty much just said by voice the command and
gave no explanation of what it does" <<< Plenty have said that one to me.