[CoLoCo] bash question

Sat Aug 20 09:29:57 UTC 2011

Jim,

I wrote a quick easy Ruby script to do exactly what you want it to do.  
This was not written to be the most efficient, but instead the easiest 
to read.  And in the "Teach a man to fish" school of thought, I figured 
I would walk you through the script so any newbies out there can see how 
this is done as well as give you a working script.

First like any linux script, declare your language
---------------------------------------------------------------------------
#!/usr/bin/ruby
---------------------------------------------------------------------------

Next If you want to skip the first line of each file, you need a flag
---------------------------------------------------------------------------
firstline = true
---------------------------------------------------------------------------

Now, you need to iterate through all your files.  In Ruby, you simply 
create a Dir object, and pass it a path.  Now that you have a directory, 
simply call the each method, and define a name to hold the filename... 
in this case 'file'
---------------------------------------------------------------------------
Dir.new('./data').each do |file|
---------------------------------------------------------------------------

This could have also been done by declaring a variable, say 'd', to hold 
the Dir object, then iterate on that.  This did not buy us anything in 
this case so I did not do that.  But if I did, it would look like this
---------------------------------------------------------------------------
d = Dir.new('./data')
d.each do |file|
---------------------------------------------------------------------------

The next thing you want to do, is skip any files you don't want to 
process.  in this case I defined '.', '..', and any directories.  If you 
want your data in directories, it would be easy to define.  I defined it 
to purposely skip directories so you can create a 'NoProcess' directory 
to keep files you don't want processed
---------------------------------------------------------------------------
    next if file == '.'
    next if file == '..'
    next if File::Stat.new("./data/#{file}").directory?
---------------------------------------------------------------------------

Pretty simple so far? I hope so.  Now that we have a filename that we 
want to work with, lets open the file, and assign the file handle to a 
variable I will call 'fh'
---------------------------------------------------------------------------
    File.open("./data/#{file}") do |fh|
---------------------------------------------------------------------------

You will also notice I will not close this file in Ruby.  By opening the 
file, and doing a "do |varname|", (called passing a block in Ruby), the 
file will close when the block is done processing... cool heh?  No more 
files left open by mistake.  I love these new language structures.

Lets initialize your count
---------------------------------------------------------------------------
       count = 0
---------------------------------------------------------------------------

and read lines out of the file until there are no more to read
---------------------------------------------------------------------------
       while line = fh.gets
---------------------------------------------------------------------------

Skip what you read if it is the first line, and set your flag to 
indicate that this has already been done
---------------------------------------------------------------------------
          if firstline
             firstline = false
             next
          end
---------------------------------------------------------------------------

All right, you have a file, you have opened it, you have read a line 
from it, and it is not the first line.  Lets parse it into the three 
column values
---------------------------------------------------------------------------
          (rev_user_text,page_title,linecount) = line.split
---------------------------------------------------------------------------

And add the third column (called linecount) to our total.  Since your 
file is a text file, Ruby will treat all data as strings.  The to_i 
simply forces the strint -> int conversion
---------------------------------------------------------------------------
          count += linecount.to_i
---------------------------------------------------------------------------

OK, we are now done with our line, lets end this line, and go on till we 
hit the end of file
---------------------------------------------------------------------------
       end
---------------------------------------------------------------------------

Once End-Of-File is reached, we can print our results
---------------------------------------------------------------------------
       puts "#{file}: #{count}"
---------------------------------------------------------------------------

If you wanted, you could have opened a file in the begininng and wrote 
the to the file here.  But by placing this out on the stdout, you can 
use redirection to create a file, or other command line tools like awk, 
sed, wc, etc on the output.

We are now at the end of processing that file, and we can loop back and 
get any other files in that directory
---------------------------------------------------------------------------
    end
---------------------------------------------------------------------------

Once all the files are processed, we are done
---------------------------------------------------------------------------
end
---------------------------------------------------------------------------

Pretty simple, here is the program all in one piece.  You can simply put 
it in a temp folder, place all the files you want to process in a 
sub-folder called data and run it.  Or, it should now be pretty easy to 
alter this program to pull files from wherever you want.  So, here is 
the program in its entirety:
---------------------------------------------------------------------------
#!/usr/bin/ruby

firstline = true

Dir.new('./data').each do |file|
    next if file == '.'
    next if file == '..'
    next if File::Stat.new("./data/#{file}").directory?

    File.open("./data/#{file}") do |fh|
       count = 0
       while line = fh.gets
          if firstline
             firstline = false
             next
          end

          (rev_user_text,page_title,linecount) = line.split
          count += linecount.to_i
       end
       puts "#{file}: #{count}"
    end
end
---------------------------------------------------------------------------

Enjoy
Kevin