[RFC] Compact origin information for knit data files

Wed Nov 22 12:43:48 GMT 2006

Problem/Proposal
----------------

Often adjacent lines of data in annotated knit data files contains the same 
origin information so it would be useful to compact the information in such 
case. I propose to skip origin information for all lines except the first line 
in a block of adjacent lines with the same origin. So instead of:

     origin1 linedata1
     origin1 linedata2
     origin2 linedata3
     origin2 linedata4

the content will be:

     origin1 linedata1
      linedata2
     origin2 linedata3
      linedata4

When knit file parser gets a line without any origin information the 
information will be taken from a previous line which contains such an 
information within the block of adjacent lines.

Advantages
----------

I expect not only smaller revision store size but also a some speedup (smaller 
data files will be processed faster, no need to utf-8 encoding/decoding for 
every data line).

Open questions
--------------

Maybe instead of just skip origin information it would be better to place a one 
char marker at the start of the line? It would be useful in case of different 
markers for different line flavors. For example: '=' marker could be used if 
origin information is the same as the version id of block of changes and '+' 
marker in case of the same information for adjacent lines.

It seems the new repository format version number should be introduced. How 
repository may be converted into new format (bzr upgrade?)?

Thoughts, comments?

-- 
Dmitry Vasiliev (dima at hlabs.spb.ru)
     http://hlabs.spb.ru