Rev 4177: Start writing up the index doc in a ReST discussion. in http://bzr.arbash-meinel.com/branches/bzr/jam-integration

Fri Mar 20 21:18:11 GMT 2009

At http://bzr.arbash-meinel.com/branches/bzr/jam-integration

------------------------------------------------------------
revno: 4177
revision-id: john at arbash-meinel.com-20090320211501-pvg206m44mwckrec
parent: pqm at pqm.ubuntu.com-20090320162502-09zcxxom3xmdqhfa
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: jam-integration
timestamp: Fri 2009-03-20 16:15:01 -0500
message:
  Start writing up the index doc in a ReST discussion.
  Include even more analysis around the number of bytes we need.
  Some interesting results.
-------------- next part --------------
=== added file 'doc/developers/improved_chk_index.txt'

--- a/doc/developers/improved_chk_index.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/improved_chk_index.txt	2009-03-20 21:15:01 +0000
@@ -0,0 +1,262 @@
+===================
+CHK Optimized index
+===================
+
+Our current btree style index is nice as a general index, but it is not optimal
+for Content-Hask-Key based content. With CHK, the keys themselves are hashes,
+which means they are randomly distributed (similar keys do not refer to
+similar content), and they do not compress well. However, we can create an
+index which takes advantage of these abilites, rather than suffering from
+them. Even further, there are specific advantages provided by
+``groupcompress``, because of how individual items are clustered together.
+
+Btree indexes also rely on zlib compression, in order to get their compact
+size, and further has to try hard to fit things into a compressed 4k page.
+When the key is a sha1 hash, we would not expect to get better than 20bytes
+per key, which is the same size as the binary representation of the hash. This
+means we could write an index format that gets approximately the same on-disk
+size, without having the overhead of ``zlib.decompress``. Some thought would
+still need to be put into how to efficiently access these records from remote. 
+
+
+Required information
+====================
+For a given groupcompress record, we need to know the offset and length of the
+compressed group in the .pack file, and the start and end of the content inside
+the uncompressed group. The absolute minimum is slightly less, but this is a
+good starting point.
+
+1. A compressed sha hash is 20-bytes
+
+2. Pack files can be > 4GB, we could use an 8-byte (64-bit) pointer, or we
+   could store a 5-byte pointer for a cap at 1TB. 8-bytes still seems like
+   overkill, even if it is the natural next size up.
+
+3. An individual group would never be longer than 2^32, but they will often
+   be bigger than 2^16. 3 bytes for length (16MB) would be the minimum safe
+   length, and may not be safe if we expand groups for large content (like ISOs).
+   So probably 4-bytes for group length is necessary.
+
+4. A given start offset has to fit in the group, so another 4-bytes.
+
+5. Uncompressed length of record is based on original size, so 4-bytes is
+   expected as well.
+
+6. That leaves us with 20+8+4+4+4 = 40 bytes per record. At the moment, btree
+   compression gives us closer to 38.5 bytes per record. We don't have perfect
+   compression, but we also don't have >4GB pack files (and if we did, the first
+   4GB are all under then 2^32 barrier :).
+
+If we wanted to go back to the ''minimal'' amount of data that we would need to
+store.
+
+1. `Partial hash`_. 10 bytes of a sha hash are generally going to be more
+   than enough to fully determine the entry. We could support some amount of
+   collision in an index record, in exchange for resolving it inside the
+   content. At least in theory, we don't *have* to record the whole 20-bytes
+   for the sha1 hash.
+
+2. We could record the start and length of each group in a separate location,
+   and then have each record reference the group by an 'offset'. This is because
+   we expect to have many records in the same group (something like 10k or so,
+   though we've fit >64k under some circumstances). At a minimum, we have one
+   record per group so we have to store at least one reference anyway. So the
+   maximum overhead is just the size and cost of the dereference (and normally
+   will be much much better than that.)
+
+3. If a group reference is an 8-byte start, and a 4-byte length, and we have 1M
+   keys, but get at least 1k records per group, then we would have 1k groups.
+   So we would need 12kB to record all the group offsets, and then each
+   individual record would only need a 2-byte group number, rather than a
+   12-byte reference.  We could be safe with a 4-byte group number, but if
+   each group is ~1MB, 64k groups is 64GB. We can start with 2-byte, but leave
+   room in the header info to indicate if we have more than 64k group entries.
+   Also, current grouping creates groups of 4MB each, which would make it
+   256GB, to create 64k groups. And our current chk pages compress down to
+   less than 100 bytes each (average is closer to 40 bytes), which for 256GB
+   of raw data, would amount to 2.7 billion CHK records. (This will change if
+   we start to use CHK for text records, as they do not compress down as
+   small.)
+
+4. We want to know the start-and-length of a record in the decompressed
+   stream. This could actually be moved into a mini-index inside the group
+   itself. Initial testing showed that storing an expanded "key =>
+   start,offset" consumed a considerable amount of compressed space. (about
+   30% of final size was just these internal indices.) However, we could move
+   to a pure "record 1 is at location 10-20", and then our external index
+   would just have a single 'group entry number'.
+
+   There are other internal forces that would give a natural cap of 64k
+   entries per group. So without much loss of generality, we could probably get
+   away with a 2-byte 'group entry' number. (which then generates an 8-byte
+   offset + endpoint as a header in the group itself.)
+
+5. So for 1M keys, an ideal chk+group index would be:
+    
+    a. 10-byte hash prefix
+
+    b. 2-byte group number
+
+    c. 2-byte entry in group number
+
+    d. a separate lookup of 12-byte group number to offset + length
+
+    e. a 256-way fan out top-level node consuming the first byte of the key
+       (for even better expansion, this value could *not* be repeated in the
+       final index record.) And then referencing the location in the index for
+       the actual reference with a 4-byte pointer. (It should be fine to
+       assume an index won't grow above 4GB, but they can already get above 16MB.)
+
+    f. for 1M keys, we probably want to consume another byte in the mini-index,
+       costing 256*256=64k records at 4bytes each. Having multiple
+       mini-indexes does not scale particularly well, because the index is
+       really just a 4-byte pointer for where to look for more information. So
+       instead, we would just have a single mini-index. This could then have a
+       variable number of bits-per-prefix that we are consuming. 8 bits (1
+       byte) would be a 256 entry index 1024 bytes long. 16 bits (2 bytes)
+       would be a 64k entry index would be 256KiB (0.25MiB). If we wanted a
+       simple 64KiB page, we would want 16k entries, which would be 14bits.
+       12 bits (1.5 bytes) would at 4k entries and 16KiB might be one of the
+       better tradeoffs.
+
+So the max size for the optimal groupcompress+chk index with 1M entries would be::
+
+  14 * 1M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 15 MiB
+
+So 15MiB which breaks down as 14MiB for the actual entries, 0.75MiB for the
+group records, and 0.25MiB for the mini index.
+
+1. Looking up a key would involve:
+
+   a. Read ``XX`` bytes to get the header, and various config for the index.
+      Such as length of the group records, length of mini index, etc.
+
+   b. Find the offset in the mini index for the first YY bits of the key. Read
+      the 4 byte pointer stored at that location (which may already be in the
+      first content if we pre-read a minimum size.)
+
+   c. Jump to the location indicated, and read enough bytes to find the
+      correct 14-byte record. The mini-index only indicates the start of
+      records that start with the given prefix. A 64k-way index resolves 1MB
+      records down to 16 possibilities. So at 14 bytes each, to read all would
+      cost at most 224 bytes to be read. If we change that to a 4k-way index,
+      that goes up to 3584B = 3.5KiB. 16k-way is 896 bytes.
+
+   d. Determine the offset for the group entry, which is the known ``start of
+      groups`` location + 12B*offset number. Read its 12-byte record.
+
+   e. Switch to the .pack file, and read the group header to determine where in
+      the stream the given record exists. At this point, you have enough
+      information to read the entire group block. For local ops, you could
+      only read enough to get the header, and then only read enough to
+      decompress just the content you want to get at.
+      
+      Using an offset, you also don't need to decode the entire group header.
+      If we assume that things are stored in fixed-size records, you can jump
+      to exactly the entry that you care about, and read its 8-byte
+      (start,length in uncompressed) info.  If we wanted more redundancy we
+      could store the 20-byte hash, but the content can verify itself.
+      
+   f. If the size of these mini headers becomes critical (8 bytes per record
+      is 8% overhead for 100 byte records), we could also compress this mini
+      header. Alternatively we could set the 'width' of records. So instead of
+      always using 4-byte wide offset + length, we could switch to 2-byte
+      wide. Though groups are likely to always be >64KiB long (uncompressed).
+      For minimum size without compression, we could only store the 4-byte
+      length of each node. Then to compute the offset, you have to sum all
+      previous nodes. This should still be very cheap in compiled code, and
+      would also have the advantage that it would be highly compressible
+      itself. (Most nodes are going to have a length that fits 1-2 bytes.)
+      An alternative form would be to use the base-128 encoding. (If the MSB
+      is set, then the next byte needs to be added to the current value
+      shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1
+      byte, and 2MiB in 3 bytes. If we only stored 64k entries in a group, we
+      would expect this header to be <128KiB long, which should be fast to
+      process.
+
+
+Partial Hash
+============
+The size of the index is dominated by the individual entries (the 1M records).
+Saving 1 byte there saves 1MB overall, which is the same as the group entries
+and mini index combined. If we can change the index so that it can handle
+collisions gracefully (have multiple records for a given collision), then we
+can shrink the number of bytes we need overall. Also, if we aren't going to
+put the full 20-bytes into the index, then some form of graceful handling of
+collisions is recommended anyway.
+
+The current structure does this just fine, in that the mini-index dereferences
+you to a "list" of records that start with that prefix. It is assumed that
+those would be sorted, but we could easily have multiple records. To resolve
+the exact record, you can read both records, and compute the sha1 to decide
+between them. This has performance implications, as you are now decoding 2x
+the records to get at one.
+
+The chance of ``n`` texts colliding with a hash space of ``H`` is generally
+given as::
+
+     1 - e ^(-n^2 / 2 H)
+
+Or if you use ``H = 2^h``, where ``h`` is the number of bits::
+
+     1 - e ^(-n^2 / 2^(h+1))
+
+For 1M keys and 4-bytes (32-bit), the chance of collision is for all intents
+and purposes 100%.  Rewriting the equation to give the number of bits (``h``)
+needed versus the number of entries (``n``) and the desired collision rate
+(``epsilon``)::
+
+    h = log_2(-n^2 / ln(1-epsilon)) - 1
+
+The denominator ``ln(1-epsilon)`` == ``-epsilon``` for small values (even @0.1
+== -0.105, and we are assuming we want a much lower chance of collision than
+10%). So we have::
+
+    h = log_2(n^2/epsilon) - 1 = 2 log_2(n) - log_2(epsilon) - 1
+
+Given that ``epsilon`` will often be very small and ``n`` very large, it can
+be more convenient to transform it into ``epsilon = 10^-E`` and ``n = 10^N``,
+which gives us::
+
+    h = 2 * log_2(10^N) - 2 log_2(10^-E) - 1
+    h = log_2(10) (2N + E) - 1
+    h ~ 3.3 (2N + E) - 1
+
+Or if we use number of bytes ``h = 8H``::
+
+    H ~ 0.4 (2N + E)
+
+This actually has some nice understanding to be had. For every order of
+magnitude we want to increase the number of keys (at the same chance of
+collision), we need ~1 byte (0.8), for every two orders of magnitude we want
+to reduce the chance of collision we need the same extra bytes. So with 8
+bytes, you can have 20 orders of magnitude to work with, 10^10 keys, with
+guaranteed collision, or 10 keys with 10^-20 chance of collision.
+
+Putting this in a different form, we could make ``epsilon == 1/n``. This gives
+us an interesting simplified form::
+
+    h = log_2(n^3) - 1 = 3 log_2(n) - 1
+
+writing ``n`` as ``10^N``, and ``H=8h``::
+
+    h = 3 N log_2(10) - 1 =~ 10 N - 1
+    H ~ 1.25 N
+
+So to have a one in a million chance of collision using 1 million keys, you
+need ~59 bits, or slightly more than 7 bytes. For 10 million keys and a one in
+10 million chance of any of them colliding, you can use 9 (8.6) bytes. With 10
+bytes, we have a one in a 100M chance of getting a collision in 100M keys
+(substituting back, the original equation says the chance of collision is 4e-9
+for 100M keys when using 10 bytes.)
+
+Given that the only cost for a collision is reading a second page and ensuring
+the sha hash actually matches we could actually use a fairly "high" collision
+rate. A chance of 1 in 1000 that you will collide in an index with 1M keys is
+certainly acceptible.  (note that isn't 1 in 1000 of those keys will be a
+collision, but 1 in 1000 that you will have a *single* collision).  Using a
+collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 =
+6 bytes.
+
+.. 
+  vim: ft=rst tw=78 ai