[RFC] Removing hash prefix in storage vastly improves performance

Fri Aug 18 16:10:11 BST 2006

On Fri, Aug 18, 2006 at 09:31:34AM -0500 I heard the voice of
John Arbash Meinel, and lo! it spake thus:
> Matthew D. Fuller wrote:
> > ports tree:
> >     % find ports -type d -print | wc -l
> >        22969
> >     % find ports -type f -print | wc -l
> >        84257
> > (obviously, this is the far side of the curve from "put everything in
> > one directory   ;)
> 
> Well, there is also the fact that we create 2 files (index + knit)
> for every versioned file. So all these numbers double. And we create
> these files for directories as well, so you would end up with:
> (22969 + 84257) * 2 = 214,452 or >200k files in one directory. Which
> is a little bit much for any filesystem without an index.

If you had the history of the tree, it'd be even worse; the CVS repo
has 31,933 dirs (excluding Attic/) and 145,206 files, so that would
put it over 350k files.  (of course, some of those are the result of
repo surgery to move etc, but we can imagine there'd be a tree with
really that much)  At some point you have to say "Look, if you've got
a tree like that, it's going to be zarking slow", to be sure.

> With a perfect distribution over the 256 hash prefixes, you only
> have 1K files per dir, which is probably a lot more reasonable.

And with the repo layout mirroring the working tree, there'd be more
like 4 files per dir, which is even FAT16 friendly   ;)

Now, for some more useful info.  On a memory-backed filesystem, on a
FreeBSD -CURRENT system, I can touch 5000 files into existence in an
empty dir in 31 seconds, and a basic `ls >> /dev/null` takes around
.05 seconds.  rm -rf'ing it takes less than a second.

Doing it with 50,000 files takes 5:42 (while fork()ing like a monkey
on speed, so it's not really a pure filesystem test by any stretch),
which is roughly linear to the above.  The ls takes around .44
seconds, so that's linear again.  An ls -l (so it has to stat() 50k+2
times) burns a whopping ~3.5s.  cat'ing a random file from there with
a full path takes almost unmeasurable time (.01, the limit of tcsh's
'time' precision).

% time nice sh -c 'for i in `jot 50000 1 50000`; do touch ${i}; done'
40.708u 277.312s 5:42.44 92.8%  12+134k 3+102292io 0pf+0w

% time nice sh -c '/bin/ls >> /dev/null'
0.328u 0.109s 0:00.44 95.4%     7+1647k 0+0io 0pf+0w

% time nice sh -c '/bin/ls -l >> /dev/null'
1.335u 2.136s 0:03.50 98.8%     5+2428k 0+0io 0pf+0w

% time nice sh -c 'cat /tmp/tdir/`jot -r 1 1 50000`'
0.000u 0.018s 0:00.01 100.0%    88+160k 0+0io 0pf+0w

The dir with 50k files in it shows a size of ~780k (`ls -l .`).  `rm
-rf`'ing it afterward takes a hair over 9 seconds.  These timing
results are on a Mobile Duron 800MHz.  /tmp is a swap-backed memory
filesystem (with enough RAM that it didn't have to actually store out
to swap during the test).  I'd expect results to be no different on a
regular disk partition; the filesystem-level code is the same, the
only difference would probably be more time spend sync'ing stuff out
to a real disk.

So, you're fine with a 50k dir on FreeBSD, and the scaling should keep
you going just fine to 100k and beyond unless it hits a big knee.
Still, I'd be a little nervous about saying "Every filesystem works
fine with 100k files in a dir, so we'll do with it".

-- 
Matthew Fuller     (MF4839)   |  fullermd at over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
           On the Internet, nobody can hear you scream.