readahead - from a tar file

Rob Ubuntu Linux rob.ubuntu.linux at
Tue Oct 30 00:31:37 GMT 2007

Apologies if the threading is screwed, I'm replying from web archive
having just joined the list, rather than replying direct from received
list email.

This smells of premature optimisation, and trying to solve an issue in
the wrong place, by a one-off hack, with great code complexity, so I
don't really like it.

That said, there's something in both lazy evaluation and cache warming
schemes, although paradoxically they seem totally opposed.  So I've
ended up thinking far too much and too long on this.

On 10/22/07, Scott James Remnant <scott at> wrote:

> On Mon, 2007-10-22 at 13:40 -0700, Shawn Rutledge wrote:
> > Another idea I had over the weekend to speed up boot times is also
> > related to reducing disk seek time.

What is the real problem?

Is it seek times or lots of synchronous reads and serialised
processes, doing lots of waiting on disk reads, re-parsing files in
indirect ways and general neglect of "once only" startup code by hard
pressed application developers struggling to cope with all the
overwhelming complexity, and mindless incompatible environment
modifications made by distro and OS developers?

How long will it be before genuine Random Access solid state memory
devices replace noisy unreliable slow, delicately coated iron with
fragile precision engineered moving parts?

You could probably even use some kind of 'snapshot' with a specialised
FS for flash memory devices.  Reversing the usual CacheFS concept, a
smallish read-mostly disk filesystem containing config files like
/etc, works opposite to read through and write back cacheing as
modifications occuring on disk invalidate the cache entries, meaning
reads bypass the cache until an update later in background, on a
quiescent stable(ish) system.

But if disk seek time is genuinely the issue, then simply partitioning
the disk will improve things!   But Distro's actually all seem to have
moved towards huge "simple" file system layouts; moving /var, /usr and
/opt out of the root file system on my recent Gutsy install leaves :

root at elm:/# du -x --max-depth=1
4764    ./bin
0       ./dev
8944    ./etc
118457  ./lib
38      ./tmp
0       ./sys
0       ./var
0       ./usr
1       ./boot
0       ./home
0       ./proc
6953    ./sbin
5709    ./root
144867  .

In /lib, 105MB seems to be purely generic kernel related
{modules,firmware,linux-restricted-modules}/2.6.22-14-generic which I
assume have to share the '/' filesystem at present.

root at elm:/# df .
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda5               514028    175072    338956  35% /

So /etc, /bin, /sbin and /lib are confined to 1/2 Gb sandwhiched by
/usr & /var, with  /boot to 1st 4 cylinders of disk.

root at elm:/# sfdisk -l /dev/hda

Disk /dev/hda: 19846 cylinders, 16 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Warning: The partition table looks like it was made
  for C/H/S=*/255/63 (instead of 19846/16/63).
For this listing I'll assume that geometry.
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/hda1   *      0+      3       4-     32098+  83  Linux
/dev/hda2          4     395     392    3148740   83  Linux
/dev/hda3        396    1244     849    6819592+   f  W95 Ext'd (LBA)
/dev/hda4          0       -       0          0    0  Empty
/dev/hda5        396+    459      64-    514048+  83  Linux

Perhaps you can experiment by benchmarking with a new disk, set up as
a Huge / system, then partition and rebuilds into a seek reduced setup
to have valid comparison.  I just might be able to do it if on an old
Dual Celeron BP6 box with 2xIBM disks, which I've mirrored in past,
which would be another possibility for speeding a seek bound system.

> > I see that ubuntu already starts
> > a "readahead" process before much else, to preload the necessary files
> > into RAM.  This is an excellent idea, but those files are still
> > potentially located all over the disk, right?

Actually like shared libraries dynamically linked, you've pre-loaded
them so they're either in the traditional buffer cache, or they've
been mapped into RAM by the VM which can later re-use clean pages.
The process of preloading could in principal use asynchronous reads,
so the I/O scheduler can maximise the benefits of it's elevator.
There's kernel hooks now for "Nice I/O" by background batch processes,
to minimise the impact on "interactive jobs".

The problem is getting the overall startup throughput up, but without
causing read starvation or waiting on a "heated cache", which hinders
the attempt at parallelisation.

Something like bringing up an DHCP network interface, takes an aeon
compared to interpreting scripts or parsing of files loaded in memory,
so anticipatory I/O ought to have significant benefits in keeping more
processes runnable.

> > So seek time may be the
> > dominant factor in that process.  What if it instead read the same
> > files from a cache, in the form of an uncompressed tar file?  Then it
> > would be a completely sequential, contiguous read.  And each time a
> > file is completely loaded, an event could be fired.
> >
> This would require kernel changes, since it would have to know that the
> file being read is equivalent to another on the disk -- and you'd
> actually need to know things like the inode/block numbers of the files
> you're replacing!

The basic unpacking a tgz idea is ugly and horrid in practice for the
reasons suggested.

However may be an "initrd" style idea decrompessing a known
environment, into a temporary memory file system, could solve some
bootstrapping problems.

For instance running certain infrastructure programs chroot-ed in a
dynamically linked mini-environment, before local & network disks are
mounted.  The state they generate would be saved in memory mapped
files or dumped out into a memory file system eg) tmpfs, so Upstart
can have them reload from memory file on restarting them into
"maxi-environment" in later phase of the boot process.

> Since you can't guarantee they won't change while the computer is off,
> you need to double-check on boot.  So you lose the efficiency you were aiming for.
> The most interesting thing is to actually reorder the filesystem so that
> the blocks you need are always at the front and always sequential.

Theoretically you could actually use a pre-load library stub to
'trace' each start up execution, with directories accessed and files
opened (or executed) saved for next time.  Then a process could
initiate on boot, purely to warm up the disk and page cache.  As it
load the actual data on disk not a copy, config changes could only
slow the boot down not cause failures, a new file would simply be
ignored, and changed file contents be pre-read seemlessly, a far more
robust approach.

 One non-blocking I/O thread runs per disk, queuing up large batches
of requests anticipating the future demand of currently blocking (or
even to be executed) processes, coordinated by a Master supervisory
thread responsible for scheduling and thrash control.  This pushes the
responsibility to the block I/O scheduler to optimise for throughput
yet avoiding  read starvation to the real processes.   Asynchronous
I/O is discussed at but AIO
itself I think has suffered due to seperate path through kernel
syndrome.  Jens Axboe has worked on rework of block I/O layer in
kernel (interview at ) and there's
some interesting Zero Copy ideas in there to (splicing).

The drawback would be possible contention with other processes, should
they catch up, rather than be blocking on some slow system call.  Like
any pre-load scheme it relies on a memory rich environment, so that
pages aren't thrown away before they're used, and working set size
issues are avoided.

More information about the upstart-devel mailing list