Optimized kernel builds: the straight dope

Tue Aug 15 01:33:09 BST 2006

== Background ==

You've all seen our kernel lineup before.  In the beginning, on i386 we had
linux-386, linux-686, linux-k7, linux-686-smp, linux-k7-smp.  Later, we
added specialized server kernels, and folded SMP support into linux-686 and
linux-k7.

Since Linux has always offered CPU-specific optimizations, it's been taken
for granted that this offered enough of a performance benefit to make all of
this maintenance effort worthwhile.  A lot has happened to the kernel since
the early days, though, and for some time, it has been capable of loading
these optimizations at runtime.  Even when you use the -386 kernel, you get
the benefit of many CPU-specific optimizations automatically.  This is great
news for integrators, like Ubuntu, because we want to provide everyone with
the best experience out of the box, and as you know, there isn't room for so
many redundant kernels on the CD (only one).  Many users spend time and
bandwidth quotas downloading these optimized kernel in hopes of squeezing
the most performance out of their hardware.

This begged the question: do we still need these old-fashioned builds?
Experiments have shown that users who are told that their system will run
faster will say that they "feel" faster whether there is a measurable
difference or not.  For fun, try it with an unsuspecting test subject: tell
them that you'll "optimize" their system to make it a little bit faster, and
make some do-nothing changes to it, then see if they notice a difference.
The fact is, our observations of performance are highly subjective, which is
why we need to rely on hard data.

== Data ==

Enter Ben Collins, our kernel team lead, who has put together a performance
test to answer that question, covering both the i386 and amd64
architectures.  The results are attached in the form of an email from him.
Following that is a README which explains how to interpret the numerical
figures.

No benchmark says it all.  They're all biased toward specific workloads, and
very few users run a homogenous workload, especially not desktop users.
This particular benchmark attempts to measure system responsiveness, a key
factor in overall performance (real and perceived) for desktop workloads,
which are largely interactive.

== Conclusion ==

Having read over it, I think the numbers are fairly compelling.  The
difference in performance between -386 and -686 is insigificant; the
measurements are all within a reasonable error range, and within that range,
-686 was slower as often as it was faster.

My recommendation is that we phase out these additional kernel builds, which
I expect will save us a great deal of developer time, buildd time, archive
storage and bandwidth.

I'm interested to hear (objective, reasoned) feedback on this data and my
conclusions from the members of this list.

-- 
 - mdz
-------------- next part --------------
An embedded message was scrubbed...
From: Ben Collins <bcollins at ubuntu.com>
Subject: Benchmarks between generic and cpu specific kernel images
Date: Mon, 14 Aug 2006 19:58:22 -0400
Size: 8545
Url: https://lists.ubuntu.com/archives/ubuntu-devel/attachments/20060814/82176505/attachment-0001.eml 
-------------- next part --------------
/*******************************************
 *
 * Copyright 2003, Aggelos Economopoulos, all rights reserved.
 *
 * Author:  Aggelos Economopoulos <aoiko at cc.ece.ntua.gr>
 * Based on original work by: Con Kolivas <contest at kolivas.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 *******************************************/

Contest v0.61

WHAT IS THIS?

This program is designed to test system responsiveness by running kernel
compilation under a number of different load conditions. It is designed to
compare different kernels, not different machines. It uses real workloads
you'd expect to find for short periods in every day machines but sustains
them for the duration of a kernel compile to increase the signal to noise
ratio.

Some of the load conditions are applications originally sourced by
B.Matthews' irman tool also licensed under GPL and modified to suit.

Null load - No load
Cache Run - A no load run directly after a previous run without memory
	flushing
Process load - Fork and exec N processes, connected in a
	unidirectional ring by pipes.  Insert M<<N chunks of data into
	the ring and pass them around (nice IPC/switch test)
Memory load - Repeatedly reference 110% of RAM in a pattern
	designed to cause cache misses
IO Load - copies /dev/zero continually to a file the size of
	the physical memory.
IO Other - same as IO load at a different location
Read Load - Reads a file the size of the physical memory
List Load - Lists the entire file system (ls -lRa /)
CTar Load - Repeatedly creates a tar of the kernel tree
XTar Load - Repeatedly extracts a tar of the kernel tree
Dbench Load - Runs dbench N (where N is 16*num_cpus) repeatedly

HOW DO I INSTALL IT?

Extract the tarball.
cd to the contest directory and type

make
make install

To override install dir, simply pass the INSTPATH variable

make INSTPATH=/path/to/install/dir install
(installs in /usr/bin by default)

HOW DO I RUN IT?

Download a linux 2.4.19 kernel source tree, and extract it where you
wish to be running your benchmarks. Copy the contest.config file in
this tarball to that directory as .config. Download dbench and compile
it (optional). Copy dbench somewhere into your PATH and copy the .txt
files from the dbench source into your kernel test tree. Then:

make oldconfig
make dep

Reboot into the kernel you wish to test, disabling any apm in the bios.
At lilo preferably boot into single user mode:

linux single vga=normal apm=off

Then cd to the kernel source directory and type (from the man page):

     contest [-cdr] [-k name] [-t file] -o file [-n nrruns] [load...]

OPTIONS
     These are the commonly used options:

     -b      Print a progress bar if information from previous runs is avail-
             able. Do not use this for remote logins / testing as it will
             taint the results.

     -k NAME
             Use the kernel name NAME in the report generated by contest
             (default is the internal kernel name from uname.)

     load
             The name of the load you wish to run - can specify multiple loads
             Available loads are: no_load, cacherun, pro-cess_load, ctar_load,
             xtar_load, io_load, read_load, list_load, mem_load and
             dbench_load (default all loads.)

     -n N    Number of times to run each load in contest to generate useful
             averages (default 3.)

     -r      Generate report on all the cumulative log files in the current
             directory.

     -t FILE
             Use the file FILE as the temporary file to use for io loads - can
             specify a full path such as /tmp/tmpfile (default ./dump.)

     -o FILE
             target file for io_other (required unless io_other is not specified).

OTHER OPTIONS
     You shouldn't need to touch these.
     -c      Assume a cold cache. This bypasses the memory flushing routines
             that occur between different loads in contest that are normally
             used to minimise the effects of caching of data from previous
             compiles. It is of use only to test contest functionality and
             will invalidate any testing you do.

     -d      Print debugging information if contest was compiled with debug-
             ging enabled (will flood the console).

     -p      Dont cleanup after each load is run - speeds up the running of
             contest but can use up massive amounts of disk space.

NOTES
     * contest must be run in the top directory of a 2.4.19 linux kernel tree,
     and the contest.config file from the contest source should be copied to
     that directory as .config.

     * contest assumes the existence of the common utilities, cc and dbench in
     your PATH

     * The version of cc used must not vary between benchmarks.

     * dbench_load requires both the existence of dbench in your PATH and the
     presence of the .txt files from the dbench source in your testbed tree.

     * contest should be run by itself in single user mode or as the sole init
     process to exclude the effects of any other loads on the system.

     * contest requires large amounts of spare disk space to perform
     dbench_load, the tar loads (ctar_load and xtar_load), read_load, 
     io_load and io_other.

     * io_load requires as much spare disk space as the physical RAM the test
     machine has. 

     * io_other needs the -o parameter to tell it where to do the other
     io load. Running the io load on a separate hard disk tests different
     aspects of the kernel.

     * cacherun is only meaningful if run in combination with no_load.

     * must be run with sufficient permissions to execute swapon/swapoff.

     * The contest name is a play on words due to the Author's name being Con.

     * Never do make dep again after running contest as it will change the
     results of subsequent runs.

     *Changing the vga from framebuffer mode to non-fb and vice versa changes
     results.

There are four numbers generated for each load; the time taken to compile
the kernel, the average cpu percentage to do the compile, the amount of
times the load performed its task and the cpu% the load used while running.
The kernelname.log file contains more detailed information.

A report generated with -r gives the most useful information. It sorts the
kernels, lays out the information and creates a new result called
ratio which is simply the ratio of the time result compared to the noload
result for the kernel.

FAQ

WHAT DO THE NUMBERS MEAN

The lower the time (and ratio) the better, and the higher the cpu percentage the
better. The absolute number is not important, but the differences are.
When the difference between noload and the loads is small it shows that
on that kernel, the system is able to respond to requests for normal
tasks (ie responsiveness). When the cpu percentage is high it shows that
although the other loads are high, a cpu intensive application (kernel
compile) can still use as much as it needs. Ideally the load should be able
to perform some reasonable amount of work and the kernel compile take slightly
longer. If the balance shifts substantially in one direction or the other it
is indicating a large change in the fairness of scheduling (which is really
what contest is testing.)