[steve.langasek at ubuntu.com: Re: [QA] Release Meeting 2012-04-06]

Andy Whitcroft apw at canonical.com
Tue Apr 10 14:43:51 UTC 2012


On Fri, Apr 06, 2012 at 12:01:34PM -0700, Steve Langasek wrote:
> [Resending to the right kernel team address]
> 
> Cc:ing the kernel team; folks, there seems to be some strange behavior in
> the kernel that causes our boot speed tests to sometimes turn out much
> slower than normal.
> 
> On Fri, Apr 06, 2012 at 11:54:49AM +0200, Jean-Baptiste Lallement wrote:
> > === Boot speed Testing ===
> > http://reports.qa.ubuntu.com/reports/boot-speed/
> 
> > 1 run yesterday on AMD64 took 12 seconds more than the average to boot
> > http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_14-14-20/bootchart.png
> 
> > The same occurred on i386 on Mar. 31 (54.66s) and Apr. 01 (49.83s)
> > http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-03-31_11-57-21/bootchart.png
> > http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-02/2012-03-31_14-25-29/bootchart.png
> > http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-01_19-37-05/bootchart.png
> > http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-02/2012-04-01_15-13-56/bootchart.png
> 
> This is very interesting.  Although most of the added time is attributed to
> the plumbing layer, when I look at dmesg for the two runs 
> (http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_14-14-20/dmesg
> vs.
> http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_09-55-24/dmesg)
> and vdiff them, it's evident that the slower boot is showing itself to be
> slower very early on, starting with:
> 
> [    0.004007] Calibrating delay loop (skipped), value calculated using timer frequency.. 3591.01 BogoMIPS (lpj=7182020)
> [    0.004018] pid_max: default: 32768 minimum: 301
> [    0.004079] Security Framework initialized
> [    0.004119] AppArmor: AppArmor initialized
> 
> vs.
> 
> [    0.008007] Calibrating delay loop (skipped), value calculated using timer frequency.. 3590.99 BogoMIPS (lpj=7181980)
> [    0.008018] pid_max: default: 32768 minimum: 301
> [    0.008079] Security Framework initialized
> [    0.008118] AppArmor: AppArmor initialized

Ok so we know we have made the timer calibration slower, as it now has
to loop twice as often.  For now I am going to assume that this accounts
for the 4ms difference in timing here.

> so we start out right off the bat with a 4ms difference in timing, which
> grows steadily to 13ms by the time we free the extra memory for the initrd;
> then we make up 8ms here...
> 
> [    0.506003] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> [    0.828286] Freeing initrd memory: 13836k freed
> 
> [    0.519059] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> [    0.833537] Freeing initrd memory: 13840k freed


0.519059-0.506003 = 13ms
0.833537-0.828286 = 5ms

So we are likely actually about the same offset.  I suspect anything
else at this point is within the accuracy of the timer, regardless of
the number of digits it prints.  Bearing in mind that we are not singly
threaded at this time, we are actually decompressing things in parallel.

> the first substantial difference between the two seems to be with the TSC
> calibration:
> 
> [    0.000000] Fast TSC calibration using PIT
> [    0.000000] Detected 1795.505 MHz processor.
> 
> [    0.000000] Fast TSC calibration failed
> [    0.000000] TSC: PIT calibration matches HPET. 1 loops
> [    0.000000] Detected 1795.495 MHz processor.

Ok, lets assume this is the trigger for the slight offset in times as we
have had to take corrective action.  This would likely account for 4ms
in base timings.

> Unfortunately on i386 the dmesg log doesn't go all the way to the beginning
> of boot, so I can't check for similar differences there.

We likely should get some done with the dmesg buffer expanded, this can
be done on the command line.

> Then things really start to drift apart as soon as we start looking at the
> disk:
> 
> [    1.097471] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [    1.143716]  sda: sda1 sda2 < sda5 >
> [    1.144710] sd 0:0:0:0: [sda] Attached SCSI disk
> 
> [    1.106372] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [    1.168798]  sda: sda1 sda2 < sda5 >
> [    1.169777] sd 0:0:0:0: [sda] Attached SCSI disk

1.106372-1.097471 = 8ms

So cirtainly at the start of the disc scan we are similarly offset.  The
bigger difference doess appear after the disks have been scanned which
may or may not be a clue:

1.168798-1.143716 = 25ms

> 
> and from there on, there's a general pattern of things being slower on the
> slow boot than on the fast ones - particularly where I/O is concerned (which
> is consistent with what the boot charts show), but even things like driver
> initialization are slowed down in comparison.
> 
> Do you kernel folks have any idea what could be causing this occasional
> slowdown?  It doesn't look like a CPU frequency issue, yet something is
> definitely slowing down the whole system.

If I look at a pair of charts taken the same day, I see a pair which
take, 48.72s (A) and 35.09 (B):

dmesg A: http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_14-14-20/dmesg
dmesg B: http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_09-55-24/dmesg

version A: 3.2.0-22.35-generic
version B: 3.2.0-22.35-generic 

Time as measureed to initramfs udevd starting:

A: [    1.277800] udevd[103]: starting version 175
B: [    1.252271] udevd[103]: starting version 175

Time to root filesystem checked:

A: [    1.884790] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
B: [    1.943905] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Time to root udevd starting:

A: [   14.343249] udevd[356]: starting version 175
B: [   13.663680] udevd[393]: starting version 175

So we have some parts delayed, some catching up and exceeding.

I am wondering if the clocks are reliable here.  Do we have any outside
measure of the times by any chance, time to completion of Jenkins jobs
or anything which would validate the times here??

Overall when I look at the two bootcharts associated with these the most
distinctive difference is the shape of the activirty from when mountall
ends to when when X starts.  There is a distinct difference in urgency
there.  The oddest thing is the order of starting is completely reversed
in some areas.  Actually, in the slow one, dbud-daemon starts then seems
to exit and perhaps restart, I wonder if that is the trigger for some
of this slowness.  For example in the slow one the apparmor initiasation
and urandom seem to show up majorly longer.

A: http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_14-14-20/bootchart.svg
B: http://reports.qa.ubuntu.com/reports/boot-speed/acer-veriton-01/2012-04-05_09-55-24/bootchart.svg

Steve?

-apw



More information about the Ubuntu-release mailing list