lttng overhead on 32 bit ARM Nexus 4, some analysis

Colin Ian King colin.king at canonical.com
Fri May 17 13:31:34 UTC 2013


Hi there,

I've completed some basic background research on the overhead of lttng
on the ARM based Nexus 4.  There were concerns that lttng was going to
be overly burdensome on ARM devices especially as these devices have
traditionally much smaller instruction and data caches compared to the
beefier x86 processors and this may impact the behaviour of a heavily
traced ARM system.

I based my analysis on the methodology described in the excellent
research conducted by Romik Guha Anjoy and Soumya Kanti Chakraborty
(Mälardalen University, School of Innovation, Design and Engineering),
references immediately below:

Abstract:
http://mdh.diva-portal.org/smash/record.jsf?searchId=1&pid=diva2:325301

PDF:
http://mdh.diva-portal.org/smash/get/diva2:325301/FULLTEXT01

In my testing I used oprofile to instrument the lttng overhead when
capturing all available events using lttng enable-event -a -k.  I
captured all events to simulate the largest log generating scenario and
perhaps something that QA may want to use.  oprofile was used because it
(almost) works out-of-the-box and the academic research indicated it was
a good overall system profiling choice.

I had considerable difficulties with oprofile and some test scenarios,
it seems that oprofile with the phablet clock application causes random
reboots, so I could not instrument this test.  I originally used the
raring Nexus4 kernel with lttng-modules-dkms but hit several issues
including corrupt lttng log output that babeltrace could not parse. I
eventually used the new Saucy kernel that includes lttng built in
(thanks Tim!).

I successfully tested the following scenarios on the Nexus 4:

* idle system
* stress test (stress --cpu 4 --io 4 --vm 4)
* dd stress test (dd if=/dev/urandom of=/dev/null) x 4 (1 per cpu)
* calendar application running (fairly idle)

The analysed data and graphs are available here:

http://kernel.ubuntu.com/~cking/lttng/nexus4/lttng-overhead.ods

There are 5 pages, tabbed as "idle", "stress", "stress-dd", "calendar"
and "summary" covering the 4 test scenarios and the collated data in the
summary.

== Discussion of results in the "summary" page ==

1. lttng CPU overhead as measured using oprofile

In the first table I extracted all the lttng oprofile counter hits and
compared these to the overall oprofile counter hits for each test.  For
the low CPU demanding tests "idle" and "calendar", lttng CPU overhead is
~0.29-0.37% and for the busy stress tests, ~0.59-0.69%.  The research
paper cited above used a Intel® CoreTM 2 Quad with four 64 bit Q9550 SMP
cores @ 2.83 GHz and observed an overhead of 0.25% to 0.85% depending on
the load, so the Nexus 4 results are very similar.

2. lttng Log file size

One of the concerns is the amount of data captured in flight recorder
mode by lttng.  lttng reduces buffer copying by using memory mapped
buffers, however, ultimately the data needs to be periodically flushed out.

For idle cases, we see ~105K second of log data being generated where as
for busy stress tests we see 560-850K second of log being generated.

Observing the output I/O activity on the logs shows that this data is
being written at a consistent rate matching the average write rate so
lttng is contributing to an light but noticeable I/O load.

3. comparing CPU usage, lttng vs non-lttng

In these tests, I used sar to gather CPU load information and compared
each test run with lttng enabled and without lttng.

User % CPU loads:

In the idle cases, it is hard to compare as the overall load was low.
However, in the stress cases, we see lttng does increase the user space
load, most probably because of the user space lttng components handling
buffers and flushing this out to disk. Typically we see an additional
5-7% extra overhead on busy loaded stress tests.

System % CPU loads:

In the idle cases, one can see a marginal additional lttng overhead. For
the stress cases, the lttng cases see a reduction in system CPU load.
Not sure why this is.

I/O Wait % CPU loads:

In the idle cases we see that lttng doubles the I/O wait time - I guess
this is due to the additional log data flushing.  For the stress cases
there are no I/O wait stats, possibly because these are negligible
because the CPUs are totally pegged out in the stress tests.

Idle % CPU load:

In the idle case, lttng consumes a fraction more CPU, so the % time in
idle is lower than the non-lttng case. However, this data is marginal
and within the margin of error, so it is hard to say if this is
noteworthy.  For the stress cases, no free CPU idle time is measured.

== Conclusion ==

* lttng has minimal impact on CPU load, less < 1% even on busy heavily
loaded stressed systems according to oprofile.
* On a loaded system lttng can generate 850K of log data a second with
full tracing enabled.  This can contribute to I/O loading and skew I/O
related tests / benchmarks.
* Log data is flushed out regularly and not in a bursty manner.

If full tracing is enabled, we need to ensure that tests do not run for
a prolonged time. A 1 hour run potentially can generate ~3GB of trace
data on a very busy system.

Note that I have *NOT* measured the impact of the trace overhead with
lttng not enabled.  The academic literature suggests this is almost
negligible so I did not pursue trying to measuring this.

Colin

Postscript: I've written some LTTng notes here:
http://smackerelofopinion.blogspot.co.uk/2013/05/kernel-tracing-using-lttng.html




More information about the kernel-team mailing list