CEP-??: benchmarks, measurements, composite and derivative tests

Sun Jul 26 13:16:33 UTC 2015

Hey

BTW: This is something I discussed (as an idea) with Maciek. We've
decided that it's better to express this on the mailing list to let
everyone participate and throw their ideas around. None of the things
below are implemeted.

This is something that was on my mind for quite a bit of time (almost
five years if you include derivative tests). Some tl;dr; up front as
this might be a long email. I'd like to explore ideas on how we could
enhance checkbox with the ability to natively support "benchmarks"
(which can be abstracted as a test that measures something rather than
producing a boolean value). This expands onto a discussion of what
many current benchmarks de-facto do (composite tests) and onto some
other interesting and low hanging fruit (derivative tests).

If you look at the spectrum of software that is "similar" to what we
do the biggest missing feature is the ability to see how much/fast/big
something is. Consumer computing magazines offer comprehensive
benchmarks of new platforms, storage devices, graphics cards, CPUs.
There is much more and niche things you will find in the enterprise
market. Recently [1], Juju has gained the ability to run identical
payloads on various cloud providers in order to allow downstream
consumers to provide reliable, up-to-date cost/benefit comparison
across the public cloud ecosystem.

While traditionally not something many software stacks do, benchmarks
are slowly but steadily creeping in, to work alongside unit tests, to
ensure that important software projects don't deteriorate in
performance due to a seemingly trivial change. You will see this in
operating systems (the Linux kernel and the large collection of
associated testing and benchmarking projects), in web browsers, in
graphical toolkits and recently enough even in programming languages
(rust). My point here is that there are more benchmarks added now than
ever before and this is because benchmarks are, in many cases, just as
important as tests.

In my past, when designing LAVA I included "measurements" from day
one. We divided tests into two groups: qualitative (pass vs fail) and
quantitative (how much). This has worked pretty well but there are
some interesting lessons learned. The one thing I need to add here,
though, is that LAVA and plainbox have some widely different
semantics. LAVA was (perhaps still is) more about results than about
tests. Plainbox has native test definition. LAVA did not (or
definitely no to the same extent). What LAVA modeled was a program to
run and a co-program to parse the output. The key difference, is that
the companion program could return multiple results. So for example, a
simple memory benchmark test included a number of measurements, for
different bucket sizes, different copy modes, etc, etc. This is the
first thing I'd like to, consider, introducing.

Composite and Composite-Measurements Units

Consider this simple example, it is based on the stream benchmark [2].
Stream is a very simple benchmark. You can find the example output
(from a raspberry pi) attached to this message (for reference). The
key fact is that it contains a _lot_ of data in an otherwise trivially
small output. To model all of that for plainbox, we need two basic
pieces of information.

We need something that describes that stream is a runnable composite
test. This includes the stream executable. In addition, we need the
"companion" program that will be able to process stream and returns
something we can work with. This is a design choice, to split those in
two. We could have very well ran stream internally from "parse-stream"
and only place requirements on the expected output of the parser. I
wanted to avoid this so that anyone that is familiar with a given
benchmark can look at the raw log files and see them without
obfuscation introduced by the "plainbox layer".

We also need, though you may argue this is optional, something that
describes particular parts of the composite test. Units like this will
tell us about specific tests or measurements that stream makes and how
to interpret them. This is the fist design decision that is a lesson
learned on top of LAVA. By requiring this up front we have a clean,
managed meta-data about a test that would otherwise depend (and
change) with the parser program. I think this is the same situation
that has led us to explore template units over local jobs.

  unit: composite
  id: stream
  command: stream
  companion: parse-stream
  companion-mode: parse-stdout
  _summary: Stream (synthetic memory benchmark)
  _description:
    The STREAM benchmark is a simple synthetic benchmark program
    that measures sustainable memory bandwidth (in MB/s) and
    the corresponding computation rate for simple vector kernels.

  unit: composite-measurement
  part-of: stream
  id: stream.best-rate.Copy
  units: megabytes per second
  units-abbrev: MB/s
  better: more
  _summary: Best memory bandwidth using the "Copy" function

  unit: composite-measurement
  part-of: stream
  id: stream.best-rate.Scale
  units: megabytes per second
  units-abbrev: MB/s
  better: more
  _summary: Best memory bandwidth using the "Scale" function

  unit: composite-measurement
  part-of: stream
  id: stream.best-rate.Add
  units: megabytes per second
  units-abbrev: MB/s
  better: more
  _summary: Best memory bandwidth using the "Add" function

  unit: composite-measurement
  part-of: stream
  id: stream.best-rate.Triad
  units: megabytes per second
  units-abbrev: MB/s
  better: more
  _summary: Best memory bandwidth using the "Triad" function

  unit: composite-test
  part-of: stream
  id: stream.valid
  _summary: Check if stream measurements validate (avg error less than 1e-13)

NOTE: If you compare this to the output of stream you will see there
are many bits missing. I specifically left out the rest. To make this
example short and to-the-point.

The definitions above give us some interesting knowledge:
 - We can run the stream composite and processing it should return
five "results".
 - One of those results is a "classic" pass/fail test.
 - Four of those results are measurements
 - We have useful descriptions of each of the measurements
 - We know what units of measurement are used (MB/s in this case)
 - We know if more is better or worse.
(All of this lets us immediately plot useful comparison charts or
trend charts without injecting any new information from side channels
or having to create "custom reports")

Let's start with the very last result here: the stream.valid. Stream
performs many runs and computes the average error between them. This
can be used to spot nondeterministic system (e.g. a busy system or a
VM) where the rest the measurements may be meaningless. This seems
irrelevant but many benchmarks that don't take this into account can
easily produce garbage results silently. If we go ahead and develop
our own benchmarks (beyond the USB/network copy performance) then this
is a small reminder to pay attention to the proper methodology.

So far, as you know, plainbox results store the "outcome" of a test.
The outcome ranges from PASS, FAIL to some other interesting values
that are mostly irrelevant to the topic.
Storing measurements is relatively straightforward. It would be a
result object with OUTCOME_MEASUREMENT ("measurement") and a new
"measurement" field with a numeric value. Everything else is already
available from the composite-measurement and composite-test units.

Having composite units would be, in my opinion, quite quite useful.
Many real-world tests are composite: we can run units tests of many
projects and collect many outputs. We can run fwts and produce many
results. We can run a GFX benchmark and obtain a whole bag of
different, significant numbers. Splitting them into separate
mini-programs to run would be meaningless in some cases and simply
annoying in most. On the other hand, the need to maintain a separate
post-processor program is an added burden and is not itself without
issues. It has to stay in-sync with the output of the test which is
often maintained by a third party. It is almost always impossible to
support i18n as everything will rely on parsing, etc.

The only task that is unsuitable to companion units as I've described
them above is having an easy-to-support unit test wrapper. Consider
this example:

  unit: composite
  id: plainbox
  command: plainbox self-test -u
  companion: parse-python-unittests
  companion-mode: parse-stdout
  _summary: Plainbox (unit tests)

Here we would have to list all of the test cases that plainbox has.
This is clearly not something anyone would want to do (once) or
maintain (forever). Still, we want to have _some_ information about
what may be produced. I bet that for popular testing frameworks we
could provide a off-the-shelf companion program that just works.

To let users express this kind of situation, we might need something
that kind of looks like a template but isn't tied to any resource
program. Instead it would get the data straight from the companion
program itself.

  unit: composite-test-template
  part-of: plainbox
  id: plainbox.{id}

Here we just say, that anything else coming is a qualitative test and
the companion program will supply the rest. Why is this useful? Some
projects that we know of have adopted plainbox to be their top-level
test definition and runner system. This would be a natural extension
towards integrating many types of tests under one umbrella.

Derivative Tests

I kind of wanted to leave this out after writing everything else but ...

Derivative tests let us compute new values out of existing results.
This could use the expression evaluation system we have for resource
programs.

We could, for example, include an expression in a test plan that
computes the overall "outcome" of all of the testing (and returns this
information as the exit code).
We could compute average performance of many runs of a single benchmark.

The core idea that simple arithmetic and logic expressions could be
ran on top of any existing data to produce more data. I think this has
a lot of potential but would really need a killer feature to show up
before it is something I would invest in implementing. For now,
consider a few examples:

# Ensure that all plainbox unit tests pass (skips are counted as okay)
unit: derivative-test
id: plainbox.overall
select: plainbox.*
expression: all(result.outcome == "pass" or result.outcome == "skip"
for result in selected_results)

# Compute the average transfer rate of all stream tests
unit: derivative-measurement
id: stream.average-rate
select: stream.best-rate.*
expression: statistics.mean(result.measurement for result in selected_results)

# Ensure that USB3 storage is fast enough
unit: derivative-test
id: usb3/storage-transfer.acceptable
select: usb3/storage-transfer.rate
expression: all(result.measurement >= 80.0 for result in selected_results)

Best regards
ZK

[1] https://insights.ubuntu.com/2015/06/25/announcing-benchmarking-with-juju/
[2] https://www.cs.virginia.edu/stream/ref.html
-------------- next part --------------
zyga at pi ~ $ gcc stream.c -O -o stream
./stream
zyga at pi ~ $ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 720056 microseconds.
   (= 360028 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:             277.5     0.577147     0.576497     0.577596
Scale:            237.5     0.674524     0.673680     0.675027
Add:              327.4     0.734032     0.733020     0.738897
Triad:            323.9     0.742631     0.740892     0.748557
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------