CEP-??: benchmarks, measurements, composite and derivative tests

Mon Aug 3 13:59:18 UTC 2015

This is pretty interesting, but how would this apply in the context of
certification...  would there be some further interpretation?  For example,
if I do some sort of I/O benchmarking, would a test actually fail for poor
performance, or would the benchmarking result be more informational?

And what is the definition of poor performance that would fail a
certification test?  Even what we have to day is purely arbitrary with
different limits set for different things... example: server network
devices fail certification if they don't provide at least 80% of advertised
bandwidth, while client systems only have to provide 40% to pass.

Anyway, I was just curious... we haven't run this sort of thing via
"checkbox" in years, and even then it was a wrapper that interpreted the
benchmark suite results and provided the standard 0/1 exit code to
Checkbox.

But as I said, it is interesting... we're planning to add some sort of
benchmarking in conjunction with some tests already this cycle.  For
example, a CPU load monitor of some sort that runs with the network test to
determine if network loading is overloading the CPU, which would lead to
things like poor network performance, or poor application performance when
the system is under high network load.

Interested to see how this pans out.

On Sun, Jul 26, 2015 at 9:16 AM Zygmunt Krynicki <
zygmunt.krynicki at canonical.com> wrote:

> Hey
>
> BTW: This is something I discussed (as an idea) with Maciek. We've
> decided that it's better to express this on the mailing list to let
> everyone participate and throw their ideas around. None of the things
> below are implemeted.
>
> This is something that was on my mind for quite a bit of time (almost
> five years if you include derivative tests). Some tl;dr; up front as
> this might be a long email. I'd like to explore ideas on how we could
> enhance checkbox with the ability to natively support "benchmarks"
> (which can be abstracted as a test that measures something rather than
> producing a boolean value). This expands onto a discussion of what
> many current benchmarks de-facto do (composite tests) and onto some
> other interesting and low hanging fruit (derivative tests).
>
> If you look at the spectrum of software that is "similar" to what we
> do the biggest missing feature is the ability to see how much/fast/big
> something is. Consumer computing magazines offer comprehensive
> benchmarks of new platforms, storage devices, graphics cards, CPUs.
> There is much more and niche things you will find in the enterprise
> market. Recently [1], Juju has gained the ability to run identical
> payloads on various cloud providers in order to allow downstream
> consumers to provide reliable, up-to-date cost/benefit comparison
> across the public cloud ecosystem.
>
> While traditionally not something many software stacks do, benchmarks
> are slowly but steadily creeping in, to work alongside unit tests, to
> ensure that important software projects don't deteriorate in
> performance due to a seemingly trivial change. You will see this in
> operating systems (the Linux kernel and the large collection of
> associated testing and benchmarking projects), in web browsers, in
> graphical toolkits and recently enough even in programming languages
> (rust). My point here is that there are more benchmarks added now than
> ever before and this is because benchmarks are, in many cases, just as
> important as tests.
>
> In my past, when designing LAVA I included "measurements" from day
> one. We divided tests into two groups: qualitative (pass vs fail) and
> quantitative (how much). This has worked pretty well but there are
> some interesting lessons learned. The one thing I need to add here,
> though, is that LAVA and plainbox have some widely different
> semantics. LAVA was (perhaps still is) more about results than about
> tests. Plainbox has native test definition. LAVA did not (or
> definitely no to the same extent). What LAVA modeled was a program to
> run and a co-program to parse the output. The key difference, is that
> the companion program could return multiple results. So for example, a
> simple memory benchmark test included a number of measurements, for
> different bucket sizes, different copy modes, etc, etc. This is the
> first thing I'd like to, consider, introducing.
>
> Composite and Composite-Measurements Units
>
> Consider this simple example, it is based on the stream benchmark [2].
> Stream is a very simple benchmark. You can find the example output
> (from a raspberry pi) attached to this message (for reference). The
> key fact is that it contains a _lot_ of data in an otherwise trivially
> small output. To model all of that for plainbox, we need two basic
> pieces of information.
>
> We need something that describes that stream is a runnable composite
> test. This includes the stream executable. In addition, we need the
> "companion" program that will be able to process stream and returns
> something we can work with. This is a design choice, to split those in
> two. We could have very well ran stream internally from "parse-stream"
> and only place requirements on the expected output of the parser. I
> wanted to avoid this so that anyone that is familiar with a given
> benchmark can look at the raw log files and see them without
> obfuscation introduced by the "plainbox layer".
>
> We also need, though you may argue this is optional, something that
> describes particular parts of the composite test. Units like this will
> tell us about specific tests or measurements that stream makes and how
> to interpret them. This is the fist design decision that is a lesson
> learned on top of LAVA. By requiring this up front we have a clean,
> managed meta-data about a test that would otherwise depend (and
> change) with the parser program. I think this is the same situation
> that has led us to explore template units over local jobs.
>
>   unit: composite
>   id: stream
>   command: stream
>   companion: parse-stream
>   companion-mode: parse-stdout
>   _summary: Stream (synthetic memory benchmark)
>   _description:
>     The STREAM benchmark is a simple synthetic benchmark program
>     that measures sustainable memory bandwidth (in MB/s) and
>     the corresponding computation rate for simple vector kernels.
>
>   unit: composite-measurement
>   part-of: stream
>   id: stream.best-rate.Copy
>   units: megabytes per second
>   units-abbrev: MB/s
>   better: more
>   _summary: Best memory bandwidth using the "Copy" function
>
>   unit: composite-measurement
>   part-of: stream
>   id: stream.best-rate.Scale
>   units: megabytes per second
>   units-abbrev: MB/s
>   better: more
>   _summary: Best memory bandwidth using the "Scale" function
>
>   unit: composite-measurement
>   part-of: stream
>   id: stream.best-rate.Add
>   units: megabytes per second
>   units-abbrev: MB/s
>   better: more
>   _summary: Best memory bandwidth using the "Add" function
>
>   unit: composite-measurement
>   part-of: stream
>   id: stream.best-rate.Triad
>   units: megabytes per second
>   units-abbrev: MB/s
>   better: more
>   _summary: Best memory bandwidth using the "Triad" function
>
>   unit: composite-test
>   part-of: stream
>   id: stream.valid
>   _summary: Check if stream measurements validate (avg error less than
> 1e-13)
>
> NOTE: If you compare this to the output of stream you will see there
> are many bits missing. I specifically left out the rest. To make this
> example short and to-the-point.
>
> The definitions above give us some interesting knowledge:
>  - We can run the stream composite and processing it should return
> five "results".
>  - One of those results is a "classic" pass/fail test.
>  - Four of those results are measurements
>  - We have useful descriptions of each of the measurements
>  - We know what units of measurement are used (MB/s in this case)
>  - We know if more is better or worse.
> (All of this lets us immediately plot useful comparison charts or
> trend charts without injecting any new information from side channels
> or having to create "custom reports")
>
> Let's start with the very last result here: the stream.valid. Stream
> performs many runs and computes the average error between them. This
> can be used to spot nondeterministic system (e.g. a busy system or a
> VM) where the rest the measurements may be meaningless. This seems
> irrelevant but many benchmarks that don't take this into account can
> easily produce garbage results silently. If we go ahead and develop
> our own benchmarks (beyond the USB/network copy performance) then this
> is a small reminder to pay attention to the proper methodology.
>
> So far, as you know, plainbox results store the "outcome" of a test.
> The outcome ranges from PASS, FAIL to some other interesting values
> that are mostly irrelevant to the topic.
> Storing measurements is relatively straightforward. It would be a
> result object with OUTCOME_MEASUREMENT ("measurement") and a new
> "measurement" field with a numeric value. Everything else is already
> available from the composite-measurement and composite-test units.
>
> Having composite units would be, in my opinion, quite quite useful.
> Many real-world tests are composite: we can run units tests of many
> projects and collect many outputs. We can run fwts and produce many
> results. We can run a GFX benchmark and obtain a whole bag of
> different, significant numbers. Splitting them into separate
> mini-programs to run would be meaningless in some cases and simply
> annoying in most. On the other hand, the need to maintain a separate
> post-processor program is an added burden and is not itself without
> issues. It has to stay in-sync with the output of the test which is
> often maintained by a third party. It is almost always impossible to
> support i18n as everything will rely on parsing, etc.
>
> The only task that is unsuitable to companion units as I've described
> them above is having an easy-to-support unit test wrapper. Consider
> this example:
>
>   unit: composite
>   id: plainbox
>   command: plainbox self-test -u
>   companion: parse-python-unittests
>   companion-mode: parse-stdout
>   _summary: Plainbox (unit tests)
>
> Here we would have to list all of the test cases that plainbox has.
> This is clearly not something anyone would want to do (once) or
> maintain (forever). Still, we want to have _some_ information about
> what may be produced. I bet that for popular testing frameworks we
> could provide a off-the-shelf companion program that just works.
>
> To let users express this kind of situation, we might need something
> that kind of looks like a template but isn't tied to any resource
> program. Instead it would get the data straight from the companion
> program itself.
>
>   unit: composite-test-template
>   part-of: plainbox
>   id: plainbox.{id}
>
> Here we just say, that anything else coming is a qualitative test and
> the companion program will supply the rest. Why is this useful? Some
> projects that we know of have adopted plainbox to be their top-level
> test definition and runner system. This would be a natural extension
> towards integrating many types of tests under one umbrella.
>
> Derivative Tests
>
> I kind of wanted to leave this out after writing everything else but ...
>
> Derivative tests let us compute new values out of existing results.
> This could use the expression evaluation system we have for resource
> programs.
>
> We could, for example, include an expression in a test plan that
> computes the overall "outcome" of all of the testing (and returns this
> information as the exit code).
> We could compute average performance of many runs of a single benchmark.
>
> The core idea that simple arithmetic and logic expressions could be
> ran on top of any existing data to produce more data. I think this has
> a lot of potential but would really need a killer feature to show up
> before it is something I would invest in implementing. For now,
> consider a few examples:
>
> # Ensure that all plainbox unit tests pass (skips are counted as okay)
> unit: derivative-test
> id: plainbox.overall
> select: plainbox.*
> expression: all(result.outcome == "pass" or result.outcome == "skip"
> for result in selected_results)
>
> # Compute the average transfer rate of all stream tests
> unit: derivative-measurement
> id: stream.average-rate
> select: stream.best-rate.*
> expression: statistics.mean(result.measurement for result in
> selected_results)
>
> # Ensure that USB3 storage is fast enough
> unit: derivative-test
> id: usb3/storage-transfer.acceptable
> select: usb3/storage-transfer.rate
> expression: all(result.measurement >= 80.0 for result in selected_results)
>
> Best regards
> ZK
>
> [1]
> https://insights.ubuntu.com/2015/06/25/announcing-benchmarking-with-juju/
> [2] https://www.cs.virginia.edu/stream/ref.html
> --
> Checkbox-devel mailing list
> Checkbox-devel at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/checkbox-devel
>
-- 
"Entropy isn't what it used to be."

Jeff Lane - Server Certification Lead, OCP Certification Tools Engineering
Lead,
                  Warrior Poet, Biker, Lover of Pie
Phone: 919-442-8649
Ubuntu Ham: W4KDH                          Freenode IRC: bladernr or
bladernr_
gpg: 1024D/3A14B2DD 8C88 B076 0DD7 B404 1417  C466 4ABD 3635 3A14 B2DD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/checkbox-devel/attachments/20150803/d46cbabb/attachment-0001.html>