[PATCH 1/1] UBUNTU: SAUCE: ubuntu_performance_deep_learning: init deep learning framework performance test

Mon Jul 5 15:01:52 UTC 2021

Hi Sam,

Thank you for your comments. I have created another patch to include the
updates based on your feedback. The new patch will supersede this one. The
new patch could be found here
https://lists.ubuntu.com/archives/kernel-team/2021-July/122058.html .

Additionally, I left some more comments below for more context and better
communication.

On Mon, Jun 28, 2021 at 6:55 AM Po-Hsu Lin <po-hsu.lin at canonical.com> wrote:

> Hello Tai,
> Please see inline comment below.
>
> On Fri, Jun 25, 2021 at 1:47 AM Taihsiang Ho (tai271828)
> <taihsiang.ho at canonical.com> wrote:
> >
> > The purpose of this test is to generate the performance data of deep
> > learning framework. Currently it supports TensorFlow testing only.
> >
> > If the target shell script complete to run and the target data file is
> > generated, the job is passed.
> >
> > The test environment is mostly prepared by MAAS via a customized curtin
> > preseed. For tasks like driver installation, software installation
> > highly associated with driver or required reboot, are setup by the
> > preseed. The rest of tasks are completed by the autotest framework, and
> > defined in the corresponding testing job.
> >
> So it sounds like I can only test this manually on a DGX deployed with
> the maas in Taipei?
>

The test is expected to run on any machine with GPUs supported by
corresponding drivers and deep learning frameworks. For example, a DGX-1,
DGX-2, and DGX-A100 could all be the target machines.

>
> > Signed-off-by: Taihsiang Ho (tai271828) <taihsiang.ho at canonical.com>
> > ---
> >  ubuntu_performance_deep_learning/control      | 12 +++
> >  ubuntu_performance_deep_learning/helper.py    | 27 ++++++
> >  .../ubuntu_performance_deep_learning.py       | 85 +++++++++++++++++++
> >  .../ubuntu_performance_tensor_flow.sh         | 63 ++++++++++++++
> >  4 files changed, 187 insertions(+)
> >  create mode 100644 ubuntu_performance_deep_learning/control
> >  create mode 100644 ubuntu_performance_deep_learning/helper.py
> >  create mode 100644
> ubuntu_performance_deep_learning/ubuntu_performance_deep_learning.py
> >  create mode 100755
> ubuntu_performance_deep_learning/ubuntu_performance_tensor_flow.sh
> >
> > diff --git a/ubuntu_performance_deep_learning/control
> b/ubuntu_performance_deep_learning/control
> > new file mode 100644
> > index 00000000..68a0a626
> > --- /dev/null
> > +++ b/ubuntu_performance_deep_learning/control
> > @@ -0,0 +1,12 @@
> > +AUTHOR = 'Taihsiang Ho <taihsiang.ho at canonical.com>'
> > +TIME = 'MEDIUM'
> > +NAME = 'Basic TensorFlow Testing'
> > +TEST_TYPE = 'client'
> > +TEST_CLASS = 'General'
> > +TEST_CATEGORY = 'Benchmark'
> > +
> > +DOC = """
> > +Perform basic tensor flow testing
> > +"""
> > +
> > +job.run_test_detail('ubuntu_performance_deep_learning',
> test_name='tensor-flow-cnn-resnet', tag='tensor-flow-cnn-resnet',
> timeout=60*15)
> Any specific reason to use 'tensor-flow-cnn-resnet', in which the dash
> will later be replaced with underscores:
>     benchmark = benchmark.replace("-", "_")
> Why not just use underscores directly?
>

The namespace and format of `benchmark` is kept on purpose to be consistent
with the existing performance tests, e.g. pts performance
https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/tree/ubuntu_performance_pts/ubuntu_performance_pts.py#n185
. This special format is expected to be used by the other analysis tools
consuming the data generated by these performance tests.

>
> > diff --git a/ubuntu_performance_deep_learning/helper.py
> b/ubuntu_performance_deep_learning/helper.py
> > new file mode 100644
> > index 00000000..526a7e15
> > --- /dev/null
> > +++ b/ubuntu_performance_deep_learning/helper.py
> > @@ -0,0 +1,27 @@
> > +import re
> > +
> > +
> > +def get_stats(stdout_results):
> > +    # search for the benchmark output line
> > +    # for example, search for "300 300.0  6776.8  0.000  0.960 0.00000"
> which has
> > +    #     1. 6 numbers, either integers (300) or floats in x.x format
> (6776.8)
> > +    #     2. the third number (6776.8) is what we want
> > +    #
> > +    # regular expression:
> > +    #     1. (\d+(\.\d+)?) for x.x or x
> > +    #         1.1. \d for numbers, equivalent to [0-9]
> > +    #         1.2. \d+ one or more numbers. + is short for {1, }
> > +    #         1.3. (\.\d+)? zero or one ".x". ? is short for {0, 1}
> > +    #     2. \s for space, short for [\f\n\r\t\v\u00A0\u2028\u2029]
> > +    #         2.1. \s+ for one or more spaces
> > +    #     3. (){n} for n repetitions of group
> > +    pattern = r"""(\d+(\.\d+)?)         # for x.x or x
> > +                  (\s+(\d+(\.\d+)?)){2} # 2 repetitions of _x.x or _x
> > +                  (\s+(\d+(\.\d+)?)){3} # 3 repetitions of _x.x or _x"""
> > +    rc = re.compile(pattern, re.VERBOSE)
> > +    matches = rc.findall(stdout_results, re.MULTILINE)
> > +
> > +    # get the key number
> > +    target_number = matches[1][3]
> It is possible to see IndexError here if matches didn't get the expected
> value.
> I don't have test output here so not sure if this will happen though.
>
> The answer for the whole bunch of questions regarding `values` is similar
to the above `benchmark` style: they are kept on purpose to be consistent
with the existing logic/algorithm when generating performance data. For
example, please see
https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/tree/ubuntu_performance_pts/ubuntu_performance_pts.py#n203

The main idea is to fail the test job if the job did not run through and
generate data. If there is any exception, e.g. `matches` does not get the
expected value, the exception should fail the test job and not generate
data. That being said, when `maches` does not get the expected value,
matches should fail or crash the job (by giving an exception).

> From the code in ubuntu_performance_deep_learning.py:
>     values[i] = helper.get_stats(stdout_result)
>     if values[i]:
> It looks like you're expecting to see exception here?
>
>
Yes. The reason is the same as the elaboration for `matches` above.

> +
> > +    return target_number
> > diff --git
> a/ubuntu_performance_deep_learning/ubuntu_performance_deep_learning.py
> b/ubuntu_performance_deep_learning/ubuntu_performance_deep_learning.py
> > new file mode 100644
> > index 00000000..00c6074f
> > --- /dev/null
> > +++
> b/ubuntu_performance_deep_learning/ubuntu_performance_deep_learning.py
> > @@ -0,0 +1,85 @@
> > +import os
> > +import helper
> > +from autotest.client import test, utils
> > +from autotest.client.shared import error
> ^ unused import
>
>
Good catch. Thanks! This will be fixed in the new patch.

> > +
> > +
> > +TEST_ITERATION = 3
> > +
> > +
> > +class ubuntu_performance_deep_learning(test.test):
> > +    version = 1
> > +
> > +    def initialize(self):
> > +        pass
> > +
> > +    def install_required_pkgs(self):
> > +        p_dir = os.path.dirname(os.path.abspath(__file__))
> > +        uptf_cmd = os.path.join(p_dir,
> "ubuntu_performance_tensor_flow.sh")
> > +        cmd = "{} setup".format(uptf_cmd)
> > +        shell_exit_code = utils.system(cmd, ignore_status=True)
> > +
> > +        return shell_exit_code
> This return code is not being used in setup(), so the test will keep
> going if setup fails. If you want this to bail early when the setup
> task fails, this line:
>         shell_exit_code = utils.system(cmd, ignore_status=True)
> can be replaced with:
>         utils.system(cmd)
>
> ignore_status is default to true.
>

Thanks for the comment. This snippet will be also updated by following your
suggestion. Besides, I thought there is a typo: ginore_status is default to
"false". Please refer to the source:
https://github.com/autotest/autotest/blob/master/client/shared/utils.py#L1217

> > +
> > +    def setup(self):
> > +        self.install_required_pkgs()
> > +
> > +    def tensor_flow_cnn_resnet(self, benchmark):
> > +        """Test for running basic tensor flow features"""
> > +        unit = "images/sec"
> > +        max_error_threshold = 0.05
> > +        values = {}
> > +
> > +        # benchmark is the benchmark item of config.yaml
> What is the config.yaml mentioned here?
>

Please refer to the elaboration for `benchmark` above. It's the
configuration file defined by tools that will be used to analyze the
performance data. The config.yaml defines the naming conventions.

> > +        benchmark = benchmark.replace("-", "_")
> > +        if "TEST_CONFIG" in os.environ:
> > +            benchmark += "_" + os.environ["TEST_CONFIG"]
> > +
> > +        p_dir = os.path.dirname(os.path.abspath(__file__))
> > +        uptf_cmd = os.path.join(p_dir,
> "ubuntu_performance_tensor_flow.sh")
> > +        cmd = "{} test".format(uptf_cmd)
> > +
> > +        for i in range(TEST_ITERATION):
> > +            stdout_result = utils.system_output(cmd, retain_output=True)
> > +            values[i] = helper.get_stats(stdout_result)
> > +
> > +            if values[i]:

Just like the question in ubuntu_performance_deep_learning/helper.py
> if you're expecting something like a string here, it will fail the
> min/max/average computation below.
>
>
It will run normally. There is type conversion (string to float) below.
Besides, please note this code snippet is implemented in the same way as
some other existing performance test jobs, e.g.
https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/tree/ubuntu_performance_pts/ubuntu_performance_pts.py#n201
for being easier to read.

> +                print("")
> > +                print("Test %d of %d:" % (i + 1, TEST_ITERATION))
> > +                print("{}[{}] {} {}".format(benchmark, i, values[i],
> unit))
> > +
> > +        #
> > +        #  Compute min/max/average:
> > +        #
> > +        if values[i]:
> > +            v = [float(values[i]) for i in values]

As mentioned above, anything that cannot be converted like None or a
> non-numeric string will cause TypeError or ValueError here.
>

These exceptions should fail the job. This behavior is expected. Please
refer to the elaboration of `matches` shown above.

>
> > +            maximum = max(v)
> > +            minimum = min(v)
> > +            average = sum(v) / float(len(v))
> > +            max_err = (maximum - minimum) / average
> > +
> > +            print("")
> > +            print(benchmark + "_minimum {:.2f} {}".format(minimum,
> unit))
> > +            print(benchmark + "_maximum {:.2f} {}".format(maximum,
> unit))
> > +            print(benchmark + "_average {:.2f} {}".format(average,
> unit))
> > +            print(benchmark + "_maximum_error {:.2%}".format(max_err))
> > +            print("")
> > +
> > +            if max_err > max_error_threshold:
> > +                print("FAIL: maximum error is greater than 5%")
> > +            else:
> > +                print("PASS: test passes specified performance
> thresholds")
> > +        else:
> > +            print("NOT-RUN or FAIL to PARSE DATA")
> > +
> > +    def run_once(self, test_name):
> > +        if test_name == "tensor-flow-cnn-resnet":
> > +            self.tensor_flow_cnn_resnet(test_name)
> > +
> > +            print("")
> > +            print("tensor_flow_cnn_resnet shell script has run.")
> > +
> > +        print("")
> > +
> > +    def postprocess_iteration(self):
> > +        pass
> > diff --git
> a/ubuntu_performance_deep_learning/ubuntu_performance_tensor_flow.sh
> b/ubuntu_performance_deep_learning/ubuntu_performance_tensor_flow.sh
> > new file mode 100755
> > index 00000000..cbf9ff0e
> > --- /dev/null
> > +++ b/ubuntu_performance_deep_learning/ubuntu_performance_tensor_flow.sh
>
> You have some mix use of space and tab in this script, we don't have a
> strict guideline to follow but I think it's better not to use them at
> the same time.
>
>
Good catch. Thanks for the heads-up.

> > @@ -0,0 +1,63 @@
> > +#!/usr/bin/bash
> What is the target distribution for this test?
> This works on 20.04 / 21.04 but not 18.04 and earlier.
>
> #!/usr/bin/env bash will be more universal.
>

+1. I will use this shebang for 18.04 and above.

I appreciate your input. Thank you for your time and effort! Please note an
updated patch has been submitted. The patch
https://lists.ubuntu.com/archives/kernel-team/2021-July/122058.html will
supersede this outdated patch. Please feel free to NACK this old patch.

-tai

>
> > +#
> > +# perform TensorFlow performance testing and corresponding pre-setup.
> > +#
> > +
> > +set -eo pipefail
> > +
> > +CONTAINER_VER="20.12"
> > +
> > +install_nvidia_docker() {
> > +    local distribution
> > +    distribution="$(. /etc/os-release;echo $ID$VERSION_ID)"
> > +    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo
> apt-key add -
> > +    curl -s -L
> https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list |
> \
> > +       sudo tee /etc/apt/sources.list.d/nvidia-docker.list > /dev/null
> > +    sudo apt update
> > +    sudo apt install nvidia-docker2 -y
> > +    sudo systemctl restart docker
> > +}
> > +
> > +get_num_gpus() {
> > +    # required to passthrough GPUs into containers
> > +    nvidia-smi -L | wc -l
> > +}
> > +
> > +setup() {
> > +    # pre-setup testing environment and necessary tools
> > +    install_nvidia_docker
> > +}
> > +
> > +run_test() {
> > +    sudo nvidia-docker run \
> > +        --shm-size=1g \
> > +        --ulimit memlock=-1 \
> > +        --ulimit stack=67108864 \
> > +        -ti --rm nvcr.io/nvidia/tensorflow:${CONTAINER_VER}-tf1-py3
> <http://nvcr.io/nvidia/tensorflow:$%7BCONTAINER_VER%7D-tf1-py3> -- \
> > +        mpiexec \
> > +        --bind-to socket \
> > +        --allow-run-as-root \
> > +        -np "$(get_num_gpus)" \
> > +        python -u /workspace/nvidia-examples/cnn/resnet.py \
> > +        --layers=50 \
> > +        --precision=fp16 \
> > +        --batch_size=256 \
> > +        --num_iter=300 \
> > +        --iter_unit=batch \
> > +        --display_every=300
> > +}
> > +
> > +case $1 in
> > +    setup)
> > +       setup
> > +       echo ""
> > +       echo "Setting up necessary test environment..."
> > +       echo ""
> > +       ;;
> > +    test)
> > +       run_test
> > +       echo ""
> > +       echo "Running test..."
> > +       echo ""
> > +       ;;
> > +esac
> > --
> > 2.32.0
> >
> >
> > --
> > kernel-team mailing list
> > kernel-team at lists.ubuntu.com
> > https://lists.ubuntu.com/mailman/listinfo/kernel-team
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210705/c810c912/attachment-0001.html>