cgroup stanza a proposal

James Hunt james.hunt at ubuntu.com
Fri Nov 29 17:29:14 UTC 2013


On 22/11/13 10:58, James Hunt wrote:
> Hi Stéphane,
> 
> On 20/11/13 19:23, Stéphane Graber wrote:
>> This morning at vUDS we discussed adding support for cgroups in Upstart.
>>
>> Before I go into details about the proposed stanza and overall
>> behaviour, I'd begin by saying that contrary to some other init systems,
>> our intent is solely related to resource controls which is the main goal
>> of cgroups. Process grouping and tracking will remain unaffected by the
>> addition of cgroup support.
>>
>> Cgroup support will be implemented by adding a new "cgroup" stanza which
>> will control the application of cgroup based restrictions to the job.
>> The limits will be applied to any of the scripts
>                                 ~~~
> s/any/all/
> 
>> (pre-start/post-start/job/pre-stop/post-stob) similar to what's done
>> with setuid/setgid/apparmor stanzas.
>>
>> Now my recommended format for the stanza, which I believe should be
>> flexible enough is:
>>  cgroup <controller> <cgroup name|auto> [<key> <value>]
>>
>>
>> Detail on the fields:
>> == controller ==
>> Name for one of the cgroup controller
>>
>> Currently the valid values are (but won't be hardcoded into upstart):
>>  - blkio
>>  - cpu
>>  - cpuacct
>>  - cpuset
>>  - devices
>>  - freezer
>>  - hugetlb
>>  - memory
>>  - perf_event
>>
>> == cgroup-name|$auto ==
>> Name of the cgroup to use (and create if non-existing)
>>
>> The name may contain a / (e.g. "db/pgsql" or "db/$auto") indicating that
>> it's requesting a sub-cgroup.
> Since cgroups are represented by directories, we're either going to have to
> require that the name be quoted, or only support cgroups without spaces in them.
> I think quotes is preferable as it provides full flexibility, for example:
> 
>    cgroup cpu "my cpu cgroup 1" soft_limit_in_bytes 1024
> 
>>
>> "$auto" is the recommended name and will have upstart generate a name
>> based on the job instance name.
> I think this is confusing - "$auto" is too suggestive of an environment
> variable. However, we can change that to simply 'auto' if we require cgroup
> names to be quoted as mentioned since the bare-word auto can then be safely
> special-cased:
> 
>     cgroup cpu auto soft_limit_in_bytes 1024
> 
>>
>> The main use of that field is for cases where a set of jobs should share
>> limits, in such case the main job should declare the various values and
>> the others just refer to the cgroup by name but not defined values.
>>
>> The name may be different for the various controllers but may not differ
>> within the same controller. Example:
>> valid =>    cgroup memory group1 limit_in_bytes 52428800
>>             cgroup cpuset group2 cpus 0-1
>>
>> invalid =>  cgroup memory group1 limit_in_bytes 52428800
>>             cgroup memory group1 soft_limit_in_bytes 1024
>>
>> == key ==
>> The cgroup control file minus the controller name, so for example
>> memory.soft_limit_in_bytes will become limit_in_bytes.
>>
>> == value ==
>> Any value valid for the given control file, upstart itself won't perform
>> any validation.
>>
>> If the value contains spaces, it should be put between double-quotes (e.g.):
>> cgroup devices auto allow "c 1:2 rwm"
>>
>>
>> Upstart won't have any controller aware logic in its code, instead,
>> it'll simply talk over dbus (using a private dbus socket) to the cgroup
>> manager which will take care of applying the various limits.
>> That cgroup manager will be started very early in the boot sequence. Any
>> job containing a cgroup stanza will be held until the manager is
>> started.
>>
>> The cgroup will be destroyed when a job is stopped and the cgroup isn't
>> shared with another job (task count is 0 and it has no child cgroup).
>>
>> It'll be possible to disable cgroup support entirely by either building
>> upstart without it (needed for non-Linux systems) or by passing
>> --no-cgroup as a parameter to upstart. In that case, the cgroup stanza
>> will simply be ignored and the jobs will start without limitations.
>>
>>
>> All of the above is also meant to apply to user sessions. The cgroup
>> manager will allow unprivileged cgroup configuration, so as long as the
>> user has write access to a sub-section of a controller, it'll be allowed
>> to write entries there. Similarly to other restriction stanzas, failure
>> to apply a cgroup limit in a user session won't be fatal.
>>
>>
>> Now a few examples to try and illustrate the thoughts behind that proposal:
>>
>> == Single job simple example ==
>> === Job ===
>> cgroup memory $auto limit_in_bytes 52428800
>>
>> === Result ===
>> The job will only start once the manager is up and running and will have a
>> 50MB memory limit. If the system has less than 50MB, the job will fail
>> to start.
>>
>> == Single job complex example ==
>> === Job ===
>> cgroup memory $auto limit_in_bytes 52428800
>> cgroup cpuset $auto cpus 0-1
>> cgroup blkio slowio throttle.write_bps_device "8:16 1048576"
>>
>> == Result ==
>> The job will only start once the manager is up and running and will have a
>> 50MB memory limit, be restricted to CPU ids 0 and 1 and have a 1MB/s
>> write limit to the block device 8:16.
>> The job will fail to start if the system has less than 50MB of RAM or
>> less than 2 CPUs.
>>
>>
>> == Multiple jobs complex example ==
>> === Job 1 ===
>> cgroup cpuset db cpus 0-1
>> cgroup memory db limit_in_bytes 104857600
>> cgroup blkio db throttle.write_bps_device "8:16 1048576"
>>
>> === Job 2 ===
>> cgroup cpuset db/$auto cpus 1
We've realised that using a bare auto is going to be problematic in the
sub-cgroup scenario: if we require the name to be quoted, we have:

cgroup cpuset "db/"auto cpus 1

However, that is rather odd syntax since it looks wrong - most folk would expect
a space immediately before the word auto. Added to which, it would be too easy
to inadvertantly put the auto within the quotes which would change the behavior
completely:

cgroup cpuset "db/auto" cpus 1

That (probably) isn't going to do what was intended since rather than creating a
sub-cgroup named 'db/<job details>', the sub-cgroup would be named literally
'db/auto'.

Stéphane and I have discussed this and the feeling is that we should embrace the
fact that $auto looks like a variable and support variable expansion in the
cgroup name token (in fact Scott already suggested this in [1]). Further, by
supporting a $UPSTART_CGROUP (*) variable (which would represent the unique
representation Upstart decides to choose for the job instances sub-cgroup in
question) we have:

cgroup cpuset "db/$UPSTART_CGROUP" cpus 1

... or to create a literal 'auto' sub-cgroup:

cgroup cpuset "db/auto" cpus 1

Note that $UPSTART_CGROUP would map any slashes to underscores (as it done for
example by the logger when logging instance job output in /var/log/upstart/). We
would need to decide how best to handle a job that specifies a variable in the
cgroup name string that does contain a slash (a hard error would be safest of
course).

Thoughts?

>> cgroup memory db/$auto limit_in_bytes 52428800
>> cgroup blkio db/$auto throttle.write_bps_device "8:17 1048576"
>>
>> === Job 3 ===
>> cgroup cpuset db
>> cgroup memory db
>>
>> === Job 4 ===
>> cgroup cpuset db/$auto cpus 2
>>
>> == Result ==
>> This is rather complex, so let's go job by job:
>>  - Job 1 will start bound to CPU 0 and 1 with a 100MB memory limit and
>>    1MB/s write limit to the 8:16 block device. It'll fail to start if
>>    the system has less than 2 CPUs or less than 100MB of RAM.
>>
>>  - Job 2 will start bound to CPU 1 and with a 50MB memory limit. It'll
>>    inherit the 1MB/s write limit to 8:16 and on top of that also rate limit
>>    writes to 8:17 also at 1MB/s.
>>    The job will fail to start if the system has less than 50MB of RAM or
>>    less than 2 CPUs.
>>
>>  - Job 3 will start in the "db" cpuset and memory cgroups. If it starts
>>    before Job 1, no limit will be applied at startup time. As soon as Job 1
>>    starts however Job 3 will be limited to 2 CPUs and 100MB of memory.
>>    As it doesn't have a blkio statement, it won't have rate limited I/Os.
>>
>>  - Job 4 if started after Job 1 will fail to start as it's requesting a
>>    CPU that the parent cgroup doesn't have access to. If started before
>>    Job 1 however, it won't have a parent value set so will inherit the
>>    default and so will start so long as the system has at least 3 CPUs.
>>
>>
>>
>> I think this pretty much covers all I've got in mind at this point, I
>> think the above is flexible enough to work with all existing
>> controllers.
>>
>> Questions, comment and suggestions are much welcome!
>>
>>
>>
> 
> Thanks for documenting this!
> 
> Kind regards,
> 
> James.
> --
> James Hunt
> ____________________________________
> #upstart on freenode
> http://upstart.ubuntu.com/cookbook
> https://lists.ubuntu.com/mailman/listinfo/upstart-devel
> 

Kind regards,

James.

[1] - https://lists.ubuntu.com/archives/upstart-devel/2012-May/001877.html
(*) - $UPSTART_CGROUP would *not* be exported into the jobs environment since if
it was not used in the cgroup stanza it would not correctly represent the cgroup
for the job instance.

--
James Hunt
____________________________________
#upstart on freenode
http://upstart.ubuntu.com/cookbook
https://lists.ubuntu.com/mailman/listinfo/upstart-devel



More information about the upstart-devel mailing list