cgroup stanza a proposal

Thu Dec 5 17:31:53 UTC 2013

On 29/11/13 18:06, Stéphane Graber wrote:
> Hello everyone,
> 
> I have now published the Cgroup specification on the Upstart wiki:
> http://upstart.ubuntu.com/wiki/Cgroup
We've made a few updates to the spec so please let us know if you have any comments.

Upstart plans to leverage the 'cgmanager' cgroup manager currently being
developed [1]. This facility is going to be a host-global [2], generic, cgroup
management system which will handle all cgroup requests. cgmanager will do this
by providing a (hopefully) standardised API which Upstart will consume.

Since the design for the cgmanager is still being finalised, we'll need to
refresh the upstart spec occasionally until that point is reached.

= Potential Issues =

== cgroup stanza syntax ==

As mentioned on [3], the cgroup syntax may change. We need to be very aware of
this and ensure that a suitable abstraction for the cgroup stanza values is used
if appropriate. Since the cgmanager authors are already discussing this issue
with the cgroup kernel subsystem maintainer, we should however get this "for
free" once the cgmanager spec is finalised.

== Non-blocking calls ==

An important consideration from the Upstart side is to ensure that Upstart
should not block when requesting services from cgmanager. Ideally, the cgmanager
would offer a callback-type interface to allow upstart to handle cgroup
creation/deletion events (both requested and indirectly notified).

== New initctl command ==

The spec shows that we plan to add a new initctl command to notify Upstart that
the cgroup manager is available. This follows the existing pattern used by
notify-disk-writeable and notify-dbus-address. Dmitrijs has suggested we
consider a more generic "initctl notify <name> <value>" facility which I think
is a good idea as it could simplify re-exec handling internally and is a more
elegant solution. However, implementing a generic notify command could be
handled separately to the cgroup implementation [4].

=== Inotify notification for cgmanager socket ===

It would be clearner if we could use use inotify to avoid the need for
notify-cgroup-manager-address. However, if the well-known socket the cgmanager
creates is abstract, we can't watch for it.

Kind regards,

James.

[1] - https://blueprints.launchpad.net/ubuntu/+spec/core-1311-cgroup-manager
[2] - 1 daemon will run per host servicing not only the immediate hosts cgroup
requests, but also those from any containerised guests running on the host.
[3] - See
http://lists.linuxcontainers.org/pipermail/lxc-devel/2013-November/006283.html
[4] - Albeit adding a 3rd notify command that would need to be aliased should we
grow the notify command in the future.

> 
> This is based on my original proposal with the changes suggested on the
> mailing list.

> 
> On Wed, Nov 20, 2013 at 02:23:59PM -0500, Stéphane Graber wrote:
>> This morning at vUDS we discussed adding support for cgroups in Upstart.
>>
>> Before I go into details about the proposed stanza and overall
>> behaviour, I'd begin by saying that contrary to some other init systems,
>> our intent is solely related to resource controls which is the main goal
>> of cgroups. Process grouping and tracking will remain unaffected by the
>> addition of cgroup support.
>>
>> Cgroup support will be implemented by adding a new "cgroup" stanza which
>> will control the application of cgroup based restrictions to the job.
>> The limits will be applied to any of the scripts
>> (pre-start/post-start/job/pre-stop/post-stob) similar to what's done
>> with setuid/setgid/apparmor stanzas.
>>
>> Now my recommended format for the stanza, which I believe should be
>> flexible enough is:
>>  cgroup <controller> <cgroup name|auto> [<key> <value>]
>>
>>
>> Detail on the fields:
>> == controller ==
>> Name for one of the cgroup controller
>>
>> Currently the valid values are (but won't be hardcoded into upstart):
>>  - blkio
>>  - cpu
>>  - cpuacct
>>  - cpuset
>>  - devices
>>  - freezer
>>  - hugetlb
>>  - memory
>>  - perf_event
>>
>> == cgroup-name|$auto ==
>> Name of the cgroup to use (and create if non-existing)
>>
>> The name may contain a / (e.g. "db/pgsql" or "db/$auto") indicating that
>> it's requesting a sub-cgroup.
>>
>> "$auto" is the recommended name and will have upstart generate a name
>> based on the job instance name.
>>
>> The main use of that field is for cases where a set of jobs should share
>> limits, in such case the main job should declare the various values and
>> the others just refer to the cgroup by name but not defined values.
>>
>> The name may be different for the various controllers but may not differ
>> within the same controller. Example:
>> valid =>    cgroup memory group1 limit_in_bytes 52428800
>>             cgroup cpuset group2 cpus 0-1
>>
>> invalid =>  cgroup memory group1 limit_in_bytes 52428800
>>             cgroup memory group1 soft_limit_in_bytes 1024
>>
>> == key ==
>> The cgroup control file minus the controller name, so for example
>> memory.soft_limit_in_bytes will become limit_in_bytes.
>>
>> == value ==
>> Any value valid for the given control file, upstart itself won't perform
>> any validation.
>>
>> If the value contains spaces, it should be put between double-quotes (e.g.):
>> cgroup devices auto allow "c 1:2 rwm"
>>
>>
>> Upstart won't have any controller aware logic in its code, instead,
>> it'll simply talk over dbus (using a private dbus socket) to the cgroup
>> manager which will take care of applying the various limits.
>> That cgroup manager will be started very early in the boot sequence. Any
>> job containing a cgroup stanza will be held until the manager is
>> started.
>>
>> The cgroup will be destroyed when a job is stopped and the cgroup isn't
>> shared with another job (task count is 0 and it has no child cgroup).
>>
>> It'll be possible to disable cgroup support entirely by either building
>> upstart without it (needed for non-Linux systems) or by passing
>> --no-cgroup as a parameter to upstart. In that case, the cgroup stanza
>> will simply be ignored and the jobs will start without limitations.
>>
>>
>> All of the above is also meant to apply to user sessions. The cgroup
>> manager will allow unprivileged cgroup configuration, so as long as the
>> user has write access to a sub-section of a controller, it'll be allowed
>> to write entries there. Similarly to other restriction stanzas, failure
>> to apply a cgroup limit in a user session won't be fatal.
>>
>>
>> Now a few examples to try and illustrate the thoughts behind that proposal:
>>
>> == Single job simple example ==
>> === Job ===
>> cgroup memory $auto limit_in_bytes 52428800
>>
>> === Result ===
>> The job will only start once the manager is up and running and will have a
>> 50MB memory limit. If the system has less than 50MB, the job will fail
>> to start.
>>
>> == Single job complex example ==
>> === Job ===
>> cgroup memory $auto limit_in_bytes 52428800
>> cgroup cpuset $auto cpus 0-1
>> cgroup blkio slowio throttle.write_bps_device "8:16 1048576"
>>
>> == Result ==
>> The job will only start once the manager is up and running and will have a
>> 50MB memory limit, be restricted to CPU ids 0 and 1 and have a 1MB/s
>> write limit to the block device 8:16.
>> The job will fail to start if the system has less than 50MB of RAM or
>> less than 2 CPUs.
>>
>>
>> == Multiple jobs complex example ==
>> === Job 1 ===
>> cgroup cpuset db cpus 0-1
>> cgroup memory db limit_in_bytes 104857600
>> cgroup blkio db throttle.write_bps_device "8:16 1048576"
>>
>> === Job 2 ===
>> cgroup cpuset db/$auto cpus 1
>> cgroup memory db/$auto limit_in_bytes 52428800
>> cgroup blkio db/$auto throttle.write_bps_device "8:17 1048576"
>>
>> === Job 3 ===
>> cgroup cpuset db
>> cgroup memory db
>>
>> === Job 4 ===
>> cgroup cpuset db/$auto cpus 2
>>
>> == Result ==
>> This is rather complex, so let's go job by job:
>>  - Job 1 will start bound to CPU 0 and 1 with a 100MB memory limit and
>>    1MB/s write limit to the 8:16 block device. It'll fail to start if
>>    the system has less than 2 CPUs or less than 100MB of RAM.
>>
>>  - Job 2 will start bound to CPU 1 and with a 50MB memory limit. It'll
>>    inherit the 1MB/s write limit to 8:16 and on top of that also rate limit
>>    writes to 8:17 also at 1MB/s.
>>    The job will fail to start if the system has less than 50MB of RAM or
>>    less than 2 CPUs.
>>
>>  - Job 3 will start in the "db" cpuset and memory cgroups. If it starts
>>    before Job 1, no limit will be applied at startup time. As soon as Job 1
>>    starts however Job 3 will be limited to 2 CPUs and 100MB of memory.
>>    As it doesn't have a blkio statement, it won't have rate limited I/Os.
>>
>>  - Job 4 if started after Job 1 will fail to start as it's requesting a
>>    CPU that the parent cgroup doesn't have access to. If started before
>>    Job 1 however, it won't have a parent value set so will inherit the
>>    default and so will start so long as the system has at least 3 CPUs.
>>
>>
>>
>> I think this pretty much covers all I've got in mind at this point, I
>> think the above is flexible enough to work with all existing
>> controllers.
>>
>> Questions, comment and suggestions are much welcome!
>>
>> -- 
>> Stéphane Graber
>> Ubuntu developer
>> http://www.ubuntu.com
> 
> 
> 
>> -- 
>> upstart-devel mailing list
>> upstart-devel at lists.ubuntu.com
>> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/upstart-devel
> 
> 
> 
> 

-- 
Kind regards,

James.
--
James Hunt
____________________________________
#upstart on freenode
http://upstart.ubuntu.com/cookbook
https://lists.ubuntu.com/mailman/listinfo/upstart-devel