CephFS Backend for Hadoop

Dmitrii Shcherbakov dmitrii.shcherbakov at canonical.com
Wed Jul 26 07:28:21 UTC 2017

Hi {James, Patrizio},

Be careful with using cephfs in production before ceph Luminous though (RC

Although cephfs was declared stable in Jewel,

This is the first release in which CephFS is declared stable! Several
features are disabled by default, including snapshots and multiple active
MDS servers"

having multiple active MDS servers is considered experimental for anything
prior to Luminous (12.2.x) and running in 1 active/multiple standby mode
has certain issues (scalability & performance, availability)


"For the best chance of a happy healthy filesystem, use a single active MDS
and do not use snapshots. Both of these are the default. Note that creating
multiple MDS daemons is fine, as these will simply be used as standbys.
However, for best stability you should avoid adjusting max_mds upwards, as
this would cause multiple daemons to be active at once."

"Prior to the Luminous (12.2.x) release, running multiple active metadata
servers within a single filesystem was considered experimental. Creating
multiple active metadata servers is now permitted by default on new

"Multiple active MDS daemons is now considered stable. The number
of active MDS servers may be adjusted up or down on an active CephFS file

"Even with multiple active MDS daemons, a highly available system still
requires standby daemons to take over if any of the servers running an
active daemon fail."

As far as I can see, a ceph filesystem metadata will be sharded across
multiple MDS servers if configured. So having a multi-mds setup does not
alleviate the need for standby servers and failover - this setup provides
more parallelism but MDS high-availability is still needed for individual

"Each CephFS filesystem has a number of ranks, one by default, which start
at zero. A rank may be thought of as a metadata shard. Controlling the
number of ranks in a filesystem is described in Configuring multiple active
MDS daemons
Each file system may specify a number of standby daemons to be considered
healthy. This number includes daemons in standby-replay waiting for a rank
to fail (remember that a standby-replay daemon will not be assigned to take
over a failure for another rank or a failure in a another CephFS file

Also, if you need multiple cephfs file systems, it looks like you will need
this amount of MDS instances: <num_shards> * <num_standby_per_shard> *

"Each CephFS ceph-mds process (a daemon) initially starts up without a
rank. It may be assigned one by the monitor cluster. A daemon may only hold
one rank at a time. Daemons only give up a rank when the ceph-mds process

It is interesting how rank assignment is performed by the monitor cluster -
I would very much like to avoid cases where you have multiple or all ranks
of a single file system stored on one machine with multiple active MDS


I think the scope of work in charm-cephfs would be to:

   - implement standby MDS configuration;
   - implement multi-active MDS configuration.

Best Regards,
Dmitrii Shcherbakov

Field Software Engineer
IRC (freenode): Dmitrii-Sh

On Wed, Jul 26, 2017 at 9:14 AM, Patrizio Bassi <patrizio.bassi at gmail.com>

> Il giorno mer 26 lug 2017 alle 06:28 James Beedy <jamesbeedy at gmail.com>
> ha scritto:
>> Hello all,
>> I will be evaluating CephFS as a backend for Hadoop over the next few
>> weeks, probably start investigating how this can be delivered via the
>> charms in the morning. If anyone has ventured to this realm, or has an idea
>> on what the best way to deliver this might be, I would love to hear from
>> you.
>> Thanks,
>> James
> I do!
> Probably i won't be able to test before end of the year but i plan to host
> hadoop clusters in openstack tenants and i would like to share the same
> ceph osd providing infrastructural storage to openstack nova/cinder.
> Deploying hadoop via juju in an openstack tenant requires a separate model
> (as far as i could design it).
> So we may use the new juju 2.2 cross model relation to relate the hadoop
> charms to the openstack ceph units.
> does it sound feasible?
> regards
> Patrizio
> --
> Juju mailing list
> Juju at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/
> mailman/listinfo/juju
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju/attachments/20170726/ca1e56a9/attachment.html>

More information about the Juju mailing list