We maintain an HDFS distributed file system that handles large data sets running on commodity hardware. Our Hadoop is spread between computing nodes and JBODs and currently, there are more than 240 nodes (with a range of disks from 3 to 120 per node). For the future, we continue to separate storage from compute nodes for higher availability (it was proven to be more stable to run storage on dedicated nodes, than mixing compute nodes and storage nodes together). At the time of writing, we maintain more than 8PB of RAW disk space, while replication is set to 2. Future upgrades are planned to use erasure coding and increase the usable disk space.

We also have a small CEPH cluster as a storage system for user workloads and using it as a home directory also benefitting from a stable, high-performance parallel file system. While the current system based on enterprise-grade hard disks has peaked at a throughput of up to 2 GB/s, generated by a highly parallel analysis workload, future use of solid-state drives is expected to further boost the science capability of the system. CEPH Cluster consists of an overall 288TB. There are 6 CEPH Monitors, 1 MDS, and 3 storage nodes each with:

  • 2 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with 512 GB Memory.
  • 12 x 8 TB SAS Disks
  • 2 x 6.4 TB Intel SSD DC P4608 (SSDs are used for Block devices)
  • 2 x Mellanox 40G NICs (in Bond mode)

Ceph Test Cluster (to test EC coding and failures, new releases):

  • 11 Nodes (For Monitors, Metadata, Managers, OSDs):
    • 32 cores, AMD Opteron(tm) Processor 6378
    • 94 GB RAM;
    • 2 OSDs in each (3.7TB)
    • Network: 1Gb/s (All connected to same switch)
  • 1 Node (For testing):
    • 32 cores, AMD Opteron(tm) Processor 6378
    • 94 GB RAM;
    • 2 OSDs (3.7TB)
    • Network: 10Gb/s (Connected to same switch)

Main HDFS issues in our environment

  • Open Source:
    • HDFS/OSG Issue: Cloudera used to distribute source RPMs for CDH 5, which contains Hadoop 2.6.0. However, with CDH 6 containing Hadoop 3.0.0, they only distribute binary RPMs and require installation of Cloudera Manager for installation and license management. (Of course we tried to go installing tar of HDFS – but this does not look production-ready type installation). CEPH so far is fully open source (clone, modify, propose, redeploy, build).
  • Reliable and scalable:
    • HDFS Does not have Rolling upgrades till the 3.0 release. Metadata single point of failure (or need to run standby/active via NFS.)
  • Erasure coding (Rep 2 || 3 is expensive) – gain space on the same hardware; We could go to HDFS 3.x – still does not work for small files.
  • POSIX Compliant for user home dirs, analysis facilities.
  • One FS for many purposes: Block device (for containers, VMs,), Single FS.
  • The Latest C libhdfs does not have xattrs.

Why Ceph?

  • Open Source: • YES
  • Reliable and scalable: YES (You can scale any component and there is no single point of failure)
  • Erasure coding: YES (we can control EC based on directory layout, e.g. /store/user is EC(6,1), /store/group is EC(6,2))
  • POSIX Compliant for user home dirs, analysis facilities • YES (works perfectly for our needs). Some POSIX differences described here: https://docs.ceph.com/docs/master/cephfs/posix/
  • One FS for many purposes: Block device (for containers, VMs), Single FS, Object storage for future Analysis facilities.
  • Can use xattr for CephFS control (chunk size, strips, ec), checksums + any other custom attrs.

Our plan for moving from HDFS to Ceph

We plan to migrate all our storage and use multiple CephFS. This requires some changes in our puppet setup and how it is deployed, and which OSD node belongs to which FS System and or some caviets with diff purpose requirements (I will cover more in another blog post more in detail why we are not using Ceph multiple fs feature).
In production, the plan is: Make sure we are able to mount separate FS on all nodes. We will have separate CEPH Clusters based on purpose:

  • User home directories (Analysis facility) – ~300TB RAW HDD + 12TB SSD
  • VM Management (between 9 machines) – ~50TB RAW HDD + 40TB SSD
  • CMS Storage (~240 mixed nodes) – 9PB RAW HDD + 200TB SSD
  • SDN Testbed (20 nodes) – ~200 TB SSD

The only CephFS we have in production right now is the Analysis facility and as we move forward and install Ceph for other purposes, we will cover more details with their installation, issues.

Written by jbalcas

Leave a Comment