Sometimes we have to survive power loss, node loss, or any kind of hard drive (be it HDD, SSD, NVMe) failure. In such cases, it is really important to ensure that we are not losing any important data.

It is easy to say and suggest to use the default replication level of 3 (allow to lose 3 disks simultaneously) and let the software deal with it. To maintain the replication level of 3, we will have 300% overhead and this drives our cost 3 times higher. We have been running HDFS and Ceph with replication 2 (200% overhead) and keeping all the disks until the end comes, sometimes we keep disks longer than 8 years of lifetime. But this operation mode showed that if we lose 2 disks – we always lose data, and we need to identify lost files. We also have EC based pools, and depending on the data path and data importance, different pools can sustain 2, 3, or 4 disk losses.

In the end, it all comes to the cost $/GB and how many disks you can sustain to loose.

In Ceph, the best way to check cluster status is using “ceph -s” command.

# ceph -s
  cluster:
    id:     229a17bd-8149-41f9-84cd-cec7bbe82853
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive, 3 pgs down

  services:
    mon: 6 daemons, quorum node05,node04,node01,node02,node03,node06 (age 19h)
    mgr: node01(active, since 46m), standbys: node03, node02
    mds: cephfs:1 {0=node11=up:active}
    osd: 22 osds: 18 up (since 28m), 18 in (since 18m)

  data:
    pools:   4 pools, 81 pgs
    objects: 20.73k objects, 67 GiB
    usage:   122 GiB used, 65 TiB / 65 TiB avail
    pgs:     3.704% pgs not active
             78 active+clean
             3  down

And we can clearly see, that we have 3 PGs in inactive/down state. You can also see that from 22 OSDs – only 18 are up and in, so 4 OSDs are down. First of all, try to identify all lost OSDs and recover them if possible. To find out which OSDs are down, use “ceph osd tree” command.

# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME        STATUS  REWEIGHT  PRI-AFF
 -1         80.05112  root default
 -3          7.27737      host node01
  0    hdd   3.63869          osd.0      down         0  1.00000
  1    hdd   3.63869          osd.1      down         0  1.00000
 -5          7.27737      host node02
  2    hdd   3.63869          osd.2      down         0  1.00000
 21    hdd   3.63869          osd.21     down         0  1.00000
 -7          7.27737      host node03
  3    hdd   3.63869          osd.3        up   1.00000  1.00000
  4    hdd   3.63869          osd.4        up   1.00000  1.00000
 -9          7.27737      host node04
  5    hdd   3.63869          osd.5        up   1.00000  1.00000
  6    hdd   3.63869          osd.6        up   1.00000  1.00000
-11          7.27737      host node05
  7    hdd   3.63869          osd.7        up   1.00000  1.00000
  8    hdd   3.63869          osd.8        up   1.00000  1.00000
...

At this point, we can try to identify why OSD is down on node01 and node02 is down and try to recover them. You can take out OSD and check with your computer, try to recover data, and copy to another OSD – but this is time-consuming and there are no guarantees that this will work. In case disks are fully lost and there is no way to recover data from them, we need to identify lost files and invalidate them (or inform end-users to reproduce data).

First of all, lets identify all PGs in inactive state:

# ceph pg dump_stuck inactive
ok
PG_STAT  STATE  UP                                     UP_PRIMARY  ACTING                                 ACTING_PRIMARY
14.14     down   [7,12,15,16,14,18,5,17,4,6,11,13,19]           7   [7,12,15,16,14,18,5,17,4,6,11,13,19]               7
14.8      down  [10,9,16,3,17,14,12,18,15,7,20,19,13]          10  [10,9,16,3,17,14,12,18,15,7,20,19,13]              10
14.c      down    [16,18,7,20,4,15,17,19,8,9,11,3,14]          16    [16,18,7,20,4,15,17,19,8,9,11,3,14]              16

And for each PG, identify which files it was storing:

# cephfs-data-scan pg_files / <PG>
//eck10m3/8mb/file699.bin

Now as you know which file was affected – you can either remove it (if you have a copy elsewhere) and inform the end-user about the file being lost and there is no way to restore it.

To simplify, here is a small script to list all lost files.

# ALLPGS=””; for PG in ceph pg dump_stuck inactive | awk '{print $1}' | sed -e '1,1d'; do ALLPGS=”$ALLPGS $PG”; done
# echo $ALLPGS
14.14 14.8 14.c
# cephfs-data-scan pg_files / $ALLPGS
//eck10m3/8mb/file699.bin
//eck10m3/8mb/file201.bin
//eck10m3/8mb/file2277.bin
//eck10m3/8mb/file477.bin
//eck10m3/8mb/file355.bin
//eck10m3/8mb/file478.bin
//eck10m3/8mb/file1642.bin

P. S. DO THIS AT YOUR OWN RISK!

Written by jbalcas

Leave a Comment