Ceph can issue many health messages and one of these messages is “daemons have recently crashed”. If this error is present in your “ceph -s” output, means it is not archived by the administrator. You can examine the crashes and send them to the Ceph community. Because of the crash, the cluster state will remain in the HEALTH_WARN state (please see config parameters below if you want to disable it). For example, a Ceph cluster below shows that 2 daemons have recently crashed:

# ceph -s
    id:     229a17bd-8149-41f9-84cd-cec7bbe82853
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive, 3 pgs down
            2 daemons have recently crashed

    mon: 6 daemons, quorum node05,node04,node01,node02,node03,node06 (age 19h)
    mgr: node01(active, since 42m), standbys: node03, node02
    mds: cephfs:1 {0=node11=up:active}
    osd: 22 osds: 18 up (since 24m), 18 in (since 14m)

    pools:   4 pools, 81 pgs
    objects: 20.73k objects, 67 GiB
    usage:   122 GiB used, 65 TiB / 65 TiB avail
    pgs:     3.704% pgs not active
             78 active+clean
             3  down

To list all crashes, use “ceph crash ls” and to list only the new crashes use the following command “crash ls-new“:

# ceph crash ls
ID                                                                ENTITY      NEW
2021-01-06_21:21:02.775248Z_a38a3ac2-a8ae-4649-9884-bd7e36b16475  mon.node04
2021-01-08T01:09:01.282029Z_9b26da30-61d4-412d-a37c-53e22a9fa943  mon.node01   *
2021-01-08T01:09:01.342756Z_79c3fa2e-b43f-49c9-b6bd-1e72bd464072  mon.node01   *

If you want to read the message, use “ceph crash info <id>”

# ceph crash info 2021-01-08T01:09:01.342756Z_79c3fa2e-b43f-49c9-b6bd-1e72bd464072
    "assert_condition": "session_map.sessions.empty()",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el7/BUILD/ceph-15.2.8/src/mon/Monitor.cc",
    "assert_func": "virtual Monitor::~Monitor()",
    "assert_line": 262,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el7/BUILD/ceph-15.2.8/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7fb84bb2e340 time 2021-01-07T17:09:01.277083-0800\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el7/BUILD/ceph-15.2.8/src/mon/Monitor.cc: 262: FAILED ceph_assert(session_map.sessions.empty())\n",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "(()+0xf630) [0x7fb840b3b630]",
        "(gsignal()+0x37) [0x7fb83f91a387]",
        "(abort()+0x148) [0x7fb83f91ba78]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x7fb842d5553e]",
        "(()+0x2696b7) [0x7fb842d556b7]",
        "(Monitor::~Monitor()+0x846) [0x5586610f6f76]",
        "(Monitor::~Monitor()+0x9) [0x5586610f6fc9]",
        "(main()+0x260a) [0x558661084b3a]",
        "(__libc_start_main()+0xf5) [0x7fb83f906555]",
        "(()+0x230590) [0x5586610b5590]"
    "ceph_version": "15.2.8",
    "crash_id": "2021-01-08T01:09:01.342756Z_79c3fa2e-b43f-49c9-b6bd-1e72bd464072",
    "entity_name": "mon.node01",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "7 (Core)",
    "os_version_id": "7",
    "process_name": "ceph-mon",
    "stack_sig": "ab152f150a1094de7164dc4d81b3e4907557de44a84f01c9e06b92d85166f1ef",
    "timestamp": "2021-01-08T01:09:01.282029Z",
    "utsname_hostname": "node01.tier2",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1127.13.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Jun 23 15:46:38 UTC 2020"

To acknowledge the crash or all crashes in one command, use:

# ceph crash archive <id>
# OR
# ceph crash archive-all

The following commands will archive these crashes and to list all use “ceph crash ls” command. There are two config parameters to control the “recent” flag and or how long to keep errors in the record until all these are purged from the system.

  • mgr/crash/warn_recent_interval (default: 2 weeks) – controls for how long to raise the RECENT_CRASH health warning message.
  • mgr/crash/retain_interval (default: 1 year) – controls how long to keep crashes on the record, until they are fully purged from the system.

Written by jbalcas

Leave a Comment