Description:

Year ago I deploy simple Ceph environment (one monitor, two object storage) for fast overview of main features. Today I want to ask unusual questions to the Ceph.

For this, I created some test environments:

  • 1 mon + 2 osd
  • 2 mon + 2 osd
  • 3 mon + 2 osd
  • 3 mon + 3 osd
  • 3 mon + 4 osd
  • 3 mon + 5 osd

All tests were performed with random data, which were verified by md5sum/sha1sum. To analyze the traffic used a development version of Wireshark.


Q: What will happen with cluster, if not enough monitors for election (majority doesn't exist) ?
A: The cluster will be inactive/not working state. The monitor will attempt to hold the elections, but it will not respond to requests from the management tools/clients.

[root@cmon3 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.cmon3.asok  mon_status
{ "name": "cmon3",
  "rank": 2,
  "state": "probing",
  "election_epoch": 36,
  "quorum": [],
  "outside_quorum": [
        "cmon3"],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 3,
      "fsid": "6f20cf97-7c73-42df-a0ca-4a5db48e5fde",
      "modified": "2015-09-04 13:34:53.863401",
      "created": "2015-09-04 10:46:28.612081",
      "mons": [
            { "rank": 0,
              "name": "cmon1",
              "addr": "10.0.100.101:6789\/0"},
            { "rank": 1,
              "name": "cmon2",
              "addr": "10.0.100.102:6789\/0"},
            { "rank": 2,
              "name": "cmon3",
              "addr": "10.0.100.103:6789\/0"}]}}


Q: The client sends data to the RADOS, all the monitors suddenly fail. What will happen with data transfer ?
A: Nothing. Client directly "talk" with primary OSD.

Q: The client sends data to the RADOS, primary OSD suddenly fail (osd pool default min size = 1). What will happen with data transfer ?
A: Connection will be switched to next available OSD and data will be transferred completely.

Q: What should happen (examples of situations) to the OSD changed its status to "up out" ?
A: This mean that OSD is up but not in cluster. This situation arrive when was started process of removing OSD from cluster.

Q: Can the client to work during the recovery process ?
A: Yes.

Every 1.0s: ceph -s                                                                                                                      Sun Sep  6 06:34:03 2015

    cluster 6f20cf97-7c73-42df-a0ca-4a5db48e5fde
     health HEALTH_WARN 34 pgs recovering; 2 pgs recovery_wait; 38 pgs stuck unclean; 1 requests are blocked > 32 sec; recovery 1110/5084 objects degraded (21.833%)
     monmap e3: 3 mons at {cmon1=10.0.100.101:6789/0,cmon2=10.0.100.102:6789/0,cmon3=10.0.100.103:6789/0}, election epoch 36, quorum 0,1,2 cmon1,cmon2,cmon3
     osdmap e273: 5 osds: 5 up, 5 in
      pgmap v1097: 256 pgs, 3 pools, 10104 MB data, 2542 objects
            38424 MB used, 199 GB / 237 GB avail
            1110/5084 objects degraded (21.833%)
                   2 active  
                   2 active+recovery_wait
                 217 active+clean
                  34 active+recovering
                   1 active+clean+scrubbing
recovery io 10322 kB/s, 2 objects/s
  client io 1518 kB/s wr, 0 op/s


Q: What will be happen if cluster network fail (public network still alive) on all OSDs ?
A: No switching to the public network. All OSDs will crash.

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: /usr/bin/ceph-osd() [0x99ce66]
 2: (()+0xf130) [0x7f29f3f00130]
 3: (gsignal()+0x37) [0x7f29f2b225d7]
 4: (abort()+0x148) [0x7f29f2b23cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f29f34269b5]
 6: (()+0x5e926) [0x7f29f3424926]
 7: (()+0x5e953) [0x7f29f3424953]
 8: (()+0x5eb73) [0x7f29f3424b73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0xa86e0f]
 10: (Mutex::Lock(bool)+0x111) [0xa3d3c1]
 11: (OSD::shutdown()+0x83) [0x613e53]
 12: (OSD::handle_osd_map(MOSDMap*)+0x1784) [0x650944]
 13: (OSD::_dispatch(Message*)+0x43b) [0x652b3b]
 14: (OSD::ms_dispatch(Message*)+0x28f) [0x65344f]
 15: (DispatchQueue::entry()+0x57a) [0xb2810a]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0xa68bed]
 17: (()+0x7df5) [0x7f29f3ef8df5]
 18: (clone()+0x6d) [0x7f29f2be31ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Q: What will be happen if client requested an object but all servers which contain replica of specific object stopped working ?
A: The client will receive the rest of the data as soon as the servers will start working again.

Q: Can I restrict network bandwidth like DRBD ?
A: No, you can't.

Q: When recording, data is striped across replicas in parallel. What about reading ?
A: "Normally we read from the primary replica. We can also read from the "closest" replica or from a random replica."

To be continued ...