Description:
Year ago I deploy simple Ceph environment (one monitor, two object storage) for fast overview of main features. Today I want to ask unusual questions to the Ceph.
For this, I created some test environments:
- 1 mon + 2 osd
- 2 mon + 2 osd
- 3 mon + 2 osd
- 3 mon + 3 osd
- 3 mon + 4 osd
- 3 mon + 5 osd
All tests were performed with random data, which were verified by md5sum/sha1sum. To analyze the traffic used a development version of Wireshark.
Q: What will happen with cluster, if not enough monitors for election (majority doesn't exist) ?
A: The cluster will be inactive/not working state. The monitor will attempt to hold the elections, but it will not respond to requests from the management tools/clients.
[root@cmon3 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.cmon3.asok mon_status
{ "name": "cmon3",
"rank": 2,
"state": "probing",
"election_epoch": 36,
"quorum": [],
"outside_quorum": [
"cmon3"],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": { "epoch": 3,
"fsid": "6f20cf97-7c73-42df-a0ca-4a5db48e5fde",
"modified": "2015-09-04 13:34:53.863401",
"created": "2015-09-04 10:46:28.612081",
"mons": [
{ "rank": 0,
"name": "cmon1",
"addr": "10.0.100.101:6789\/0"},
{ "rank": 1,
"name": "cmon2",
"addr": "10.0.100.102:6789\/0"},
{ "rank": 2,
"name": "cmon3",
"addr": "10.0.100.103:6789\/0"}]}}
Q: The client sends data to the RADOS, all the monitors suddenly fail. What will happen with data transfer ?
A: Nothing. Client directly "talk" with primary OSD.
Q: The client sends data to the RADOS, primary OSD suddenly fail (osd pool default min size = 1). What will happen with data transfer ?
A: Connection will be switched to next available OSD and data will be transferred completely.
Q: What should happen (examples of situations) to the OSD changed its status to "up out" ?
A: This mean that OSD is up but not in cluster. This situation arrive when was started process of removing OSD from cluster.
Q: Can the client to work during the recovery process ?
A: Yes.
Every 1.0s: ceph -s Sun Sep 6 06:34:03 2015
cluster 6f20cf97-7c73-42df-a0ca-4a5db48e5fde
health HEALTH_WARN 34 pgs recovering; 2 pgs recovery_wait; 38 pgs stuck unclean; 1 requests are blocked > 32 sec; recovery 1110/5084 objects degraded (21.833%)
monmap e3: 3 mons at {cmon1=10.0.100.101:6789/0,cmon2=10.0.100.102:6789/0,cmon3=10.0.100.103:6789/0}, election epoch 36, quorum 0,1,2 cmon1,cmon2,cmon3
osdmap e273: 5 osds: 5 up, 5 in
pgmap v1097: 256 pgs, 3 pools, 10104 MB data, 2542 objects
38424 MB used, 199 GB / 237 GB avail
1110/5084 objects degraded (21.833%)
2 active
2 active+recovery_wait
217 active+clean
34 active+recovering
1 active+clean+scrubbing
recovery io 10322 kB/s, 2 objects/s
client io 1518 kB/s wr, 0 op/s
Q: What will be happen if cluster network fail (public network still alive) on all OSDs ?
A: No switching to the public network. All OSDs will crash.
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
1: /usr/bin/ceph-osd() [0x99ce66]
2: (()+0xf130) [0x7f29f3f00130]
3: (gsignal()+0x37) [0x7f29f2b225d7]
4: (abort()+0x148) [0x7f29f2b23cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f29f34269b5]
6: (()+0x5e926) [0x7f29f3424926]
7: (()+0x5e953) [0x7f29f3424953]
8: (()+0x5eb73) [0x7f29f3424b73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0xa86e0f]
10: (Mutex::Lock(bool)+0x111) [0xa3d3c1]
11: (OSD::shutdown()+0x83) [0x613e53]
12: (OSD::handle_osd_map(MOSDMap*)+0x1784) [0x650944]
13: (OSD::_dispatch(Message*)+0x43b) [0x652b3b]
14: (OSD::ms_dispatch(Message*)+0x28f) [0x65344f]
15: (DispatchQueue::entry()+0x57a) [0xb2810a]
16: (DispatchQueue::DispatchThread::entry()+0xd) [0xa68bed]
17: (()+0x7df5) [0x7f29f3ef8df5]
18: (clone()+0x6d) [0x7f29f2be31ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Q: What will be happen if client requested an object but all servers which contain replica of specific object stopped working ?
A: The client will receive the rest of the data as soon as the servers will start working again.
Q: Can I restrict network bandwidth like DRBD ?
A: No, you can't.
Q: When recording, data is striped across replicas in parallel. What about reading ?
A: "Normally we read from the primary replica. We can also read from the "closest" replica or from a random replica."
To be continued ...