Replacing a Failed Disk from Ceph a Cluster
Do you have a ceph cluster , great , you are awesome ; so very soon you would face this .
- Check your cluster health
# ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 830, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e8871: 153 osds: 152 up, 152 in pgmap v1409465: 66256 pgs, 30 pools, 201 TB data, 51906 kobjects 90439 GB used, 316 TB / 413 TB avail 66250 active+clean 6 stale+peering
- Login to any ceph node and search for failed disk
#ceph osd tree | grep -i down 9 2.63 osd.9 up 1 17 2.73 osd.17 up 1 30 2.73 osd.30 up 1 53 2.73 osd.53 up 1 65 2.73 osd.65 up 1 78 2.73 osd.78 up 1 89 2.73 osd.89 up 1 99 2.73 osd.99 down 0 113 2.73 osd.113 up 1 128 2.73 osd.128 up 1 141 2.73 osd.141 up 1
- Now you have identified which node's OSD is failed and what's the OSD number. Login to that node and check if that OSD is mounted , IT would be NOT ( as its failed )
# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 47G 6.0G 38G 14% / tmpfs 12G 0 12G 0% /dev/shm /dev/sdd1 2.8T 197G 2.5T 8% /var/lib/ceph/osd/ceph-30 /dev/sde1 2.8T 172G 2.6T 7% /var/lib/ceph/osd/ceph-53 /dev/sdc1 2.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-17 /dev/sdh1 2.8T 227G 2.5T 9% /var/lib/ceph/osd/ceph-89 /dev/sdf1 2.8T 169G 2.6T 7% /var/lib/ceph/osd/ceph-65 /dev/sdi1 2.8T 150G 2.6T 6% /var/lib/ceph/osd/ceph-113 /dev/sdb1 2.7T 1.3T 1.3T 51% /var/lib/ceph/osd/ceph-9 /dev/sdj1 2.8T 1.6T 1.2T 58% /var/lib/ceph/osd/ceph-128 /dev/sdg1 2.8T 237G 2.5T 9% /var/lib/ceph/osd/ceph-78 /dev/sdk1 2.8T 1.5T 1.3T 53% /var/lib/ceph/osd/ceph-141
- Go to your datacenter with a new physical drive and replace the drive physically , i assume depending on enterprise server that you are using it should be hot swappable , These days almost all servers support hot swapping of disks , but still you should check for your server model . In this example i was using HPDL380 server.
- Once you have replaced the drive physically , wait for some time so that new drive gets stable.
- Login to your node and make OSD out from cluster . Please remember , the OSD was already DOWN and OUT as soon as disk was failed . Ceph takes care of OSD and if its not available it marks it down and moves it out of cluster.
# ceph osd out osd.99 osd.99 is already out. # # service ceph stop osd.99 /etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78) # # service ceph status osd.99 /etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78) # # service ceph status === osd.9 === osd.9: running {"version":"0.72.1"} === osd.30 === osd.30: running {"version":"0.72.1"} === osd.17 === osd.17: running {"version":"0.72.1"} === osd.128 === osd.128: running {"version":"0.72.1"} === osd.65 === osd.65: running {"version":"0.72.1"} === osd.141 === osd.141: running {"version":"0.72.1"} === osd.89 === osd.89: running {"version":"0.72.1"} === osd.53 === osd.53: running {"version":"0.72.1"} === osd.113 === osd.113: running {"version":"0.72.1"} === osd.78 === osd.78: running {"version":"0.72.1"} #
- Now remove this failed OSD from Crush Map , as soon as its removed from crush map , ceph starts making PG copies that were located on this failed disk and it places these PG on other disks. So a recovery process will start.
# ceph osd crush remove osd.99 removed item id 99 name 'osd.99' from crush map # ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 43 pgs backfill; 56 pgs backfilling; 9 pgs peering; 82 pgs recovering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 192 pgs st uck unclean; 4 requests are blocked > 32 sec; recovery 373488/106903578 objects degraded (0.349%) monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 836, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e8946: 153 osds: 152 up, 152 in pgmap v1409604: 66256 pgs, 30 pools, 201 TB data, 51916 kobjects 1 1 GB used, 316 TB / 413 TB avail 1 1 /106903578 objects degraded (0.349%) 1 active 66060 active+clean 1 1 1 active+remapped+wait_backfill 3 peering 1 active+remapped 1 1 1 active+remapped+backfilling 6 stale+peering 4 active+clean+scrubbing+deep 1 1 1 active+recovering recovery io 159 MB/s, 39 objects/s
- (optional) Check disk statistics , it looks nice and after some times ( depending on data that your FAILED disk holds ) it completes
# dstat 10 ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 2 3 95 1 0 0|2223k 5938k| 0 0 |1090B 2425B|5853 11k 14 58 1 25 0 2| 130M 627M| 219M 57M|6554B 0 | 28k 111k 14 57 1 26 0 2| 106M 743M| 345M 32M| 0 4096B| 35k 73k 13 61 1 23 0 2| 138M 680M| 266M 67M| 83k 0 | 31k 82k 14 52 1 31 0 2| 99M 574M| 230M 32M| 48k 6963B| 27k 78k 14 51 2 31 0 2| 99M 609M| 291M 31M| 0 0 | 29k 83k 11 57 1 28 0 2| 118M 636M| 214M 57M|9830B 0 | 26k 92k 12 49 4 34 0 1| 97M 432M| 166M 48M| 35k 0 | 22k 100k 13 44 3 38 0 1| 95M 422M| 183M 46M| 0 0 | 22k 88k 13 52 3 30 0 2| 96M 510M| 207M 44M| 0 0 | 25k 109k 14 49 3 32 0 2| 96M 568M| 276M 37M| 16k 0 | 27k 72k 9 54 4 31 0 2| 109M 520M| 136M 45M| 0 0 | 20k 89k 14 44 5 35 0 1| 76M 444M| 192M 13M| 0 0 | 22k 54k 15 47 3 34 0 2| 101M 452M| 141M 20M|3277B 13k| 21k 79k 17 48 3 31 0 1| 108M 445M| 181M 16M| 0 200k| 23k 69k 17 48 3 30 0 1| 154M 406M| 138M 23M| 0 0 | 21k 75k 17 53 3 27 0 1| 169M 399M| 115M 23M| 0 396k| 21k 81k 13 45 4 36 0 1| 161M 330M| 131M 20M| 0 397k| 20k 90k 11 51 5 33 0 1| 116M 416M| 145M 1177k| 0 184k| 20k 69k 14 50 4 31 0 1| 144M 376M| 124M 8752k| 0 0 | 20k 72k 14 42 6 37 0 1| 142M 340M| 138M 19M| 0 0 | 19k 79k 15 47 6 32 0 1| 111M 427M| 129M 11M| 0 819B| 19k 66k 15 50 5 29 0 1| 163M 413M| 139M 5709k| 58k 0 | 20k 90k 14 49 4 32 0 1| 155M 395M| 91M 12M| 0 0 | 18k 93k 18 43 7 31 0 1| 166M 338M| 84M 6493k| 0 0 | 17k 81k 14 49 5 32 0 1| 179M 335M| 98M 3824k| 0 0 | 18k 91k 13 46 9 31 0 1| 157M 299M| 72M 14M| 0 0 | 17k 125k 17 42 9 30 0 1| 188M 269M| 82M 11M| 16k 0 | 16k 102k 22 35 15 27 0 1| 158M 167M|8932k 287k| 0 0 | 13k 88k 7 20 46 26 0 0| 118M 12M| 250k 392k| 0 82k|9333 61k 7 17 60 16 0 0| 124M 1638B| 236k 225k| 0 0 |7512 64k 7 16 63 14 0 0| 117M 1005k| 247k 238k| 0 0 |7429 60k 3 9 82 5 0 0| 41M 17M| 225k 225k| 0 0 |6049 27k 4 8 81 7 0 0| 56M 7782B| 227k 225k| 0 6144B|5933 33k 4 9 79 7 0 0| 60M 9011B| 248k 245k| 0 9011B|6457 36k 4 9 79 7 0 0| 58M 236k| 231k 230k| 0 14k|6210 35k # ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 836, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e9045: 153 osds: 152 up, 152 in pgmap v1409957: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects 90448 GB used, 316 TB / 413 TB avail 66250 active+clean 6 stale+peering
- Once data recovery is done , go ahead delete keyrings for that OSD and finally remove OSD
# ceph auth del osd.99 updated # # ceph osd rm osd.99 removed osd.99 # #ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 836, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e9046: 152 osds: 152 up, 152 in pgmap v1409971: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects 90445 GB used, 316 TB / 413 TB avail 66250 active+clean 6 stale+peering
- Remove entry of this OSD from ceph.conf ( if its present ) , make sure you keep all the nodes ceph.conf file updated . You can push the new configuration file to entire cluster using # ceph admin command.
- Time to create new OSD for the physical disk that we have inserted , You would see, ceph will create new OSD with the same OSD number , that was failed , as we have removed failed OSD cleanly , if you see a different OSD number , it means that you have not cleanly removed failed OSD.
# ceph osd create 99 # ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 836, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e9047: 153 osds: 152 up, 152 in pgmap v1409988: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects 90442 GB used, 316 TB / 413 TB avail 66250 active+clean 6 stale+peering #
- List the disk , zap it and deploy it again
# ceph-deploy disk list node14 [ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk list node14 [node14][DEBUG ] connected to host: node14 [node14][DEBUG ] detect platform information from remote host [node14][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final [ceph_deploy.osd][DEBUG ] Listing disks on node14... [node14][INFO ] Running command: ceph-disk list [node14][DEBUG ] /dev/sda : [node14][DEBUG ] /dev/sda1 other, ext4, mounted on / [node14][DEBUG ] /dev/sda2 swap, swap [node14][DEBUG ] /dev/sdb : [node14][DEBUG ] /dev/sdb1 ceph data, active, cluster ceph, osd.9, journal /dev/sdb2 [node14][DEBUG ] /dev/sdb2 ceph journal, for /dev/sdb1 [node14][DEBUG ] /dev/sdc : [node14][DEBUG ] /dev/sdc1 ceph data, active, cluster ceph, osd.17, journal /dev/sdc2 [node14][DEBUG ] /dev/sdc2 ceph journal, for /dev/sdc1 [node14][DEBUG ] /dev/sdd : [node14][DEBUG ] /dev/sdd1 ceph data, active, cluster ceph, osd.30, journal /dev/sdd2 [node14][DEBUG ] /dev/sdd2 ceph journal, for /dev/sdd1 [node14][DEBUG ] /dev/sde : [node14][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.53, journal /dev/sde2 [node14][DEBUG ] /dev/sde2 ceph journal, for /dev/sde1 [node14][DEBUG ] /dev/sdf : [node14][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.65, journal /dev/sdf2 [node14][DEBUG ] /dev/sdf2 ceph journal, for /dev/sdf1 [node14][DEBUG ] /dev/sdg : [node14][DEBUG ] /dev/sdg1 ceph data, active, cluster ceph, osd.78, journal /dev/sdg2 [node14][DEBUG ] /dev/sdg2 ceph journal, for /dev/sdg1 [node14][DEBUG ] /dev/sdh : [node14][DEBUG ] /dev/sdh1 ceph data, active, cluster ceph, osd.89, journal /dev/sdh2 [node14][DEBUG ] /dev/sdh2 ceph journal, for /dev/sdh1 [node14][DEBUG ] /dev/sdi other, btrfs [node14][DEBUG ] /dev/sdj : [node14][DEBUG ] /dev/sdj1 ceph data, active, cluster ceph, osd.113, journal /dev/sdj2 [node14][DEBUG ] /dev/sdj2 ceph journal, for /dev/sdj1 [node14][DEBUG ] /dev/sdk : [node14][DEBUG ] /dev/sdk1 ceph data, active, cluster ceph, osd.128, journal /dev/sdk2 [node14][DEBUG ] /dev/sdk2 ceph journal, for /dev/sdk1 [node14][DEBUG ] /dev/sdl : [node14][DEBUG ] /dev/sdl1 ceph data, active, cluster ceph, osd.141, journal /dev/sdl2 [node14][DEBUG ] /dev/sdl2 ceph journal, for /dev/sdl1
# ceph-deploy disk zap node14:sdi [ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk zap node14:sdi [ceph_deploy.osd][DEBUG ] zapping /dev/sdi on node14 [node14][DEBUG ] connected to host: node14 [node14][DEBUG ] detect platform information from remote host [node14][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final [node14][DEBUG ] zeroing last few blocks of device [node14][INFO ] Running command: sgdisk --zap-all --clear --mbrtogpt -- /dev/sdi [node14][DEBUG ] Creating new GPT entries. [node14][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or [node14][DEBUG ] other utilities. [node14][DEBUG ] The operation has completed successfully. # # ceph-deploy --overwrite-conf osd prepare node14:sdi [ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy --overwrite-conf osd prepare node14:sdi [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node14:/dev/sdi: [node14][DEBUG ] connected to host: node14 [node14][DEBUG ] detect platform information from remote host [node14][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final [ceph_deploy.osd][DEBUG ] Deploying osd to node14 [node14][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [node14][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add [ceph_deploy.osd][DEBUG ] Preparing host node14 disk /dev/sdi journal None activate False [node14][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdi [node14][ERROR ] INFO:ceph-disk:Will colocate journal with data on /dev/sdi [node14][DEBUG ] The operation has completed successfully. [node14][DEBUG ] The operation has completed successfully. [node14][DEBUG ] meta-data=/dev/sdi1 isize=2048 agcount=32, agsize=22884224 blks [node14][DEBUG ] = sectsz=512 attr=2, projid32bit=0 [node14][DEBUG ] data = bsize=4096 blocks=732295168, imaxpct=5 [node14][DEBUG ] = sunit=64 swidth=64 blks [node14][DEBUG ] naming =version 2 bsize=4096 ascii-ci=0 [node14][DEBUG ] log =internal log bsize=4096 blocks=357568, version=2 [node14][DEBUG ] = sectsz=512 sunit=64 blks, lazy-count=1 [node14][DEBUG ] realtime =none extsz=4096 blocks=0, rtextents=0 [node14][DEBUG ] The operation has completed successfully. [ceph_deploy.osd][DEBUG ] Host node14 is now ready for osd use. #
- Check the new OSD
# ceph osd tree 9 2.63 osd.9 up 1 17 2.73 osd.17 up 1 30 2.73 osd.30 up 1 53 2.73 osd.53 up 1 65 2.73 osd.65 up 1 78 2.73 osd.78 up 1 89 2.73 osd.89 up 1 113 2.73 osd.113 up 1 128 2.73 osd.128 up 1 141 2.73 osd.141 up 1 99 2.73 osd.99 up 1 # ceph status cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5 health HEALTH_WARN 186 pgs backfill; 12 pgs backfilling; 6 pgs peering; 57 pgs recovering; 887 pgs recovery_wait; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 283 pgs stuck unclean; 2 requests are blocked > 32 sec; recovery 784023/106982434 objects degraded (0.733%) monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2 836, quorum 0,1,2 node01-ib,node06-ib,node11-ib mdsmap e58: 1/1/1 up {0=node01-ib=up:active} osdmap e9190: 153 osds: 153 up, 153 in pgmap v1413041: 66256 pgs, 30 pools, 200 TB data, 51840 kobjects 90504 GB used, 319 TB / 416 TB avail 784023/106982434 objects degraded (0.733%) 65108 active+clean 186 active+remapped+wait_backfill 887 active+recovery_wait 12 active+remapped+backfilling 6 stale+peering 57 active+recovering recovery io 383 MB/s, 95 objects/s
- You would notice ceph will start putting PG ( data ) on this new OSD , so as to rebalance data and to make this new osd to participate cluster.
Yes at this stage you are done with replacement.
No comments:
Post a Comment