Ceph Storage :: Next Big Thing

Tuesday, April 8, 2014

Erasure Coding in Ceph

Erasure Coding : All you have to know

If there is data , there would be failure and there will also be administrators like us to recover this data and Erasure Coding is our shield.

Storage systems have technologies for data protection and recovery in event of failure. There are ~~two~~ three main methods of data protection :-

1) Replication

2) RAID

Hey wait a minute , here comes the LEGEN ( wait for it ... ) DARY : Legendary number 3

3) Erasure Coding

First of all Erasure Coding is not now , it was born before RAID and Replication methods but it was not in practical use. Raid is now in its concluding years. Raid method of data protection was built years and years back when it was difficult to find a HDD with capacity of 1 GB . In the period of last 30 years drive capacity has been increased 4000 folds. With this appreciation drive consistency and durability is still a problem. Raid was not supposed to give you data protection for a exabyte level storage array , its simply not designed for Future. We need a data protection method which is highly adaptive and ready for future needs. Back to basics Erasure Codes is the answer.

Erasure coding (EC) is a method of data protection in which data is broken into fragments , encoded and then storage in a distributed manner. Ceph , due to its distributed nature , makes use of EC beautifully.

Erasure coding makes use of a mathematical equation to achieve data protection. The entire concept revolves around the following equation.

n = k + m where ,

k = The number of chunks original data divided into.

m = The extra codes added to original data chunks to provide data protection. For ease of understanding you can considered it as reliability level.

n = The total number of chunks created after erasure coding process.

In continuation to erasure coding equation , there are couple of more terms which are :-

Recovery : To perform recovery operation we would require any k chunks out of n chunks and thus can tolerate failure of any m chunks

Reliability Level : We can tolerate failure upto any m chunks.

Encoding Rate (r) : r = k / n , where r < 1

Storage Required : 1 / r

Example 1 : (3,5) Erasure Code for any data file would look like

n = 5 , k = 3 and m = 2 ( m = n - k )

Erasure coding equation 5 = 3 + 2

So 2 coded chunks will be added to 3 data chunks to form 5 total chunks that will be stored distributedly in ceph cluster. In an event of failure , to construct the original file , we need any 3 chunks out of these 5 chunks to recover it. Hence in this example we can manage loss of any 2 chunks.

Encoding rate (r) = 3 / 5 = 0.6 < 1
Storage Required = 1 / 0.6 = 1.6 times of original file.

If the original file size is 1GB then to store this file in a ceph cluster erasure coded (3,5) pool , you would need 1.6GB of storage space.

Example 2 : (10,6) Erasure Code for any data file would look like

n = 16 , k = 10 , m = 6

Encoding rate (r) = 10 / 16 = 0.6 < 1

Storage Required = 1 / 0.6 = 1.6 times of original file.

This erasure code can sustain failure of any 6 chunks out of 16 chunks , in an event of failure for recovery we would need any 10 chunks out of 16 chunks.

Erasure Codes Step - by - Step

Step 1 : Understanding EC profile

When you look at EC profile , you would come across few options that defines EC pattern for that profile.

k ===> As explained earlier , its the number of chunks data will be divided into and each chunk will be stored on a separate OSD. The equivalent command line parameter for this is
erasure-code-k=<Number_of_data_chunks>
Default value = 2

m ===> The number of extra codes (chunks) added to data chunks to provide reliability. This is also the number of OSD that can go down without losing data. Also know as reliability level. The equivalent command line parameter for this is
erasure-code-m=<Number_of_coding_chunks>
Default value = 1

plugin ===> This is the library facilitation erasure coding in Ceph. The plugin is used to compute the coding chunks and recover missing chunks. The current implementation uses jerasure but the upcoming version might use GF-complete as its twice as faster as jerasure.The equivalent command line parameter for this is
erasure-code-plugin=<plugin_name>
Default value = 1

directory ===> The directory name from where EC plugin library will loaded from. In most of the cases this parameter is automatically added once you define plugin name .The equivalent command line parameter for this is
erasure-code-directory=<directory_path>
Default value = /usr/lib64/ceph/erasure-code

ruleset-failure-domain ===> Defines the failure domain , you should set it as OSD to get better reliability for your data.
Default value = host

Step 2 : Generating a EC profile

# ceph osd erasure-code-profile set EC-temp-pool

# ceph osd erasure-code-profile ls
EC-temp-pool
profile1
#

# ceph osd erasure-code-profile get EC-temp-pool
directory=/usr/lib64/ceph/erasure-code
k=2
m=1
plugin=jerasure
technique=reed_sol_van
#

Step 3 : Customizing your EC profile

We can use sub command set , to change existing profile , however we might need to user --force to change a profile. This can be a dangerous.

In this example we will change value of k=4

# ceph osd erasure-code-profile set EC-temp-pool ruleset-failure-domain=osd k=4 m=2
Error EPERM: will not override erasure code profile EC-temp-pool

# ceph osd erasure-code-profile set EC-temp-pool ruleset-failure-domain=osd k=4 m=2 --force

# ceph osd erasure-code-profile get EC-temp-pool
directory=/usr/lib64/ceph/erasure-code
k=4
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
#

Similarly you can change other parameter based on your needs.

Step 4 : Creating a EC Ceph Pool based on your profile

Ceph osd pool create <Pool_name> <pg_num> <pgp_num> erasure <EC_profile_name>

# ceph osd pool create ECtemppool 128 128 erasure EC-temp-pool
pool 'ECtemppool' created
#

# rados lspools
data
metadata
rbd
ECtemppool
#

# ceph osd dump | grep -i erasure
pool 22 'ECtemppool' erasure size 6 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 2034 owner 0 flags hashpspool stripe_width 4096
#

Step 5 : Writing your EC pool

Create a temporary file

# echo "Erasure Coding is ******* Awesome" > testfile
#
# cat testfile
Erasure Coding is ******* Awesome
#

List your pool for objects , since you have created a new pool , there would be no objects inside it. Put the test file to your EC pool.

# rados -p ECtemppool ls
#
# rados -p ECtemppool put object.1 testfile
#
# rados -p ECtemppool ls
object.1
#

Check PG map for your pool , you would get to know on how many OSD this object.1 has been placed.
If you analyze carefully , you can find that the object.1 has been placed in 6 different OSD , this is due to our erasure code profile where we have mentioned k=4 and m=2 , so total chunks are 6 which will be placed on all different OSD

# ceph osd map ECtemppool object.1
osdmap e2067 pool 'ECtemppool' (22) object 'object.1' -> pg 22.f560f2ec (22.2ec) -> up ([144,39,55,15,123,65], p144) acting ([144,39,55,15,123,65], p144)
#

Now you are a reliable erasure coded pool that can sustain failure of two OSD simultaneously.

Erasure Codes are 100% Authentic :-

Upto this point you have a erasure coded pool (4,2) , meaning it can sustain simultaneously 2 OSD failure.

From the above output these are the 6 OSD on which the file "testfile" has been spread = [144,39,55,15,123,65]

Test 1 : Intentionally Breaking OSD to test EC reliability

Breaking 1st OSD ( osd.65 )

# service ceph status osd.65
=== osd.65 ===
osd.65: running {"version":"0.79-125-gcf69bdb"}
#
#
# service ceph stop osd.65
=== osd.65 ===
Stopping Ceph osd.65 on storage0106-ib...kill 3505...done
#
# service ceph status osd.65
=== osd.65 ===
osd.65: not running.
#

Since osd.65 is down , it would be missing from following output , rather you would find some garbage entry " 2147483647 "

# ceph osd map ECtemppool object.1
osdmap e2069 pool 'ECtemppool' (22) object 'object.1' -> pg 22.f560f2ec (22.2ec) -> up ([144,39,55,15,123,2147483647], p144) acting ([144,39,55,15,123,2147483647], p144)
#

Breaking 2nd OSD ( osd.123 )

#
# service ceph status osd.123
=== osd.123 ===
osd.123: running {"version":"0.79-125-gcf69bdb"}
#
# service ceph stop  osd.123
=== osd.123 ===
Stopping Ceph osd.123 on storage0114-ib...kill 5327...done
#
# service ceph status  osd.123
=== osd.123 ===
osd.123: not running.
#

Now the second osd.123 is down , it would be missing from following output , like osd.65 but here comes the MAGIC
Erasure codings are intelligent , it know when you lose data or coding chunks of a file. As soon as chunks are lost , it immediately create exactly same on to new OSD. In this example you can see , OSD.65 and OSD.123 went down , so ceph intelligently recovers the failed chunk on to OSD.85 ( which is a new OSD )

# ceph osd map ECtemppool object.1
osdmap e2104 pool 'ECtemppool' (22) object 'object.1' -> pg 22.f560f2ec (22.2ec) -> up ([144,39,55,15,2147483647,85], p144) acting ([144,39,55,15,2147483647,85], p144)
#

Summary :

Erasure coding is a beautiful piece of reliable solution and works amazingly well, it intelligently manages the failed chunks and tries to recover them wherever possible without any administrative intervention. This makes ceph MORE and MORE reliable , amazingly cost effective. EC feature in ceph has been recently added starting 0.78 release , we can definitely expect more stable and performance centric version of EC in Ceph Firefly 0.80. Stay tuned.

Saturday, March 8, 2014

Admin Guide :: Replacing a Failed Disk in a Ceph Cluster

Replacing a Failed Disk from Ceph a Cluster

Do you have a ceph cluster , great , you are awesome ; so very soon you would face this .

Check your cluster health

# ceph status
  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

830, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e8871: 153 osds: 152 up, 152 in

   pgmap v1409465: 66256 pgs, 30 pools, 201 TB data, 51906 kobjects

      90439 GB used, 316 TB / 413 TB avail

        66250 active+clean

          6 stale+peering

#ceph osd tree | grep -i down

9 2.63 osd.9 up 1

17 2.73 osd.17 up 1

30 2.73 osd.30 up 1

53 2.73 osd.53 up 1

65 2.73 osd.65 up 1

78 2.73 osd.78 up 1

89 2.73 osd.89 up 1

99 2.73 osd.99 down 0

113 2.73 osd.113 up 1

128 2.73 osd.128 up 1

141 2.73 osd.141 up 1

Now you have identified which node's OSD is failed and what's the OSD number. Login to that node and check if that OSD is mounted , IT would be NOT ( as its failed )

# df -h

Filesystem      Size Used Avail Use% Mounted on

/dev/sda1       47G 6.0G 38G 14% /

tmpfs         12G  0 12G 0% /dev/shm

/dev/sdd1      2.8T 197G 2.5T 8% /var/lib/ceph/osd/ceph-30

/dev/sde1      2.8T 172G 2.6T 7% /var/lib/ceph/osd/ceph-53

/dev/sdc1      2.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-17

/dev/sdh1      2.8T 227G 2.5T 9% /var/lib/ceph/osd/ceph-89

/dev/sdf1      2.8T 169G 2.6T 7% /var/lib/ceph/osd/ceph-65

/dev/sdi1      2.8T 150G 2.6T 6% /var/lib/ceph/osd/ceph-113

/dev/sdb1      2.7T 1.3T 1.3T 51% /var/lib/ceph/osd/ceph-9

/dev/sdj1      2.8T 1.6T 1.2T 58% /var/lib/ceph/osd/ceph-128

/dev/sdg1      2.8T 237G 2.5T 9% /var/lib/ceph/osd/ceph-78

/dev/sdk1      2.8T 1.5T 1.3T 53% /var/lib/ceph/osd/ceph-141

Go to your datacenter with a new physical drive and replace the drive physically , i assume depending on enterprise server that you are using it should be hot swappable , These days almost all servers support hot swapping of disks , but still you should check for your server model . In this example i was using HPDL380 server.
Once you have replaced the drive physically , wait for some time so that new drive gets stable.
Login to your node and make OSD out from cluster . Please remember , the OSD was already DOWN and OUT as soon as disk was failed . Ceph takes care of OSD and if its not available it marks it down and moves it out of cluster.

# ceph osd out osd.99

osd.99 is already out.

#

# service ceph stop osd.99

/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)

#

# service ceph status osd.99

/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)

#

# service ceph status

=== osd.9 ===

osd.9: running {"version":"0.72.1"}

=== osd.30 ===

osd.30: running {"version":"0.72.1"}

=== osd.17 ===

osd.17: running {"version":"0.72.1"}

=== osd.128 ===

osd.128: running {"version":"0.72.1"}

=== osd.65 ===

osd.65: running {"version":"0.72.1"}

=== osd.141 ===

osd.141: running {"version":"0.72.1"}

=== osd.89 ===

osd.89: running {"version":"0.72.1"}

=== osd.53 ===

osd.53: running {"version":"0.72.1"}

=== osd.113 ===

osd.113: running {"version":"0.72.1"}

=== osd.78 ===

osd.78: running {"version":"0.72.1"}

#

Now remove this failed OSD from Crush Map , as soon as its removed from crush map , ceph starts making PG copies that were located on this failed disk and it places these PG on other disks. So a recovery process will start.

# ceph osd crush remove osd.99

removed item id 99 name 'osd.99' from crush map

# ceph status

  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 43 pgs backfill; 56 pgs backfilling; 9 pgs peering; 82 pgs recovering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 192 pgs st

uck unclean; 4 requests are blocked > 32 sec; recovery 373488/106903578 objects degraded (0.349%)

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

836, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e8946: 153 osds: 152 up, 152 in

   pgmap v1409604: 66256 pgs, 30 pools, 201 TB data, 51916 kobjects

   1 
      1 GB used, 316 TB / 413 TB avail

   1 
      1 /106903578 objects degraded (0.349%)

          1 active

        66060 active+clean

   1 
      1 
         1 active+remapped+wait_backfill

          3 peering

          1 active+remapped

   1 
      1 
         1 active+remapped+backfilling

          6 stale+peering

          4 active+clean+scrubbing+deep

   1 
      1 
         1 active+recovering

recovery io 159 MB/s, 39 objects/s

(optional) Check disk statistics , it looks nice and after some times ( depending on data that your FAILED disk holds ) it completes

# dstat 10

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read writ| recv send| in out | int csw

 2 3 95 1 0 0|2223k 5938k| 0  0 |1090B 2425B|5853  11k

 14 58 1 25 0 2| 130M 627M| 219M 57M|6554B  0 | 28k 111k

 14 57 1 26 0 2| 106M 743M| 345M 32M| 0 4096B| 35k 73k

 13 61 1 23 0 2| 138M 680M| 266M 67M| 83k  0 | 31k 82k

 14 52 1 31 0 2| 99M 574M| 230M 32M| 48k 6963B| 27k 78k

 14 51 2 31 0 2| 99M 609M| 291M 31M| 0  0 | 29k 83k

 11 57 1 28 0 2| 118M 636M| 214M 57M|9830B  0 | 26k 92k

 12 49 4 34 0 1| 97M 432M| 166M 48M| 35k  0 | 22k 100k

 13 44 3 38 0 1| 95M 422M| 183M 46M| 0  0 | 22k 88k

 13 52 3 30 0 2| 96M 510M| 207M 44M| 0  0 | 25k 109k

 14 49 3 32 0 2| 96M 568M| 276M 37M| 16k  0 | 27k 72k

 9 54 4 31 0 2| 109M 520M| 136M 45M| 0  0 | 20k 89k

 14 44 5 35 0 1| 76M 444M| 192M 13M| 0  0 | 22k 54k

 15 47 3 34 0 2| 101M 452M| 141M 20M|3277B 13k| 21k 79k

 17 48 3 31 0 1| 108M 445M| 181M 16M| 0 200k| 23k 69k

 17 48 3 30 0 1| 154M 406M| 138M 23M| 0  0 | 21k 75k

 17 53 3 27 0 1| 169M 399M| 115M 23M| 0 396k| 21k 81k

 13 45 4 36 0 1| 161M 330M| 131M 20M| 0 397k| 20k 90k

 11 51 5 33 0 1| 116M 416M| 145M 1177k| 0 184k| 20k 69k

 14 50 4 31 0 1| 144M 376M| 124M 8752k| 0  0 | 20k 72k

 14 42 6 37 0 1| 142M 340M| 138M 19M| 0  0 | 19k 79k

 15 47 6 32 0 1| 111M 427M| 129M 11M| 0 819B| 19k 66k

 15 50 5 29 0 1| 163M 413M| 139M 5709k| 58k  0 | 20k 90k

 14 49 4 32 0 1| 155M 395M| 91M 12M| 0  0 | 18k 93k

 18 43 7 31 0 1| 166M 338M| 84M 6493k| 0  0 | 17k 81k

 14 49 5 32 0 1| 179M 335M| 98M 3824k| 0  0 | 18k 91k

 13 46 9 31 0 1| 157M 299M| 72M 14M| 0  0 | 17k 125k

 17 42 9 30 0 1| 188M 269M| 82M 11M| 16k  0 | 16k 102k

 22 35 15 27 0 1| 158M 167M|8932k 287k| 0  0 | 13k 88k

 7 20 46 26 0 0| 118M 12M| 250k 392k| 0  82k|9333  61k

 7 17 60 16 0 0| 124M 1638B| 236k 225k| 0  0 |7512  64k

 7 16 63 14 0 0| 117M 1005k| 247k 238k| 0  0 |7429  60k

 3 9 82 5 0 0| 41M 17M| 225k 225k| 0  0 |6049  27k

 4 8 81 7 0 0| 56M 7782B| 227k 225k| 0 6144B|5933  33k

 4 9 79 7 0 0| 60M 9011B| 248k 245k| 0 9011B|6457  36k

 4 9 79 7 0 0| 58M 236k| 231k 230k| 0  14k|6210  35k

# ceph status

  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

836, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e9045: 153 osds: 152 up, 152 in

   pgmap v1409957: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects

      90448 GB used, 316 TB / 413 TB avail

        66250 active+clean

          6 stale+peering

Once data recovery is done , go ahead delete keyrings for that OSD and finally remove OSD

# ceph auth del osd.99

updated

#

# ceph osd rm osd.99

removed osd.99

#

#ceph status 

  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

836, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e9046: 152 osds: 152 up, 152 in

   pgmap v1409971: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects

      90445 GB used, 316 TB / 413 TB avail

        66250 active+clean

          6 stale+peering

Remove entry of this OSD from ceph.conf ( if its present ) , make sure you keep all the nodes ceph.conf file updated . You can push the new configuration file to entire cluster using # ceph admin command.
Time to create new OSD for the physical disk that we have inserted , You would see, ceph will create new OSD with the same OSD number , that was failed , as we have removed failed OSD cleanly , if you see a different OSD number , it means that you have not cleanly removed failed OSD.

# ceph osd create

99

# ceph status

  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

836, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e9047: 153 osds: 152 up, 152 in

   pgmap v1409988: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects

      90442 GB used, 316 TB / 413 TB avail

        66250 active+clean

          6 stale+peering

#

List the disk , zap it and deploy it again

# ceph-deploy disk list node14

[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk list node14

[node14][DEBUG ] connected to host: node14

[node14][DEBUG ] detect platform information from remote host

[node14][DEBUG ] detect machine type

[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final

[ceph_deploy.osd][DEBUG ] Listing disks on node14...

[node14][INFO ] Running command: ceph-disk list

[node14][DEBUG ] /dev/sda :

[node14][DEBUG ] /dev/sda1 other, ext4, mounted on /

[node14][DEBUG ] /dev/sda2 swap, swap

[node14][DEBUG ] /dev/sdb :

[node14][DEBUG ] /dev/sdb1 ceph data, active, cluster ceph, osd.9, journal /dev/sdb2

[node14][DEBUG ] /dev/sdb2 ceph journal, for /dev/sdb1

[node14][DEBUG ] /dev/sdc :

[node14][DEBUG ] /dev/sdc1 ceph data, active, cluster ceph, osd.17, journal /dev/sdc2

[node14][DEBUG ] /dev/sdc2 ceph journal, for /dev/sdc1

[node14][DEBUG ] /dev/sdd :

[node14][DEBUG ] /dev/sdd1 ceph data, active, cluster ceph, osd.30, journal /dev/sdd2

[node14][DEBUG ] /dev/sdd2 ceph journal, for /dev/sdd1

[node14][DEBUG ] /dev/sde :

[node14][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.53, journal /dev/sde2

[node14][DEBUG ] /dev/sde2 ceph journal, for /dev/sde1

[node14][DEBUG ] /dev/sdf :

[node14][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.65, journal /dev/sdf2

[node14][DEBUG ] /dev/sdf2 ceph journal, for /dev/sdf1

[node14][DEBUG ] /dev/sdg :

[node14][DEBUG ] /dev/sdg1 ceph data, active, cluster ceph, osd.78, journal /dev/sdg2

[node14][DEBUG ] /dev/sdg2 ceph journal, for /dev/sdg1

[node14][DEBUG ] /dev/sdh :

[node14][DEBUG ] /dev/sdh1 ceph data, active, cluster ceph, osd.89, journal /dev/sdh2

[node14][DEBUG ] /dev/sdh2 ceph journal, for /dev/sdh1

[node14][DEBUG ] /dev/sdi other, btrfs

[node14][DEBUG ] /dev/sdj :

[node14][DEBUG ] /dev/sdj1 ceph data, active, cluster ceph, osd.113, journal /dev/sdj2

[node14][DEBUG ] /dev/sdj2 ceph journal, for /dev/sdj1

[node14][DEBUG ] /dev/sdk :

[node14][DEBUG ] /dev/sdk1 ceph data, active, cluster ceph, osd.128, journal /dev/sdk2

[node14][DEBUG ] /dev/sdk2 ceph journal, for /dev/sdk1

[node14][DEBUG ] /dev/sdl :

[node14][DEBUG ] /dev/sdl1 ceph data, active, cluster ceph, osd.141, journal /dev/sdl2

[node14][DEBUG ] /dev/sdl2 ceph journal, for /dev/sdl1

# ceph-deploy disk zap node14:sdi

[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk zap node14:sdi

[ceph_deploy.osd][DEBUG ] zapping /dev/sdi on node14

[node14][DEBUG ] connected to host: node14

[node14][DEBUG ] detect platform information from remote host

[node14][DEBUG ] detect machine type

[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final

[node14][DEBUG ] zeroing last few blocks of device

[node14][INFO ] Running command: sgdisk --zap-all --clear --mbrtogpt -- /dev/sdi

[node14][DEBUG ] Creating new GPT entries.

[node14][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or

[node14][DEBUG ] other utilities.

[node14][DEBUG ] The operation has completed successfully.

#

# ceph-deploy --overwrite-conf osd prepare node14:sdi

[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy --overwrite-conf osd prepare node14:sdi

[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node14:/dev/sdi:

[node14][DEBUG ] connected to host: node14

[node14][DEBUG ] detect platform information from remote host

[node14][DEBUG ] detect machine type

[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final

[ceph_deploy.osd][DEBUG ] Deploying osd to node14

[node14][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

[node14][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add

[ceph_deploy.osd][DEBUG ] Preparing host node14 disk /dev/sdi journal None activate False

[node14][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdi

[node14][ERROR ] INFO:ceph-disk:Will colocate journal with data on /dev/sdi

[node14][DEBUG ] The operation has completed successfully.

[node14][DEBUG ] The operation has completed successfully.

[node14][DEBUG ] meta-data=/dev/sdi1       isize=2048 agcount=32, agsize=22884224 blks

[node14][DEBUG ]     =           sectsz=512 attr=2, projid32bit=0

[node14][DEBUG ] data  =           bsize=4096 blocks=732295168, imaxpct=5

[node14][DEBUG ]     =           sunit=64  swidth=64 blks

[node14][DEBUG ] naming =version 2       bsize=4096 ascii-ci=0

[node14][DEBUG ] log   =internal log     bsize=4096 blocks=357568, version=2

[node14][DEBUG ]     =           sectsz=512 sunit=64 blks, lazy-count=1

[node14][DEBUG ] realtime =none         extsz=4096 blocks=0, rtextents=0

[node14][DEBUG ] The operation has completed successfully.

[ceph_deploy.osd][DEBUG ] Host node14 is now ready for osd use.

#

Check the new OSD

# ceph osd tree 

9 2.63 osd.9 up 1

17 2.73 osd.17 up 1

30 2.73 osd.30 up 1

53 2.73 osd.53 up 1

65 2.73 osd.65 up 1

78 2.73 osd.78 up 1

89 2.73 osd.89 up 1

113 2.73 osd.113 up 1

128 2.73 osd.128 up 1

141 2.73 osd.141 up 1

99 2.73 osd.99 up 1

# ceph status 

  cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5

   health HEALTH_WARN 186 pgs backfill; 12 pgs backfilling; 6 pgs peering; 57 pgs recovering; 887 pgs recovery_wait; 6 pgs stale; 6 pgs stuck inactive; 6 pgs

 stuck stale; 283 pgs stuck unclean; 2 requests are blocked > 32 sec; recovery 784023/106982434 objects degraded (0.733%)

   monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2

836, quorum 0,1,2 node01-ib,node06-ib,node11-ib

   mdsmap e58: 1/1/1 up {0=node01-ib=up:active}

   osdmap e9190: 153 osds: 153 up, 153 in

   pgmap v1413041: 66256 pgs, 30 pools, 200 TB data, 51840 kobjects

      90504 GB used, 319 TB / 416 TB avail

      784023/106982434 objects degraded (0.733%)

        65108 active+clean

         186 active+remapped+wait_backfill

         887 active+recovery_wait

         12 active+remapped+backfilling

          6 stale+peering

         57 active+recovering

recovery io 383 MB/s, 95 objects/s

You would notice ceph will start putting PG ( data ) on this new OSD , so as to rebalance data and to make this new osd to participate cluster.

Yes at this stage you are done with replacement.

Friday, January 24, 2014

How Data Is Stored In CEPH Cluster

HOW :: Data is Storage Inside Ceph Cluster

This is something definitely your would be wondering about , How Data _ _ _ Ceph Cluster ?

Now showing a easy to understand ceph data storage diagram.

## POOLS : Ceph cluster has POOLS , pools are the logical group for storing objects .These pools are made up of PG ( Placement Groups ). At the time of pool creation we have to provide number of placement groups that the pool is going to contain , number of object replicas ( usually takes default value , if other not specified )

Creating a pool ( pool-A ) with 128 placement groups

# ceph osd pool create pool-A 128
pool 'pool-A' created

Listing pools

# ceph osd lspools
0 data,1 metadata,2 rbd,36 pool-A,

Find out total number of placement groups being used by pool

# ceph osd pool get pool-A pg_num
pg_num: 128

Find out replication level being used by pool ( see rep size value for replication )

# ceph osd dump | grep -i pool-A
pool 36 'pool-A' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 4051 owner 0

Changing replication level for a pool ( compare from above step , rep size changed )

# ceph osd pool set pool-A size 3
set pool 36 size to 3
#
# ceph osd dump | grep -i pool-A
pool 36 'pool-A' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 4054 owner 0

This means all the objects of pool-A will be replicated 3 times on 3 different OSD's

Now , Putting some data in pool-A , and data would be stored in the form of objects :-) thumb rule.

# dd if=/dev/zero of=object-A bs=10M count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.0222705 s, 471 MB/s
#

# dd if=/dev/zero of=object-B bs=10M count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.0221176 s, 474 MB/s
#

Putting some objects in pool-A

# rados -p pool-A put object-A  object-A

# rados -p pool-A put object-B  object-B

checking how many objects does the pool contains

# rados -p pool-A ls
object-A
object-B
#

## PG ( Placement Group ): Ceph cluster links objects --> PG . These PG containing objects are spread across multiple OSD and improves reliability.

## Object : Object is the smallest unit of data storage in ceph cluster , Each & Everything is stored in the form of objects , thats why ceph cluster is also known as Object Storage Cluster. Objects are mapped to PG , and these Objects / their copies always spreaded on different OSD. This is how ceph is designed.

Locating object , to which PG it belongs and stored where ??

# ceph osd map pool-A object-A
osdmap e4055 pool 'pool-A' (36) object 'object-A' -> pg 36.b301e3e8 (36.68) -> up [122,63,62] acting [122,63,62]
#

# ceph osd map pool-A object-B
osdmap e4055 pool 'pool-A' (36) object 'object-B' -> pg 36.47f173fb (36.7b) -> up [153,110,118] acting [153,110,118]
#

Now , we already created a pool-A , changed its replication level to 3 , added objects ( object-A and object-B ) to pool-A . Observe the above output. It throws a lot of information

OSD map version id is e4055
pool name is pool-A
pool id is 36
object name ( which was enquired , object-A and object-B )
Placement Group id to which this object belongs is ( 36.68 ) and ( 36.7b )
Our pool-A has replication level set to 3 , so every object of this pool should have 3 copies on different OSD , here our object's 3 copies resides on OSD.122 , OSD.63 and OSD.62

Login to ceph nodes containing OSD 122 , 63 and 62
You can see your OSD mounted

# df -h /var/lib/ceph/osd/ceph-122
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdj1             2.8T  1.8T  975G  65% /var/lib/ceph/osd/ceph-122
#

Browse to the directory where ACTUAL OBJECTS are stored

# pwd
/var/lib/ceph/osd/ceph-122/current
#

Under this directory if you do a ls command , you will see PG ID , In our case the PG id is 36.68 for object-A

# ls -la | grep -i 36.68
drwxr-xr-x 1 root root    54 Jan 24 16:45 36.68_head
#

Browse to the PG head directory , give ls and Here you go you reached to your OBJECT.

# pwd
/var/lib/ceph/osd/ceph-122/current/36.68_head
#
# ls -l
total 10240
-rw-r--r-- 1 root root 10485760 Jan 24 16:45 object-A__head_B301E3E8__24
#

Moral of the Story

Ceph storage cluster can have more than one Pools
Each pool SHOULD have multiple Placement Groups . More the PG , better your cluster performance , more reliable your setup would be.
A PG contains multiple Objects.
A PG is spreaded on multiple OSD , i.e Objects are spreaded across OSD. The first OSD mapped to PG will be its primary OSD and the other ODS's of same PG will be its secondary OSD.
An Object can be mapped to exactly one PG
Many PG's can be mapped to ONE OSD

How much PG you need for a POOL :

           (OSDs * 100)
Total PGs = ------------
              Replicas

# ceph osd stat
     osdmap e4055: 154 osds: 154 up, 154 in
#

Applying formula gives me = ( 154 * 100 ) / 3 = 5133.33

Now , round up this value to the next power of 2 , this will give you the number of PG you should have for a pool having replication size of 3 and total 154 OSD in entire cluster.

Final Value = 8192 PG

Friday, January 10, 2014

CephFS with a dedicated pool

CephFS with a Dedicated Pool

This blog is about configuring a dedicated pool ( user defined pool ) for cephfs. If you are looking to configure cephfs , please visit CephFS Step by Step blog

Create a new pool for cephfs ( obviosly you can use your existing pool )

# rados mkpool cephfs

Grab pool id

# ceph osd dump | grep -i cephfs
pool 34 'cephfs' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 860 owner 0
#

Assign the pool to MDS

# ceph mds add_data_pool 34

Mount your cephfs share

# mount -t ceph 192.168.100.101:/ /cephfs -o name=cephfs,secretfile=/etc/ceph/client.cephfs

Check current layout of cephfs , you would notice the default layout.data_pool is set to 0 , which means your cephfs will store date in pool 0 i.e data pool

# cephfs /cephfs/ show_layout
layout.data_pool:     0
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1

Set a new layout for data_pool in cephfs , use pool id of the pool that we have created above.

# cephfs /cephfs/ set_layout -p 34
# cephfs /cephfs/ show_layout
layout.data_pool:     34
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1
[root@na_csc_fedora19 ~]#

Remount your cephfs share

# umount /cephfs
# mount -t ceph 192.168.100.101:/ /cephfs -o name=cephfs,secretfile=/etc/ceph/client.cephfs

Check objects that are present in cephfs pool , there should be no object as this is a fresh pool and does not contain any data . But if you look for objects for any other pool , it should contain objects.

# rados --pool=cephfs ls
#
# rados --pool=metadata ls
1.00000000.inode
100.00000000
100.00000000.inode
1.00000000
2.00000000
200.00000000
this is a tesf fine
200.00000001
600.00000000
601.00000000
602.00000000
603.00000000
604.00000000
605.00000000
606.00000000
607.00000000
608.00000000
609.00000000
mds0_inotable
mds0_sessionmap
mds_anchortable
mds_snaptable
#

Go to your cephfs directory and create some files ( put data in your file ) .

# cd /cephfs/
# vi test

Recheck for objects in cephfs pool , now it will show you objects .

# rados --pool=cephfs ls
10000000005.00000000
#

Summary is , we have created a new pool named "cephfs" , changed layout of cephfs to store its data in new pool "cephfs" , and finally we saw cephfs data is getting stored in pool named cephfs ( i know its too more cephfs , read it again if you are sleeping and didn't understand cephfs)

Thursday, January 9, 2014

Kraken :: The First Free Ceph Dashboard in Town

Kraken :: The Free Ceph Dashboard is Finally Live

Kraken is a Free ceph dashboard for monitoring and statistics. Special thanks to Donald Talton for this beautiful dashboard.

Installing Kraken

Install Prerequisites

# yum install git

# yum install django

# yum install python-pip


# pip install requests

Requirement already satisfied (use --upgrade to upgrade): requests in /usr/lib/python2.7/site-packages

Cleaning up...

#

# pip install django

Requirement already satisfied (use --upgrade to upgrade): django in /usr/lib/python2.7/site-packages

Cleaning up...

#

# yum install screen

Create a new user account

# useradd kraken

Clone kraken from github

# cd /home/kraken

# git clone https://github.com/krakendash/krakendash


Cloning into 'krakendash'...

remote: Counting objects: 511, done.

remote: Compressing objects: 100% (276/276), done.

remote: Total 511 (delta 240), reused 497 (delta 226)

Receiving objects: 100% (511/511), 1.53 MiB | 343.00 KiB/s, done.

Resolving deltas: 100% (240/240), done.

#

Exceute api.sh and django.sh one by one , these would get launched in screens , use ctrl A and D to detach from screen

# ./api.sh


[detached from 14662.api]

# ./django.sh

[detached from 14698.django]

#

# ps -ef | grep -i screen

root     14662     1  0 07:29 ?        00:00:00 SCREEN -S api sudo ceph-rest-api -c /etc/ceph/ceph.conf --cluster ceph -i admin

root     14698     1  0 07:30 ?        00:00:00 SCREEN -S django sudo python krakendash/kraken/manage.py runserver 0.0.0.0:8000

root     14704 14472  0 07:30 pts/0    00:00:00 grep --color=auto -i screen

#

Open your browser and navigate to http://localhost:8000/

Great you have a Ceph GUI dashboard running now :-)
Watch out this space for new features of kraken

Thursday, January 2, 2014

Zero To Hero Guide : : For CEPH CLUSTER PLANNING

What it is all about :

If you think or discuss about Ceph , the most common question strike to your mind is "What Hardware Should I Select For My CEPH Storage Cluster ?" and yes if you really thought of this question in your mind , congratulations you seems to be serious about ceph technology and You should be because CEPH IS THE FUTURE OF STORAGE.

Ceph runs on Commodity hardware , Ohh Yeah !! everyone now knows it . It is designed to build a multi-petabyte storage cluster while providing enterprise ready features. No single point of failure , scaling to exabytes , self managing and self healing ( saves operational cost ) , runs on commodity hardware ( no vendor locking , saves capital investment )

Ceph Overview :-

The sole of ceph storage cluster is RADOS ( Reliable Autonomic Distributed Object Store ). Ceph uses powerful CRUSH ( Controlled Replication Under Scalable Hashing ) algorithm for optimize data placement , self managing and self healing. The RESTful interface is provided by Ceph Object Gateway (RGW) aks Rados GateWay and virtual disks are provisioned by Ceph Block Device (RBD)

Ceph Overview - Image Credit : Inktank

Ceph Components :-

# Ceph OSD ( Object Storage Daemons ) storage data in objects , manages data replication , recovery , rebalancing and provides stage information to Ceph Monitor. Its recommended to user 1 OSD per physical disk.

# Ceph MON ( Monitors ) maintains overall health of cluster by keeping cluster map state including Monitor map , OSD map , Placement Group ( PG ) map , and CRUSH map. Monitors receives state information from other components to maintain maps and circulate these maps to other Monitor and OSD nodes.

# Ceph RGW ( Object Gateway / Rados Gateway ) RESTful API interface compatible with Amazon S3 , OpenStack Swift .

# Ceph RBD ( Raw Block Device ) Provides Block Storage to VM / bare metal as well as regular clients , supports OpenStack and CloudStack . Includes Enterprise features like snapshot , thin provisioning , compression.

# CephFS ( File System ) distributed POSIX NAS storage.

Few Thumb Rules :-

Run OSD on a dedicated storage node ( server with multiple disks ) , actual data is stored in the form of objects.
Run Monitor on a separate dedicated hardware or coexists with ceph client nodes ( other than OSD node ) such as RGW , CephFS node . For production its recommended to run Monitors on dedicated low cost servers since Monitors are not resource hungry.

Monitor Hardware Configuration :-

Monitor maintains health of entire cluster , it contains PG logs and OSD logs . A minimum of three monitors nodes are recommended for a cluster quorum. Ceph monitor nodes are not resource hungry they can work well with fairly low cpu and memory. A 1U server with low cost processor E5-2603,16GB RAM and 1GbE network should be sufficient in most of the cases. If PG,Monitor and OSD logs are storage on local disk of monitor node , make sure you have sufficient amount of local storage so that it should not fill up.

Unhealthy clusters require more storage for logs , can reach upto GB and even hundreds of GB if the cluster is left unhealthy for a very long time . If verbose output is set on monitor nodes, then these are bound to generate huge amount of logging information. Refer ceph documentation for monitor log setting.

Its recommended to run monitor on distant nodes rather on all on all one node or on virtual machines on physical separated machines to prevent single point of failure.

The Planning Stage :-

Deploying a ceph cluster in production requires a little bit Homework , you should gather the below information so that you can design a better and more reliable and scalable ceph cluster to fit in your IT needs. These very specific to your needs and your IT environment. This information will help you to design your storage requirement better.

Business Requirement

Budget ?
Do you need Ceph cluster for day to day operation or SPECIAL

Technical Requirement

What applications will be running on your ceph cluster ?
What type of data will be stored on your ceph cluster ?
Should the ceph cluster be optimized for capacity and performance ?
What should be usable storage capacity ?
What is expected growth rate ?
How many IOPS should the cluster support ?
How much throughput should the cluster support
How much data replication ( reliability level ) you need ?

Collect as much information as possible during the planning stage , the will give all the answers required to construct a better ceph cluster.

The Physical Node and clustering technique:-

In addition to above collected information , also take into account the rack density and power budget , data center space pace cost to size the optimal node configuration. Ceph replicated data across multiple nodes in a storage cluster to provide data redundancy and higher availability. Its important to consider.

Should the replicated node be on the same rack or multiple racks to avoid SPOF ?
Should the OSD traffic stay within the rack or span across rack in a dedicated or shared network ?
How many nodes failure can be tolerated ?
If the nodes are separated out across multiple racks network traffic increases and the impact of latency and the number of network switch hops should be considered.

Ceph will automatically recover by re-replicating data from the failed nodes using secondary copies present on other nodes in cluster . A node failure thus have several effects.

Total cluster capacity is reduced by some fractions.
Total cluster throughput is reduced by some fractions.
The cluster enters a write heavy recovery processes.

A general thumb of rule to calculate recovery time in a ceph cluster given 1 disk per OSD node is :

Recovery Time in seconds = disk capacity in Gigabits / ( network speed *(nodes-1) )

# POC Environment -- Can have a minimum of 3 physical nodes with 10 OSD's each. This provides 66% cluster availability upon a physical node failure and 97% uptime upon an OSD failure. RGW and Monitor nodes can be put on OSD nodes but this may impact performance and not recommended for production.

# Production Environment -- a minimum of 5 physically separated nodes and minimum of 100 OSD @ 4TB per OSD the cluster capacity is over 130TB and provides 80% uptime on physical node failure and 99% uptime on OSD failure. RGW and Monitors should be on separate nodes.

Based on the outcome of planning phase and physical nodes and clustering stage you have a look on the hardware available in market as per your budget.

OSD CPU selection :-

< Under Construction ... Stay Tuned >