Low power/good performance Ceph (jewel) cluster monitored with grafana/influxdb/telegraf on Centos 7.3

 

The hardware..

In my Homelab : Highly resilient “datacenter-in-two-boxes�” with Centos 7 and Ceph jewel article, I’ve told how to build a low power homelab.

With this hardware, a bunch of low power disks (2,5 5400), you can build a low power virtualized storage system with Ceph, and store all your data with top-level NAS software

The software :

Centos 7.3 (1611) x86-64 “minimal”

Ceph “jewel” x86-64

Puppet (configuration management software)

Topology

Number of MONs

It’s recommended to install at least 3 MONs for resilience reasons.

For me needs, I will install 5 MONs on my 5 hosts.

In this example, all hosts are virtualized -> in my cas, I have 3 physical hosts (see other pages..). One is not able to get an osd (intel nuc) : in my final cluster map, there is 1 mon on my nuc, and 1 on each storage host. It will ensure that when a host is down, the quorum is satisfied and my ceph cluster is UP.

Installing the cluster

Preparing the hardware and the OS

Requirements :

This blog do not cover the OS installation procedure. Before you continue be sure to configure your OS with these additional requirements :

  • Use a correct DNS configuration or configure manually each /etc/hosts file of the hosts.
  • You will need at least 3 nodes, plus an admin node (for cluster deployment, monitoring, ..)
  • You MUST install NTP on all nodes :

    root@n0:~# yum install -y ntp
    root@n0:~# systemctl enable ntpd

then configure /etc/ntp.conf with your preferred NTP servers

You can also use systemd (search for chrony.conf config file)

It’s safe and more efficient to have a time source close to your cluster. Wifi AP, DSL routers often provide such services. My configuration uses my ADSL router, based on openWRT (you can setup ntpd on openwrt…)

Then run :

root@n0:~# timedatectl set-ntp true
  • disable SELINUX (see /etc/selinux/config)
  • disable your firewalld (systemctl disable firewalld.service)

Finally, ensure everything’s ok when rebooting your node..

Create the ceph admin user on each node :

On each node, create a ceph admin user (used for deployment tasks). It’s important to choose a different user than “ceph” (used by the ceph installer..)

Note : you can omit the -s directive of useradd, it’s a personal choice to use bash.

root@n0:~# sudo useradd -d /home/cephadm -m cephadm -s /bin/bash
root@n0:~# sudo passwd cephadm
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

root@n0:~# `echo "cephadm ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph`
root@n0:~# chmod 444 /etc/sudoers.d/ceph</pre>

…and so on to host admin, and nodes n1, n2 [and n3, …]

to automate this task, use dsh on the admin node, after having configured dsh for root (ssh-copy-id root@nodes)

echo "cephadm ALL = (root) NOPASSWD:ALL" | dsh -aM -i -c 'sudo tee /etc/sudoers.d/ceph'    
dsh -aM "chmod 444 /etc/sudoers.d/ceph"

Also, install lsb first, it will be useful later.

yum install redhat-lsb-core

Setup the ssh authentication with cryptographic keys

On the admin node :

Create the ssh key for the user cephadm

root@admin:~# su - cephadm
cephadm@admin:~$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/cephadm/.ssh/id_dsa):
Created directory '/home/cephadm/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cephadm/.ssh/id_dsa.
Your public key has been saved in /home/cephadm/.ssh/id_dsa.pub.
The key fingerprint is:
ec:16:ad:b4:76:e4:32:c6:7c:14:45:bc:c3:78:5a:cf cephadm@admin
The key's randomart image is:
+---[DSA 1024]----+
|           oo    |
|           ..    |
|          .o .   |
|       . ...*    |
|        S ++ +   |
|       = B.   E  |
|        % +      |
|       + =       |
|                 |
+-----------------+
cephadm@admin:~$

Then push it on the nodes of the cluster:
[cephadm@admin ~]$ ssh-copy-id cephadm@n0
[cephadm@admin ~]$ ssh-copy-id cephadm@n1
[cephadm@admin ~]$ ssh-copy-id cephadm@n2
[cephadm@admin ~]$ ssh-copy-id cephadm@n3
[cephadm@admin ~]$ ssh-copy-id cephadm@n4

Or better to automate (if you do this a lot of times, :) ):

#!/bin/sh
# sudo yum install moreutils sshpass openssh-clients
echo 'Enter password:';
read -s SSHPASS;
export SSHPASS;
for i in {0..4}; do sshpass -e ssh-copy-id -o StrictHostKeyChecking=no cephadm@n$i.int.intra -p 22 ; done
export SSHPASS=''

Install and configure dsh (distributed shell)

[root@admin ~]# yum install -y gcc
[root@admin ~]# yum install -y gcc-c++
[root@admin ~]# yum install -y wget
[root@admin ~]# wget https://www.netfort.gr.jp/~dancer/software/downloads/dsh-0.25.9.tar.gz
[root@admin ~]# wget https://www.netfort.gr.jp/~dancer/software/downloads/libdshconfig-0.20.9.tar.gz
[root@admin ~]# tar xvfz libdshconfig-0.20.9.tar.gz
[root@admin ~]# cd libdshconfig-0.20.9
[root@admin libdshconfig-0.20.9]# ./configure
[root@admin libdshconfig-0.20.9]# make
[root@admin libdshconfig-0.20.9]# make install
[root@admin ~]# tar xvfz dsh-0.25.9.tar.gz
[root@admin ~]# cd dsh-0.25.9
[root@admin dsh-0.25.9]# ./configure
[root@admin dsh-0.25.9]# make
[root@admin dsh-0.25.9]# make install
[root@admin ~]# echo /usr/local/lib &gt; /etc/ld.so.conf.d/dsh.conf
[root@admin ~]# ldconfig

Done. Then configure it :
[root@admin ~]# vi /usr/local/etc/dsh.conf

insert these lines :

remoteshell =ssh
waitshell=1  # whether to wait for execution

Create the default machine list file

[root@admin ~]# su - cephadm
cephadm@admin:~$ cd
cephadm@admin:~$ mkdir .dsh
cephadm@admin:~$ cd .dsh
cephadm@admin:~/.dsh$ for i in {0..4} ; do echo "n$i" &gt;&gt; machines.list ; done

Test…

[cephadm@admin ~]$ dsh -aM uptime
n0:  16:23:21 up 3 min,  0 users,  load average: 0.20, 0.39, 0.20
n1:  16:23:22 up 3 min,  0 users,  load average: 0.19, 0.40, 0.21
n2:  16:23:23 up 3 min,  0 users,  load average: 0.13, 0.38, 0.20
n3:  16:23:24 up 4 min,  0 users,  load average: 0.00, 0.02, 0.02
n4:  16:23:25 up 3 min,  0 users,  load average: 0.24, 0.38, 0.20

Another test :
[cephadm@admin ~]$ dsh -aM cat /proc/cpuinfo | grep model\ name
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz

Good.. !!

Now you’re ready to install your cluster with automated commands from your admin node. Not than several other solutions are good enough, like cssh (clustered…). Choose the best for your needs ;)

Well,

Now I’m assuming you have followed the installation procedure and the requirements above :).

Here’s my configuration :

n0 : 192.168.10.210/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n1 : 192.168.10.211/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n2 : 192.168.10.212/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on Crucial MX200 SSD (journal)
n3 : 192.168.10.213/24 1TB WD Red 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n4 : 192.168.10.214/24 1TB Hitachi 3.5” 7200rpm (data) + 20Gb on Crucial MX100 SSD (journal)
n5 : 192.168.10.215/24 1TB ZFS (on 2x WD Green 5TB) + 20Gb on Crucial MX100 SSD (journal)
admin : 192.168.10.177/24 (VM)

Finally, don’t forget to change your yum repositories if you installed the OSes with a local media. They should now point to a mirror for all updates (security and software).

Reboot your nodes if you wan to be very sure you haven’t forget anything, and test them with dsh, for example for ntp :

[cephadm@admin ~]$ dsh -aM timedatectl|grep NTP
[cephadm@admin ~]$ dsh -aM timedatectl status|grep NTP
n0:      NTP enabled: yes
n0: NTP synchronized: yes
n1:      NTP enabled: yes
n1: NTP synchronized: yes
n2:      NTP enabled: yes
n2: NTP synchronized: yes
n3:      NTP enabled: yes
n3: NTP synchronized: yes
n4:      NTP enabled: yes
n4: NTP synchronized: yes
...

Install your Ceph cluster

Get the software

Ensure you are up to date on each nodes, at the very beginning of this procedure.

Feel free to use dsh from the admin node for each task you would like to apply to the nodes ;)

[cephadm@admin ~]$ dsh -aM "sudo yum -y upgrade"

Install the repos

On the admin node only, configure the ceph repos.

You have the choice : either do it like this if you want to download Ceph packages from the internet :

[cephadm@admin ~]$ sudo yum install https://download.ceph.com/rpm-jewel/el7/noarch/ceph-release-1-1.el7.noarch.rpm

Or if you want a local mirror, look at the section below telling how to setup Puppet (for example) to do that. I prefer this option for myself, because I have a local mirror (for testing purposes, it’s better to download locally)

Install Ceph-deploy

This tool is written in Python.

[cephadm@admin ~]$ sudo yum install ceph-deploy

Install Ceph

Always on the admin node, create a directory that will contain all the config for your cluster :

cephadm@admin:~$ mkdir cluster
cephadm@admin:~$ cd cluster

I have chosen to install 4 monitors (3 would be sufficient at home, but my needs isn’t your needs).

cephadm@admin:~/cluster$ ceph-deploy new n{0,2,4,5}

(It generates a lot of stdout messages)

Now edit ceph.conf (in the “cluster” directory) and tell ceph you want to shard x3, and add the cluster and public networks in the [global] section ; for myself : 10.0.0.0/8 and 192.168.10.0/24

The file ceph.conf should contain the following lines now :

[global]
fsid = 74a80a50-b7f9-4588-baa4-bb242c3d4cf0
mon_initial_members = n0, n1, n3
mon_host = 192.168.10.210,192.168.10.211,192.168.10.213
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
**osd pool default size = 3**
**cluster network = 10.1.1.0/24**
**public network = 192.168.10.0/24**

[osd]
osd mkfs options type=btrfs

osd journal size = 20000

Please note that I will use btrfs to store the data. My kernel is at a sufficient level for that (4.9), and i’ve experienced obvious filesystem corruptions sometimes, when simply rebooting my nodes which had kernel 3.10 and an XFS partition for the OSDs.

If you install from a local mirror :
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --repo-url {http mirror} --gpg-url {http gpg url} --release jewel n$i; done

For ex for me :

cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --repo-url http://mirror/ceph/rpm-jewel/el7/ --release jewel n$i; done
Else :
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --release jewel n$i; done

This command generates a lot of logs (downloads, debug messages, warnings…) but should return without error. Otherwise check the error and google it. You can just restart the ceph-deploy program, depending on the error it will work the second time ;) (I’ve experienced some problems accessing the ceph repository, for ex….)

Create the mons:
cephadm@admin:~/cluster$ ceph-deploy mon create-initial

Idem, a lot of logs… but no error..

Create the OSDs (storage units)

You have to know which device will be used for the data on each node and which device for the journal. If you are building a ceph cluster for a production environment, you should use SSDs for the journal partition.for testing purpose, you can use only one device.

In my case, I took care to make the storage osd disks to be /dev/vdb on all nodes, and journal (SSD) on /dev/vdc..

Important note : if you previously installed ceph on a device, you MUST “zap” (delete) it before. Use the command “ceph-deploy disk zap n3:sdb” for example.

Execute this step if you don’t know anything about the past usage of your disks.

Zap disks If you have a separate partition for SSDs (/dev/vdc, here):
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy disk zap n$i:vdb; done
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy disk zap n$i:vdc; done
If you use only one device :
cephadm@admin:~/cluster$ for i in {0..4}; do ceph-deploy disk zap n$i:vdb; done

Create the OSDsNote : use –fs-type btrfs on “osd create” if you want (as me) another filesystem than xfs. I’ve got obvious problems with xfs (corruptions while rebooting..)

cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy osd create --fs-type btrfs n$i:vdb:vdc; done

Else use the defaults (xfs)

cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy osd create n$i:vdb:vdc; done

And remember, if you have only one device (vdb for ex), use this instead (defaults with xfs):

cephadm@admin:~/cluster$ for i in {0..4}; do ceph-deploy osd create n$i:vdb; done

#####

Deploy the ceph configuration to all storage nodes
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy admin n$i; done

And check the permissions. For some reasons the permissions are not correct:

cephadm@admin:~/cluster$ dsh -aM "ls -l /etc/ceph/*key*"
n0: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n1: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n2: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n3: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring

To correct this, issue the following command :

cephadm@admin:~/cluster$ dsh -aM "sudo chmod +r /etc/ceph/ceph.client.admin.keyring"

and check :

cephadm@admin:~/cluster$ dsh -aM "ls -l /etc/ceph/*key*"
n0: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n1: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n2: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n3: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
Finally, install the metadata servers
cephadm@admin:~/cluster$ ceph-deploy mds create n0 n1 n5
and the rados gateway
cephadm@admin:~/cluster$ ceph-deploy rgw create n3 n5

(on n3 for me)

One more time ;) remember to check the ntp status of the nodes :

cephadm@admin:~/cluster$ dsh -aM "timedatectl|grep synchron"
n0: NTP synchronized: yes
n1: NTP synchronized: yes
n2: NTP synchronized: yes
n3: NTP synchronized: yes
n4: NTP synchronized: yes

check the cluster, on one node type :

[cephadm@n0 ~]$ ceph status
cluster 2a663a93-7150-43f5-a8d2-e40e2d9d175f
health HEALTH_OK
monmap e2: 5 mons at {n0=192.168.10.210:6789/0,n1=192.168.10.211:6789/0,n2=192.168.10.212:6789/0,n3=192.168.10.213:6789/0,n4=192.168.10.214:6789/0}
election epoch 8, quorum 0,1,2,3,4 n0,n1,n2,n3,n4
osdmap e32: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v97: 104 pgs, 6 pools, 1588 bytes data, 171 objects
173 MB used, 3668 GB / 3668 GB avail
104 active+clean
Done !

Test your brand new ceph cluster

You can create a pool to test your new cluster :

[cephadm@n0 ~]$ rados mkpool test
successfully created pool test
[cephadm@n0 ~]$ rados lspools
rbd
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
test

[cephadm@n0 ~]$ rados put -p test .bashrc .bashrc
[cephadm@n0 ~]$ ceph osd map test .bashrc
osdmap e34 pool 'test' (6) object '.bashrc' -&gt; pg 6.3d13d849 (6.1) -&gt; up ([2,4,1], p2) acting ([2,4,1], p2)

A quick look at the cluster network, ensure it’s used as it should be (tcpdump -i on n0) :

22:43:17.137802 IP 10.1.1.12.50248 &gt; n0.int.intra.acnet: Flags [P.], seq 646:655, ack 656, win 1424, options [nop,nop,TS val 3831166 ecr 3830943], length 9
22:43:17.177297 IP n0.int.intra.acnet &gt; 10.1.1.12.50248: Flags [.], ack 655, win 235, options [nop,nop,TS val 3831203 ecr 3831166], length 0
22:43:17.205945 IP 10.1.1.13.42810 &gt; n0.int.intra.acnet: Flags [P.], seq 393:515, ack 394, win 1424, options [nop,nop,TS val 4392067 ecr 3829192], length 122
22:43:17.205999 IP n0.int.intra.acnet &gt; 10.1.1.13.42810: Flags [.], ack 515, win 252, options [nop,nop,TS val 3831231 ecr 4392067], length 0
22:43:17.206814 IP n0.int.intra.acnet &gt; 10.1.1.13.42810: Flags [P.], seq 394:525, ack 515, win 252, options [nop,nop,TS val 3831232 ecr 4392067], length 131
22:43:17.207547 IP 10.1.1.13.42810 &gt; n0.int.intra.acnet: Flags [.], ack 525, win 1424, options [nop,nop,TS val 4392069 ecr 3831232], length 0

….
Good !!

Now, “really” test your new cluster

Cf http://docs.ceph.com/docs/giant/rbd/libvirt/ :

First deploy the admin part of ceph on the destination system that will test your cluster

On the admin node :

[cephadm@admin cluster]$ ceph-deploy –overwrite-conf admin hyp03

On an hypervisor, with access to the network of course :

First give permissions for each process to know your cluster

chmod +r /etc/ceph/ceph.client.admin.keyring
[root@hyp03 ~]# ceph osd pool create libvirt-pool 128 128
pool 'libvirt-pool' created

[root@hyp03 ~]# ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool'
[client.libvirt]
key = AQDsdMYVYR0IdmlkKDLKMZYUifn+lvqMH3D7Q==

Create a 16G image on your new cluster

[root@hyp03 ~]# qemu-img create -f rbd rbd:libvirt-pool/new-libvirt-image 16G
Formatting 'rbd:libvirt-pool/new-libvirt-image', fmt=rbd size=17179869184 cluster_size=0

Important : Jewel brings RBD features not compatible with Centos 7.3. Disable it, otherwise you won’t be able to mount your RBD image (either with rbd map or through qemu-img)

rbd feature disable libvirt-pool/new-libvirt-image exclusive-lock object-map fast-diff deep-flatten

Create a secret

cat > secret.xml <<EOF
<secret ephemeral='no' private='no'>
        <usage type='ceph'>
                <name>client.libvirt secret</name>
        </usage>
</secret>
EOF

Then issue :

[root@hyp03 ~]# sudo virsh secret-define --file secret.xml
[root@hyp03 ~]# ceph auth get-key client.libvirt | sudo tee client.libvirt.key
sudo virsh secret-set-value --secret 12390708-973c-4f6e-b0be-aba963608006 --base64 $(cat client.libvirt.key) && rm client.libvirt.key secret.xml

Replicate the secret on all hosts you want the VM to run (especially for using live migration). repeat the previous steps on each of these hosts, but with a modified secret.xml file that includes the secret UUID created at the first run on the first host.

<secret ephemeral='no' private='no'>
    <uuid>12390708-973c-4f6e-b0be-aba963608006</uuid>
        <usage type='ceph'>
            <name>client.libvirt secret</name>
        </usage<
</secret>

Follow the guide http://docs.ceph.com/docs/giant/rbd/libvirt/ for the vm configuration and then

sudo virsh secret-define --file secret.xml

[root@hyp03 ~]# virsh start dv03
Domain dv03 started

You’re done !

Configure the cluster

Crush map

For my needs, I want my cluster to be up even if one of my host is down.

In my home “datacenter”, I’ve two “rack”, two “physical hosts”, and 6 “ceph virtual hosts”, each of these running a 1TB OSD

How do I ensure the data replication occurs in a matter that no data is only on a phycial host ? you will do that bu managing your ceph CRUSH map with rules..

First, organize your ceph hosts in your “dataxenter”.

Because my home is not a really datacenter, for this example, I will name “hosts” the virtial machines hosting centos 7.3/ceph with an OSD for each VM.

I will name “rack” the deux physical hosts that run the “hosts” (VM)

I will call “datacenter” the rack where my two physical hosts are installed.

Create the datacenter, racks, and move them into the right place

ceph osd crush add-bucket rack1 rack
ceph osd crush move n0 rack=rack1
ceph osd crush move n1 rack=rack1
ceph osd crush move n2 rack=rack1
ceph osd crush move n3 rack=rack1
ceph osd crush move rack1 root=default
ceph osd crush add-bucket rack2 rack
ceph osd crush move rack2 root=default
ceph osd crush move n4 rack=rack2
ceph osd crush move n5 rack=rack2
ceph osd crush add-bucket dc datacenter
ceph osd crush move dc root=default
ceph osd crush move rack1 datacenter=dc
ceph osd crush move rack2 datacenter=dc

Look at the results

[root@hyp03 ~]# ceph osd tree
ID  WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.54849 root default
-10 5.54849     datacenter dc
-8 3.63879         rack rack1
-2 0.90970             host n0
0 0.90970                 osd.0      up  1.00000          1.00000
-3 0.90970             host n1
1 0.90970                 osd.1      up  1.00000          1.00000
-4 0.90970             host n2
2 0.90970                 osd.2      up  1.00000          1.00000
-5 0.90970             host n3
3 0.90970                 osd.3      up  1.00000          1.00000
-9 1.90970         rack rack2
-6 0.90970             host n4
4 0.90970                 osd.4      up  1.00000          1.00000
-7 1.00000             host n5
5 1.00000                 osd.5      up  1.00000          1.00000
&nbsp;

Final config in june 2017 :

[cephadm@admin ~]$ ceph osd tree
ID  WEIGHT  TYPE NAME                     UP/DOWN REWEIGHT PRIMARY-AFFINITY 
 -1 7.36800 root default                                                    
-10 7.36800     datacenter dc                                               
-13 7.36800         room garage                                             
-14 4.54799             chassis chassis1                                    
 -8 4.54799                 rack rack1                                      
 -2 0.90999                     host n0                                     
  0 0.90999                         osd.0      up  1.00000          1.00000 
 -3 0.90999                     host n1                                     
  1 0.90999                         osd.1      up  1.00000          1.00000 
 -4 0.90999                     host n2                                     
  2 0.90999                         osd.2      up  1.00000          1.00000 
 -5 0.90999                     host n3                                     
  3 0.90999                         osd.3      up  1.00000          1.00000 
-11 0.90999                     host n6                                     
  6 0.90999                         osd.6      up  1.00000          1.00000 
-15 2.81898             chassis chassis2                                    
 -9 2.81898                 rack rack2                                      
 -6 0.90999                     host n4                                     
  4 0.90999                         osd.4      up  1.00000          1.00000 
 -7 1.00000                     host n5                                     
  5 1.00000                         osd.5      up  1.00000          1.00000 
-12 0.90999                     host n7                                     
  7 0.90999                         osd.7      up  1.00000          1.00000

data replication : play with crush map

In order to manage replicas, you create crush rules.

Crush map is a map shared by nodes and provided to clients that ensure data replication follows your policy.

Run the following commands with an authorized admin user of your cluster

First, extract the crush map of your cluster

ceph osd getcrushmap -o crush

iIt’s a binary file, you have to uncompile it so as you can edit the rules :

crushtool -d crush -o crush.txt

Edit the crush.txt file and add the following rule at the end.
We will replicate 2 occurences of the data on rack 1 (more OSD..) and the last on rack 2.

rule 3_rep_2_racks {
    ruleset 1
    type replicated
    min_size 2
    max_size 3
    step take default
    step choose firstn 2 type rack 
    step chooseleaf firstn 2 type osd
    step emit
}

Then, recompile the rules

crushtool -c crush.txt -o crushnew

the new compiled crush map is in the “crushne” file, that you’ll want to inject in your cluster nodes.

ceph osd setcrushmap -i crushnew

Then make sure you apply your rule to your pools (if not choosed related to your settings, ..)
For example, I want my VMs blocks to be replicated with this rule (my VMs are stored in the pool libvirt-pool)

ceph osd pool set libvirt-pool crush_ruleset 1

Execute the following command several times

ceph status

You will see the cluster rebalance your data dynamically until it’s OK (ceph health)

[cephadm@admin ~]$ ceph status
    cluster cd687e36-5670-48f5-b324-22a25082bede
     health HEALTH_WARN
            51 pgs backfill_wait
            3 pgs backfilling
            13 pgs degraded
            13 pgs recovery_wait
            67 pgs stuck unclean
            recovery 2188/32729 objects degraded (6.685%)
            recovery 9450/32729 objects misplaced (28.873%)
     monmap e5: 3 mons at {n0=192.168.10.210:6789/0,n4=192.168.10.214:6789/0,n8=192.168.10.218:6789/0}
            election epoch 98, quorum 0,1,2 n0,n4,n8
      fsmap e56: 1/1/1 up {0=n1=up:active}, 2 up:standby
     osdmap e915: 8 osds: 8 up, 8 in; 54 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v3123382: 304 pgs, 10 pools, 35945 MB data, 9318 objects
            107 GB used, 7429 GB / 7544 GB avail
            2188/32729 objects degraded (6.685%)
            9450/32729 objects misplaced (28.873%)
                 237 active+clean
                  51 active+remapped+wait_backfill
                  13 active+recovery_wait+degraded
                   3 active+remapped+backfilling
recovery io 116 MB/s, 29 objects/s

And finally :

[cephadm@admin ~]$ ceph status
    cluster cd687e36-5670-48f5-b324-22a25082bede
     health HEALTH_OK
     monmap e5: 3 mons at {n0=192.168.10.210:6789/0,n4=192.168.10.214:6789/0,n8=192.168.10.218:6789/0}
            election epoch 98, quorum 0,1,2 n0,n4,n8
      fsmap e56: 1/1/1 up {0=n1=up:active}, 2 up:standby
     osdmap e1010: 8 osds: 8 up, 8 in
            flags sortbitwise,require_jewel_osds
      pgmap v3123673: 304 pgs, 10 pools, 35945 MB data, 9318 objects
            106 GB used, 7430 GB / 7544 GB avail
                 304 active+clean
recovery io 199 MB/s, 49 objects/s

Repeat this for all pools you want to suit this new crush rule

In my example, with the following osd tree, you can see more data on the rack that has less drives.

[cephadm@admin ~]$ ceph osd df tree
ID  WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS TYPE NAME                     
 -1 7.36800        - 7544G   106G 7430G 1.41 1.00   0 root default                  
-10 7.36800        - 7544G   106G 7430G 1.41 1.00   0     datacenter dc             
-13 7.36800        - 7544G   106G 7430G 1.41 1.00   0         room garage           
-14 4.54799        - 4657G 59074M 4595G 1.24 0.88   0             chassis chassis1  
 -8 4.54799        - 4657G 59074M 4595G 1.24 0.88   0                 rack rack1    
 -2 0.90999        -  931G 10986M  919G 1.15 0.82   0                     host n0   
  0 0.90999  1.00000  931G 10986M  919G 1.15 0.82  94                         osd.0 
 -3 0.90999        -  931G 11617M  919G 1.22 0.87   0                     host n1   
  1 0.90999  1.00000  931G 11617M  919G 1.22 0.87 105                         osd.1 
 -4 0.90999        -  931G 11640M  919G 1.22 0.87   0                     host n2   
  2 0.90999  1.00000  931G 11640M  919G 1.22 0.87 114                         osd.2 
 -5 0.90999        -  931G 12416M  918G 1.30 0.92   0                     host n3   
  3 0.90999  1.00000  931G 12416M  918G 1.30 0.92 111                         osd.3 
-11 0.90999        -  931G 12413M  918G 1.30 0.92   0                     host n6   
  6 0.90999  1.00000  931G 12413M  918G 1.30 0.92 103                         osd.6 
-15 2.81898        - 2887G 49665M 2835G 1.68 1.19   0             chassis chassis2  
 -9 2.81898        - 2887G 49665M 2835G 1.68 1.19   0                 rack rack2    
 -6 0.90999        -  931G 17310M  913G 1.81 1.29   0                     host n4   
  4 0.90999  1.00000  931G 17310M  913G 1.81 1.29 125                         osd.4 
 -7 1.00000        - 1023G 16977M 1006G 1.62 1.15   0                     host n5   
  5 1.00000  1.00000 1023G 16977M 1006G 1.62 1.15 133                         osd.5 
-12 0.90999        -  931G 15376M  915G 1.61 1.15   0                     host n7   
  7 0.90999  1.00000  931G 15376M  915G 1.61 1.15 127                         osd.7 
               TOTAL 7544G   106G 7430G 1.41                                        
MIN/MAX VAR: 0.82/1.29  STDDEV: 0.23

Useful command :

[cephadm@admin ~]$ ceph osd df 
ID WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS 
 0 0.90999  1.00000  931G 10985M  919G 1.15 0.82  87 
 1 0.90999  1.00000  931G 11615M  919G 1.22 0.87 106 
 2 0.90999  1.00000  931G 11639M  919G 1.22 0.87 106 
 3 0.90999  1.00000  931G 12414M  918G 1.30 0.93 114 
 6 0.90999  1.00000  931G 12412M  918G 1.30 0.93 105 
 4 0.90999  1.00000  931G 17044M  913G 1.79 1.27 134 
 5 1.00000  1.00000 1023G 16977M 1006G 1.62 1.15 130 
 7 0.90999  1.00000  931G 15374M  915G 1.61 1.15 130 
              TOTAL 7544G   105G 7430G 1.40          
MIN/MAX VAR: 0.82/1.27  STDDEV: 0.22

Configuration management : puppet

Now we have to install a configuration management tool. It saves a lot of time..

Master installation

On the admin node, we will install the master :

[root@admin ~]# sudo rpm -ivh https://yum.puppetlabs.com/puppetlabs-release-pc1-el-7.noarch.rpm
[root@admin ~]# sudo yum -y install puppetserver
[root@admin ~]# systemctl enable puppetserver
[root@admin ~]# sudo systemctl start puppetserver

Agents installation :

Use dsh from the admin node :

[root@admin ~]# dsh -aM "sudo rpm -ivh https://yum.puppetlabs.com/puppetlabs-release-pc1-el-7.noarch.rpm"
[root@admin ~]# dsh -aM "sudo yum -y install puppet-agent"

Enable the agent

[root@admin ~]# dsh -aM "systemctl enable puppet"

Configure the agents, you need to set the server name if it’s not “puppet” (default). Use a fqdn, it’s important.

[root@admin ~]# dsh -aM "sudo /opt/puppetlabs/bin/puppet config set server admin.int.intra"

Start the agent

[root@admin ~]# dsh -aM "systemctl start puppet"

Puppet configuration

On the admin node, check all the agents have published their certificates to the server

[root@admin ~]# sudo /opt/puppetlabs/bin/puppet cert list
"n0.int.intra" (SHA256) 95:6B:A3:07:DA:70:04:D7:9B:18:4D:64:30:39:A1:19:9E:68:B9:6B:9C:92:DC:AB:98:36:16:6D:F3:66:B3:56
"n1.int.intra" (SHA256) 07:E3:1B:1F:6F:80:33:6C:A9:A4:96:88:71:A0:74:19:B0:DE:3A:EA:B2:36:2A:38:43:B1:5D:3E:92:3C:D0:47
"n2.int.intra" (SHA256) 62:2E:7E:91:CE:75:53:0C:DA:16:28:C7:14:EA:05:33:CD:DA:8D:B8:A4:A3:59:1B:B0:78:3B:29:AE:A6:CB:C4
"n3.int.intra" (SHA256) 77:92:0F:75:2F:75:E2:8F:68:22:4A:43:4C:BB:79:C5:24:6D:BB:98:42:D0:87:A5:13:57:52:9C:3D:82:D8:74
"n4.int.intra" (SHA256) 55:F4:15:F3:83:3A:39:99:B6:15:EC:D6:09:24:6D:6D:D2:07:9B:54:F5:73:15:C5:C8:74:9F:8F:BB:A0:E2:43

Sign the certificates

[root@admin ~]#  for i in {0..4}; do /opt/puppetlabs/bin/puppet cert sign n$i.int.intra ; done

Finished ! you can check all the nodes with a valid certificate :

[root@admin ~]# sudo /opt/puppetlabs/bin/puppet cert list --all
+ "admin.int.intra" (SHA256) F5:13:EE:E9:C2:F1:A7:86:01:3C:95:EE:61:EE:53:21:E9:75:15:24:45:FB:67:B8:D9:60:60:FE:DE:93:59:F6 (alt names: "DNS:puppet", "DNS:admin.int.intra")
+ "n0.int.intra"    (SHA256) 9D:C0:3E:AB:FD:67:00:DB:B5:25:CD:23:71:A4:2F:C5:3F:A6:56:FE:55:CA:5D:27:95:C6:97:79:A9:B2:7F:CB
+ "n1.int.intra"    (SHA256) 4F:C6:C1:B9:CD:21:4C:3A:76:B5:CF:E4:56:0D:20:D2:1D:72:35:7B:D9:53:86:D9:CD:CB:8D:3C:E8:39:F4:C2
+ "n2.int.intra"    (SHA256) D7:6E:85:63:04:CC:C6:24:79:E3:C2:CE:F2:0F:5B:2E:FA:EE:D9:EF:9C:E3:46:6A:83:9F:AA:DA:5D:3F:F8:52
+ "n3.int.intra"    (SHA256) 1C:95:61:C8:F6:E2:AF:4F:A5:52:B3:E0:CE:87:CF:16:02:2B:39:2C:61:EC:20:21:D0:BD:33:70:42:7A:6E:D9
+ "n4.int.intra"    (SHA256) E7:B6:4B:1B:0A:22:F8:C4:F1:E5:A9:3B:EA:17:5F:54:41:97:68:AF:D0:EC:A6:DB:74:3E:F9:7E:BF:04:16:FF

You have now a working puppet config management system running fine..

Monitoring

Telegraf

Install Telegraf on the nodes, with a puppet manifest.

vi /etc/puppetlabs/code/environments/production/manifests/site.pp

include this text in the file site.pp :

node 'n0', 'n1', 'n2', 'n3', 'n4' {
    file {'/etc/yum.repos.d/influxdb.repo':
        ensure  =&gt; present,                                               # make sure it exists
        mode    =&gt; '0644',                                                # file permissions
        content =&gt; "[influxdb]\nname = InfluxDB Repository - RHEL \$releasever\nbaseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable\nenabled = 1\ngpgcheck = 1\ngpgkey = https://repos.influxdata.com/influxdb.key\n",
    }            
}

Install it on all nodes (we could do that with puppet, too):

dsh -aM "sudo yum install telegraf"

Create a puppet module for telegraf

[root@admin modules]# cd /etc/puppetlabs/code/modules
[root@admin modules]# mkdir -p telegraf_client/{files,manifests,templates}

Create a template for telegraf.conf

[root@admin telegraf_client]# vi templates/telegraf.conf.template

put the following in that file (note the fqdn variable) :

[tags]

# Configuration for telegraf agent
[agent]
debug = false
flush_buffer_when_full = true
flush_interval = "15s"
flush_jitter = "0s"
hostname = "&lt;%= fqdn %&gt;"
interval = "15s"
round_interval = true

Create a template for the inputs :

[root@admin telegraf_client]# vi templates/inputs_system.conf.template

put the following (no variables, yet. customize for your needs..) :

# Read metrics about CPU usage
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = [ "usage*" ]

# Read metrics about disk usagee
[[inputs.disk]]
fielddrop = [ "inodes*" ]
mount_points=["/","/home"]

# Read metrics about diskio usage
[[inputs.diskio]]
devices = ["sda2","sda3"]
skip_serial_number = true

# Read metrics about network usage
[[inputs.net]]
interfaces = [ "eth0" ]
fielddrop = [ "icmp*", "ip*", "tcp*", "udp*" ]

# Read metrics about memory usage
[[inputs.mem]]
# no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration

# Read metrics about system load &amp; uptime
[[inputs.system]]
# no configuration

Create a template for the outputs :

[root@admin telegraf_client]# vi templates/outputs.conf.template

and put the following text in the file

[[outputs.influxdb]]
database = "telegraf"
precision = "s"
urls = [ "http://admin:8086" ]
username = "telegraf"
password = "your_pass"

create the manifest for your module

[root@admin ~]# vi /etc/puppetlabs/code/modules/telegraf_client/manifests/init.pp

and add the following contents :

class telegraf_client {

package { 'telegraf':
ensure =&gt; installed,
}

file { "/etc/telegraf/telegraf.conf":
ensure =&gt; present,
owner =&gt; root,
group =&gt; root,
mode =&gt; "644",
content =&gt; template("telegraf_client/telegraf.conf.template"),
}

file { "/etc/telegraf/telegraf.d/outputs.conf":
ensure =&gt; present,
owner =&gt; root,
group =&gt; root,
mode =&gt; "644",
content =&gt; template("telegraf_client/outputs.conf.template"),
}

file { "/etc/telegraf/telegraf.d/inputs_system.conf":
ensure =&gt; present,
owner =&gt; root,
group =&gt; root,
mode =&gt; "644",
content =&gt; template("telegraf_client/inputs_system.conf.template"),
}

service { 'telegraf':
ensure =&gt; running,
enable =&gt; true,
}
}

And finally, include the module in the global puppet manifest file. Here is mine :

[root@admin ~]# vi /etc/puppetlabs/code/environments/production/manifests/site.pp

(which content is :)

node default {
case $facts['os']['name'] {
'Solaris':           { include solaris }
'RedHat', 'CentOS':  { include centos  }
/^(Debian|Ubuntu)$/: { include debian  }
default:             { include generic }
}
}

node 'n0','n1','n2','n3','n4' {
include cephnode
}

class cephnode {
include telegraf_client
}

class centos {
yumrepo { "CentOS-OS-Local":
baseurl =&gt; "http://nas4/centos/\$releasever/os/\$basearch",
descr =&gt; "Centos int.intra mirror (os)",
enabled =&gt; 1,
gpgcheck =&gt; 0,
priority =&gt; 1
}
yumrepo { "CentOS-Updates-Local":
baseurl =&gt; "http://nas4/centos/\$releasever/updates/\$basearch",
descr =&gt; "Centos int.intra mirror (updates)",
enabled =&gt; 1,
gpgcheck =&gt; 0,
priority =&gt; 1
}

yumrepo { "InfluxDB":
baseurl =&gt; "https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable",
descr =&gt; "InfluxDB Repository - RHEL $releasever",
enabled =&gt; 1,
gpgcheck =&gt; 1,
gpgkey =&gt; "https://repos.influxdata.com/influxdb.key"
}
}

Wait for minutes for puppet to apply your work on the nodes or run :

[root@admin ~]# dsh -aM "/opt/puppetlabs/bin/puppet agent --test"

Check telegraf is up and running. And check the measurements in InfluxDB.

result