The hardware..
In my Homelab : Highly resilient “datacenter-in-two-boxes�” with Centos 7 and Ceph jewel article, I’ve told how to build a low power homelab.
With this hardware, a bunch of low power disks (2,5 5400), you can build a low power virtualized storage system with Ceph, and store all your data with top-level NAS software
The software :
Centos 7.3 (1611) x86-64 “minimal”
Ceph “jewel” x86-64
Puppet (configuration management software)
Topology
Number of MONs
It’s recommended to install at least 3 MONs for resilience reasons.
For me needs, I will install 5 MONs on my 5 hosts.
In this example, all hosts are virtualized -> in my cas, I have 3 physical hosts (see other pages..). One is not able to get an osd (intel nuc) : in my final cluster map, there is 1 mon on my nuc, and 1 on each storage host. It will ensure that when a host is down, the quorum is satisfied and my ceph cluster is UP.
Installing the cluster
Preparing the hardware and the OS
Requirements :
This blog do not cover the OS installation procedure. Before you continue be sure to configure your OS with these additional requirements :
then configure /etc/ntp.conf with your preferred NTP servers
You can also use systemd (search for chrony.conf config file)
It’s safe and more efficient to have a time source close to your cluster. Wifi AP, DSL routers often provide such services. My configuration uses my ADSL router, based on openWRT (you can setup ntpd on openwrt…)
Then run :
root@n0:~# timedatectl set-ntp true
- disable SELINUX (see /etc/selinux/config)
- disable your firewalld (systemctl disable firewalld.service)
Finally, ensure everything’s ok when rebooting your node..
Create the ceph admin user on each node :
On each node, create a ceph admin user (used for deployment tasks). It’s important to choose a different user than “ceph” (used by the ceph installer..)
Note : you can omit the -s directive of useradd, it’s a personal choice to use bash.
root@n0:~# sudo useradd -d /home/cephadm -m cephadm -s /bin/bash
root@n0:~# sudo passwd cephadm
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
root@n0:~# `echo "cephadm ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph`
root@n0:~# chmod 444 /etc/sudoers.d/ceph</pre>
…and so on to host admin, and nodes n1, n2 [and n3, …]
to automate this task, use dsh on the admin node, after having configured dsh for root (ssh-copy-id root@nodes)
echo "cephadm ALL = (root) NOPASSWD:ALL" | dsh -aM -i -c 'sudo tee /etc/sudoers.d/ceph'
dsh -aM "chmod 444 /etc/sudoers.d/ceph"
Also, install lsb first, it will be useful later.
yum install redhat-lsb-core
Setup the ssh authentication with cryptographic keys
On the admin node :
Create the ssh key for the user cephadm
root@admin:~# su - cephadm
cephadm@admin:~$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/cephadm/.ssh/id_dsa):
Created directory '/home/cephadm/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cephadm/.ssh/id_dsa.
Your public key has been saved in /home/cephadm/.ssh/id_dsa.pub.
The key fingerprint is:
ec:16:ad:b4:76:e4:32:c6:7c:14:45:bc:c3:78:5a:cf cephadm@admin
The key's randomart image is:
+---[DSA 1024]----+
| oo |
| .. |
| .o . |
| . ...* |
| S ++ + |
| = B. E |
| % + |
| + = |
| |
+-----------------+
cephadm@admin:~$
Then push it on the nodes of the cluster:
[cephadm@admin ~]$ ssh-copy-id cephadm@n0
[cephadm@admin ~]$ ssh-copy-id cephadm@n1
[cephadm@admin ~]$ ssh-copy-id cephadm@n2
[cephadm@admin ~]$ ssh-copy-id cephadm@n3
[cephadm@admin ~]$ ssh-copy-id cephadm@n4
Or better to automate (if you do this a lot of times, :) ):
#!/bin/sh
# sudo yum install moreutils sshpass openssh-clients
echo 'Enter password:';
read -s SSHPASS;
export SSHPASS;
for i in {0..4}; do sshpass -e ssh-copy-id -o StrictHostKeyChecking=no cephadm@n$i.int.intra -p 22 ; done
export SSHPASS=''
[root@admin ~]# yum install -y gcc
[root@admin ~]# yum install -y gcc-c++
[root@admin ~]# yum install -y wget
[root@admin ~]# wget https://www.netfort.gr.jp/~dancer/software/downloads/dsh-0.25.9.tar.gz
[root@admin ~]# wget https://www.netfort.gr.jp/~dancer/software/downloads/libdshconfig-0.20.9.tar.gz
[root@admin ~]# tar xvfz libdshconfig-0.20.9.tar.gz
[root@admin ~]# cd libdshconfig-0.20.9
[root@admin libdshconfig-0.20.9]# ./configure
[root@admin libdshconfig-0.20.9]# make
[root@admin libdshconfig-0.20.9]# make install
[root@admin ~]# tar xvfz dsh-0.25.9.tar.gz
[root@admin ~]# cd dsh-0.25.9
[root@admin dsh-0.25.9]# ./configure
[root@admin dsh-0.25.9]# make
[root@admin dsh-0.25.9]# make install
[root@admin ~]# echo /usr/local/lib > /etc/ld.so.conf.d/dsh.conf
[root@admin ~]# ldconfig
Done. Then configure it :
[root@admin ~]# vi /usr/local/etc/dsh.conf
insert these lines :
remoteshell =ssh
waitshell=1 # whether to wait for execution
Create the default machine list file
[root@admin ~]# su - cephadm
cephadm@admin:~$ cd
cephadm@admin:~$ mkdir .dsh
cephadm@admin:~$ cd .dsh
cephadm@admin:~/.dsh$ for i in {0..4} ; do echo "n$i" >> machines.list ; done
Test…
[cephadm@admin ~]$ dsh -aM uptime
n0: 16:23:21 up 3 min, 0 users, load average: 0.20, 0.39, 0.20
n1: 16:23:22 up 3 min, 0 users, load average: 0.19, 0.40, 0.21
n2: 16:23:23 up 3 min, 0 users, load average: 0.13, 0.38, 0.20
n3: 16:23:24 up 4 min, 0 users, load average: 0.00, 0.02, 0.02
n4: 16:23:25 up 3 min, 0 users, load average: 0.24, 0.38, 0.20
Another test :
[cephadm@admin ~]$ dsh -aM cat /proc/cpuinfo | grep model\ name
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n0: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n1: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n2: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n3: model name : Intel Core Processor (Broadwell)
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
n4: model name : Intel(R) Atom(TM) x5-Z8300 CPU @ 1.44GHz
Good.. !!
Now you’re ready to install your cluster with automated commands from your admin node. Not than several other solutions are good enough, like cssh (clustered…). Choose the best for your needs ;)
Well,
Now I’m assuming you have followed the installation procedure and the requirements above :).
Here’s my configuration :
n0 : 192.168.10.210/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n1 : 192.168.10.211/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n2 : 192.168.10.212/24 1TB HGST 2.5” 5400rpm (data) + 20Gb on Crucial MX200 SSD (journal)
n3 : 192.168.10.213/24 1TB WD Red 2.5” 5400rpm (data) + 20Gb on SM951 NVMe SSD (journal)
n4 : 192.168.10.214/24 1TB Hitachi 3.5” 7200rpm (data) + 20Gb on Crucial MX100 SSD (journal)
n5 : 192.168.10.215/24 1TB ZFS (on 2x WD Green 5TB) + 20Gb on Crucial MX100 SSD (journal)
admin : 192.168.10.177/24 (VM)
Finally, don’t forget to change your yum repositories if you installed the OSes with a local media. They should now point to a mirror for all updates (security and software).
Reboot your nodes if you wan to be very sure you haven’t forget anything, and test them with dsh, for example for ntp :
[cephadm@admin ~]$ dsh -aM timedatectl|grep NTP
[cephadm@admin ~]$ dsh -aM timedatectl status|grep NTP
n0: NTP enabled: yes
n0: NTP synchronized: yes
n1: NTP enabled: yes
n1: NTP synchronized: yes
n2: NTP enabled: yes
n2: NTP synchronized: yes
n3: NTP enabled: yes
n3: NTP synchronized: yes
n4: NTP enabled: yes
n4: NTP synchronized: yes
...
Install your Ceph cluster
Get the software
Ensure you are up to date on each nodes, at the very beginning of this procedure.
Feel free to use dsh from the admin node for each task you would like to apply to the nodes ;)
[cephadm@admin ~]$ dsh -aM "sudo yum -y upgrade"
Install the repos
On the admin node only, configure the ceph repos.
You have the choice : either do it like this if you want to download Ceph packages from the internet :
[cephadm@admin ~]$ sudo yum install https://download.ceph.com/rpm-jewel/el7/noarch/ceph-release-1-1.el7.noarch.rpm
Or if you want a local mirror, look at the section below telling how to setup Puppet (for example) to do that. I prefer this option for myself, because I have a local mirror (for testing purposes, it’s better to download locally)
Install Ceph-deploy
This tool is written in Python.
[cephadm@admin ~]$ sudo yum install ceph-deploy
Install Ceph
Always on the admin node, create a directory that will contain all the config for your cluster :
cephadm@admin:~$ mkdir cluster
cephadm@admin:~$ cd cluster
I have chosen to install 4 monitors (3 would be sufficient at home, but my needs isn’t your needs).
cephadm@admin:~/cluster$ ceph-deploy new n{0,2,4,5}
(It generates a lot of stdout messages)
Now edit ceph.conf (in the “cluster” directory) and tell ceph you want to shard x3, and add the cluster and public networks in the [global] section ; for myself : 10.0.0.0/8 and 192.168.10.0/24
The file ceph.conf should contain the following lines now :
[global]
fsid = 74a80a50-b7f9-4588-baa4-bb242c3d4cf0
mon_initial_members = n0, n1, n3
mon_host = 192.168.10.210,192.168.10.211,192.168.10.213
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
**osd pool default size = 3**
**cluster network = 10.1.1.0/24**
**public network = 192.168.10.0/24**
[osd]
osd mkfs options type=btrfs
osd journal size = 20000
Please note that I will use btrfs to store the data. My kernel is at a sufficient level for that (4.9), and i’ve experienced obvious filesystem corruptions sometimes, when simply rebooting my nodes which had kernel 3.10 and an XFS partition for the OSDs.
If you install from a local mirror :
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --repo-url {http mirror} --gpg-url {http gpg url} --release jewel n$i; done
For ex for me :
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --repo-url http://mirror/ceph/rpm-jewel/el7/ --release jewel n$i; done
Else :
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy install --release jewel n$i; done
This command generates a lot of logs (downloads, debug messages, warnings…) but should return without error. Otherwise check the error and google it. You can just restart the ceph-deploy program, depending on the error it will work the second time ;) (I’ve experienced some problems accessing the ceph repository, for ex….)
Create the mons:
cephadm@admin:~/cluster$ ceph-deploy mon create-initial
Idem, a lot of logs… but no error..
Create the OSDs (storage units)
You have to know which device will be used for the data on each node and which device for the journal. If you are building a ceph cluster for a production environment, you should use SSDs for the journal partition.for testing purpose, you can use only one device.
In my case, I took care to make the storage osd disks to be /dev/vdb on all nodes, and journal (SSD) on /dev/vdc..
Important note : if you previously installed ceph on a device, you MUST “zap” (delete) it before. Use the command “ceph-deploy disk zap n3:sdb” for example.
Execute this step if you don’t know anything about the past usage of your disks.
Zap disks If you have a separate partition for SSDs (/dev/vdc, here):
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy disk zap n$i:vdb; done
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy disk zap n$i:vdc; done
If you use only one device :
cephadm@admin:~/cluster$ for i in {0..4}; do ceph-deploy disk zap n$i:vdb; done
Create the OSDsNote : use –fs-type btrfs on “osd create” if you want (as me) another filesystem than xfs. I’ve got obvious problems with xfs (corruptions while rebooting..)
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy osd create --fs-type btrfs n$i:vdb:vdc; done
Else use the defaults (xfs)
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy osd create n$i:vdb:vdc; done
And remember, if you have only one device (vdb for ex), use this instead (defaults with xfs):
cephadm@admin:~/cluster$ for i in {0..4}; do ceph-deploy osd create n$i:vdb; done
#####
Deploy the ceph configuration to all storage nodes
cephadm@admin:~/cluster$ for i in {0..5}; do ceph-deploy admin n$i; done
And check the permissions. For some reasons the permissions are not correct:
cephadm@admin:~/cluster$ dsh -aM "ls -l /etc/ceph/*key*"
n0: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n1: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n2: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n3: -rw------- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
To correct this, issue the following command :
cephadm@admin:~/cluster$ dsh -aM "sudo chmod +r /etc/ceph/ceph.client.admin.keyring"
and check :
cephadm@admin:~/cluster$ dsh -aM "ls -l /etc/ceph/*key*"
n0: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n1: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n2: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
n3: -rw-r--r-- 1 root root 63 Oct 26 19:01 /etc/ceph/ceph.client.admin.keyring
cephadm@admin:~/cluster$ ceph-deploy mds create n0 n1 n5
and the rados gateway
cephadm@admin:~/cluster$ ceph-deploy rgw create n3 n5
(on n3 for me)
One more time ;) remember to check the ntp status of the nodes :
cephadm@admin:~/cluster$ dsh -aM "timedatectl|grep synchron"
n0: NTP synchronized: yes
n1: NTP synchronized: yes
n2: NTP synchronized: yes
n3: NTP synchronized: yes
n4: NTP synchronized: yes
check the cluster, on one node type :
[cephadm@n0 ~]$ ceph status
cluster 2a663a93-7150-43f5-a8d2-e40e2d9d175f
health HEALTH_OK
monmap e2: 5 mons at {n0=192.168.10.210:6789/0,n1=192.168.10.211:6789/0,n2=192.168.10.212:6789/0,n3=192.168.10.213:6789/0,n4=192.168.10.214:6789/0}
election epoch 8, quorum 0,1,2,3,4 n0,n1,n2,n3,n4
osdmap e32: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v97: 104 pgs, 6 pools, 1588 bytes data, 171 objects
173 MB used, 3668 GB / 3668 GB avail
104 active+clean
Done !
Test your brand new ceph cluster
You can create a pool to test your new cluster :
[cephadm@n0 ~]$ rados mkpool test
successfully created pool test
[cephadm@n0 ~]$ rados lspools
rbd
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
test
[cephadm@n0 ~]$ rados put -p test .bashrc .bashrc
[cephadm@n0 ~]$ ceph osd map test .bashrc
osdmap e34 pool 'test' (6) object '.bashrc' -> pg 6.3d13d849 (6.1) -> up ([2,4,1], p2) acting ([2,4,1], p2)
A quick look at the cluster network, ensure it’s used as it should be (tcpdump -i on n0) :
22:43:17.137802 IP 10.1.1.12.50248 > n0.int.intra.acnet: Flags [P.], seq 646:655, ack 656, win 1424, options [nop,nop,TS val 3831166 ecr 3830943], length 9
22:43:17.177297 IP n0.int.intra.acnet > 10.1.1.12.50248: Flags [.], ack 655, win 235, options [nop,nop,TS val 3831203 ecr 3831166], length 0
22:43:17.205945 IP 10.1.1.13.42810 > n0.int.intra.acnet: Flags [P.], seq 393:515, ack 394, win 1424, options [nop,nop,TS val 4392067 ecr 3829192], length 122
22:43:17.205999 IP n0.int.intra.acnet > 10.1.1.13.42810: Flags [.], ack 515, win 252, options [nop,nop,TS val 3831231 ecr 4392067], length 0
22:43:17.206814 IP n0.int.intra.acnet > 10.1.1.13.42810: Flags [P.], seq 394:525, ack 515, win 252, options [nop,nop,TS val 3831232 ecr 4392067], length 131
22:43:17.207547 IP 10.1.1.13.42810 > n0.int.intra.acnet: Flags [.], ack 525, win 1424, options [nop,nop,TS val 4392069 ecr 3831232], length 0
….
Good !!
Now, “really” test your new cluster
Cf http://docs.ceph.com/docs/giant/rbd/libvirt/ :
First deploy the admin part of ceph on the destination system that will test your cluster
On the admin node :
[cephadm@admin cluster]$ ceph-deploy –overwrite-conf admin hyp03
On an hypervisor, with access to the network of course :
First give permissions for each process to know your cluster
chmod +r /etc/ceph/ceph.client.admin.keyring
[root@hyp03 ~]# ceph osd pool create libvirt-pool 128 128
pool 'libvirt-pool' created
[root@hyp03 ~]# ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool'
[client.libvirt]
key = AQDsdMYVYR0IdmlkKDLKMZYUifn+lvqMH3D7Q==
Create a 16G image on your new cluster
[root@hyp03 ~]# qemu-img create -f rbd rbd:libvirt-pool/new-libvirt-image 16G
Formatting 'rbd:libvirt-pool/new-libvirt-image', fmt=rbd size=17179869184 cluster_size=0
Important : Jewel brings RBD features not compatible with Centos 7.3. Disable it, otherwise you won’t be able to mount your RBD image (either with rbd map or through qemu-img)
rbd feature disable libvirt-pool/new-libvirt-image exclusive-lock object-map fast-diff deep-flatten
Create a secret
cat > secret.xml <<EOF
<secret ephemeral='no' private='no'>
<usage type='ceph'>
<name>client.libvirt secret</name>
</usage>
</secret>
EOF
Then issue :
[root@hyp03 ~]# sudo virsh secret-define --file secret.xml
[root@hyp03 ~]# ceph auth get-key client.libvirt | sudo tee client.libvirt.key
sudo virsh secret-set-value --secret 12390708-973c-4f6e-b0be-aba963608006 --base64 $(cat client.libvirt.key) && rm client.libvirt.key secret.xml
Replicate the secret on all hosts you want the VM to run (especially for using live migration). repeat the previous steps on each of these hosts, but with a modified secret.xml file that includes the secret UUID created at the first run on the first host.
<secret ephemeral='no' private='no'>
<uuid>12390708-973c-4f6e-b0be-aba963608006</uuid>
<usage type='ceph'>
<name>client.libvirt secret</name>
</usage<
</secret>
Follow the guide http://docs.ceph.com/docs/giant/rbd/libvirt/ for the vm configuration and then
sudo virsh secret-define --file secret.xml
[root@hyp03 ~]# virsh start dv03
Domain dv03 started
You’re done !
Crush map
For my needs, I want my cluster to be up even if one of my host is down.
In my home “datacenter”, I’ve two “rack”, two “physical hosts”, and 6 “ceph virtual hosts”, each of these running a 1TB OSD
How do I ensure the data replication occurs in a matter that no data is only on a phycial host ? you will do that bu managing your ceph CRUSH map with rules..
First, organize your ceph hosts in your “dataxenter”.
Because my home is not a really datacenter, for this example, I will name “hosts” the virtial machines hosting centos 7.3/ceph with an OSD for each VM.
I will name “rack” the deux physical hosts that run the “hosts” (VM)
I will call “datacenter” the rack where my two physical hosts are installed.
Create the datacenter, racks, and move them into the right place
ceph osd crush add-bucket rack1 rack
ceph osd crush move n0 rack=rack1
ceph osd crush move n1 rack=rack1
ceph osd crush move n2 rack=rack1
ceph osd crush move n3 rack=rack1
ceph osd crush move rack1 root=default
ceph osd crush add-bucket rack2 rack
ceph osd crush move rack2 root=default
ceph osd crush move n4 rack=rack2
ceph osd crush move n5 rack=rack2
ceph osd crush add-bucket dc datacenter
ceph osd crush move dc root=default
ceph osd crush move rack1 datacenter=dc
ceph osd crush move rack2 datacenter=dc
Look at the results
[root@hyp03 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.54849 root default
-10 5.54849 datacenter dc
-8 3.63879 rack rack1
-2 0.90970 host n0
0 0.90970 osd.0 up 1.00000 1.00000
-3 0.90970 host n1
1 0.90970 osd.1 up 1.00000 1.00000
-4 0.90970 host n2
2 0.90970 osd.2 up 1.00000 1.00000
-5 0.90970 host n3
3 0.90970 osd.3 up 1.00000 1.00000
-9 1.90970 rack rack2
-6 0.90970 host n4
4 0.90970 osd.4 up 1.00000 1.00000
-7 1.00000 host n5
5 1.00000 osd.5 up 1.00000 1.00000
Final config in june 2017 :
[cephadm@admin ~]$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.36800 root default
-10 7.36800 datacenter dc
-13 7.36800 room garage
-14 4.54799 chassis chassis1
-8 4.54799 rack rack1
-2 0.90999 host n0
0 0.90999 osd.0 up 1.00000 1.00000
-3 0.90999 host n1
1 0.90999 osd.1 up 1.00000 1.00000
-4 0.90999 host n2
2 0.90999 osd.2 up 1.00000 1.00000
-5 0.90999 host n3
3 0.90999 osd.3 up 1.00000 1.00000
-11 0.90999 host n6
6 0.90999 osd.6 up 1.00000 1.00000
-15 2.81898 chassis chassis2
-9 2.81898 rack rack2
-6 0.90999 host n4
4 0.90999 osd.4 up 1.00000 1.00000
-7 1.00000 host n5
5 1.00000 osd.5 up 1.00000 1.00000
-12 0.90999 host n7
7 0.90999 osd.7 up 1.00000 1.00000
data replication : play with crush map
In order to manage replicas, you create crush rules.
Crush map is a map shared by nodes and provided to clients that ensure data replication follows your policy.
Run the following commands with an authorized admin user of your cluster
First, extract the crush map of your cluster
ceph osd getcrushmap -o crush
iIt’s a binary file, you have to uncompile it so as you can edit the rules :
crushtool -d crush -o crush.txt
Edit the crush.txt file and add the following rule at the end.
We will replicate 2 occurences of the data on rack 1 (more OSD..) and the last on rack 2.
rule 3_rep_2_racks {
ruleset 1
type replicated
min_size 2
max_size 3
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type osd
step emit
}
Then, recompile the rules
crushtool -c crush.txt -o crushnew
the new compiled crush map is in the “crushne” file, that you’ll want to inject in your cluster nodes.
ceph osd setcrushmap -i crushnew
Then make sure you apply your rule to your pools (if not choosed related to your settings, ..)
For example, I want my VMs blocks to be replicated with this rule (my VMs are stored in the pool libvirt-pool)
ceph osd pool set libvirt-pool crush_ruleset 1
Execute the following command several times
ceph status
You will see the cluster rebalance your data dynamically until it’s OK (ceph health)
[cephadm@admin ~]$ ceph status
cluster cd687e36-5670-48f5-b324-22a25082bede
health HEALTH_WARN
51 pgs backfill_wait
3 pgs backfilling
13 pgs degraded
13 pgs recovery_wait
67 pgs stuck unclean
recovery 2188/32729 objects degraded (6.685%)
recovery 9450/32729 objects misplaced (28.873%)
monmap e5: 3 mons at {n0=192.168.10.210:6789/0,n4=192.168.10.214:6789/0,n8=192.168.10.218:6789/0}
election epoch 98, quorum 0,1,2 n0,n4,n8
fsmap e56: 1/1/1 up {0=n1=up:active}, 2 up:standby
osdmap e915: 8 osds: 8 up, 8 in; 54 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v3123382: 304 pgs, 10 pools, 35945 MB data, 9318 objects
107 GB used, 7429 GB / 7544 GB avail
2188/32729 objects degraded (6.685%)
9450/32729 objects misplaced (28.873%)
237 active+clean
51 active+remapped+wait_backfill
13 active+recovery_wait+degraded
3 active+remapped+backfilling
recovery io 116 MB/s, 29 objects/s
And finally :
[cephadm@admin ~]$ ceph status
cluster cd687e36-5670-48f5-b324-22a25082bede
health HEALTH_OK
monmap e5: 3 mons at {n0=192.168.10.210:6789/0,n4=192.168.10.214:6789/0,n8=192.168.10.218:6789/0}
election epoch 98, quorum 0,1,2 n0,n4,n8
fsmap e56: 1/1/1 up {0=n1=up:active}, 2 up:standby
osdmap e1010: 8 osds: 8 up, 8 in
flags sortbitwise,require_jewel_osds
pgmap v3123673: 304 pgs, 10 pools, 35945 MB data, 9318 objects
106 GB used, 7430 GB / 7544 GB avail
304 active+clean
recovery io 199 MB/s, 49 objects/s
Repeat this for all pools you want to suit this new crush rule
In my example, with the following osd tree, you can see more data on the rack that has less drives.
[cephadm@admin ~]$ ceph osd df tree
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 7.36800 - 7544G 106G 7430G 1.41 1.00 0 root default
-10 7.36800 - 7544G 106G 7430G 1.41 1.00 0 datacenter dc
-13 7.36800 - 7544G 106G 7430G 1.41 1.00 0 room garage
-14 4.54799 - 4657G 59074M 4595G 1.24 0.88 0 chassis chassis1
-8 4.54799 - 4657G 59074M 4595G 1.24 0.88 0 rack rack1
-2 0.90999 - 931G 10986M 919G 1.15 0.82 0 host n0
0 0.90999 1.00000 931G 10986M 919G 1.15 0.82 94 osd.0
-3 0.90999 - 931G 11617M 919G 1.22 0.87 0 host n1
1 0.90999 1.00000 931G 11617M 919G 1.22 0.87 105 osd.1
-4 0.90999 - 931G 11640M 919G 1.22 0.87 0 host n2
2 0.90999 1.00000 931G 11640M 919G 1.22 0.87 114 osd.2
-5 0.90999 - 931G 12416M 918G 1.30 0.92 0 host n3
3 0.90999 1.00000 931G 12416M 918G 1.30 0.92 111 osd.3
-11 0.90999 - 931G 12413M 918G 1.30 0.92 0 host n6
6 0.90999 1.00000 931G 12413M 918G 1.30 0.92 103 osd.6
-15 2.81898 - 2887G 49665M 2835G 1.68 1.19 0 chassis chassis2
-9 2.81898 - 2887G 49665M 2835G 1.68 1.19 0 rack rack2
-6 0.90999 - 931G 17310M 913G 1.81 1.29 0 host n4
4 0.90999 1.00000 931G 17310M 913G 1.81 1.29 125 osd.4
-7 1.00000 - 1023G 16977M 1006G 1.62 1.15 0 host n5
5 1.00000 1.00000 1023G 16977M 1006G 1.62 1.15 133 osd.5
-12 0.90999 - 931G 15376M 915G 1.61 1.15 0 host n7
7 0.90999 1.00000 931G 15376M 915G 1.61 1.15 127 osd.7
TOTAL 7544G 106G 7430G 1.41
MIN/MAX VAR: 0.82/1.29 STDDEV: 0.23
Useful command :
[cephadm@admin ~]$ ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 0.90999 1.00000 931G 10985M 919G 1.15 0.82 87
1 0.90999 1.00000 931G 11615M 919G 1.22 0.87 106
2 0.90999 1.00000 931G 11639M 919G 1.22 0.87 106
3 0.90999 1.00000 931G 12414M 918G 1.30 0.93 114
6 0.90999 1.00000 931G 12412M 918G 1.30 0.93 105
4 0.90999 1.00000 931G 17044M 913G 1.79 1.27 134
5 1.00000 1.00000 1023G 16977M 1006G 1.62 1.15 130
7 0.90999 1.00000 931G 15374M 915G 1.61 1.15 130
TOTAL 7544G 105G 7430G 1.40
MIN/MAX VAR: 0.82/1.27 STDDEV: 0.22
Configuration management : puppet
Now we have to install a configuration management tool. It saves a lot of time..
Master installation
On the admin node, we will install the master :
[root@admin ~]# sudo rpm -ivh https://yum.puppetlabs.com/puppetlabs-release-pc1-el-7.noarch.rpm
[root@admin ~]# sudo yum -y install puppetserver
[root@admin ~]# systemctl enable puppetserver
[root@admin ~]# sudo systemctl start puppetserver
Agents installation :
Use dsh from the admin node :
[root@admin ~]# dsh -aM "sudo rpm -ivh https://yum.puppetlabs.com/puppetlabs-release-pc1-el-7.noarch.rpm"
[root@admin ~]# dsh -aM "sudo yum -y install puppet-agent"
Enable the agent
[root@admin ~]# dsh -aM "systemctl enable puppet"
Configure the agents, you need to set the server name if it’s not “puppet” (default). Use a fqdn, it’s important.
[root@admin ~]# dsh -aM "sudo /opt/puppetlabs/bin/puppet config set server admin.int.intra"
Start the agent
[root@admin ~]# dsh -aM "systemctl start puppet"
Puppet configuration
On the admin node, check all the agents have published their certificates to the server
[root@admin ~]# sudo /opt/puppetlabs/bin/puppet cert list
"n0.int.intra" (SHA256) 95:6B:A3:07:DA:70:04:D7:9B:18:4D:64:30:39:A1:19:9E:68:B9:6B:9C:92:DC:AB:98:36:16:6D:F3:66:B3:56
"n1.int.intra" (SHA256) 07:E3:1B:1F:6F:80:33:6C:A9:A4:96:88:71:A0:74:19:B0:DE:3A:EA:B2:36:2A:38:43:B1:5D:3E:92:3C:D0:47
"n2.int.intra" (SHA256) 62:2E:7E:91:CE:75:53:0C:DA:16:28:C7:14:EA:05:33:CD:DA:8D:B8:A4:A3:59:1B:B0:78:3B:29:AE:A6:CB:C4
"n3.int.intra" (SHA256) 77:92:0F:75:2F:75:E2:8F:68:22:4A:43:4C:BB:79:C5:24:6D:BB:98:42:D0:87:A5:13:57:52:9C:3D:82:D8:74
"n4.int.intra" (SHA256) 55:F4:15:F3:83:3A:39:99:B6:15:EC:D6:09:24:6D:6D:D2:07:9B:54:F5:73:15:C5:C8:74:9F:8F:BB:A0:E2:43
Sign the certificates
[root@admin ~]# for i in {0..4}; do /opt/puppetlabs/bin/puppet cert sign n$i.int.intra ; done
Finished ! you can check all the nodes with a valid certificate :
[root@admin ~]# sudo /opt/puppetlabs/bin/puppet cert list --all
+ "admin.int.intra" (SHA256) F5:13:EE:E9:C2:F1:A7:86:01:3C:95:EE:61:EE:53:21:E9:75:15:24:45:FB:67:B8:D9:60:60:FE:DE:93:59:F6 (alt names: "DNS:puppet", "DNS:admin.int.intra")
+ "n0.int.intra" (SHA256) 9D:C0:3E:AB:FD:67:00:DB:B5:25:CD:23:71:A4:2F:C5:3F:A6:56:FE:55:CA:5D:27:95:C6:97:79:A9:B2:7F:CB
+ "n1.int.intra" (SHA256) 4F:C6:C1:B9:CD:21:4C:3A:76:B5:CF:E4:56:0D:20:D2:1D:72:35:7B:D9:53:86:D9:CD:CB:8D:3C:E8:39:F4:C2
+ "n2.int.intra" (SHA256) D7:6E:85:63:04:CC:C6:24:79:E3:C2:CE:F2:0F:5B:2E:FA:EE:D9:EF:9C:E3:46:6A:83:9F:AA:DA:5D:3F:F8:52
+ "n3.int.intra" (SHA256) 1C:95:61:C8:F6:E2:AF:4F:A5:52:B3:E0:CE:87:CF:16:02:2B:39:2C:61:EC:20:21:D0:BD:33:70:42:7A:6E:D9
+ "n4.int.intra" (SHA256) E7:B6:4B:1B:0A:22:F8:C4:F1:E5:A9:3B:EA:17:5F:54:41:97:68:AF:D0:EC:A6:DB:74:3E:F9:7E:BF:04:16:FF
You have now a working puppet config management system running fine..
Monitoring
Telegraf
Install Telegraf on the nodes, with a puppet manifest.
vi /etc/puppetlabs/code/environments/production/manifests/site.pp
include this text in the file site.pp :
node 'n0', 'n1', 'n2', 'n3', 'n4' {
file {'/etc/yum.repos.d/influxdb.repo':
ensure => present, # make sure it exists
mode => '0644', # file permissions
content => "[influxdb]\nname = InfluxDB Repository - RHEL \$releasever\nbaseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable\nenabled = 1\ngpgcheck = 1\ngpgkey = https://repos.influxdata.com/influxdb.key\n",
}
}
Install it on all nodes (we could do that with puppet, too):
dsh -aM "sudo yum install telegraf"
Create a puppet module for telegraf
[root@admin modules]# cd /etc/puppetlabs/code/modules
[root@admin modules]# mkdir -p telegraf_client/{files,manifests,templates}
Create a template for telegraf.conf
[root@admin telegraf_client]# vi templates/telegraf.conf.template
put the following in that file (note the fqdn variable) :
[tags]
# Configuration for telegraf agent
[agent]
debug = false
flush_buffer_when_full = true
flush_interval = "15s"
flush_jitter = "0s"
hostname = "<%= fqdn %>"
interval = "15s"
round_interval = true
[root@admin telegraf_client]# vi templates/inputs_system.conf.template
put the following (no variables, yet. customize for your needs..) :
# Read metrics about CPU usage
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = [ "usage*" ]
# Read metrics about disk usagee
[[inputs.disk]]
fielddrop = [ "inodes*" ]
mount_points=["/","/home"]
# Read metrics about diskio usage
[[inputs.diskio]]
devices = ["sda2","sda3"]
skip_serial_number = true
# Read metrics about network usage
[[inputs.net]]
interfaces = [ "eth0" ]
fielddrop = [ "icmp*", "ip*", "tcp*", "udp*" ]
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration
# Read metrics about system load & uptime
[[inputs.system]]
# no configuration
Create a template for the outputs :
[root@admin telegraf_client]# vi templates/outputs.conf.template
and put the following text in the file
[[outputs.influxdb]]
database = "telegraf"
precision = "s"
urls = [ "http://admin:8086" ]
username = "telegraf"
password = "your_pass"
create the manifest for your module
[root@admin ~]# vi /etc/puppetlabs/code/modules/telegraf_client/manifests/init.pp
and add the following contents :
class telegraf_client {
package { 'telegraf':
ensure => installed,
}
file { "/etc/telegraf/telegraf.conf":
ensure => present,
owner => root,
group => root,
mode => "644",
content => template("telegraf_client/telegraf.conf.template"),
}
file { "/etc/telegraf/telegraf.d/outputs.conf":
ensure => present,
owner => root,
group => root,
mode => "644",
content => template("telegraf_client/outputs.conf.template"),
}
file { "/etc/telegraf/telegraf.d/inputs_system.conf":
ensure => present,
owner => root,
group => root,
mode => "644",
content => template("telegraf_client/inputs_system.conf.template"),
}
service { 'telegraf':
ensure => running,
enable => true,
}
}
And finally, include the module in the global puppet manifest file. Here is mine :
[root@admin ~]# vi /etc/puppetlabs/code/environments/production/manifests/site.pp
(which content is :)
node default {
case $facts['os']['name'] {
'Solaris': { include solaris }
'RedHat', 'CentOS': { include centos }
/^(Debian|Ubuntu)$/: { include debian }
default: { include generic }
}
}
node 'n0','n1','n2','n3','n4' {
include cephnode
}
class cephnode {
include telegraf_client
}
class centos {
yumrepo { "CentOS-OS-Local":
baseurl => "http://nas4/centos/\$releasever/os/\$basearch",
descr => "Centos int.intra mirror (os)",
enabled => 1,
gpgcheck => 0,
priority => 1
}
yumrepo { "CentOS-Updates-Local":
baseurl => "http://nas4/centos/\$releasever/updates/\$basearch",
descr => "Centos int.intra mirror (updates)",
enabled => 1,
gpgcheck => 0,
priority => 1
}
yumrepo { "InfluxDB":
baseurl => "https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable",
descr => "InfluxDB Repository - RHEL $releasever",
enabled => 1,
gpgcheck => 1,
gpgkey => "https://repos.influxdata.com/influxdb.key"
}
}
Wait for minutes for puppet to apply your work on the nodes or run :
[root@admin ~]# dsh -aM "/opt/puppetlabs/bin/puppet agent --test"
Check telegraf is up and running. And check the measurements in InfluxDB.
