Monitoring a Ceph cluster and other things with InfluxDB, Grafana, collectd & Telegraf

 

Now we have a working Ceph cluster (cf Ceph Cluster) you will certainly want to monitor it..

Here is another cool open source suite of software :)

Using Debian 8.6 Jessie for this memo

This blog is written with the help of several web pages, including https://www.guillaume-leduc.fr/monitoring-de-votre-serveur-avec-telegraf-influxdb-et-grafana.html

Thanks Guillaume ;)

Influx DB

I’ve a virtualized admin node in my home “datacenter”. This node will be used to collect and graph these stats.

InfluxDB is a great, time series database. I wille deploy it for my needs (monitoring all my systems, at first my ceph cluster)

First add the InfluxDB repository:

cephadm@admin:~$ curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -

root@admin:~# echo “deb https://repos.influxdata.com/debian jessie stable” > /etc/apt/sources.list.d/influxdb.list

cephadm@admin:~$ sudo apt-get update
cephadm@admin:~$ sudo apt-get install influxdb
Installation done.

Now configure the databases. For my needs, I will create two databases that will be two datasources in grafana (collectd and telegraf). Collectd and Telegraf are two well known agents that collects statistics from hosts. Collectd is useful for some hosts like routers (I own an openwrt internet router, very useful to monitor internet bandwidth..)

For Ceph and the nodes we will use Telegraf.

For other speialized things we will use collectd

Enable influxdb, start and enter the database like this :

root@admin:~# systemctl enable influxdb
root@admin:~# systemctl start influxdb

root@admin:~# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 1.0.2
InfluxDB shell version: 1.0.2
>
Create the databases :

> CREATE DATABASE telegraf
> CREATE DATABASE collectd_db

Check the databases :

> SHOW DATABASES;

name: databases

name
telegraf
_internal
collectd_db
Create a unique user for the monitoring activities, for ex : “mon”

> CREATE USER telegraf WITH PASSWORD ‘pass’
> GRANT ALL ON telegraf TO telegraf
> GRANT ALL ON collect_db TO telegraf

Look at the retention policies for InfluxDB if you want your database to purge automatically the data. This function is not detailed here.Configure InfluxDB to receive the collectd data on port 25826 (default). In /etc/influxdb/influxdb.conf, insert this conf :

[[collectd]]
enabled = true
bind-address = “:25826”
database = “collectd_db”
typesdb = “/usr/share/collectd/types.db”






Then install collectd on the admin node to get /usr/share/collectd/types.db (you can just copy the file from an agent.. but you may want to monitor your admin node too ;) )
TelegrafOn the admin node, use dsh to install telegraf on the nodes :
cephadm@admin:~$ dsh -aM sudo apt-get install -y curl
on all nodes, run :

cephadm@admin:~$ dsh -aM “sudo curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -“

cephadm@admin:~$ dsh -aM “echo ‘deb https://repos.influxdata.com/debian jessie stable’ | sudo tee /etc/apt/sources.list.d/influxdb.list”
Then install telegraf, for example from the admin node using dsh :

cephadm@admin:~$ dsh -aM sudo apt-get update


cephadm@admin:~$ dsh -aM sudo apt-get install -y telegraf

Configure the nodes : in /etc/telegraf/telegraf.conf only let this, for ex for my node 1 (n1) :

[tags]

Configuration for telegraf agent

[agent]
debug = false
flush_buffer_when_full = true
flush_interval = “15s”
flush_jitter = “0s”
hostname = “n1”
interval = “15s”
round_interval = true





hostname : replace it with the influx host name

Then configure the outputs (to the central influxDB server). In /etc/telegraf/telegraf.d/outputs.conf :

[[outputs.influxdb]]
database = “telegraf”
precision = “s”
urls = [ “http://admin:8086“ ]
username = “telegraf”
password = “pass”






And configure the inputs :

Then configure the outputs (to the central influxDB server). In /etc/telegraf/telegraf.d/inputs_system.conf :

# Read metrics about CPU usage
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = [ “usage*” ]

Read metrics about disk usagee

[[inputs.disk]]
fielddrop = [ “inodes*” ]
mount_points=[“/“,”/home”]

Read metrics about diskio usage

[[inputs.diskio]]
devices = [“sda2”,”sda3”]
skip_serial_number = true

Read metrics about network usage

[[inputs.net]]
interfaces = [ “eth0” ]
fielddrop = [ “icmp“, “ip“, “tcp“, “udp“ ]

Read metrics about memory usage

[[inputs.mem]]

no configuration

Read metrics about swap memory usage

[[inputs.swap]]

no configuration

Read metrics about system load & uptime

[[inputs.system]]

no configuration






Enable and restart the telegraf service on all hosts

cephadm@admin:~$ dsh -aM sudo systemctl enable telegraf

cephadm@admin:~$ dsh -aM sudo systemctl start telegraf

Check the data in InfluxDB :
cephadm@admin:~$ influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 1.0.2
InfluxDB shell version: 1.0.2
> use telegraf
Using database telegraf
> show measurements;
name: measurements
——————
name
cpu
disk
diskio
kernel
mem
net
processes
swap
system
It worked !

Collectd

If you have other hosts like openwrt, and you want to monitor them

Install collectd on the hosts

opkg update

opkg install collectd collectd-mon-network
Configure collectd to send your data to the admin node :

insert this text in /etc/collectd.conf

## CollectD Servers
LoadPlugin network
<Plugin network>
Server “admin.int.intra” “25826”
</Plugin>

Set the hostname, polling interval and other things. User/pass if you need..

Restart collectd with this config, and check in the database

cephadm@admin:~$ influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 1.0.2
InfluxDB shell version: 1.0.2
> use collectd_db
Using database collectd_db
> show measurements;
name: measurements
——————
name
conntrack_value
cpu_value
df_free
df_used
disk_read
disk_write
interface_rx
interfacetx
iwinfo
iwinfo_value
load_longterm
load_midterm
load_shortterm
memory_value
netlink_rx
netlink_tx
netlink_value
processes_majflt
processes_minflt
processes_processes
processes_syst
processes_threads
processes_user
processes_value
tcpconns_value
wireless_value
It worked ;)

Grafana

To do later.

Here is the results :