September 15, 2015 In general, for monitoring standard MySQL replication, it is common practice to check the Seconds_Behind_Master variable. There are cases, however, where the Seconds_Behind_Master can have a low number, but replication is in fact broken. For checking this more accurately, Percona has written a program called pt-heartbeat, which continuously inserts timestamped data into a single row in a single table on the master and verifies the data on the slave. More information about the program itself can be found in the guide here. pt-heartbeat is compatible with all forks of MySQL, including MariaDB, Percona, and MySQL Community Edition. In this post, I will explain how you can configure this using Puppet and Nagios on a CentOS 6 system (it should be noted that Zabbix can be used for this as well). One of the critical prerequisites for setting up pt-heartbeat monitoring is that both the master and the slave(s) have their clocks properly synchronized with NTP. If you’re using Puppet, an easy way of managing this is to use the puppetlabs/ntp Forge module. An out-of-sync system clock can result in a skew of the seconds of delay count. This guide also uses Nagios and NRPE with exported configurations through PuppetDB in the example code. However, this is not necessary in order to monitor pt-heartbeat. You would just leave off the “@@nagios_service” resources in your Puppet manifests. Finally, this assumes that you are using the puppetlabs/mysql Forge module to manage MySQL on the server. It is possible, however, to use this example with a manually-configured MySQL server; you will need to add the user account and database manually, though. For this example, I have placed all Puppet files and manifests for pt-heartbeat under a “percona” module. Note: for this example, init.pp is not actually in use and is empty. Below is the directory tree: percona ├── files │ ├── heartbeat_master_cfg │ ├── nrpe_pt_heartbeat_proc │ └── pt-heartbeat_init ├── manifests │ ├── heartbeat │ │ ├── master.pp │ │ └── slave.pp │ ├── heartbeat.pp │ ├── init.pp │ ├── params.pp │ └── repo.pp └── templates ├── heartbeat_setup.sql.erb └── nrpe_check_mysql_repl.erb Before setting up anything, you will need Percona’s Yum repository installed on the master and on the slaves. In this guide, it is managed by the percona::repo manifest: class percona::repo { yumrepo { 'percona': baseurl => "http://repo.percona.com/centos/${::operatingsystemmajrelease}/os/x86_64/", mirrorlist => absent, descr => 'Percona', gpgcheck => 0, } } The sole purpose of the percona::heartbeat manifest is to install the required packages, in this case the Percona Toolkit, which contains the pt-heartbeat program, and Percona’s Nagios Plugins: class percona::heartbeat { include percona::repo package { [ 'percona-toolkit', 'percona-nagios-plugins' ]: ensure => present, require => Yumrepo['percona'], } } Now create a percona::params class that reads the require parameters in from Hiera: class percona::params { $master_server_id = hiera(master_server_id, undef) $heartbeat_mysql_pw = hiera(heartbeat_mysql_pw, undef) $mysql_repl_pw = hiera(mysql_repl_pw, undef) $server_id = hiera(mysql_server_id, undef) } Before continuing, I will attempt to explain the purpose of the above parameters. The $heartbeat_mysql_pw parameter is the password of the heartbeat account that will be used to insert rows into and query the Heartbeat database and table. The $mysql_repl_pw parameter is the password of the account that you use for replication between your master and slave. I chose this account because the pmp-check-mysql-replication-running Nagios plugin requires either the SUPER or REPLICATION CLIENT privilege and this account has the latter. If you so choose, you can also use the root account for this, but I personally try to minimize usage of the root account. As recommended in previous posts, you should encrypt your passwords that you store in Hiera with hiera-eyaml. The $server_id parameter is the server_id of the master running the pt-heartbeat daemon, while $master_server_id is the server_id of the master the Nagios plugin will be checking against on the slave. So why are these not one and the same? This is because there are a number of scenarios where a master may also be a slave, and you may also want to check the replication lag on it. However, if you just have a single master in your server topology, you can probably combine these two parameters. Now that the params class has been created, we can then proceed to creating the classes for the master and slave configurations. The class percona::heartbeat::master contains all of the resources required to configure pt-heartbeat on your master: class percona::heartbeat::master ( $heartbeat_mysql_pw = $percona::params::heartbeat_mysql_pw, $server_id = $percona::params::server_id, ) inherits percona::params { include percona::heartbeat file { '/usr/local/etc/heartbeat_setup.sql': ensure => file, content => template('percona/heartbeat_setup.sql.erb'), } ::mysql::db { 'heartbeat': ensure => present, user => 'heartbeat', password => $heartbeat_mysql_pw, host => 'localhost', grant => 'SELECT', sql => '/usr/local/etc/heartbeat_setup.sql', require => File['/usr/local/etc/heartbeat_setup.sql'], } file { '/etc/pt-heartbeat': ensure => file, source => 'puppet:///modules/percona/heartbeat_master_cfg', owner => 'root', group => 'root', mode => '0600', require => Mysql::Db['heartbeat'], notify => Service['pt-heartbeat'], } file { '/etc/init.d/pt-heartbeat': ensure => file, mode => '0755', owner => 'root', group => 'root', source => 'puppet:///modules/percona/pt-heartbeat_init', require => Package['percona-toolkit'], } service { 'pt-heartbeat': ensure => running, enable => true, require => File['/etc/init.d/pt-heartbeat'], } file { '/etc/nrpe.d/check_pt_heartbeat_proc.cfg': ensure => file, source => 'puppet:///modules/percona/nrpe_pt_heartbeat_proc', owner => 'root', group => 'nrpe', mode => '0640', require => File['/etc/nrpe.d'], notify => Service['nrpe'], } @@nagios_service { "check_pt_heartbeat_proc_${::hostname}": check_command => 'check_nrpe!check_pt_heartbeat_proc', use => 'generic-service', host_name => $::fqdn, notification_period => '24x7', service_description => 'pt-heartbeat Process', max_check_attempts => 3, } } First, the class reads in the necessary parameters. Second, it pushes out the SQL file that creates the database and table that are required for pt-heartbeat. In this example, the template for this is located at templates/heartbeat_setup.sql.erb, with the below content: DROP TABLE IF EXISTS `heartbeat`; CREATE TABLE `heartbeat` ( `ts` varchar(26) NOT NULL, `server_id` int(10) unsigned NOT NULL, `file` varchar(255) DEFAULT NULL, `position` bigint(20) unsigned DEFAULT NULL, `relay_master_log_file` varchar(255) DEFAULT NULL, `exec_master_log_pos` bigint(20) unsigned DEFAULT NULL, PRIMARY KEY (`server_id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; LOCK TABLES `heartbeat` WRITE; INSERT INTO `heartbeat` (ts, server_id) VALUES (NOW(), <%= @server_id %>); UNLOCK TABLES; Once this SQL file is available on the master, the mysql::db provider can then create the heartbeat database and user, and execute the SQL. Following this, the manifest pushes out the configuration file used by pt-heartbeat script itself. For my example, this will be located at /etc/pt-heartbeat on the master and will contain the following content: update socket=/var/lib/mysql/mysql.sock database=heartbeat table=heartbeat pid=/var/run/pt-heartbeat.pid You can also specify these settings as command line options when running pt-heartbeat. I prefer, however, to have these in a separate file when using an init script. When starting up pt-heartbeat on boot, you could have it set to start in /etc/rc.local or anacron. However, the proper way of doing this would be to have it run as an init script, as in this example. Below is my simple init script (parts of this script were borrowed from the article here): #!/bin/bash # # description: pt-heartbeat server init script # # Get function from functions library . /etc/init.d/functions # Start the service pt-heartbeat start() { if [ ! -f '/etc/pt-heartbeat' ] ; then echo "Configuration file not found. Exiting ..." exit 1 fi echo -n "Starting pt-heartbeat service: " pt-heartbeat --config /etc/pt-heartbeat --defaults-file=/root/.my.cnf --daemonize ### Create the lock file ### touch /var/lock/subsys/pt-heartbeat success $"pt-heartbeat service startup" echo } # Restart the service pt-heartbeat stop() { echo -n "Stopping pt-heartbeat service: " kill `cat /var/run/pt-heartbeat.pid` ### Now, delete the lock and pid files ### rm -f /var/run/pt-heartbeat.pid rm -f /var/lock/subsys/pt-heartbeat success $"pt-heartbeat service shutdown" echo } ### main logic ### case "$1" in start) start ;; stop) stop ;; status) status -p /var/run/pt-heartbeat.pid -l pt-heartbeat pt-heartbeat ;; restart) stop start ;; *) echo $"Usage: $0 {start|stop|restart|status}" exit 1 esac exit 0 Note that this example requires that the root credentials be stored in /root/.my.cnf, which in this example is managed by the puppetlabs/mysql Forge module. Optionally, you can have NRPE and Nagios check periodically for the pt-heartbeat process. Note, however, your slave will complain as well if this isn’t running. Below is the NRPE check, located in /etc/nrpe.d/check_pt_heartbeat_proc.cfg: command[check_pt_heartbeat_proc]=/usr/lib64/nagios/plugins/check_procs -c 1:1 -a '/usr/bin/pt-heartbeat' Once you have created the above files, templates, and manifests, include the percona::heartbeat::master class in your master’s catalog and trigger a Puppet run on it. If everything was created successfully and the pt-heartbeat service is running, you should be able to connect to MySQL and run the below query. The timestamp in the “ts” column should be constantly updating: mysql> SELECT ts FROM heartbeat.heartbeat; +----------------------------+ | ts | +----------------------------+ | 2015-09-15T09:03:00.003470 | +----------------------------+ 1 row in set (0.01 sec) Now we can proceed to setting up the checks on the slave. First, create the class containing all of the resources needed to configure the slave: class percona::heartbeat::slave ( $heartbeat_mysql_pw = $percona::params::heartbeat_mysql_pw, $mysql_repl_pw = $percona::params::mysql_repl_pw, $master_server_id = $percona::params::master_server_id, ) inherits percona::params { include percona::heartbeat file { '/etc/nrpe.d/check_mysql_repl.cfg': ensure => file, content => template('percona/nrpe_check_mysql_repl.erb'), owner => 'root', group => 'nrpe', mode => '0640', require => File['/etc/nrpe.d'], notify => Service['nrpe'], } @@nagios_service { "check_mysql_repl_delay_${::hostname}": check_command => 'check_nrpe!check_mysql_repl_delay', use => 'generic-service', host_name => $::fqdn, notification_period => '24x7', service_description => 'MySQL Replication Delay', max_check_attempts => 3, } @@nagios_service { "check_mysql_repl_running_${::hostname}": check_command => 'check_nrpe!check_mysql_repl_running', use => 'generic-service', host_name => $::fqdn, notification_period => '24x7', service_description => 'MySQL Replication Running', max_check_attempts => 3, } } Unlike the class for the master, only a single file is needed to configure the slave, /etc/nrpe.d/check_mysql_repl.cfg. The template for this contains the below content: command[check_mysql_repl_delay]=/usr/lib64/nagios/plugins/pmp-check-mysql-replication-delay -H localhost -l heartbeat -p <%= @heartbeat_mysql_pw %> -T heartbeat.heartbeat -s <%= @master_server_id %> command[check_mysql_repl_running]=/usr/lib64/nagios/plugins/pmp-check-mysql-replication-running -H localhost -l repl -p <%= @mysql_repl_pw %> The top NRPE command, check_mysql_repl_delay, executes Percona’s replication delay plugin. This compares the time stamp in the heartbeat table to the system time and throws a warning or an error depending on how many seconds behind it is. You can optionally specify the warning and critical second thresholds with the -w and -c options, respectively. The defaults are 300 seconds as the warning threshold. and 600 seconds as the critical threshold. The bottom command, check_mysql_repl_running, checks if replication is running and if not, alerts Nagios. One additional note: if you are running MySQL version 5.6 or later, the pmp-check-mysql-replication-delay will complain if you specify the password in the NRPE command; in Nagios, it will display the message, “Warning: Using a password on the command line interface can be insecure”, even if the check itself is green. The solution is to create a file located at /etc/nagios/my.cnf that contains the user name and password of the heartbeat account: # /etc/nagios/my.cnf file resource file { '/etc/nagios/my.cnf': ensure => file, content => template('percona/nagios_my_cnf.erb'), owner => 'root', group => 'nrpe', mode => '0640', require => Package['nrpe'], } # templates/nagios_my_cnf.erb [client] user=heartbeat host=localhost password='<%= @heartbeat_mysql_pw %>' socket=/var/lib/mysql/mysql.sock # check_mysql_repl_delay command command[check_mysql_repl_delay]=/usr/lib64/nagios/plugins/pmp-check-mysql-replication-delay --defaults-file /etc/nagios/my.cnf -T heartbeat.heartbeat -s <%= @master_server_id %> Include the percona::heartbeat::slave class in your slave’s catalog and trigger a Puppet run on it. If you’re using exported resources to manage your Nagios configuration, the two Nagios checks should be automatically configured on your Nagios server. MySQL replication is now being monitored more accurately with pt-heartbeat and Nagios.