Introduction: Prometheus

July 10, 2017 Off By kex

What is it:

Prometheus is a monitoring and alerting system, designed for large systems and is highly scalable by its design. It uses the Apache 2.0 license which means it’s free to use in most environments and has its entire source code available on github.

Prometheus was created by the team at Soudcloud, the model of the tool is very similar to the Google project ‘Borgmon’ which is a pull based monitoring solution designed with scale in mind. Borgmon until recently has been shrouded in secrecy,  which is unsurprising considering it’s only accessible internally at Google, Prometheus is the most similar solution to Borgmon that is publicly available.

How does it work:

Prometheus uses a client-server architecture where the server ‘scrapes’ data from the client. Each client will have a tiny service running on it that serves various stats over a HTTP endpoint, the server will then ‘scrape’ these endpoints over time and log each scrape into the database. The results of these scrapes are stored and analysis can be done according to the relationship between collected values over time.

The server application ‘Prometheus’ is a database from the results of these scrapes, the actual ‘monitoring’ itself is performed by using a query language against data sets in that database. For example you could write a query to check the average CPU usage of a node over the last 15 minutes, alerting you when it’s hitting its limits, or creating a graph of the performance over time to analyse key times/areas of activity. I’ve drawn a simple diagram below that will hopefully provide some clarity (the blue lines represent HTTP requests):

Example Prometheus architecture

As you’ll probably have noticed, in this diagram Prometheus is feeding into Grafana. This is because whilst graphing can be done by Prometheus, it’s extremely limited and requires the integration of a tool such as Grafana to provide decent usability. Grafana also includes far better methods of controlling access, the Prometheus web UI has no built in security at all. Both tools integrate naturally, Grafana allows you to select Prometheus as an input type with no extra configuration along with Prometheus advises you to use both in its own documentation. Prometheus queries can be run directly within Grafana, so you can use it without ever having to go to the actual Prometheus server interface.

Installation/Configuration:

In this walk through I’ll be installing Prometheus from a binary, alternatively you can run it in a container or compile it from source, however this aim of this tutorial is to understand how Prometheus works and how to do some basic configuration of the service stack, so to maintain focus on that goal we will use the already compiled binary. The specs of the machines used are two virtual machines as follows:

Hostname: webserver
OS: CentOS 7.3.1611
IP: 192.168.1.181
SELinux: Permissive
Hostname: monitoring
OS:CentOS 7.3.1611
IP: 192.168.1.180
SELinux: Permissive

To start with we can install the Prometheus server on ‘monitoring’, so we need to download the tarball:

wget https://github.com/prometheus/prometheus/releases/download/v2.0.0-alpha.3/prometheus-2.0.0-alpha.3.linux-amd64.tar.gz

Untar it:

tar xvf prometheus-2.0.0-alpha.3.linux-amd64.tar.gz

Now we have a bunch of files, most importantly we have the ‘prometheus’ executable, and the config file ‘prometheus.yml’. All config is done within this prometheus.yml file, it’s relatively simple to get a basic setup working. Open it with your favourite text editor and configure as follows:

# my global config
global:
scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first.rules"
# - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']
- job_name: 'webserver'
static_configs:
- targets: ['192.168.1.181:9100']

As you can probably tell we created a new job called webserver that is targeting our webserver on port 9100, we also set the scrape time to 15 seconds which means the server will scrape targets every 15 seconds. We should test that the server will actually start (so if we run into problems later on at least we can verify it worked at some point). From the Prometheus directory run:

./prometheus

You should be greeted with some output, if it says 'Listening on <IP Address>'then we know the service is working, to test it further we can try running:

./prometheus &

This will make Prometheus run in the background, open a web browser and browse to the monitoring server on port 9090, in our case this was http://192.168.1.180:9090. If you get an access denied error it might be because you have a firewall running, by default CentOS uses firewalld, which you can stop temporarily by running systemctl stop firewalld and trying again. You should be presented with the Prometheus dashboard like so:

The Prometheus server interface

This interface is the primary Prometheus server interface, this is where you can run queries and get back result sets. It’s the best place to test your queries as it updates very quickly. In the configuration file we set up earlier there is a line that specifies the job prometheus to record metrics from the localmachine, this means Prometheus is already feeding itself stats about itself. We can use this to test some basic queries.

Clicking the box – inset metric at cursor – will allow you to see all the metrics that Prometheus is currently aware of, click on any one and it will insert it into the text box. For the sake of this tutorial we will use the http_requests_total metric, so find it and click on it:

Selecting a metric

After selecting the metric hit ‘Execute’, you should then get back a bunch of results like so:

Example Prometheus data set

If you select ‘Graph’, you will see the results visualised into a graph. As you can see from the results set Prometheus is taking the total number of requests and will increment by one each time there is a new request forever. This is useful, however we might also want to know how many requests have been made over the last 5 minutes. To do that we can replace out http_requests_total with:

irate(http_requests_total[5m])

Press ‘Execute’ and now we can see the average number of requests we have received over the last 5 minutes:

Rate of requests over the last 5 minutes

This is useful if we need to see spikes in traffic at certain times, we can see the value get higher as the average number of requests is higher than normal. To do some more advanced querying however we should first set up the client for Prometheus to scrape, so lets ssh to the webserver machine. If you’re like me and running this in a test lab you can just run the following script to get a simple server up and running on the target machine:

yum install -y epel && yum install -y nginx && systemctl stop firewalld && systemctl start nginx

After which you’ll be greeted by our old friend the default nginx landing page:

The default nginx landing page

Now we should install the exporter, there are exporters available for individual software, or a generic one used for the OS layer stuff. I would use the provided nginx exporter here however it either requires VTS or LUA which over-complicate this a little, so I’ll just be using the default node exporter which will still give us access to a large number of metrics. The exporters themselves are written in go and a few are available pre-compiled, they’re pretty easy to write on their own and it shouldn’t take too much effort for someone who is familiar with golang to create. What’s happening under the hood is the exporter creates a web service on a specific port that displays plain-text values it collects by using cat on files in /proc to get system information. This is great as it means the load is almost entirely placed on the server, scraping the endpoints more often in the Prometheus server config has no bearing on the clients at all. In order to continue we must actually install one of the exporters, I’ll be installing it using the pre-compiled binary.

First we need to download the binary:

wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz

Untar it:

tar xvf node_exporter-0.14.0.linux-amd64.tar.gz

We should now have a single executable and a LICENSE file. Run the executable to check it works, you should be presented with something similar to:

INFO[0000] Listening on :9100

You can pass other arguments to these exporters and tell them to listen on a different port, as long as you’ve set the port the same as in the prometheus.yml config. For the sake of this tutorial we will use the default. We can check it’s exporting correctly by running:

./node_exporter & curl localhost:9100/metrics

We should be presented with some HTML content and no errors. Next check it in your browser by going to:

http://192.168.1.181:9100/metrics

And you should see something like this:

Example output of the /metrics url

Using the hashed out lines in this file we can get a better idea of what these metrics actually are. As the exporter is now up and running we can go and check our stats out in Prometheus itself, open up the server interface and try running our http_total_requests query again, you should now be presented with some stats for the webserver job alongside the prometheus job:

Metric collection results - now with added webserver!

If you want to test it further, open the webserver in another tab and refresh it a bunch of times. You should see the number rise each time you make a request and hit ‘Execute’. Let’s try making a graph using our previous irate query:

Prometheus graph example

It’s clearly getting there, we can see the spikes in traffic that were reflected in the results table. However, we don’t really care about the requests to the Prometheus server, only our webserver. We can hide the results from any job other than the webserver by adding job="webserver" as shown here:

irate(http_requests_total{job="webserver"}[5m])

The graph result should change to just have the traffic from the webserver job display. You can use this same syntax for anything in the search result, just check the legend for the table and pick and choose the parts you want. Here’s an example of how our graph could look:
Results from the updated query

So now we have our simple graph of metrics in Prometheus, from here we will look at using Grafana to create a much more usable interface. To start with we need to install it, Grafana has it’s own repository which is easier to use than downloading RPM files from the internet (automatic updating within the package manager anyone?), so create a new file by running:
vi /etc/yum.repos.d/grafana.repo
Press ‘I’ to enter interactive mode and add the contents shown here:

[grafana]
name=grafana
baseurl=https://packagecloud.io/grafana/stable/el/7/$basearch
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

Now we can update yum with the new repository and install Grafana directly:

yum install -y grafana

Note that you will need to install the GPGKey when requested. The Grafana service is installed as grafana-server so you’ll need to run the following to start it and enable start on boot:

systemctl start grafana-server && systemctl enable grafana-server

Time to check which port Grafana is listening on, to do so we can run:

netstat tulpn

Which should output that grafana-server is listening to port 3000, open up your web browser and go to http://192.168.1.180:3000 where you should be greeted by this screen:

The Grafana login screen
The default username and password here are both admin. After you’ve logged in you should be presented with a rather empty screen that prompts you to select a data-source, select it, choose a name, add the Prometheus server details and change proxy to direct as it’s on the same machine so a proxy shouldn’t be necessary. I’ve found that Grafana doesn’t like using ‘localhost’ for the datasource URL, so it’s best to just put the server IP address in the URL box. Click ‘Save and Test’ and you should see something like this:

Successfully adding the datasource

It’s worth noting that ANY changes to Grafana dashboards must be saved before you leave, or else the changes will be lost.

That’s all there is to getting data into Grafana, next we should go back to the Grafana home screen where we’re presented with the option to add a ‘New Dashboard’, click on the icon. Next we’re presented with a rather barren looking screen, click the floppy disk ‘Save’ icon and give your first dashboard a name. Now we can start to add visualtions of out data to the dashboard, let’s start by making a new graph. Mouse over the three dots on the left hand side of the ‘Empty Space’ block and select ‘Add Panel’ like so:
Creating a new graph

From here select ‘Graph’, and a graph should fill up the empty space. Click the ‘Panel Title’ space and select ‘Edit’ like so:
Click the 'Panel Title' section to get this to appear
From here the panel settings will open at the bottom of the screen, ensure you are on the ‘Metrics’ tab and paste a query into the box as shown here:
Using a query in Grafana

As you can see the query is the same, however Grafana offers far more flexibility than the Prometheus graphing tool. There are many customisation options and ways to tweak the data and how it’s displayed in Grafana, however it’s an extensive subject and I will upload another guide that goes more in detail than this at some point. This is about as far as we need to go in terms of grasping the basic concepts of Prometheus and it’s interactions with other tools, however we still have some loose ends to tie up.

When we run the Grafana server it’s a service within systemd, so we can enable it and start/stop it at will, when we installed Prometheus it didn’t come with a unit file, this means systemd is unaware of its existance. We will make our own systemd unit file, but first we need to move the Prometheus files to somewhere more appropriate than wherever we downloaded them. First we should kill the Prometheus server, get the PID by running:

pgrep prometheus

If nothing is returned then Prometheus isn’t running, so we don’t need to worry. If it returns a value than use that value as an argument o the kill command to stop it. Now that Prometheus is stopped we should move the configuration file and executable, use the mv command to move the entire directory to /etc/prometheus (move the executable to a folder in $PATH to have it runnable from anywhere, but I don’t bother with this as the systemd unit file will be used to manage it).

At this point it’s worth testing that it’s still working by attempting to run the Prometheus executable. Now to make the unit file, create a new file with:
adduser prometheus && vi /usr/lib/systemd/system/prometheus.service

Press ‘i’ to enter insert mode and paste the following:

[Unit]
Description=Prometheus metric and monitoring server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart-/etc/prometheus/prometheus \
-config.file=/etc/prometheus/prometheus.yml

[Install]
WantedBy=multi-user.target

Save and quit, run the following to refresh systemd and start/enable the Prometheus service:

systemctl daemon-reload && systemctl start prometheus && systemctl enable prometheus && systemctl status prometheus

If all went well you should have some green output and a working systemd Prometheus service, it’s worth rebooting the machine and running the systemctl status prometheus command after boot to verify it’s starting up correctly with the machine before we mark this as done. We can do the same thing with the ‘node_exporter’ service on client machines, just change the location of the executable.

This article will be updated.