What is it:
Prometheus is a monitoring and alerting system, designed for large systems and is highly scalable by its design. It uses the Apache 2.0 license which means it’s free to use in most environments and has its entire source code available on github.
Prometheus was created by the team at Soudcloud, the model of the tool is very similar to the Google project ‘Borgmon’ which is a pull based monitoring solution designed with scale in mind. Borgmon until recently has been shrouded in secrecy, which is unsurprising considering it’s only accessible internally at Google, Prometheus is the most similar solution to Borgmon that is publicly available.
How does it work:
Prometheus uses a client-server architecture where the server ‘scrapes’ data from the client. Each client will have a tiny service running on it that serves various stats over a HTTP endpoint, the server will then ‘scrape’ these endpoints over time and log each scrape into the database. The results of these scrapes are stored and analysis can be done according to the relationship between collected values over time.
The server application ‘Prometheus’ is a database from the results of these scrapes, the actual ‘monitoring’ itself is performed by using a query language against data sets in that database. For example you could write a query to check the average CPU usage of a node over the last 15 minutes, alerting you when it’s hitting its limits, or creating a graph of the performance over time to analyse key times/areas of activity. I’ve drawn a simple diagram below that will hopefully provide some clarity (the blue lines represent HTTP requests):
As you’ll probably have noticed, in this diagram Prometheus is feeding into Grafana. This is because whilst graphing can be done by Prometheus, it’s extremely limited and requires the integration of a tool such as Grafana to provide decent usability. Grafana also includes far better methods of controlling access, the Prometheus web UI has no built in security at all. Both tools integrate naturally, Grafana allows you to select Prometheus as an input type with no extra configuration along with Prometheus advises you to use both in its own documentation. Prometheus queries can be run directly within Grafana, so you can use it without ever having to go to the actual Prometheus server interface.
In this walk through I’ll be installing Prometheus from a binary, alternatively you can run it in a container or compile it from source, however this aim of this tutorial is to understand how Prometheus works and how to do some basic configuration of the service stack, so to maintain focus on that goal we will use the already compiled binary. The specs of the machines used are two virtual machines as follows:
Hostname: webserver OS: CentOS 7.3.1611 IP: 192.168.1.181 SELinux: Permissive
Hostname: monitoring OS:CentOS 7.3.1611 IP: 192.168.1.180 SELinux: Permissive
To start with we can install the Prometheus server on ‘monitoring’, so we need to download the tarball:
tar xvf prometheus-2.0.0-alpha.3.linux-amd64.tar.gz
Now we have a bunch of files, most importantly we have the ‘prometheus’ executable, and the config file ‘prometheus.yml’. All config is done within this prometheus.yml file, it’s relatively simple to get a basic setup working. Open it with your favourite text editor and configure as follows:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor' # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first.rules" # - "second.rules" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] - job_name: 'webserver' static_configs: - targets: ['192.168.1.181:9100']
As you can probably tell we created a new job called
webserver that is targeting our webserver on port 9100, we also set the scrape time to 15 seconds which means the server will scrape targets every 15 seconds. We should test that the server will actually start (so if we run into problems later on at least we can verify it worked at some point). From the Prometheus directory run:
You should be greeted with some output, if it says
'Listening on <IP Address>'then we know the service is working, to test it further we can try running:
This will make Prometheus run in the background, open a web browser and browse to the monitoring server on port 9090, in our case this was
http://192.168.1.180:9090. If you get an access denied error it might be because you have a firewall running, by default CentOS uses firewalld, which you can stop temporarily by running
systemctl stop firewalld and trying again. You should be presented with the Prometheus dashboard like so:
This interface is the primary Prometheus server interface, this is where you can run queries and get back result sets. It’s the best place to test your queries as it updates very quickly. In the configuration file we set up earlier there is a line that specifies the job
prometheus to record metrics from the localmachine, this means Prometheus is already feeding itself stats about itself. We can use this to test some basic queries.
Clicking the box – inset metric at cursor – will allow you to see all the metrics that Prometheus is currently aware of, click on any one and it will insert it into the text box. For the sake of this tutorial we will use the
http_requests_total metric, so find it and click on it:
After selecting the metric hit ‘Execute’, you should then get back a bunch of results like so:
If you select ‘Graph’, you will see the results visualised into a graph. As you can see from the results set Prometheus is taking the total number of requests and will increment by one each time there is a new request forever. This is useful, however we might also want to know how many requests have been made over the last 5 minutes. To do that we can replace out
Press ‘Execute’ and now we can see the average number of requests we have received over the last 5 minutes:
This is useful if we need to see spikes in traffic at certain times, we can see the value get higher as the average number of requests is higher than normal. To do some more advanced querying however we should first set up the client for Prometheus to scrape, so lets ssh to the webserver machine. If you’re like me and running this in a test lab you can just run the following script to get a simple server up and running on the target machine:
yum install -y epel && yum install -y nginx && systemctl stop firewalld && systemctl start nginx
After which you’ll be greeted by our old friend the default nginx landing page:
Now we should install the exporter, there are exporters available for individual software, or a generic one used for the OS layer stuff. I would use the provided nginx exporter here however it either requires VTS or LUA which over-complicate this a little, so I’ll just be using the default node exporter which will still give us access to a large number of metrics. The exporters themselves are written in go and a few are available pre-compiled, they’re pretty easy to write on their own and it shouldn’t take too much effort for someone who is familiar with golang to create. What’s happening under the hood is the exporter creates a web service on a specific port that displays plain-text values it collects by using
cat on files in
/proc to get system information. This is great as it means the load is almost entirely placed on the server, scraping the endpoints more often in the Prometheus server config has no bearing on the clients at all. In order to continue we must actually install one of the exporters, I’ll be installing it using the pre-compiled binary.
First we need to download the binary:
tar xvf node_exporter-0.14.0.linux-amd64.tar.gz
We should now have a single executable and a LICENSE file. Run the executable to check it works, you should be presented with something similar to:
INFO Listening on :9100
You can pass other arguments to these exporters and tell them to listen on a different port, as long as you’ve set the port the same as in the prometheus.yml config. For the sake of this tutorial we will use the default. We can check it’s exporting correctly by running:
./node_exporter & curl localhost:9100/metrics
We should be presented with some HTML content and no errors. Next check it in your browser by going to:
And you should see something like this:
Using the hashed out lines in this file we can get a better idea of what these metrics actually are. As the exporter is now up and running we can go and check our stats out in Prometheus itself, open up the server interface and try running our
http_total_requests query again, you should now be presented with some stats for the
webserver job alongside the
If you want to test it further, open the webserver in another tab and refresh it a bunch of times. You should see the number rise each time you make a request and hit ‘Execute’. Let’s try making a graph using our previous irate query:
It’s clearly getting there, we can see the spikes in traffic that were reflected in the results table. However, we don’t really care about the requests to the Prometheus server, only our webserver. We can hide the results from any job other than the webserver by adding
job="webserver" as shown here:
The graph result should change to just have the traffic from the
webserver job display. You can use this same syntax for anything in the search result, just check the legend for the table and pick and choose the parts you want. Here’s an example of how our graph could look:
So now we have our simple graph of metrics in Prometheus, from here we will look at using Grafana to create a much more usable interface. To start with we need to install it, Grafana has it’s own repository which is easier to use than downloading RPM files from the internet (automatic updating within the package manager anyone?), so create a new file by running:
Press ‘I’ to enter interactive mode and add the contents shown here:
[grafana] name=grafana baseurl=https://packagecloud.io/grafana/stable/el/7/$basearch repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
Now we can update yum with the new repository and install Grafana directly:
yum install -y grafana
Note that you will need to install the GPGKey when requested. The Grafana service is installed as
grafana-server so you’ll need to run the following to start it and enable start on boot:
systemctl start grafana-server && systemctl enable grafana-server
Time to check which port Grafana is listening on, to do so we can run:
Which should output that grafana-server is listening to port 3000, open up your web browser and go to
http://192.168.1.180:3000 where you should be greeted by this screen:
The default username and password here are both
admin. After you’ve logged in you should be presented with a rather empty screen that prompts you to select a data-source, select it, choose a name, add the Prometheus server details and change
direct as it’s on the same machine so a proxy shouldn’t be necessary. I’ve found that Grafana doesn’t like using ‘localhost’ for the datasource URL, so it’s best to just put the server IP address in the URL box. Click ‘Save and Test’ and you should see something like this:
It’s worth noting that ANY changes to Grafana dashboards must be saved before you leave, or else the changes will be lost.
That’s all there is to getting data into Grafana, next we should go back to the Grafana home screen where we’re presented with the option to add a ‘New Dashboard’, click on the icon. Next we’re presented with a rather barren looking screen, click the floppy disk ‘Save’ icon and give your first dashboard a name. Now we can start to add visualtions of out data to the dashboard, let’s start by making a new graph. Mouse over the three dots on the left hand side of the ‘Empty Space’ block and select ‘Add Panel’ like so:
From here select ‘Graph’, and a graph should fill up the empty space. Click the ‘Panel Title’ space and select ‘Edit’ like so:
From here the panel settings will open at the bottom of the screen, ensure you are on the ‘Metrics’ tab and paste a query into the box as shown here:
As you can see the query is the same, however Grafana offers far more flexibility than the Prometheus graphing tool. There are many customisation options and ways to tweak the data and how it’s displayed in Grafana, however it’s an extensive subject and I will upload another guide that goes more in detail than this at some point. This is about as far as we need to go in terms of grasping the basic concepts of Prometheus and it’s interactions with other tools, however we still have some loose ends to tie up.
When we run the Grafana server it’s a service within systemd, so we can enable it and start/stop it at will, when we installed Prometheus it didn’t come with a unit file, this means systemd is unaware of its existance. We will make our own systemd unit file, but first we need to move the Prometheus files to somewhere more appropriate than wherever we downloaded them. First we should kill the Prometheus server, get the PID by running:
If nothing is returned then Prometheus isn’t running, so we don’t need to worry. If it returns a value than use that value as an argument o the
kill command to stop it. Now that Prometheus is stopped we should move the configuration file and executable, use the mv command to move the entire directory to /etc/prometheus (move the executable to a folder in $PATH to have it runnable from anywhere, but I don’t bother with this as the systemd unit file will be used to manage it).
At this point it’s worth testing that it’s still working by attempting to run the Prometheus executable. Now to make the unit file, create a new file with:
adduser prometheus && vi /usr/lib/systemd/system/prometheus.service
Press ‘i’ to enter insert mode and paste the following:
[Unit] Description=Prometheus metric and monitoring server Documentation=https://prometheus.io/docs/introduction/overview/ After=network-online.target [Service] User=prometheus Restart=on-failure ExecStart-/etc/prometheus/prometheus \ -config.file=/etc/prometheus/prometheus.yml [Install] WantedBy=multi-user.target
Save and quit, run the following to refresh systemd and start/enable the Prometheus service:
systemctl daemon-reload && systemctl start prometheus && systemctl enable prometheus && systemctl status prometheus
If all went well you should have some green output and a working systemd Prometheus service, it’s worth rebooting the machine and running the
systemctl status prometheus command after boot to verify it’s starting up correctly with the machine before we mark this as done. We can do the same thing with the ‘node_exporter’ service on client machines, just change the location of the executable.
This article will be updated.