Learning Prometheus

7 min readJan 9, 2025

Prometheus is an open-source monitoring and alerting tool that collects and stores metrics as time series data. That means all collected data points have a timestamp and can be plotted on a graph. Additionally, all gathered metrics are labeled with key/value pairs, and these labels, coupled with the metric name, make for a unique time series of data. Anything that is measured numerically over time can be monitored by Prometheus. Some examples include but are by no means limited to:

CPU utilization
Memory utilization
Disk space
Disk I/O
HTTP request rates
SQL query performance
Uptime

Testing Environment

This Github repo sets up a testing environment on an LXC virtual machine.

Note, br0 was manually created on the Ubuntu server and bound to the home lab network. This allows any attached VM to be reachable on the same network and get DHCP. There are ways to spin up a VM connected to the lxdbr0 interface and proxy connectivity from a host port on the Ubuntu server to the LXC virtual machine where Prometheus is running. This is a more complicated setup, but it works. There are some commented-out targets for this configuration in the justfile, and a separate LXC profile is needed to bind the VM to the lxdbr0 interface. The rest of this post assumes the posture depicted in the above figure.

Requirements

The below requirements must be met first to get this up and working:

Host machine running Ubuntu 22.04
LXC (tested with version 5.21.2)
just command runner (tested with version 1.36.0)

Setup the Environment

To set up the environment, clone the repo down to the Ubuntu machine running LXC:

git clone gibt@github.com:TheFutonEng/learning-prometheus.git && cd learning-prometheus

Then run the following:

just setup

This command will execute the following:

Delete an environment if one is deployed.
Recreate the LXC profile for the VM.
Create fresh SSH keys.
Deploy the LXC virtual machine.
Install Prometheus on the LXC virtual machine.
Configure Prometheus on the LXC virtual machine.

Below is example output from the just setup command:

$ just setup
Deleting VM prometheus.
Deleting profile prometheus-net.
Profile prometheus-net deleted
Setting up SSH keys for Prometheus project...
SSH keys already exist in .ssh/prometheus_ed25519
cloud-init/user-data created with new SSH key
Creating network profile on br0
Profile prometheus-net created
Device root added to prometheus-net
Device eth0 added to prometheus-net
Deploying Prometheus VM
Creating prometheus
Starting prometheus                           
Waiting for VM to get IP address...
VM IP address: 192.168.1.112
Waiting for SSH to be available on 192.168.1.112
SSH is now available on 192.168.1.112
Prometheus installed in /opt/prometheus
Creating systemd service.
Creating systemd service.
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.
Prometheus started and enabled
Prometheus setup complete!
Access the web interface at:
  - http://localhost:9090
  - http://192.168.1.112:9090    (direct VM access)
Waiting for Prometheus to become available...
✓ Prometheus is up and running

We can confirm the web interface for Prometheus is running at the link in the above output:

Finally, we can get to Prometheus. If you want to much around in the virtual machine, run just ssh to access the VM.

Prometheus

This section will dive into a few basic exercises to get exposure to Prometheus. The just setup command installed Prometheus and set it up to monitor itself. To see all of the metrics Prometheus is gathering about itself, navigate to http://<IP-of-VM>:9090/metrics.

Understanding the Basics

For grins, let’s see what memory utilization has been like on the Prometheus VM. There are quite a few memory-related metrics that get collected. Let’s use go_memstats_alloc_bytes for this. Enter that string into the Expression input field and then click the execute button.

Hmm… not that interesting. The 23206832 number on the far left indicates the current value for the go_memstats_alloc_bytes metric (23.2MB). Click on the Graph tab next to Table to see what we get.

A little more interesting. By default, this view shows the time series data for the go_memstats_alloc_bytes metric over the past hour. Not sure what that spike in memory is about but this is precisely why Prometheus is so helpful! This data, combined with logs, could be invaluable in resolving an issue or helping with root cause analysis.

Adding an Exporter

Exporters are third-party libraries that expose metrics for Prometheus to scrape. There are dozens of exporters out there, but this section will mess around with the node_exporter (source here).

Note that the node_exporter exposes data about the VM, which differs from setting up Prometheus to monitor itself. The latter gets stats about the Prometheus process but not the server as a whole.

To quickly deploy node_exporter on the Prometheus VM, run the just configure-node-exporter command. This will install the node_exporter binary, update the Prometheus configuration with an additional endpoint for node_exporter (http://localhost:9100), and restart the Prometheus service. Navigating to http://<IP-of-VM>:9090/targets should now show two targets:

Clicking the show more button on the above screen will show additional data, including the metrics endpoint for each target. There are loads of metric exposed by the node_exporter. W’ll use the node_cpu_seconds_total metric and for this next section. Drop it into the Expression input field as before:

The above screenshot shows how granular an exporter's data can be. Notice that the top half of the metrics are for CPU0 and the bottom for CPU1. Each CPU has metrics captured for eight different “modes.” Note that cpu, instance, job, and mode in the above screenshot are all labels in Prometheus. Clicking on the Graph tab, as done previously, shows a graph that is not as interesting:

This shows all 16 time series metrics for node_cpu_seconds_total on one graph, but the data all plotted together muddles the graph. How could we display the total CPU utilization for the VM on the graph?

PromQL Fundamentals

To do this, we have to modify the string that is put into the Expression input field. Prometheus has its own query language called PromQL that allows for fine tweaking of the data collected to display. Drop the following string into the Expression input field:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

There’s a lot going on in this string, but let’s see what it returns:

Just a single metric. Now the graph:

We can interpret this better now. Let’s break down that string:

node_cpu_seconds_total{mode=”idle”}[1m]
- Takes the raw counter of idle CPU seconds over a 1-minute time interval
rate()
- Calculates how fast the counter is increasing over that 1-minute interval and returns a per-second rate of change
avg by (instance)
- Takes the average across CPU0 and CPU1
* 100
- Converts the rate from a decimal to a percentage
100 -
- Takes the percentage of idle time (what the metric measures) and subtracts it from 100 to switch it to a utilization metric.

There is a lot to this query to show something as basic as CPU utilization. This complexity is also part of the strength of Prometheus. You can get a granular as needed when digging into a captured metric.

Setting Up Alerting

Let’s now take this average CPU metric and generate an alert based on it. Run the just setup-alerts command. This will put in place the below rules.yml file and reload Prometheus:

# Example alert rule structure
groups:
- name: example
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High CPU usage on {{ $labels.instance }}

Note the same PromQL query appears in the alert definition as in the previous section. We can confirm that the rule for the alert is in place by browsing to http://<IP-of-VM>:9090/alerts

While this alert is in place, it is currently inactive, meaning there is no active trigger for the rule. Run the just stress-cpu command to trigger the alert. The alert will shift to a Pending state shortly after the command is run:

Pending means that the condition for the rule has been hit but any defined duration in the rule has not yet elapsed. After 5 minutes, the alert transitions to a firing state, meaning all conditions have been met.

Typically, alerts are not viewed in Prometheus natively in this way but instead forwarded to another platform to handle deduplication, grouping of similar alerts, silencing alerts, etc. Alertmanager is a standard tool used for this purpose.

Wrap It Up

That’s it! I hope this post was informative.