Introduction
Prometheus is an open-source systems monitoring and alerting toolkit. It records real-time metrics in a time series database built using a HTTP pull model. It was named after Prometheus, the Titan god of forethought. A very well thought out name for such a tool considering the definition of forethought is “careful consideration of what will be necessary or may happen in the future.” By collecting data and analyzing that data over time, you can predict what “may happen in the future.” And this is what prometheus does, collects data over time. How you use that data will determine how powerful Prometheus actually is.
What is Prometheus?
Prometheus really comes down to 4 services –
- A time series database that will store all our metrics data
- A data retrieval worker that is responsible for pulling/scraping metrics from external sources and pushing them into the database
- A web server that provides a simple web interface for configuration and viewing the data stored
- A HTTP rest api that sits at /api/v1/query for querying data
Here’s an architecture diagram from Prometheus‘s documentation page to help visualize it –
There’s a lot going on in this diagram so lets break it down –
- PromQL – This is a querying language specific to Prometheus. It is designed for building powerful yet simple queries for graphs, alerts, or derived time series (aka recording rules).
- AlertManager – Handles alerts sent by client applications such as Prometheus. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
- Pushgateway – If for any reason you can’t or don’t want to use the pull model that prometheus comes with, you can push metrics the old fashioned way.
- Service Discovery – Prometheus is designed to require very little configuration when first setup, and was designed from the ground up to run in dynamic environments such Kubernetes. It therefore performs automatic discovery of services running to try and make a “best guess” of what it should be monitoring.
- Prometheus Targets – Outside of Service Discovery, you can create targets for prometheus to scrape.
Why Prometheus?
Prometheus, Graphite, and InfluxDB aren’t new technologies, they’ve been around for quite a few years now. The part that separates Prometheus from other tools is the pull model. Instead of sending your data to Prometheus, you just have to tell Prometheus where your data is and it will collect it.
As you can see in the diagram above, Influxdb requires an agent called Telegraf to collect metrics and send them to Influxdb; or for Kubernetes there is a Telegraf input plugin that runs as a container.
Prometheus on the other hand does not require an agent or plugin and instead reaches out to its targets every 10 seconds to collect metrics. In some cases you may need extra configurations to expose those metrics, or you may need to create a /metrics or /health HTTP endpoint for prometheus to consume. Most modern containerized apps will have those metrics exposed already. If you’re trying to collect metrics from a custom app, ask your developers if they have such an endpoint or if they can create one. The pull model is great for a number of reasons –
- Prometheus does not require us to install any custom software or containers to collect metrics. This also means we aren’t using extra CPU cycles on our apps to push metrics.
- Prometheus provides us with a centralized config and management console, allowing us to see when data was last scraped. With this Prometheus also handles service failure gracefully. If an app goes down Prometheus can record that it was unable to retrieve data. Vs the push method we’re often unsure why we are not receiving data any longer.
- If the pull model is not an option, you can still push metrics to Prometheus. So really you get the best of all worlds.
How Does Service Discovery Work?
Prometheus automatically searches Kubernetes clusters for specific labels, annotations, or CRD’s (Custom Resource Definitions) within Kubernetes. Here at ICFNext we use CRD’s. Service monitors and pod monitors create targets in prometheus, and prometheus rules create rules and alerts. I really like CRD’s because you can easily include them in an app deployment or helm chart. Here are a few YAML examples to include in a deployment of an SFTP app to give Prometheus visibility and basic up/down alerting –
# Service Monitor Example for an SFTP app apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: sftp namespace: sftp labels: release: prometheus spec: selector: matchLabels: app.kubernetes.io/name: sftp # match labels to the service you want to monitor endpoints: - port: metrics # Name of the port the service monitor should go to for metrics. In this case the sftp app comes with a metrics port already. namespaceSelector: matchNames: - sftpns # namespace the service is in
# prometheus rule for setting up an alert for sftp apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: sftp role: alert-rules name: prometheus-sftp-rules namespace: sftpns spec: groups: - name: sftp rules: - alert: sftp-up expr: absent(up{job="sftp"}) # PromQL query that checks if the job status for sftp is UP for: 2m # If it's not "UP" for 2m it fires an alert labels: severity: critical priority: P1 # Label I added so it will match Ops Genie standards of alerting cluster: prod-cluster # Label to identify which cluster the alert is coming from annotations: # annotations are great for adding info to alert notifications, you can also include some variables here like $labels and $value. I like to add anything here that would help an SRE when they receive the alert. summary: sftp job missing on prod-cluster (instance {{ $labels.instance }}) description: "An sftp job has disappeared on prod-cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" impact: sftp is down for prod action: check sftp status with kubectl 'kubectl get all -n sftpns' priority: P1 alertmanagerurl: "https://prometheus-prod.mycompany.com/alertmanager/" prometheusurl: "https://prometheus-prod.mycompany.com/alerts" k8login: "https://kubeapi-prod.mycompany.com"
For more info on the CRD’s you can use with prometheus, check out redhat’s api documentation here – https://docs.openshift.com/container-platform/4.9/rest_api/monitoring_apis/monitoring-apis-index.html
So I have Metrics, What Now?
So far we have covered a lot of the basics of Prometheus. The next step in your Prometheus journey will be to familiarize yourself with the PromQL language. This is a very powerful querying language designed specifically for Prometheus. This part gets a little bit more advanced and we’re not going to dig too deep into PromQL queries here. To learn more about querying against Prometheus check out prometheus’ querying documentation – https://prometheus.io/docs/prometheus/latest/querying/basics/
The most basic uses of PromQL are to setup monitoring and alerting using Prometheus rules; and then using Alert Manager to send those notifications to whoever they need to go to. You can configure the notifications to go to slack, microsoft teams, ops genie, or victor ops api’s; or just send emails. Alert Manager has a ton of pre-selected options for sending notifications. See below for an example of the “watchdog” alert in Alert Manager, which is just a generic testing alert that is always firing within Prometheus –
Once you start to get the hang of Prometheus, you can start building more predictive analytics using these metrics. In some cases you can send the notifications to your own API’s which could then automatically restart a service or rebuild a pod in your Kubernetes cluster. Why call a person in the middle of the night when your system can fix itself?
Since Prometheus is a time series database, you can also store predictive analytics/metrics in it. Give your company’s Data Scientist’s access to Prometheus and they can start to predict failures and outcomes. This is where the “forethought” part of Prometheus really comes into play. At this point you’re starting to venture into the machine learning and AI section of Prometheus.
I have to say though, my personal favorite use of Prometheus is to use it for real time testing. Writing IAC (infrastructure as code) has a lot of challenges, and the better your tests are the better your deployments will be. Use Postman/Newman to run HTTP GET commands against prometheus to test your code or deployment against real time metrics. Newman validates the Prometheus results and then give you an exit code. https://github.com/postmanlabs/newman
A lot of IAC has to do with waiting for dependencies and for systems to finish building to start the next step. Start your deployment and once the Infrastructure is in place, Prometheus will pick up the metrics automatically. Then Newman tests against Prometheus and you’re good to go! You can check out the Prometheus api in your browser by going to http://yourprometheusurl.local:9090/api/v1/query. Here is an example of querying for the SFTPGO app http://yourprometheusurl.local:9090/api/v1/query?query=up{job=”sftpgo”}
Here are the logs of my Newman test pod to verify that SFTPGO is up and running. If the test succeeds, the Newman pod gets a completed status. If the Newman test fails, you get an exit code and the pod ends in an errored state.
A Whole World Awaits
Prometheus can be an incredibly powerful tool. It can also just be a simple monitoring and alerting solution if that’s all you want it to be. This article just scratches the surface of Prometheus and how to use it. With Kubernetes and containerization becoming main stream, having a powerful tool like Prometheus on your side is the quickest way to succeed!
Far too often I’ve seen monitoring, performance, and testing become an after thought. I’m a firm believer in test driven development. There’s no reason that test driven development can’t be applied to IAC. If you start with building your metrics and tests, I think you would be surprised how the rest of the pieces just fall into place.
Collecting data is the key to the future and the more data you have, the more options you have. Prometheus, the Titan god of forethought is best known for defying the gods by stealing fire from them and giving it to humanity in the form of knowledge; and in this case, it’s in the form of metrics.
Leave a Reply