Actionable Cloud Infrastructure Metrics

June 9, 2022 · 13 min read

Some Engineer

Understanding what's running in your cloud infrastructure is important for a number of reasons—for example, security, compliance, and cost.

But sometimes, the cloud feels more like a black box that you're feeding with cash, and in turn it performs the work that makes your business run.

Even those spinning up cloud resources might only be aware of their small slice of the pie. With hundreds of thousands of interconnected resources, it is really hard to know what's going on!

Cloud inventory has become a new type of technical debt, where organizations lose track of their infrastructure and how it relates to the business. Resoto helps to break open the aforementioned black box and eliminate inventory debt.

Resoto provides a searchable snapshot of the current state of your cloud infrastructure, and can automatically react to state changes. Resoto also allows you to aggregate and visualize this data.

Here's an example of a heatmap that allows you to immediately see outliers (like when an account suddenly starts using a large number of expensive, high-core-count instances):

Instance use heatmap

We can ingest this aggregated data into a time series database, such as Prometheus. This information can then be used to build diagrams illustrating cloud resources (e.g., compute instances and storage) over time.

Metrics Overview

This allows you to alert on trends—for example, if you are projected to exceed a quota or spend limit.

Another use case is to quickly identify anomalies using the 3σ rule. If cloud API credentials are leaked or an automated system goes haywire, you would immediately see the spike instead of receiving an unpleasant surprise on your next cloud bill. Best of all, it works across multiple clouds and accounts!

Resoto comes with a handy metrics component, Resoto Metrics, which takes aggregation results and exports them to Prometheus. This blog post describes how to define your own metrics, write some PromQL queries and build a simple metrics dashboard using Resoto Metrics, Prometheus, and Grafana.

Concepts and Terminology

If you are already familiar with graph and time series databases, metrics, samples, labels, Prometheus, and Grafana, please feel free to skip ahead. For those new to the cloud-native metrics ecosystem, let's get some concepts and terminology out of the way!

Collect

Resoto creates an inventory of your cloud infrastructure by storing the metadata of your cloud resources inside of a graph. This is what we call the collect step.

Each resource (e.g., compute instance, storage volume, security group, etc.) is represented by a graph node. Nodes are connected via edges.

Edges represent the relationship between two nodes, like so (please excuse my MS Paint skills):

Graph visualization

A node is essentially an indexed JSON document containing the metadata of a resource. The aws_ec2_instance from the graph picture above would look something like this:

{
  "reported": {
    "kind": "aws_ec2_instance",
    "id": "i-07c9d738469b966d0",
    "tags": {
      "owner": "lukas"
    },
    "name": "wes-scaletesting-bootstrap",
    "ctime": "2020-06-16T15:08:45Z",
    "instance_cores": 4,
    "instance_memory": 16,
    "instance_type": "t2.xlarge",
    "instance_status": "running",
    ...
  },
  ...
}

Search

Among other things, Resoto allows you to search this metadata. Here's an example:

> search is(aws_ec2_instance) and instance_cores > 4
​kind=aws_ec2_instance, id=i-065af67d77cd5a272, name=16ca1.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-019f3f3a2a8d1990e, name=16ca2.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-0667dc8de49a4319e, name=16ca3.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-076b9763c755a9b51, name=16ca4.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-074fcfe526f95c9fd, name=16ca5.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-04e09d3c714048c4d, name=16ca6.prod1, instance_cores=16, age=3yr2mo, cloud=aws, account=eng-production, region=us-west-2
​kind=aws_ec2_instance, id=i-0d2dfda13e02b2b20, name=16ca7.prod1, instance_cores=16, age=2yr9mo, cloud=aws, account=eng-production, region=us-west-2
...

The search returned a list of all EC2 instances with more than 4 cores. There are times when you may not be interested in the details of individual resources, but simply want to aggregate them. You may want to know the total number of resources, or the number of running resources of a particular kind. You may be interested in the distribution of compute instances by instance type (e.g., m5.large, m5.2xlarge, etc.), or the current cost of compute and storage grouped by team.

Aggregation

Aggregating and grouping search results creates the samples of a metric.

> search aggregate(/ancestors.cloud.reported.name as cloud, /ancestors.account.reported.name as account, /ancestors.region.reported.name as region, instance_type as type, instance_status as status: sum(1) as instances_total): is(instance)
​group:
​  cloud: aws
​  account: eng-production
​  region: us-west-2
​  type: m5.xlarge
​  status: running
​instances_total: 13
​---
​group:
​  cloud: aws
​  account: eng-production
​  region: us-west-2
​  type: m5.4xlarge
​  status: stopped
​instances_total: 7
...

This is useful, but the ability to compare current values to those from an hour, day, month, year, etc. ago would be even more useful. This brings us to the next concept, time series.

Time Series

Time series databases such as Prometheus do not store details of individual resources, but aggregated data over time—allowing us to query aggregate data and create charts to visualize the results.

In the aggregated search above, each result is what Prometheus calls a sample. A sample is a single value at a specific point in time.

Looking again at the same example, cloud, account, region, type, and status in each group are labels. Labels are key: value pairs that allow us to group samples.

Prometheus has basic graphing capabilities, but Grafana allows you to build a dashboard visualizing data from different sources in a variety of chart styles, like this stacked line chart:

Instance cost over time

So here's the plan. First we will learn how to configure Prometheus to fetch data from Resoto Metrics. Then how to query that data inside Prometheus. After that we explore from where Resoto retrieves its metrics configuration and how to define our own metrics. Finally we will use Grafana to create a simple dashboard and visualize the data.

Getting Started

If you are new to Resoto, start the Resoto stack and configure it to collect your cloud accounts.

To check out the data Resoto Metrics generates open https://localhost:9955/metrics in your browser (replacing localhost with the IP address or hostname of the machine where resotometrics is running). This data is updated whenever Resoto runs the collection workflow. You should see an output similar to this:

List of metrics

That is the raw metrics data Prometheus will ingest. If you are using our Docker stack you do not have to do anything, Prometheus is already pre-configured. If you are using your own Prometheus installation, configure it to scrape this metrics endpoint. The config will look something like this:

prometheus.yml
scrape_configs:
  - job_name: "resotometrics"
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets: ["localhost:9955"]

Instead of skipping verification of the TLS certificate, you can also download the Resoto CA certificate and configure Prometheus to use it.

Querying a Metric

Open up your Prometheus installation (in our Docker stack it is running at http://localhost:9090) and you should see the following:

Empty Prometheus

Let's start with a very simple expression:

resoto_instances_total

That's it, that's the query. If you have any instances collected in Resoto the output will look something like this:

Prometheus listing metrics

Here is one of those metrics from the list:

resoto_instances_total{cloud="aws", account="eng-production", region="us-west-2", status="running", type="m5.xlarge", instance="localhost:9955", job="resotometrics"} 17

The key="value" pairs inside those curly brackets are those previously mentioned labels. To filter by label let us update the query to:

resoto_instances_total{status="running"}

Now we are only seeing compute instances that we are actually paying for at the moment.This information is a bit more interesting, but we could get the same from within the Resoto Shell. What would be really interesting, is how the number of compute instances has changed over the last week or two.

Click on the Graph tab, choose a 2w period and click the Show stacked graph button.

Prometheus raw graph data

We are getting closer to what we'd like to see. But what are these speckles? Why aren't we seeing solid lines?

By default Resoto collects data once per hour. Let's tell Prometheus to create an average over time over one hour by changing the query to:

avg_over_time(resoto_instances_total{status="running"}[1h])

Prometheus summed metrics

Good, the data points are connected and averaged over time. However the amount of labels is a bit overwhelming. Right now we are seeing one stacked chart per unique label combination. Let's try to reduce the amount of labels by summing them all up.

sum(avg_over_time(resoto_instances_total{status="running"}[1h]))

Prometheus summed metrics

Nice, now we see how the total number of compute instances has changed over the last two weeks. However we lost absolutely all labels. No more accounts, region and instance type information. To get some information back, let's group the summed up averages by account.

sum(avg_over_time(resoto_instances_total{status="running"}[1h])) by (account)

Prometheus summed metrics by account

Neat, we see how the number of compute instances has changed over time for each account.

Want to see how storage has changed over time? Just change resoto_instances_total to resoto_volume_bytes. Want to see $$$ spent per hour? resoto_instances_hourly_cost_estimate is the metric you are looking for.

How Metrics Are Made

The Prometheus web UI provides syntax help and autocomplete for available metric names. However, you may be wondering—how are you supposed to know which metrics exist? How do you find what other metrics exist and where a value (for example, resoto_instances_total) is defined?

Metrics are defined in the resoto.metrics configuration. To edit metrics definitions, execute the following command in Resoto Shell:

> config edit resoto.metrics

resotometrics:
  metrics:
    instances_total:
      # Metric help text
      help: 'Number of Instances'
      # Aggregation search to run
      search: 'aggregate(/ancestors.cloud.reported.name as cloud, /ancestors.account.reported.name as account, /ancestors.region.reported.name as region, instance_type as type, instance_status as status: sum(1) as instances_total): is(instance)'
      # Type of metric (gauge or counter)
      type: 'gauge'
    ...
  ...

As described above, the aggregate expression in the search field is what creates the samples of a metric.

Metrics configuration can be updated at runtime. When the metrics workflow is run, Resoto Metrics will generate the new metric for Prometheus to consume.

> workflow run metrics

Creating a Metrics Dashboard

Now that we've learned how to get metrics from Resoto into Prometheus, query metrics, and define new metrics, we can create the dashboard.

Alright, fasten your seatbelts! This will go fast. 🏎️💨

Start the Grafana Docker container:

$ docker run -d -p 3000:3000 -v grafana-data:/var/lib/grafana -v grafana-etc:/etc/grafana grafana/grafana-oss

Open the Grafana web UI (e.g., http://localhost:3000).
Login as admin with password admin and set a new password.
On the left, open Settings > Data Sources > Add Data Source > Prometheus.
In the URL field, enter the Prometheus URL e.g. http://tsdb.docker.internal:9090
Scroll down and click the Save & test button. Make sure that the result is "Data source is working":
Click the + button on the left, select Create Dashboard, and then click the Save button in the top menu bar.
Select Dashboard settings > Variables and click the Add variable button:
Enter cloud for Name, Cloud for Label, and label_values(cloud) for Query. Toggle Multi-value and Include All option to enable both selection options. Ensure that the Preview of values at the bottom displays the available clouds, then click the Update button.
Repeat steps 8 and 9, but with the following values:

Name Label Query Multi-value Include All option
account Account label_values(account) ✔️ ✔️
region Region label_values(region) ✔️ ✔️
Hit Esc on your keyboard to go back, then click Add new panel.

Name	Label	Query	Multi-value	Include All option
`account`	`Account`	`label_values(account)`	✔️	✔️
`region`	`Region`	`label_values(region)`	✔️	✔️

Copy the following into the text box to the right of Metrics browser > in the Query tab:

sum(avg_over_time(resoto_instances_total{cloud=~"$cloud", account=~"$account", region=~"$region", status="running"}[$__interval])) by (cloud, account)

"Time Series" should be selected in the dropdown at the top right. Configure the settings underneath as follows:

Setting	Value
Panel options > Title	`Instances Total - running`
Tooltip > Tooltip mode	All
Tooltip > Values sort order	Descending
Legend > Legend mode	Hidden
Graph styles > Line width	4
Graph styles > Fill opacity	40
Graph styles > Connect null values	Always
Graph styles > Stack series	Normal

Click the Save button, then the Apply button.

Setup of a new panel

You now have a functional dashboard!

Initial dashboard

info

Don't forget to click the Save button any time you make changes to the dashboard!

Now, we'll add a second panel. Again, click Add new panel.

Copy the following into the text box to the right of Metrics browser > in the Query tab:
```
sum(avg_over_time(resoto_instances_total{cloud=~"$cloud", region=~"$region", account=~"$account", status="running"}[$__interval]))
```
Select Stats in the panel type dropdown at the top right. Then, configure the settings underneath as follows:

Setting Value
Panel options > Title Instances $cloud - running
Value options > Calulation Last *
Stat styles > Color mode None
Stat styles > Graph mode None

Click the Save button, then the Apply button.
The dashboard now shows two panels; one showing the number of currently running instances, and another depicting the history of the number of instances:

Setting	Value
Panel options > Title	`Instances $cloud - running`
Value options > Calulation	Last *
Stat styles > Color mode	None
Stat styles > Graph mode	None

The Final Product

If we repeat the above steps for all of the metrics present in the configuration, the result is a dashboard that looks like this:

Finished Dashboard

This is the actual production dashboard from a real Resoto user. 🥰

The dashboard shows the amount of compute and storage currently in use, as well as the associated cost. It also graphs volumes that are not in use and pending cleanup by Resoto. They also have dashboards for quota limits and network-related stats, which individual teams use to monitor their cloud usage by exposing custom tags as Prometheus labels and filtering by team or project.

This user even contributed their Grafana dashboard templates to our GitHub repository, so you don't have to create them yourself. But if you want to customize it, you now know how!

Install Resoto and build your own dashboard today! ✨

Concepts and Terminology​

Collect​

Search​

Aggregation​

Time Series​

Getting Started​

Querying a Metric​

How Metrics Are Made​

Creating a Metrics Dashboard​

The Final Product​

Contact Us