From Spinnaker – Monitoring Spinnaker: SLA Metrics

By March 11, 2020Blog

Originally posted on the Spinnaker Community blog, by Rob Zienert, Sr Software Engineer @ Netflix

Long, long ago, in an internet that I barely remember, I wrote about monitoring Orca. I haven’t managed to take the time to write another post about a specific service — it’s a lot of work! Instead of going deep this time around, I want to paint with broader strokes: What are the key metrics we can track that help quickly answer the question, “Is Spinnaker healthy?”

Spinnaker is comprised of about a dozen open source services that may vary widely based on configuration, and as such, there’s no singular metric to rule them all. This makes the question, “Is Spinnaker healthy?” a particularly bothersome question since not all services are equally important. If Igor — the service that is responsible for monitoring CI/SCM systems — is unable to communicate with Jenkins, Spinnaker will be in a degraded state, but its core behavior is still healthy. Should Orca’s queue processing drop to zero, however, it’s time to have an elevated heart rate and quick remedy.

Service Metrics

The Service Level Indicators for our individual services can vary depending on configuration. For example, Clouddriver has cloud provider-specific metrics that should be tracked in addition to its core metrics. For the sake of this post’s length, I won’t be going into any cloud-specific metrics.

Universal Metrics

All Spinnaker services are RPC-based, and as such, the reliability of requests inbound and outbound are supremely important: If the services can’t talk to each other reliably, someone will be having a poor experience.

For each service, a controller.invocations metric is emitted, which is a PercentileTimer including the following tags:

  • status: The HTTP status code family, 2xx, 3xx, 4xx...
  • statusCode: The actual HTTP status code value, 204, 302, 429...
  • success: If the request is considered successful. There’s nuance here in the 4xx range, but 2xx and3xx are definitely all successful, whereas 5xx definitely are not
  • controller: The Spring Controller class that served this request
  • method: The Spring Controller method name, NOT the HTTP method

Similarly, each service also emits metrics for each RPC client that is configured via okhttp.requests. That is, Orca will have a variety of metrics for its Echo client, as well as its Clouddriver client. This metric has the following tags:

  • status: The HTTP status code family, 2xx, 3xx, 4xx...
  • statusCode: The actual HTTP status code value, 204, 302, 429...
  • success: If the request is considered successful
  • authenticated: Whether or not the request was authenticated or anonymous (if Fiat is disabled, this is always false)
  • requestHost: The DNS name of the client. Depending on your topology, some services may have more than one client to a particular service (like Igor to Jenkins, or Orca to Clouddriver shards).
Example of our 24/7 request fanout from Gate. One interesting tidbit: The sudden increase in traffic at 9am is the increased traffic to Clouddriver (bottom) from Chaos Monkey starting its daily light mayhem!

Having SLOs — and consequentially, alerts — around failure rate (determined via the succcess tag) and latency for both inbound and outbound RPC requests is, in my mind, mandatory across all Spinnaker services.

As a real world example, the alert Netflix uses for Orca to all of its client services is:

nf.cluster,orca-main.*,:re,
name,okhttp.requests,:eq,:and,
status,(,Unknown,5xx,),:in,:and,
statistic,count,:eq,:and,
:sum,
(,nf.cluster,),:by,
0.2,:gt,3,
:rolling-count,3,:ge

So, for people who can’t read Atlas expressions, if we have more than 0.2 failing/unknown RPS to a specific service over 3 minutes, we’ll get an alert.

Service-specific Metrics

Most of our services have an additional metric to judge operational health, but in/out RPC monitoring will go far if you’re just starting out.

  • Echo
    echo.triggers.count tracks the number of CRON-triggered pipeline executions fired. This value should be pretty steady, so any significant deviation is an indicator of something going awry (or the addition/retirement of a customer integration).
    echo.pubsub.messagesProcessed is important if you have any PubSub triggers. Your mileage may vary, but Netflix can alert if any subscriptions drop to zero for more than a few minutes.
  • Orca
    task.invocations.duration tracks how long individual queue tasks take to execute. While it is a Timer, for an SLA Metric, its count is what’s important. This metric’s value can vary widely, but if it drops to zero, it means Orca isn’t processing any new work, so Spinnaker is dead in the water from a core behavior perspective.
  • Clouddriver: Each cloud provider is going to emit its own metrics that can help determine health, but two universal ones I recommend tracking are related to its cache.
    cache.drift tracks cache freshness. You should group this by agent and region to be granular on exactly what cache collection is falling behind. How much lag is acceptable for your org is up to you, but don’t make it zero.
    executionCount tracks the number of caching agent executions and combined with status , we can track how many specific caching agents are failing at any given time.
Here, one collection for a specific AWS service in our largest region was getting stale. In this case, while AWS availability was fine for Clouddriver, Edda was having trouble refreshing.
It’s OK that there are failures in agents: As stable as we like to think our cloud providers are, it’s still another software system and software will fail. Unless you see sustained failure, there’s not much to worry about here. This is often an indicator of a downstream cloud provider issue.
  • Igor
    pollingMonitor.failed tracks the failure rate of CI/SCM monitor poll cycles. Any value above 0 is a bad place to be, but is often a result of downstream service availability issues such as Jenkins going offline for maintenance.
    pollingMonitor.itemsOverThreshold tracks a polling monitor circuit breaker. Any value over 0 is a bad time, because it means the breaker is open for a particular monitor and it requires manual intervention.

Product SLAs at Netflix

We also track specific metrics as they pertain to some of our close internal customers. Some customers care most about latency reading our cloud cache, others have strict requirements in latency and reliability of ad-hoc pipeline executions.

In addition to tracking our own internal metrics for each customer, we also subscribe to our customers’ alerts against Spinnaker. If internal metrics don’t alert us of a problem before our customers are aware something is wrong, we at least don’t want to wait for our customers to tell us.

Continued Observability Improvements

Since Spinnaker is such a large, varied system, blog posts such as these are fine, but really are meant to get the wheels turning on what could be possible. It also highlights a problem with Spinnaker today: A lack of easily discoverable operational insights and knobs. No one should have to rely on a core contributor to distill information like this into a blog post!

There’s already been a start to improving automated service configuration property documentation, but something similar needs to be started for metrics and matching admin APIs as well. A contribution that documents metrics, their tags, purpose and related alerts would be of huge impact to the project and something I’d be happy to mentor on and/or jumpstart.

Of course, if you want to get involved in improving Spinnaker’s operational characteristics, there’s a Special Interest Group for that. We’d love to see you there!