Spinnaker's cache system.. and some of it’s pain points

Contributed by: Jason McIntosh | Originally posted on spinnaker.io

If you’ve not heard of Spinnaker it’s a continuous delivery platform under the CD Foundation (which is under the Linux Foundation). More information on it is available at https://spinnaker.io. This deployment platform has a couple of key systems that it uses to do deployments as well as show cloud information. Let’s talk about two of them:

The orchestration layer “Orca” which talks to the clouddriver. This primarily is a “read” from the CATS system in clouddriver. Note that kubernetes deploys DO NOT use the CATS system but do live calls on deploys. That said it still calls clouddriver to do deployments and invoke cloud operations.
The cache engine “CATS” — aka cache-all-the-stuff. This cache system loads data from all the known accounts, updates the cache, and listens for requests for updates.

This post is to discuss the “CATS” system a bit more in depth, why it was built, how it operates… and why it’s the MOST painful aspect of how Spinnaker operates today.

For a VERY simple overview, the armory docs page still has some good information around the cache system. DISCLOSURE: I work for harness which acquired armory’s assets including spinnaker resources, teams and tech related to armory!

And before we continue… if you want to see how to install spinnaker, deploy your app and ask some questions, please join the spinnaker workshop we’re having soon!

Cache all the things!

First, realize that the CATS system is a core piece that runs inside of Clouddriver and is the primary system by which Spinnaker knows about various server groups, tasks, images, and associated infrastructure. The way it operates is FAIRLY simplistic. Each account in Spinnaker adds a set of “agents”. Each agent is usually associated with a location and resource. E.g. account12345/us-west-2/LoadBalancingAgent might be an agent. Most agents are automatically run on a regular cycle. The agent does a “Read load balancers in us-west-2 for account 12345 and refresh the cache with this data. Then schedule to run and load data again 30 seconds after the last run completes”. This provides a continuous, near real-time view of all data for all accounts that Spinnaker knows about.

This is ALSO where Spinnaker struggles at scale. That’s a LOT of data, and a LOT of API calls. MOST cloud providers rate limit you heavily on how many API calls you can do for a given user. Spinnaker WILL spawn a new user for these calls through role assumption. But we still will often run into rate limits as each AWS endpoint is rate limited in slightly different levels or calls. Armory documented configuration to control the these rate limits long ago because this was such an issue. This is one of the most challenging areas to control and manage and since each agent runs in it’s own “thread” and has to be scheduled, the more accounts you have, the more Clouddriver grows AND grows the resources to run all of these agents.

At this point in time, there isn’t a work around. If you have 1500 aws accounts, you end up with about ~15 agents per AWS account PER region. OR 22,500 agents needed to run at a time for just that one region. IF You add ECS, that adds another 5–6 agents per account per region. Add Lambda, and that adds some more agents. This gets to the point where you can be 100,000 agents needed to run at a time for 4–5 regions for each account. This requires LARGE amounts of resources both for caching, storage, as well as network throughput. VERY few orgs hit this scale, but it’s a major scaling issue in Spinnaker that large orgs hit. I’ve seen spinnaker installs use 24xlarge instances and destroy them JUST for cache operations of AWS accounts.

The additional challenge is that each clouddriver pod defaults to running 100 agents at a time by default when using SQL. Universally this gets scaled to a higher number but you’re STILL going to hit limits — either database OR network limits on how many you can concurrently run. This means it takes longer and longer to run all of these agents, and you can’t run them all at the same time in a viable manner. Further, the existing workingschedulers have a problem: They select up to a “max concurrent agents” and randomize WHICH agents get run. This means the more agents you have, the slower those agents run, the more and more likely it is it can take a LONG time for changes to be detected in any given account.

What I hear from most orgs with AWS/ECS accounts is that they’ll “shard” spinnaker long before they get to the point where a single database can’t process all of the available accounts. This is both complicated AND more work without reducing the overall required resources — it can make it easier to manage however.

This problem around agents, the cache operations and how it works is one of the CORE challenges to truly scaling Spinnaker and making it work. There ARE solutions in places, but lets talk about some past design decisions on why this operates the way it does.

History and why CATS

First remember, that multi-account wasn’t a big deal in AWS for a LONG time. Most orgs would have a few production accounts that various teams would deploy into. This meant you really only had a few hundred agents to really run, and they’d have a lot of data, but you’d work with AWS to allow these and tune these rate limits for each type of resource, region and config for those accounts. The users would hit these APIs TOO hard if they want directly to these APIs instead of using a cache system. Netflix introduced Edda to act as a cache proxy around AWS APIs, but only their internal version was a viable product. The OSS version never seemed to be as usable by the rest of the community to help with these cache operations. That said Spinnaker DOES still support edda in the codebase but I don’t know of anyone outside of the Netflix who might use/support this integration.

Next, consider that there is NOT a good way to get data on all the accounts/regions/state without multiple API calls. Even logging into the AWS console, to find this information requires you often to drill down to specific resources. To get a holistic view of the state of an application is rather problematic. IF you want to load data for an application that runs in 5 regions, check the state of their health checks, this is a LOT of API calls to do on demand for all of those services from a control plane. That’s why the cache exists and it DOES help concurrency on those APIs. FURTHER and most importantly — if those calls take 5 minutes to complete, what do you show on the UI until those calls complete? Ignoring that issue (which in theory you can do with some clever UI “loading” wheels), Spinnaker was designed for a lot of users hitting the cache to reduce real cloud API calls because of the “single account” focus. It wasn’t as well designed for LOTS of accounts with FEW API calls to those accounts.

What are the solutions?

The short answer

There currently isn’t currently a good out-of-the-box solution.

The longer answer?

Various orgs have implemented work arounds on this system.

Kubernetes

Kubernetes doesn’t deploy USING cats but it does use the CATS system to show the UI and clusters view. Kubernetes control planes often have different API restrictions than cloud providers allowing Spinnaker to query/load information without rate limit issues. Further, spinnaker only has ONE agent per cluster vs. multiples for other accounts. Kubernetes spawns internal threads for each namespace it loads, but it’s a very different system and doesn’t have QUITE the same issues.

This means for “CATS”, kubernetes doesn’t hit this same issues the other agents hit. That said, there are STILL issues with “polling” for data. There’s still the issue of a CATS agent per account so if you have lots of clusters that’s lots of agents. Further the threads can get large to cache that data depending upon namespaces and configuration and number of objects. And spinnaker CAN hit the kubernetes cluster APIs HARD to load/fetch this information. It’s also on a cycle — so with large numbers of namespaces it’s slow to refresh information. LAST — it forks out to kubectl to get the data and is slow as a result. THIS one in particular is annoying and we hope to get fixed sooner rather than later.

STILL — kubernetes has the best solution thus far — bypass of the CATS for deploys, but still using the CATS system for UI and visualization of resources.

Agent approach:

Armory made an “attempt” for kubernetes to use a “watcher” to listen for changes and operate on a distributed operations queue. This used a remote agent to do all cache AND deployments. It used watchers to handle things instead of polling and offloaded all operations for that account from clouddriver to these remote agents. To an extent, this worked, but is at this point is mostly a dead project though I know it’s still tested/used internally.

A “real” caching implementation

Add a true on-demand caching (NOT to be confused with the “on-demand” internal system on CATS for which I’ll write a separate blog post on a future date) would be a potential cleaner alternative though there are challenges here. Salesforce did a presentation a while back about how in their fork of Spinnaker they only load data for accounts when the APIs for those accounts are requested which drastically reduces the cache system as a whole load wise. Salesforce has been great on contributing to Spinnaker — one of the leading OSS contributors to the platform! We’re still waiting on this contribution, but there’s a lot of challenges with various forks and integration of those forks into a project as complex as Spinnaker so we don’t know if/when we’ll see this into the main project.

An alternative that Home Depot wrote — Clouddriver — a go based clouddriver that implements PARTIALLY the clouddriver APIs. Click here for some great details on this implementation and why this path. This doesn’t unfortunately handle everything. BUT It is a lot closer to the probable “ideal” state long term of the way Spinnaker should interact with clouds. There’s no cache system that I’ve seen — it’s all real time calls when things are requested to the various kubenretes APIs. The challenge is it’s JUST Kubernetes, and doesn’t support other things that clouddriver provides — e.g. artifact handling/fetching like git cloning for baking manifests via helm/kustomize/etc.

Informer/Watch solutions

IF you’re not familiar — kubernetes has solutions around detecting changes efficiently. This is essentially the exact same as the “CATS” system. However kubernetes does this via a more efficient solution than the “poll for state, generate a delta, and update the cache” system that Spinnaker uses. Kubernetes informers are designed to track based upon a resourceids or bookmarks. Further, they’re integrated into the SDKs and are more efficient on detecting and operating on changes in kubernetes because of the way the kubernetes API support these types of operations. They DO have disadvantages — because of the way they operate, they have LARGE memory/thread requirements, and require some understanding to avoid failures. There are major potential scale operations — but we’ve ALSO heard of at least one company looking to contribute this kind of implementation at some point.

Here, you can see the watcher APIs. But there’s no current implementation available that uses this system.

What CAN be done?

First there are several fixes for various operations being put into place. Let me discuss one KEY one first:

In latest releases. YOU DO NOT need to cache kubernetes at ALL to deploy anything. To repeat: CATS is NOT needed for deployments. That said much of the view of the system DOES require CATS still today — if you want to get real time state of the various clusters your app might be deployed to, the CATS system is still POTENTIALLY useful. There is some debate on “is this really useful” but I personally still regularly use this feature.

Given the above, if caching is still useful, there are some things that will help.

Kubernetes — the move from the command line to an SDK will help, and we’re looking forward to some contributions on this. There’s a couple of companies reporting some changes in how this works and we’ll hopefully see PRs around this over the next few months. Moving from kubectl command calls to SDKS for kubernetes operations for is reported to help performance greatly. This is on the radar for kubernetes.

Kubernetes via an informer pattern would also help, but given the challenges in how these operate, it’s questionable how much most orgs would get advantage on this. Since these informers essentially duplicate much of how the CATS system operates, and may require more threads/CPU than the existing CATS system requires today.

ECS is getting a number of fixes right now due to bugs in how it queried for data. The first PR to fix some one of these bugs is up here as an example, but ECS cache operations have some MAJOR issues when you have large numbers of accounts. They load ALL cache data for ALL accounts into memory THEN filter the data. I’ll be working on this shortly to help here, but it still won’t help that it polls for changes on a regular cycle.

AWS has similar problems as ECS for NUMBER of agents by account/region that need addressing. It seems to NOT have some of the same performance issues in a few key areas though — it filters using lookup calls, though I think we could improve this system in several ways. BUT there isn’t any solution TODAY available that changes how AWS cache operations work to be more efficient. AWS does have an “event” system, but at last analysis, the events don’t have enough information to update the cache system and will still require cache operations to be run. E.g. this could change from a “30 seconds go run” to “we detected a change go run”. BUT to make this change isn’t simple and STILL requires loading the APIs. Further, in accounts with a LOT of changes going on this may not provide any significant advantages.

Last, there’s work on a few alternative schedulers that operate on a more… consistent throughput basis. You still have the pure SCALE issue on number of accounts/agents, but there are some attempts to make the scheduling of those agents to load data on a more consistent, deterministic manner.

What SHOULD be done?

First, we’d welcome PRs/thoughts around these issues! We’re seeing several companies contribute more regularly, but the more activity the better! The informer for kubernetes, a possible event based triggering vs. “30 second” poll change or similar may help a lot!

Next, it’d be great if cloud providers had better systems for these use cases. Kubernetes informers/watchers are a great example of a well done implementation. We need to have Spinnaker use these implementations if possible. OR possibly a move away from the “poll all the time” to simialr to what Salesforce has — cache on demand but do it in a more native way that scales. I’ve heard from other orgs where the informer/watcher pattern isn’t even scalable due to the sheer amount of resources that can change and cause these patterns to break. So we need to find new patterns/systems to meet the needs of these platforms.

Next, it’d be great if AWS/GCP provided equivalent listener/watcher endpoints to kubernetes. At the least, better endpoints for pulling data to present the information in other ways. My ideal would be a graphql OR perhaps a TRUE data feed system we could hook into. AWS Config in the past was NOT considered viable as it was also missing key pieces of data to update the CATS database. I doubt this will happen but we can ask! Lambda in particular in Spinnaker has to do ~7 different calls to find all the data to show and track lambda state! Better APIs that consider these kinds of integrations would help spinnaker and also help with MCP/AI work!

Third, change from the current systems to a queue approach. Both for data in, but also for data out. Clouddriver already operates PARTIALLY this way… but the “scheduler” which runs the various agents needs some work. There’s a few attempts at this, but none seem to have really fixed the issue (disclosure: I’m working on one now). Possibly moving to a distributed queue for operations work as well might help reduce the amount of work that clouddriver does. This leads into the next area that netflix has already apparently done internally.

Last, and this is broader as a whole. Is there an architectural change that should be made? Netflix seems to have moved to temporal for a lot of cloud operations — similar to this, caching as a whole could move this direction. BUT the project has to support both small AND large users efficiently while being simpler to operate and deploy. I hear rumblings that temporal is an AMAZING tech — but orgs almost HAVE to pay to use it at scale due to the operational complexities of managing it and scaling a temporal installation. I could see a pub/sub approach to this that invokes via temporal, or possibly remote agents as an alternative that allows a “selection” of backends for both small and larger orgs. There’s some contributed code that isn’t merged around fargate bakes for packer operations as an example. These kinds of changes would be great… but we need people to help contribute and review and bring these changes in! IF you’d like to contribute please join slack and hop over in the #dev channel — always happy to talk about spinnaker and ideas on how to improve it!

Spinnaker’s cache system.. and some of it’s pain points