Skip to main content
Category

Project

From Jenkins – WebSocket

By Blog, Project

Originally posted on the Jenkins blog by Jesse Glick

I am happy to report that JEP-222 has landed in Jenkins weeklies, starting in 2.217. This improvement brings experimental WebSocket support to Jenkins, available when connecting inbound agents or when running the CLI. The WebSocket protocol allows bidirectional, streaming communication over an HTTP(S) port.

While many users of Jenkins could benefit, implementing this system was particularly important for CloudBees because of how CloudBees Core on modern cloud platforms (i.e., running on Kubernetes) configures networking. When an administrator wishes to connect an inbound (formerly known as “JNLP”) external agent to a Jenkins master, such as a Windows virtual machine running outside the cluster and using the agent service wrapper, until now the only option was to use a special TCP port. This port needed to be opened to external traffic using low-level network configuration. For example, users of the nginx ingress controller would need to proxy a separate external port for each Jenkins service in the cluster. The instructions to do this are complex and hard to troubleshoot.

Using WebSocket, inbound agents can now be connected much more simply when a reverse proxy is present: if the HTTP(S) port is already serving traffic, most proxies will allow WebSocket connections with no additional configuration. The WebSocket mode can be enabled in agent configuration, and support for pod-based agents in the Kubernetes plugin is coming soon. You will need an agent version 4.0 or later, which is bundled with Jenkins in the usual way (Docker images with this version are coming soon).

Another part of Jenkins that was troublesome for reverse proxy users was the CLI. Besides the SSH protocol on port 22, which again was a hassle to open from the outside, the CLI already had the ability to use HTTP(S) transport. Unfortunately the trick used to implement that confused some proxies and was not very portable. Jenkins 2.217 offers a new -webSocket CLI mode which should avoid these issues; again you will need to download a new version of jenkins-cli.jar to use this mode.

The WebSocket code has been tested against a sample of Kubernetes implementations (including OpenShift), but it is likely that some bugs and limitations remain, and scalability of agents under heavy build loads has not yet been tested. Treat this feature as beta quality for now and let us know how it works!

From Jenkins – Atlassian’s new Bitbucket Server integration for Jenkins

By Blog, Project

Originally posted on the Jenkins blog by Daniel Kjellin

We know that for many of our customers Jenkins is incredibly important and its integration with Bitbucket Server is a key part of their development workflow. Unfortunately, we also know that integrating Bitbucket Server with Jenkins wasn’t always easy – it may have required multiple plugins and considerable time. That’s why earlier this year we set out to change this. We began building our own integration, and we’re proud to announce that v1.0 is out.

The new Bitbucket Server integration for Jenkins plugin, which is built and supported by Atlassian, is the easiest way to link Jenkins with Bitbucket Server. It streamlines the entire set-up process, from creating a webhook to trigger builds in Jenkins, to posting build statuses back to Bitbucket Server. It also supports smart mirroring and lets Jenkins clone from mirrors to free up valuable resources on your primary server.

Our plugin is available to install through Jenkins now. Watch this video to find out how, or read the BitBucket Server solution page to learn more about it.

Once you’ve tried it out we’d love to hear any feedback you have. To share it with us, visit https://issues.jenkins-ci.org and create an issue using the component atlassian-bitbucket-server-integration-plugin.

Screwdriver: Introducing Queue Service

By Blog, Project
Introducing Queue Service

Pritam Paul, Software Engineer, Verizon Media

We have recently made changes to the underlying Screwdriver Architecture for build processing. Previously, the executor-queue was tightly-coupled to the SD API and worked by constantly polling for messages at specific intervals. Due to this design, the queue would block API requests. Furthermore, if the API crashed, scheduled jobs might not be added to the queue, causing cascading failures.

Hence, keeping the principles of isolation-of-concerns and abstraction in mind, we designed a more resilient REST-API-based queueing system: the Queue Service. This new service reads, writes and deletes messages from the queue after processing. It also encompasses the former capability of the queue-worker and acts as a scheduler.

Authentication

The SD API and Queue Service communicate bidirectionally using signed JWT tokens sent via auth headers of each request.

Build Sequence
image
Design Document

For more details, check out our design spec.

Using Queue Service

As a cluster admin, to configure using the queue as an executor, you can deploy the queue-service as a REST API using a screwdriver.yaml and update configuration in SD API to point to the new service endpoint:

# config/default.yaml
ecosystem:
    # Externally routable URL for the User Interface
    ui: https://cd.screwdriver.cd

    # Externally routable URL for the Artifact Store
    store: https://store.screwdriver.cd

    # Badge service (needs to add a status and color)
    badges: https://img.shields.io/badge/build–.svg

    # Internally routable FQDNS of the queue service
    queue: http://sdqueuesvc.screwdriver.svc.cluster.local

executor:
    plugin: queue
    queue: “

For more configuration options, see the queue-service documentation.

Compatibility List

In order to use the new workflow features, you will need these minimum versions:

  • UI – v1.0.502
  • API – v0.5.887
  • Launcher – v6.0.56
  • Queue-Service – v1.0.11
Contributors

Thanks to the following contributors for making this feature possible:

Questions and Suggestions

We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Screwdriver : Recent Enhancements and Bug Fixes

By Blog, Project

Recent Enhancements and Bug Fixes

Screwdriver Team from Verizon Media

UI

Previously, users could not start builds during a freeze window unless they made changes to the freeze window setting in the screwdriver.yaml configuration. Now, you can start a build by entering a reason in the confirmation modal. This can be useful for users needing to push out an urgent patch or hotfix during a freeze window.

image
image

Store

  • Feature: Build cache now supports local disk-based cache in addition to S3 cache.

Queue Worker

  • Bugfix: Periodic build timeout check
  • Enhancement: Prevent re-enqueue of builds from same event.

Compatibility List

In order to have these improvements, you will need these minimum versions:

  • UI – v1.0.479
  • API – v0.5.835
  • Store – v3.10.3
  • Launcher – v6.0.42
  • Queue-Worker – v2.9.0

Contributors

Thanks to the following contributors for making this feature possible:

Questions and Suggestions

We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Screwdriver: Improvements and Fixes

By Blog, Project

Part 2 from the Screwdriver Team at Verizon Media

UI
  • Enhancement: Upgrade to node.js v12.
  • Enhancement: Users can now link to custom test & coverage URL via metadata.
  • Enhancement: Reduce number of API calls to fetch active build logs.
  • Enhancement: Display proper title for Commands and Templates pages.
  • Bug fix: Hide “My Pipelines” from Add to collection dialogue.
  • Enhancement: Display usage stats for a template.
image
API
Store
Compatibility List

In order to have these improvements, you will need these minimum versions:

  • UI – v1.0.491
  • API – v0.5.851
  • Store – v3.10.5
Contributors

Thanks to the following contributors for making this feature possible:

Questions and Suggestions

We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Screwdriver: Build cache – Disk Strategy

By Blog, Project

Screwdriver now has the ability to cache and restore files and directories from your builds to either s3 or disk-based storage. Rest all features related to the cache feature remains the same, only a new storage option is added. Please DO NOT USE this cache feature to store any SENSITIVE data or information.

The graph below is our Internal Screwdriver instance build-cache comparison between disk-based strategy vs aws s3.

Build cache – get cache – (disk strategy)

image

Build cache – get cache – (s3)

image

Build cache – set cache – (disk strategy)

image

Build cache – set cache – (s3)

image

Why disk-based strategy?

Based on the cache analysis, 1. The majority of time was spent pushing data from build to s3, 2. At times the cache push fails if the cache size is big (ex: >1gb). So, simplified the storage part by using a disk cache strategy and using filer/storage mount as a disk option. Each cluster will have its own filer/storage disk mount.

NOTE: When a cluster becomes unavailable and if the requested cache is not available in the new cluster, the cache will be rebuilt once as part of the build.

Cache Size: 

Max size limit per cache is configurable by Cluster admins.

Retention policy:

Cluster admins are responsible to enforce retention policy.

Cluster Admins:

Screwdriver cluster-admin has the ability to specify the cache storage strategy along with other options like compression, md5 check, cache max limit in MB

Reference: 

  1. https://github.com/screwdriver-cd/screwdriver/blob/master/config/default.yaml#L280
  2. https://github.com/screwdriver-cd/executor-k8s-vm/blob/master/index.js#L336
  3. Issue: https://github.com/screwdriver-cd/screwdriver/issues/1830

Compatibility List:

In order to use this feature, you will need these minimum versions:

Contributors:

Thanks to the following people for making this feature possible:

Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Spinnaker: 1.18 Release Introduces Spinnaker Community Stats

By Blog, Project

Author: Spinnaker Steering Committee (Travis Tomsu, Software Engineer, Google)

The Spinnaker community has grown significantly after launching as an open source project in 2015. The project maintainers increasingly look for ways to help the community better understand how Spinnaker is used, and to help contributors prioritize future improvements.

Today, feature development is guided by industry experts, community discussions, Special Interest Groups (SIGs), and events like the recently held Spinnaker Summit. In August 2019, the community published an RFC, which proposed the tooling that will enable everyone to make data-driven decisions based on product usage across all platforms. We encourage Spinnaker users to continue providing feedback, and to review and comment on the RFC.

Following on from this RFC, the Spinnaker 1.18 release includes an initial implementation of statistics collection capabilities that are used to collect generic deployment and usage information from Spinnaker installations around the world. Before going into the details, here are some important facts to know:

  • No personally identifying information (PII) is collected or logged.
  • The implementation was reviewed and is approved by the Linux Foundation’s Telemetry Data Collection and Usage Policy.
  • All stats collection code is open source and can be found in the Spinnaker statsEcho, and Kork repos found on GitHub.
  • Users can disable statistics collection at any time through a single Halyard command.
  • Community members that want to work with the underlying dataset and/or dashboard reports can request and receive full access.

This feature exists in the Spinnaker 1.18 release,but is disabled by default while we finalize testing of the backend and fine-tune report dashboards. The feature will be enabled by default in the Spinnaker 1.19 release (scheduled for March 2020).

All data will be stored in a Google BigQuery database, and report dashboards will be publicly available from the Community Stats page. Community members can request access to the collection data.

Data collected as part of this effort allows the entire community to better monitor the growth of Spinnaker, understand how Spinnaker is used “in the wild”, and prioritize feature development across a large community of Spinnaker contributors. Thank you for supporting Spinnaker and for your help in continuing to make Spinnaker better!

From Spinnaker – April’s Spinnaker Gardening #CommunityHack is Going Virtual!

By Blog, Project
Spinnaker Gardening Days Community Hack

Originally posted on the Armory blog, by Rosalind Benoit

Guess what?! Our Hackathon is going fully online! “Spinnaker Gardening Days #CommunityHack” happens in one month, and we’re gearing up for an international open-source work-from-home extravaganza! Via Zoom, Slack, and Github, we’ll empower you to move the needle on continuous delivery projects. Teams will hack, newcomers will train, and champions will share Spinnaker secrets. Click here to register and get your free tickets for the hackathon, training track, lunchtime learnings, or all three.

 Join other Spinnaker users and companies to learn and let your skills shine at this collaborative event. We’ll address open-source feature requests, extend the ecosystem, and have lots of fun. Thanks to our generous sponsor Salesforce, all logged-in participants will score prizes, premium swag, and lunch on us! Hack through the workday, or check out our noontime lightning talks. Visit the Spinnaker Gardening repository for the schedule and details.

Salesforce logo

The Armory Tribe celebrates the support of Salesforce and, in particular, Edgar Magana, a Spinnaker champion and Cloud Operations Architect. We recently sat down to discuss the Ops SIG, modeling and standardizing Spinnaker, and his ideas for hackathon projects. Read the full article here.

A relative newcomer to the Spinnaker community, but a veteran in matters of cloud computing, networking, and OSS projects like OpenStack, Edgar recently founded the Operations SIG (Special Interest Group). Just as he recognized that “the community needed a place to discuss how to operate Spinnaker better,” he also urges us to jump-start the Spinnaker community. He’s recommended improvements to the contributor experience, and persuaded Salesforce to sponsor this first-ever Spinnaker hackathon.

Of course, we touched on his most pressing open-source Spinnaker initiatives in our chat. Next up? Gather a team! 

“We really want to come to the hackathon with goals, and to put extra motivation for folks to address them as a community,” Edgar explains their sponsorship.

From Salesforce and the Ops SIG perspective, Edgar has two features stories to focus on at the hackathon:

  • “Run any OSS source code scanning software against Spinnaker microservices, and you’ll find a number of vulnerabilities in the libraries that Spinnaker leverages. We’d like to minimize and solve those as much as possible.” 
    • I’m pumped about this one because a) in many instances, this is a low-barrier-to-entry task that newer contributors can make a huge dent in, and b) every ops freak knows that fixing OSS dependencies is probably the most important security measure we touch. 
  • “Cloud driver scalability is another key initiative in progress. The dynamic account system works, but performance can be improved drastically for those using a large system with 800-1000 Kubernetes accounts. There was a bugfix in 1.17, but it still takes lots of time for clouddriver to cache new accounts, and this means a long startup time.”
    • Edgar would like to see new accounts dynamically appended to the cache instead of triggering another cache of all accounts, and has been collaborating with Armory engineers on a solution. Another excellent project goal for Community Gardening!

Here on Armory’s Community team, we second Edgar’s suggestion to make Spinnaker more “beginner-friendly” and welcoming to new contributors. Our top goals for the first half of 2020 revolve around improving the contributor experience, from promoting issue triage in SIGs, to creating and organizing documentation around Spinnaker development environment, release cycle, and contribution guidelines so that newcomers know where to find answers and how to get started. Expect to see a contributor experience project from us at the hackathon!

In the meantime, the Plugin Framework for Spinnaker that Armory and Netflix are building is maturing fast. This work will make Spinnaker more welcoming to contributors in another way: it provides clear extension points in the codebase, along with an easy way to load extensions to a running Spinnaker instance. With the Spinnaker Gardening Days, we want encourage you to build extensions. Moreover, we know that many teams using Spinnaker in production have already built custom tooling around it; we’re encouraging those teams to leverage the plugin framework to quickly share their work with the OSS community (sounds like a stellar hackathon project!). We’re better together, and with a widely adopted project like Spinnaker, you can feel sure that paying it forward will reap big dividends for you and your organization. Check out the Plugin Creators Guide and Plugin Users Guide to learn more!

Calling Edgar and all other incredible Spinnaker developers: it’s time to add your fantastic Spinnaker Gardening ideas to the Project Ideas Wiki, create a slack channel for your project, and start prepping for the most exciting online event of 2020! Don’t forget to register here and reserve your ticket : )

spinnaker-hackathon gardening readme

Learn more in the spinnaker-hackathon/gardening README

From Spinnaker – Monitoring Spinnaker: SLA Metrics

By Blog, Project

Originally posted on the Spinnaker Community blog, by Rob Zienert, Sr Software Engineer @ Netflix

Long, long ago, in an internet that I barely remember, I wrote about monitoring Orca. I haven’t managed to take the time to write another post about a specific service — it’s a lot of work! Instead of going deep this time around, I want to paint with broader strokes: What are the key metrics we can track that help quickly answer the question, “Is Spinnaker healthy?”

Spinnaker is comprised of about a dozen open source services that may vary widely based on configuration, and as such, there’s no singular metric to rule them all. This makes the question, “Is Spinnaker healthy?” a particularly bothersome question since not all services are equally important. If Igor — the service that is responsible for monitoring CI/SCM systems — is unable to communicate with Jenkins, Spinnaker will be in a degraded state, but its core behavior is still healthy. Should Orca’s queue processing drop to zero, however, it’s time to have an elevated heart rate and quick remedy.

Service Metrics

The Service Level Indicators for our individual services can vary depending on configuration. For example, Clouddriver has cloud provider-specific metrics that should be tracked in addition to its core metrics. For the sake of this post’s length, I won’t be going into any cloud-specific metrics.

Universal Metrics

All Spinnaker services are RPC-based, and as such, the reliability of requests inbound and outbound are supremely important: If the services can’t talk to each other reliably, someone will be having a poor experience.

For each service, a controller.invocations metric is emitted, which is a PercentileTimer including the following tags:

  • status: The HTTP status code family, 2xx, 3xx, 4xx...
  • statusCode: The actual HTTP status code value, 204, 302, 429...
  • success: If the request is considered successful. There’s nuance here in the 4xx range, but 2xx and3xx are definitely all successful, whereas 5xx definitely are not
  • controller: The Spring Controller class that served this request
  • method: The Spring Controller method name, NOT the HTTP method

Similarly, each service also emits metrics for each RPC client that is configured via okhttp.requests. That is, Orca will have a variety of metrics for its Echo client, as well as its Clouddriver client. This metric has the following tags:

  • status: The HTTP status code family, 2xx, 3xx, 4xx...
  • statusCode: The actual HTTP status code value, 204, 302, 429...
  • success: If the request is considered successful
  • authenticated: Whether or not the request was authenticated or anonymous (if Fiat is disabled, this is always false)
  • requestHost: The DNS name of the client. Depending on your topology, some services may have more than one client to a particular service (like Igor to Jenkins, or Orca to Clouddriver shards).
Example of our 24/7 request fanout from Gate. One interesting tidbit: The sudden increase in traffic at 9am is the increased traffic to Clouddriver (bottom) from Chaos Monkey starting its daily light mayhem!

Having SLOs — and consequentially, alerts — around failure rate (determined via the succcess tag) and latency for both inbound and outbound RPC requests is, in my mind, mandatory across all Spinnaker services.

As a real world example, the alert Netflix uses for Orca to all of its client services is:

nf.cluster,orca-main.*,:re,
name,okhttp.requests,:eq,:and,
status,(,Unknown,5xx,),:in,:and,
statistic,count,:eq,:and,
:sum,
(,nf.cluster,),:by,
0.2,:gt,3,
:rolling-count,3,:ge

So, for people who can’t read Atlas expressions, if we have more than 0.2 failing/unknown RPS to a specific service over 3 minutes, we’ll get an alert.

Service-specific Metrics

Most of our services have an additional metric to judge operational health, but in/out RPC monitoring will go far if you’re just starting out.

  • Echo
    echo.triggers.count tracks the number of CRON-triggered pipeline executions fired. This value should be pretty steady, so any significant deviation is an indicator of something going awry (or the addition/retirement of a customer integration).
    echo.pubsub.messagesProcessed is important if you have any PubSub triggers. Your mileage may vary, but Netflix can alert if any subscriptions drop to zero for more than a few minutes.
  • Orca
    task.invocations.duration tracks how long individual queue tasks take to execute. While it is a Timer, for an SLA Metric, its count is what’s important. This metric’s value can vary widely, but if it drops to zero, it means Orca isn’t processing any new work, so Spinnaker is dead in the water from a core behavior perspective.
  • Clouddriver: Each cloud provider is going to emit its own metrics that can help determine health, but two universal ones I recommend tracking are related to its cache.
    cache.drift tracks cache freshness. You should group this by agent and region to be granular on exactly what cache collection is falling behind. How much lag is acceptable for your org is up to you, but don’t make it zero.
    executionCount tracks the number of caching agent executions and combined with status , we can track how many specific caching agents are failing at any given time.
Here, one collection for a specific AWS service in our largest region was getting stale. In this case, while AWS availability was fine for Clouddriver, Edda was having trouble refreshing.
It’s OK that there are failures in agents: As stable as we like to think our cloud providers are, it’s still another software system and software will fail. Unless you see sustained failure, there’s not much to worry about here. This is often an indicator of a downstream cloud provider issue.
  • Igor
    pollingMonitor.failed tracks the failure rate of CI/SCM monitor poll cycles. Any value above 0 is a bad place to be, but is often a result of downstream service availability issues such as Jenkins going offline for maintenance.
    pollingMonitor.itemsOverThreshold tracks a polling monitor circuit breaker. Any value over 0 is a bad time, because it means the breaker is open for a particular monitor and it requires manual intervention.

Product SLAs at Netflix

We also track specific metrics as they pertain to some of our close internal customers. Some customers care most about latency reading our cloud cache, others have strict requirements in latency and reliability of ad-hoc pipeline executions.

In addition to tracking our own internal metrics for each customer, we also subscribe to our customers’ alerts against Spinnaker. If internal metrics don’t alert us of a problem before our customers are aware something is wrong, we at least don’t want to wait for our customers to tell us.

Continued Observability Improvements

Since Spinnaker is such a large, varied system, blog posts such as these are fine, but really are meant to get the wheels turning on what could be possible. It also highlights a problem with Spinnaker today: A lack of easily discoverable operational insights and knobs. No one should have to rely on a core contributor to distill information like this into a blog post!

There’s already been a start to improving automated service configuration property documentation, but something similar needs to be started for metrics and matching admin APIs as well. A contribution that documents metrics, their tags, purpose and related alerts would be of huge impact to the project and something I’d be happy to mentor on and/or jumpstart.

Of course, if you want to get involved in improving Spinnaker’s operational characteristics, there’s a Special Interest Group for that. We’d love to see you there!

From Spinnaker – Future of SRE: Robert Keng Builds a DeploymentBot #withSpinnaker

By Blog, Project

Originally posted on the Spinnaker Community blog, by Rosalind Benoit

Coming soon from Chime to OSS, a software delivery chatbot which uses Slack to deploy apps via Spinnaker

Last month I had the pleasure of chatting with Robert Keng, a Lead SRE at Chime, about a Slack-integrated ChatBot he recently built to facilitate lightweight, direct deployments for developers. Chime’s continuous delivery is based on Spinnaker, driven with signal-based GitOps. Via pipelines, merged release branches are auto-deployed from a continuous integration (CI) solution, through QA to production with no human interaction interaction.

However, it hasn’t always been this way; Chime has roots in a legacy build environment, largely for Ruby-on-Rails development. It’s based on configuration management tools such as Salt, and thus not containerized, but pointed at long-lived infrastructure. So, containerization formed an important milestone in Chime’s continuous delivery adoption. Luckily, according to Robert, its high-trust, growth minded culture and workflows have supported the evolution.

Chime’s culture also provides flexibility that highlights Spinnaker’s power to accelerate digital transformation. Robert explains that, in some instances, it makes sense for developers to deploy straight to a test environment, bypassing CI. When adding a small feature to a mobile app, for example, I might want to bypass CI wait time to deploy and experiment with behavior (raise your hand if you‘ve built an app and never done that…didn’t think so!)

Meeting Chime devs where they’re @

“We’re cutting the straight-to-prod patch fix deployments down to zero,” Robert clarifies, and he’s done it by creating a flexible system with Spinnaker that models Chime’s culture of trust. At any time, if the devs he enables would rather execute commands in Slack to deploy branches to environments of their choosing, they can. Robert has created a tool that allows them that agency, while empowering them to address complex use cases, for example, adding logic into the Slack commands to deploy dynamic environments into different Kubernetes clusters. In production, “If we need to scale customers on the Z-axis, and build multiple app versions with different backends to target different service providers” as deployment targets, with Spinnaker, Chime can. Robert points out:

“Spinnaker offers a lot of agility in that respect. It would be hard to accommodate gitOps and chatOps in the same place without it.”

In a prime example of the opportunities to solve that Spinnaker provides as a platform, Robert has created a golden path which allows Chime’s teams to iterate in a safe environment. To create it, Robert analyzed workflows as they are and designed an alternative workflow that mapped what he observed in Spinnaker. This, combined with the auto-deploy strategy, tells the story, written in pipelines, of how Chime engineers deliver software. This way, as an SRE, he can rely on automated guardrails for safety regardless of the deployment path. As Kelsey Hightower says, it “serializes the culture in tools” in a way that’s seamless, painless, and purposefully abstracted.

Because at the end of the day, it’s not about the tools. It’s about your story, which in Chime’s case, is all about changing the way people feel about banking. What products and services do you delight your customers with? What’s your story? You can tell it #withSpinnaker

One DeploymentBot, Headed for OSS Spinnaker

The tool, in a multi-service design, has a component which handles the request/response communication with Slack, a frontend that leverages Okta user groups to control who can access Spinnaker, and a Python backend which processes the request data in batches. This architecture evolved from using webhooks to, at Armory’s suggestion, using client certs for faster authentication, and from a monolith version to microservices, because of constraints encountered in the bot’s development. The top constraint: the Slack Events API’s requirement that a response from requests arising from message actions be received within 3 seconds.

This constraint presented challenges in actions like querying Vault for certificates to authenticate against Spinnaker, and even in token exchange with Slack. Breaking the chatbot into pieces allowed Robert to create a responsive, extensible service to deliver a full-featured experience for Chime devs. “It’s turned into a monster,” he grins. “I have tons of feature requests for additional functionality already” (because his devs love using it).

Next steps for Robert’s Bot include developing it against the entire Spinnaker API to leverage all features available, and adding more dynamic capability. He wants to enable devs to use the bot to deal with existing pipelines and executions, and adjust parameters and other configuration via a scripted payload directly from Slack.

Another important next step? Open-sourcing the DeploymentBot! Robert’s very busy with projects right now (read more below), but I’ll hook him up with support from Armory engineers, if needed, to help get this invention to the masses.

The Future of Site Reliability, Platforms, and DevOps Engineering

As he describes his plans for the Bot, we start talking about the myth of NoOps. I have my own words about the opportunities and fallacies of Dev + Ops, but here, Robert’s voice speaks for itself:

“My team isn’t DevOps, it’s SRE (Site Reliability Engineering). DevOps is just part of what we do. As tech stacks mature, we’re seeing less dependency on direct hardware interaction, but that doesn’t mean the management complexity goes away; it actually gets worse. Here’s an easy example: We have this awesome thing called Kubernetes. Given config maps and secrets, where is the source of truth? Ask anyone in the community, and they’ll say, ‘Umm…build it yourself!’ I know Hashicorp released a sidecar method to inject values, but none of that is complete. This is why there’s a lot of custom work in the community, and companies are building their own mutating webhook controllers, for example, which is what we’re doing. You can’t buy this stuff, because it doesn’t exist.

We have our own way of injecting Vault secrets which 100% bypasses Kubernetes stuff, because we can’t version it, and we can’t manage it from any source or truth, as it’s scattered across 1000 namespaces. It’s impossible to manage in one place. So in our environment, we put everything in Vault, whether it’s configuration, or secrets. That gives us a common interface to code against. In V1, we’re using init containers, which is exactly what Hashicorp’s sidecar does. In V2, depending on the environment, we’ll grab values from different Vault clusters, since storing production and non-production values in the same place is just, suicide. You’ll get a huge ban hammer from your security team, and no-one wants that.

So we’re building, and we’re operating it at the same time. And are developers ever going to touch these [tools]? No! There are a lot of these instances in Kubernetes where things just don’t exist, so what do you do?Same thing for, EC2, and ECS even. Then, moving into Knative, and Lambas, and serverless computing and functions, it’s even worse. It’s a free-for-all. We’re designing our own framework.

The next thing we’re looking at is building plugins that will plug in our code, and use Spinnaker to deploy it [on that infra]. I heard Armory is working on something similar for deploying Lambdas, and I’m desperately waiting, because it’s going to make my life easier. Functions in general are kind of useless. The ecosystem around them is more important; you’ve got to think about API gateways, API management, queues, load balancers, etc. How do I wrap that into a sane framework where we can consistently build, integrate, test, and deploy? I don’t want to use 10 different ways to do the same thing. I’d rather just have everything work in Spinnaker.”

Then when we start talking about making that happen. I tell Robert about the Community Gardening Days I’m planning for Spinnaker this Spring (keep your eyes peeled! Announcement forthcoming on Spinnaker.io and social), and he gets psyched about Chime’s involvement. Music to my ears!

Look out for more articles from me on the Spinnaker developer and contributor experience. I’ll shine a light on the way Open Source Heroes like Robert are getting into the ecosystem as they enable the delivery of software products and services. Hang on, the latest industrial revolution (where software truly changes the freaking world for the better!) is just taking off.

Please share this on Twitter, LinkedIn, and HackerNews and give Robert some glory : )