Contributed by Benedikt Koller
Original article posted on May 17, 2020
No way around it: I am what you call an “Ops guy”. In my career I admin’ed more servers than I’ve written code. Over twelve years in the industry have left their permanent mark on me. For the last two of those I’m exposed to a new beast – Machine Learning. My hustle is bringing Ops-Knowledge to ML. These are my thoughts on that.
Deploying software into production
Hundreds of thousands of companies deploy software into production every day. Every deployment mechanism has someone who built it. Whoever it was (The Ops Guy™, SRE-Teams, “Devops Engineers”), all follow tried-and-true paradigms. After all, the goal is to ship code often, in repeatable and reliable ways. Let me give you a quick primer on two of those.
Infrastructure-as-code (IaC)
Infrastructure as code, or IaC, applies software engineering rules to infrastructure management. The goal is to avoid environment drift, and to ensure idempotent operations. In plain words, read the infrastructure configuration and you’ll know exactly how the resulting environment looks like. You can rerun the provisioning without side effects, and your infrastructure has a predictable state. IaC allows for version-controlled evolution of infrastructures and quick provisioning of extra resources. It does so through declarative configurations.
Famous tools for this paradigm are Terraform, and to a large degree Kubernetes itself.
Immutable infrastructure
In conjunction with IaC, immutable infrastructure ensures the provisioned state is maintained. Someone ssh’ed onto your server? Its tainted – you have no guarantee that it still is in the identical shape to the rest of your stack. Interaction between a provisioned infrastructure and new code happens only through automation. Infrastructure, e.g. a Kubernetes cluster, is never modified after it’s provisioned. Updates, fixes and modifications are only possible through new deployments of your infrastructure.
Operational efficiency requires thorough automation and handling of ephemeral data. Immutable infrastructure mitigates config drift and snowflake server woes entirely.
ML development
Developing machine learning models works in different ways. In a worst case scenario, new models begin their “life” in a Jupyter Notebook on someones laptop. Code is not checked into git, there is no requirements file, and cells can be executed in any arbitrary order. Data exploration and preprocessing are intermingled. Training happens on that one shared VM with the NVIDIA K80, but someone messed with the CUDA drivers. Ah, and does anyone remember where I put those matplotlib-screenshots that showed the AUROC and MSE?
Getting ML models into production reliably, repeatedly and fast remains a challenge, and large data sets become a multiplying factor. The solution? Learn from our Ops-brethren.
We can extract key learnings from the evolution of infrastructure management and software deployments:
- Automate processing and provisioning
- Version-control states and instructions
- Write declarative configs
How can we apply them to a ML training flow?
Fetching data
Automate fetching of data. Declaratively define the datasource, the subset of data to use and then persist the results. Repeated experiments on the same source and subset can use the cached results.
Thanks to automation, fetching data can be rerun at any time. The results are persisted, so data can be versioned. And by reading the input configuration everyone can clearly tell what went into the experiment.
Splitting (and preprocessing data)
Splitting data can be standardized into functions.
- Splitting happens on a quota, e.g. 70% into train, 30% into eval. Data might be sorted on an index, data might be categorized.
- Splitting happens based on features/colums. Data might be categorized, Data might be sorted on an index.
- Data might require preprocessing / feature engineering (e.g. filling, standardization).
- A wild mix of the above.
Given those, we can define an interface and invoke processing through parameters – and use a declarative config. Persist the results so future experiments can warm-start.
Implementation of interfaces makes automated processing possible. The resulting train/eval datasets are versionable, and my input config is the declarative authority on the resulting state of the input dataset.
Training
Standardizing models is hard. Higher-level abstractions like Tensorflow and Keras already provide comprehensive APIs, but complex architectures need custom code injection.
A declarative config will, at least, state which version-controlled code was used. Re-runs on the same input will deliver the same results, re-runs on different inputs can be compared. Automation of training will yield a version-controllable artefact – the model – of a declared and therefore anticipatable shape.
Evaluation
Surprisingly, this is the hardest to fully automate. The dataset and individual usecase define the required evaluation metrics. However, we can stand on the shoulders of giants. Great tools like Tensorboard and the What-If-Tool go a long way. Our automation just needs to account for enough flexibility that a.) custom metrics for evaluation can be injected, and b.) raw training results are exposed for custom evaluation means.
Serving
Serving is caught between the worlds. It would be easy to claim that a trained model is a permanent artifact, like you might claim that a Docker container acts as an artifact of software development. We can borrow another learning from software developers – if you don’t understand where your code is run, you don’t understand your code.
Only by understanding how a model is served will a ML training flow ever be complete. For one, data is prone to change. A myriad of reasons might be the cause, but the result remains the same: Models need to be retrained to account for data drift. In short, continuous training is required. Through the declarative configuration of our ML flow so far we can reuse this configuration and inject new data – and iterate on those new results.
For another, preprocessing might need embedding with your model. Automation lets us apply the same preprocessing steps used in training to live data, guaranteeing identical shape of input data.
Why?
Outside academia, performance of machine learning models is measured through impact – economically, or by increased efficiency. Only reliable and consistent results are true measures for the success of applied ML. We as a new and still growing part of software engineering have to make sure of this success. And the reproducibility of success hinges on the repeatability of the full ML development lifecycle.
Originally posted on Medium by community member, Andreas Grimmer
Continuous Delivery (CD) and Runbook Automation are standard means to deploy, operate and manage software artifacts across the software life cycle. Based on our analysis of many delivery pipeline implementations, we have seen that on average seven or more tools are included in these processes, e.g., version control, build management, issue tracking, testing, monitoring, deployment automation, artifact management, incident management, or team communication. Most often, these tools are “glued together” using custom, ad-hoc integrations in order to form a full end-to-end workflow. Unfortunately, these custom ad-hoc tool integrations also exist in Runbook Automation processes.
Problem: Point-to-Point Integrations are Hard to Scale and Maintain
Not only is this approach error-prone but maintenance and troubleshooting of these integrations in all its permutations is time-intensive too. There are several factors that prevent organizations from scaling this across multiple teams:
- Number of tools: Although the great availability of different tools always allows having the appropriate tool in place, the numberof required integrations explodes.
- Tight coupling: The tool integrations are usually implemented within the pipeline, which results in a tight coupling between the pipeline and the tool.
- Copy-paste pipeline programming: A common approach we are frequently seeing is that a pipeline with a working tool integration is often used as a starting point for new pipelines. If now the API of a used tool changes, all pipelines have to catch up to stay compatible and to prevent vulnerabilities.
Let’s imagine an organization with hundreds of copy-paste pipelines, which all contain a hard-coded piece of code for triggering Hey load tests. Now this organization would like to switch from Hey to JMeter. Therefore, they would have to change all their pipelines. This is clearly not efficient!
Solution: Providing Standardized Interoperability Interfaces
In order to solve these challenges, we propose introducing interoperability interfaces, which allow abstract tooling in CD and Runbook Automation processes. These interfaces should trigger operations in a tool-agnostic way.
For example, a test interface could abstract different testing tools. This interface can then be used within a pipeline to trigger a test without knowing which tool is executing the actual test in the background.
These interoperability interfaces are important and this is confirmed by the fact that the Continuous Delivery Foundation has implemented a dedicated working group on Interoperability, as well as the open-source project Eiffel, which provides an event-based protocol enabling a technology-agnostic communication especially for Continuous Integration tasks.
Use Events as Interoperability Interfaces
By implementing these interoperability interfaces, we define a standardized set of events. These events are based on CloudEvents and allow us to describe event data in a common way.
The first goal of our standardization efforts is to define a common set of CD and runbook automation operations. We identified the following common operations (please let us know if we are missing important operations!):
- Operations in CD processes: deployment, test, evaluation, release, rollback
- Operations in Runbook Automation processes: problem analysis, execution of the remediation action, evaluation, and escalation/resolution notification
For each of these operations, an interface is required, which abstracts the tooling executing the operation. When using events, each interface can be modeled as a dedicated event type.
The second goal is to standardize the data within the event, which is needed by the tools in order to trigger the respective operation. For example, a deployment tool would need the information of the artifact to be deployed in the event. Therefore, the event can either contain the required resources (e.g. a Helm chart for k8s) or a URI to these resources.
We already defined a first set of events https://github.com/keptn/spec, which is specifically designed for Keptn — an open-source project implementing a control plane for continuous delivery and automated operations. We know that these events are currently too tailored for Keptn and single tools. So, please
Let us Work Together on Standardizing Interoperability Interfaces
In order to work on a standardized set of events, we would like to ask you to join us in Keptn Slack.
We can use the #keptn-spec channel in order to work on standardizing interoperability interfaces, which eventually are directly interpreted by tools and will make custom tool integrations obsolete!
From Dailymotion, a French video-sharing technology platform with over 300 million unique monthly users
At Dailymotion, we are hosting and delivering premium video content to users all around the world. We are building a large variety of software to power this service, from our player or website to our GraphQL API or ad-tech platform. Continuous Delivery is a central practice in our organization, allowing us to push new features quickly and in an iterative way.
We are early adopters of Kubernetes: we built our own hybrid platform, hosted both on-premise and on the cloud. And we heavily rely on Jenkins to power our “release platform”, which is responsible for building, testing, packaging and deploying all our software. Because we have hundreds of repositories, we are using Jenkins Shared Libraries to keep our pipeline files as small as possible. It is an important feature for us, ensuring both a low maintenance cost and a homogeneous experience for all developers – even though they are working on projects using different technology stacks. We even built Gazr, a convention for writing Makefiles with standard targets, which is the foundation for our Jenkins Pipelines.
In 2018, we migrated our ad-tech product to Kubernetes and took the opportunity to set up a Jenkins instance in our new cluster – or better yet move to a “cloud-native” alternative. Jenkins X was released just a few months before, and it seemed like a perfect match for us:
- It is built on top and for Kubernetes.
- At that time – in 2018 – it was using Jenkins to run the pipelines, which was good news given our experience with Jenkins.
- It comes with features such as preview environments which are a real benefit for us.
- And it uses the Gitops practice, which we found very interesting because we love version control, peer review, and automation.
While adopting Jenkins X we discovered that it is first a set of good practices derived from the best performing teams, and then a set of tools to implement these practices. If you try to adopt the tools without understanding the practices, you risk fighting against the tool because it won’t fit your practices. So you should start with the practices. Jenkins X is built on top of the practices described in the Accelerate book, such as micro-services and loosely-coupled architecture, trunk-based development, feature flags, backward compatibility, continuous integration, frequent and automated releases, continuous delivery, Gitops, … Understanding these practices and their benefits is the first step. After that, you will see the limitations of your current workflow and tools. This is when you can introduce Jenkins X, its workflow and set of tools.
We’ve been using Jenkins X since the beginning of 2019 to handle all the build and delivery of our ad-tech platform, with great benefits. The main one being the improved velocity: we used to release and deploy every two weeks, at the end of each sprint. Following the adoption of Jenkins X and its set of practices, we’re now releasing between 10 and 15 times per day and deploying to production between 5 and 10 times per day. According to the State of DevOps Report for 2019, our ad-tech team jumped from the medium performers’ group to somewhere between the high and elite performers’ groups.
But these benefits did not come for free. Adopting Jenkins X early meant that we had to customize it to bypass its initial limitations, such as the ability to deploy to multiple clusters. We’ve detailed our work in a recent blog post, and we received the “Most Innovative Jenkins X Implementation” Jenkins Community Award in 2019 for it. It’s important to note that most of the issues we found have been fixed or are being fixed. The Jenkins X team has been listening to the community feedback and is really focused on improving their product. The new Jenkins X Labs is a good example.
As our usage of Jenkins X grows, we’re hitting more and more the limits of the single Jenkins instance deployed as part of Jenkins X. In a platform where every component has been developed with a cloud-native mindset, Jenkins is the only one that has been forced into an environment for which it was not built. It is still a single point of failure, with a much higher maintenance cost than the other components – mainly due to the various plugins.
In 2019, the Jenkins X team started to replace Jenkins with a combination of Prow and Tekton. Prow (or Lighthouse) is the component which handles the incoming webhook events from GitHub, and what Jenkins X calls the “ChatOps”: all the interactions between GitHub and the CI/CD platform. Tekton is a pipeline execution engine. It is a cloud-native project built on top of Kubernetes, fully leveraging the API and capabilities of this platform. No single point of failure, no plugins compatibility nightmare – yet.
Since the beginning of 2020, we’ve started an internal project to upgrade our Jenkins X setup – by introducing Prow and Tekton. We saw immediate benefits:
- Faster scheduling of pipelines “runners” pods – because all components are now Kubernetes-native components.
- Simpler pipelines – thanks to both the Jenkins X Pipelines YAML syntax and the ability to easily decouple a complex pipeline in multiple small ones that are run concurrently.
- Lower maintenance cost.
While replacing the pipeline engine of Jenkins X might seem like an implementation detail, in fact, it has a big impact on the developers. Everybody is used to see the Jenkins UI as the CI/CD UI – the main entry point, the way to manually restart pipelines executions, to access logs and test results. Sure, there is a new UI and a real API with an awesome CLI, but the new UI is not finished yet, and some people still prefer to use web browsers and terminals. Leaving the Jenkins Plugins ecosystem is also a difficult decision because some projects heavily rely on a few plugins. And finally, with the introduction of Prow (Lighthouse) the Github workflow is a bit different, with Pull Requests merges being done automatically, instead of people manually merging when all the reviews and automated checks are green.
So if 2019 was the year of Jenkins X at Dailymotion, 2020 will definitely be the year of Tekton: our main release platform – used by almost all our projects except the ad-tech ones – is still powered by Jenkins, and we feel more and more its limitations in a Kubernetes world. This is why we plan to replace all our Jenkins instances with Tekton, which was truly built for Kubernetes and will enable us to scale our Continuous Delivery practices.
By Tracy Ragan, CEO of DeployHub, CD Foundation Board Member
Microservice pipelines are different than traditional pipelines. As the saying goes…
“The more things change; the more things stay the same.”
As with every step in the software development evolutionary process, our basic software practices are changing with Kubernetes and microservices. But the basic requirements of moving software from design to release remain the same. Their look may change, but all the steps are still there. In order to adapt to a new microservices architecture, DevOps Teams simply need to understand how our underlying pipeline practices need to shift and change shape.
Understanding Why Microservice Pipelines are Different
The key to understanding microservices is to think ‘functions.’ With a microservice environment the concept of an ‘application’ goes away. It is replaced by a grouping of loosely coupled services connected via APIs at runtime, running inside of containers, nodes and pods. The microservices are reused across teams increasing the need for improved organization (Domain Driven Design), collaboration, communication and visibility.
The biggest change in microservice pipeline is having a single microservice used by multiple application teams independently moving through the life cycle. Again, one must stop thinking ‘application’ and think instead think ‘functions’ to fully appreciate the oncoming shift. And remember, multiple versions of a microservice could be running in your environments at the same time.
Microservices are immutable. You don’t ‘copy over’ the old one, you deploy a new version. When you deploy a microservice, you create a Kubernetes deployment YAML file that defines the Label and the version of the image.
In the above example, our Label is dh-ms-general. When a microservice Label is reused for a new container image, Kubernetes stops using the old image. But in some cases, a second Label may be used allowing both services to be running at the same time. This is controlled by the configuration of your ingresses. Our new pipeline process must incorporate these new features of our modern architecture.
Comparing Monolithic to Microservice Pipelines
What does your life cycle pipeline look like when we manage small functions vs. a monolithic applications running in a modern architecture? Below is a comparison for each category and their potential shift for supporting a microservice pipeline.
Change Request
Monolithic:
Logging a user problem ticket, enhancement request or anomaly based on an application.
Microservices:
This process will remain relatively un-changed in a microservice pipeline. Users will continue to open tickets for bugs and enhancements. The difference will be sorting out which microservice needs the update, and which version of the microservice the ticket was opened against. Because a microservice can be used by multiple applications, dependency management and impact analysis will become more critical for helping to determine where the issue lies.
Version Control
Monolithic:
Tracking changes in source code content. Branching and merging updates allowing multiple developers to work on a single file.
Microservices:
While versioning your microservice source code will still be done, your source code will be smaller, 100-300 lines of code versus 1,000 – 3,000 lines of code. This impacts the need for branching and merging. The concept of merging ‘back to the trunk’ is more of a monolithic concept, not a microservice concept. And how often will you branch code that is a few hundred lines long?
Artifact Repository
Monolithic:
Originally built around Maven, an artifact repository provides a central location for publishing jar files, node JS Packages, Java scripts packages, docker images, python modules. At the point in time where you run your build your package manager (maven, NPM, PIP) will perform the dependency management for tracking transitive dependencies.
Microservices:
Again, these tools supported monolithic builds and solved dependency management to resolve compile/link steps. We move away from monolithic builds, but we still need to build our container and resolve our dependencies. These tools will help us build containers by determining the transitive dependencies need for the container to run.
Builds
Monolithic:
Executes a serial process for calling compilers and linkers to translate source code into binaries (Jar, War, Ear, .Exe, .dlls, docker images). Common languages that support the build logic includes Make, Ant, Maven, Meister, NPM, PIP, and Docker Build. The build calls on artifact repositories to perform dependency management based on what versions of libraries have been specified by the build script.
Microservices:
For the most part, builds will look very different in a microservice pipeline. A build of a microservice will involve creating a container image and resolving the dependencies needed for the container to run. You can think of a container image to be our new binary. This will be a relatively simple step and not involve a monolithic compile/link of an entire application. It will only involve a single microservice. Linking is done at runtime with the restful API call coded into the microservice itself.
Software Configuration Management (SCM)
Monolithic:
The build process is the central tool for performing configuration management. Developers setup their build scripts (POM files) to define what versions of external libraries they want to include in the compile/link process. The build performs configuration management by pulling code from version control based on a ‘trunk’ or ‘branch. A Software Bill of Material can be created to show all artifacts that were used to create the application.
Microservices:
Much of what we use to do for configuring our application occurred at the software ‘build.’ But ‘builds’ as we know them go away in a microservice pipeline. This is where we made very careful decisions about what versions of source code and libraries we would use to build a version of our monolithic application. For the most part, the version and build configuration shifts to runtime with microservices. While the container image has a configuration, the broader picture of the configuration happens at run-time in the cluster via the APIs.
In addition, our SCM will begin to bring in the concept of Domain Driven Design where you are managing an architecture based on the microservice ‘problem space.’ New tooling will enter the market to help with managing your Domains, your logical view of your application and to track versions of applications to versions of services. In general, SCM will become more challenging as we move away from resolving all dependencies at the compile/link step and must track more of it across the pipeline.
Continuous Integration (CI)
Monolithic:
CI is the triggered process of pulling code and libraries from version control and executing a Build based on a defined ‘quiet time.’ This process improved development by ensuring that code changes were integrated as frequently as possible to prevent broken builds, thus the term continuous integration.
Microservices:
Continuous Integration was originally adopted to keep us re-compiling and linking our code as frequently as possible in order to prevent the build from breaking. The goal was to get to a clean ’10-minute build’ or less. With microservices, you are only building a single ‘function.’ This means that an integration build is no longer needed. CI will eventually go away, but the process of managing a continuous delivery pipeline will remain important with a step that creates the container.
Code Scanning
Monolithic:
Code scanners have evolved from looking at coding techniques for memory issues and bugs to scanning for open source library usage, licenses and security problems.
Microservices:
Code scanners will continue to be important in a microservice pipeline but will shift to scanning the container image more than the source. Some will be used during the container build focusing on scanning for open source libraries and licensing while others will focus more on security issues with scanning done at runtime.
Continuous Testing
Monolithic:
Continuous testing was born out of test automation tooling. These tools allow you to perform automated test on your entire application including timings for database transactions. The goal of these tools is to improve both the quality and speed of the testing efforts driven by your CD workflow.
Microservices:
Testing will always be an important part of the life cycle process. The difference with microservices will be understanding impact and risk levels. Testers will need to know what applications depend on a version of a microservice and what level of testing should be done across applications. Test automation tools will need to understand microservice relationships and impact. Testing will grow beyond testing a single application and instead will shift to testing service configurations in a cluster.
Security
Monolithic:
Security solutions allow you to define or follow a specific set of standards. They include code scanning, container scanning and monitoring. This field has grown into the DevSecOps movement where more of the security activities are being driven by Continuous Delivery.
Microservices:
Security solutions will shift further ‘left’ adding more scanning around the creation of containers. As containers are deployed, security tools will begin to focus on vulnerabilities in the Kubernetes infrastructure as they relate to the content of the containers.
Continuous Delivery Orchestration (CD)
Monolithic:
Continuous Delivery is the evolution of continuous integration triggering ‘build jobs’ or ‘workflows’ based on a software application. It auto executes workflow processes between development, testing and production orchestrating external tools to get the job done. Continuous Delivery calls on all players in the lifecycle process to execute in the correct order and centralizes their logs.
Microservices:
Let’s start with the first and most obvious difference between a microservice pipeline and a monolithic pipeline. Because microservices are independently deployed, most organizations moving to a microservice architecture tell us they use a single pipeline workflow for each microservice. Also, most companies tell us that they start with 6-10 microservices and grow to 20-30 microservices per traditional application. This means you are going to have hundreds if not thousands of workflows. CD tools will need to include the ability to template workflows allowing a fix in a shared template to be applied to all child workflows. Managing hundreds of individual workflows is not practical. In addition, plug-ins need to be containerized and decoupled from a version of the CD tool. And finally, look for actions to be event driven, with the ability for the CD engine to listen to multiple events, run events in parallel and process thousands of microservices through the pipeline.
Continuous Deployments
Monolithic:
This is the process of moving artifacts (binaries, containers, scripts, etc.) to the physical runtime environments on a high frequency basis. In addition, deployment tools track where an artifact was deployed along with audit information (who, where, what) providing core data for value stream management. Continuous deployment is also referred to as Application Release Automation.
Microservices:
The concept of deploying an entire application will simply go away. Instead, deployments will be a mix of tracking the Kubernetes deployment YAML file with the ability to manage the application’s configuration each time a new microservice is introduced to the cluster. What will become important is the ability to track the ‘logical’ view of an application by associating which versions of the microservices make up an application. This is a big shift. Deployment tools will begin generating the Kubernetes YAML file removing it from the developer’s to-do list. Deployment tools will automate the tracking of versions of the microservice source to the container image to the cluster and associated applications to provide the required value stream reporting and management.
Conclusion
As we shift from managing monolithic applications to microservices, we will need to create a new microservice pipeline. From the need to manage hundreds of workflows in our CD pipeline, to the need for versioning microservices and their consuming application versions, much will be different. While there are changes, the core competencies we have defined in traditional CD will remain important even if it is just a simple function that we are now pushing independently across the pipeline.
About the Author
Tracy Ragan is CEO of DeployHub and serves on the Continuous Delivery Foundation Board. She is a microservice evangelist with expertise in software configuration management, builds and release. Tracy was a consultant to Wall Street firms on build and release management for 7 years prior to co-founding OpenMake Software in 1995. She was a founding member of the Eclipse organization and served on the board for 5 years. She is a recognized leader and has been published in multiple industry publications as well as presenting to audiences at industry conferences. Tracy co-founded DeployHub in 2018 to serve the microservice development community.
By Tracy Ragan, CEO DeployHub, CD Foundation Board Member
The New Mexico CI/CD CDF Meetup, hosted by DeployHub, enjoyed an amazing presentation by Louis Vernon, Site Reliability Engineer at Descartes Labs. Louis showed in detail how Descartes Labs improved service levels to customers, dumped a waterfall release approach and simplified their GKE releases using Spinnaker, Istio and StackDriver.
Louis’ presentation covers how the team at Descartes Lab implemented Spinnaker to push continuous deployments including the integration of Istio to route updates between Dev, Beta, Pre-Release, and Release, all running in the same cluster.
The use of both Istio and Spinnaker at Descartes Labs is a mature example of what can be done to build out a modern Kubernetes Pipeline.
While Descartes Labs still implements different ‘states’ of the pipeline, the release process uses a single cluster with Istio properly routing using virtual service names.
Louis explains how the team at Descartes Labs got to the point where they understood that a shift of this magnitude was essential for creating a stable environment for all users.
In my humble opinion, moving away from separate Dev, Test and Prod clusters is the direction we will all be moving.
The full recorded demo is here:
Louis also presented at the Spinnaker Summit 2019. You can download his presentation at:
A not so serious look at a very serious problem
Originally published on Medium by Dan Lorenc, (dlorenc@google.com,
twitter.com/lorenc_dan)
A Blast From the Past
2019 was a crazy time to be writing software. It’s hard to believe how careless we were as an industry. Everyone was just having fun slinging code. Companies were using whatever code they found laying around on NPM, Pip, or Maven Central. No one even looked at the code these package managers were downloading for them. We had no idea where these binaries came from or even who wrote most of this stuff.
And don’t even get me started on containers! There was no way to know what was inside most of them or what they did. Yet there we were, pulling them from Dockerhub, slapping some YAML on them, and running them as root in our Kubernetes clusters. Whoops, I just dated myself. Kubernetes was a primitive system written mostly in YAML and Bash that people used to interact with before Serverless came and saved us all.
Looking back, it’s shocking that the industry is still around! How we didn’t have to cough up every Bitcoin in the world to stop our databases from getting leaked or our servers from being blown up is beyond me. Thankfully, we realized how silly this all was, and we stopped using whatever code had the most Github stars and started using protection.
We’re Under Attack
No, really. Every time you pip install
, go get
, or mvn fetch
something, you’re doing the equivalent of plugging a thumb drive you found on the sidewalk into your production server.
You’re taking code from someone you’ve never met and then running it with access to your most sensitive data. Hopefully, you at least know their email address or Github account from the commit, but there’s no way to know if this is accurate unless you’re checking PGP signatures. And let’s be honest, you’re probably not doing that.
This might sound like I’m just fear-mongering, but I promise I’m not. This is a real problem that everyone needs to be aware of. Attacks like this are called supply-chain attacks, and they are nothing new. Just last month, an active RCE vulnerability was found in an open source package on PyPi that was being used to steal SSH and GPG credentials.
There are lots of variations on this same play that make use of different social-engineering techniques in interesting ways. One attacker used a targeted version of this to steal cryptocurrency from a few specific websites. Another group performed a “long-con” where they actually produced and maintained a whole set of useful open source images on Dockerhub for years before slowly adding, you guessed it, crypto-mining.
The possibilities are endless, terrifying, and morbidly fascinating. And they’re happening more and more often. If reading about attacks like these is your kind of thing, the CNCF has started cataloging known instances of them. Snyk also just published a post detailing how easy it is to inject code like this in most major languages — Github even hides these diffs in code review by default! Russ Cox has also been writing about this problem for a while.
Vision
OK, there’s a bit of hyperbole up there (Kubernetes doesn’t have that much bash in it), but open source is under attack, and it’s not OK. Some progress is being made in this area — GitHub and others are scanning repositories, binaries, and containers, but these tools all only work on known vulnerabilities. They have no mechanism to handle intentional, malicious ones before they are discovered, which are at least as dangerous.
The brutal fact is that there is no way to be confident about the code you find on most artifact repositories today. The service might be compromised and serve you a different package from the one the author uploaded. The maintainer’s credentials might have been compromised, allowing an attacker to upload something malicious. The compiler itself might have been hacked, or even the compiler that compiler used (PDF warning)! Or, the maintainer could have just snuck something in on purpose.
For any given open source package, we need to be able to confidently assert what code it’s comprised of, what toolchains and steps were used to produce the package, and who was responsible for each piece. This information needs to be made available publicly. A reliable, secure view of the supply-chain of every open source package will help make these attacks easier to prevent and easier to detect when they do happen. And the ability to tie each line of code and action back to a real individual will allow us to hold attackers accountable.
How Do We Get There?
We need to work as an industry to start securing open source software, piece by piece.
Artifact repositories need to support basic authentication best practices like 2FA, artifact signing, and strong password requirements. DockerHub, PyPi, and NPM support 2FA, but there’s no way to see if a maintainer of a package is using it. Most container registries don’t support signatures yet, though work is ongoing.
Software build systems need to make reproducible, hermetic builds possible and easy. Debian has started doing some great work here, but they’re basically alone. Every docker build gives you a new container digest. Tar and gzip throw timestamps everywhere. It’s possible to get reproducible builds in Go, Java, and most other major languages, but it’s not necessarily easy. See the recently published whitepaper on how Google handles much of this internally for more information.
SCM providers need strong identity mechanisms so we can associate code back to authors confidently. Git commit logs can be easily forged, and signed commits are not in common use. Even with them, you still have no idea who is on the other end of a PR, only that the signature matches. This isn’t just an issue for security. It can also be a licensing nightmare if you don’t know the real author or license of code you’re accepting.
There is value in allowing developers to work anonymously, but there is also a cost. We need to balance this with systems that apply a higher level of scrutiny to anonymous code. We also need to allow other individuals to “vouch for” patches that they’ve examined, maybe similar to how Wikipedia handles anonymous edits.
And finally, all of this needs to be tied together in secure CI/CD systems and platforms that implement binary transparency for public packages. Putting the packaging steps in the hands and laptops of developers leaves way too large an attack surface. The ability to push a package that will run in prod is the same as having root in prod. By moving the build and upload steps into secure CI/CD systems, we can reduce the need to trust individuals.
OK, but What Can I Do Now?
First, start by securing your code as much as possible. Make sure you have copies of every dependency you’re using stored somewhere. Make sure you review all code you’re using, including OSS. Set up and mandate the use of 2FA across your organization. Publish, and actually check the signatures and digests of the software you’re using.
Log enough information in your build system so you can trace back every artifact to the sources. And every deployment to the artifacts. Once you’ve done all of this, you’ll be pretty far ahead of everyone else. You’re not completely safe, though.
That’s where we need to work together. If you’re interested in helping out, there are many ways to get involved, and I’m sure there are a lot of efforts going on. We’re just getting started on several initiatives inside the Continuous Delivery Foundation, like our new Security SIG. We’re also hoping to make it easier to build and use secure delivery pipelines inside the TektonCD open source project.
We would love your help, no matter your expertise! For example, I’m far from a security expert, but I’ve spent a lot of time working on developer tools and CI/CD systems. Feel free to reach out to me directly if you have any questions or want to get involved. I’m on Twitter and Github.
Written by Tracy Miranda, CloudBees director of open source community and member of the CDF governing board
The CD Foundation (CDF) recently shared its 9 Strategic Goals. The second on the list is “Cultivate Growth of Projects.” This goal naturally leads us to ask ourselves the question: how do we measure the growth of our projects to know we are being successful?
There are many dimensions to open source projects. In order to sustain a project much more than code is required. The CDF helps with multiple essential services for project growth and sustenance. One of my favourite services is the CDF devstats site, which provides a wealth of data around the projects. CDF devstats, which is based on the CNCF devstats, gives indicators on community health and contributor statistics.
Example from Tekton, one of many dashboards and data sets available:
Sometimes we have distorted views of how well projects are doing – this can be down to a few things such as hype or public sentiment around a project. Sometimes newer projects are viewed as doing better than older projects. It is important to have a sense of how well your project is doing. While there are lots of different ways to do it, one method I really like is looking at the number of individual developers contributing to a project.
With CDF devstats I am able to take a snapshot of that data, and then see how the CDF projects stacked up against CNCF graduated projects.
The chart here shows a visualization of average developer contributions to each project based on data from the past one year. There are many caveats with the data. E.g. Which repos are included for each project may not be strictly equivalent. But what I like about it is that at a high level it gives an indication of the size of project contributions and how projects compare relatively.
I also like that it is all open – so you can verify the data and process for yourself, plus do your own analysis.
Kubernetes, as you might expect, is a powerhouse of a project with thousands of contributors. But actually Jenkins stacks up nicely in comparison with a healthy number of contributions – which is all the more significant considering it is a 15 year old project. Sustaining and growing community contributions year-on-year for 15 years is an incredible achievement. The other CDF projects, Spinnaker, Jenkins X and Tekton, are much newer but also coming along quite nicely. See this repo for links to data.
For me this is a nice snapshot to say we’re off to a good start here at CDF. Individual project growth will come down to each project’s community – but CDF will be working to provide key services and some of the less fun grunt stuff so project leaders can better focus on the important efforts of community building.