DevOps for Data Workshop | August 28

Plan to attend the next CD Workshop on Thursday, August 28 with Danyal Khan, a contributor of our DataOps Initiative.

DevOps for Data: Delivering and Orchestrating Apache Spark on Containers

Data teams ship critical workloads, but Spark jobs often live outside the DevOps/CI/CD guardrails. This session shows how to bring Continuous Delivery discipline to Apache Spark on container-orchestrated platforms (with Kubernetes as a concrete example). We’ll cover how to package Spark apps as immutable artifacts, add automated quality gates (code, dependency, and data tests), and promote jobs through environments using pipeline-as-code.

We will discuss the end-to-end flow: commit → CI build → artifact + test → CD submit to a local container orchestrator → run/observe/roll back. We’ll close with a production checklist for platform teams (multi-tenant quotas, secrets, cost controls, and supply-chain security) and share a template repo you can adapt. If you’re a DevOps, platform, or data engineer looking to make Spark delivery as robust as your apps and services, this is your fast on-ramp.

Key Takeaways:

CD blueprint for Spark: From commit → artifact (image/jar/wheel) → automated checks → environment promotion → safe rollouts (Jobs/CronJobs/SparkApplication) with rollback strategies.
Quality gates for code and data: Unit + integration tests, schema/contract checks, lightweight data validations; include dependency scanning, SBOM, and signatures.
CDEvents for orchestration: Event-driven pipelines and notifications across CI, registry, and runtime; traceability from build to execution.
Platform guardrails: Namespaces & quotas, secrets management, cost controls, multi-tenancy patterns, and operational SLOs for batch jobs.