# 🔄 CI/CD and DevOps ## CI/CD Pipeline ``` Code Commit → Build → Test → Package → Deploy to Staging → Integration Tests → Deploy to Production ``` ### Detailed Pipeline Stages ``` 1. Checkout ──→ 2. Lint ──→ 3. Test ──→ 4. Build ──→ 5. Scan ──→ 6. Publish ──→ 7. Deploy │ │ │ ESLint/ Unit/Integ/ SAST/SCA/ Prettier e2e tests Container scan ``` | Stage | Tools | What Happens | |-------|-------|--------------| | **Checkout** | git clone, fetch | Retrieve code from repository, including submodules | | **Lint** | ESLint, Prettier, RuboCop, golangci-lint | Static code analysis, formatting | | **Test (unit)** | Jest, pytest, JUnit | Fast tests (ms to s), no dependencies | | **Test (integration)** | Testcontainers, Docker Compose | Tests with DB, message queue, external services | | **Test (e2e)** | Playwright, Cypress, Selenium | Full-stack tests in the browser | | **Build** | Docker build, go build, npm build, Maven | Compilation, artifact assembly | | **Scan (SAST)** | Semgrep, SonarQube, CodeQL | Static security analysis | | **Scan (DAST)** | OWASP ZAP, Burp Suite | Dynamic analysis (running application) | | **Scan (SCA)** | Dependabot, Snyk, Trivy | Dependency and CVE analysis | | **Publish** | Docker push, npm publish, Maven deploy | Upload artifact to registry | | **Deploy** | ArgoCD, Terraform, Helm, kubectl | Deploy to target environment | ### Continuous Integration (CI) - Automatic build and tests on every commit - Fast feedback loop (< 10 min) - Linting, type checking, unit tests, security scan (SAST) ### Continuous Delivery (CD) - Automatic deployment to staging / test environments - Manual approval for production (optional) - Smoke tests after deployment ### Continuous Deployment - Fully automatic deployment to production - Requires high confidence in tests and monitoring - Feature flags for risk management ## GitHub Actions Detail ### Workflow Syntax ```yaml name: CI Pipeline on: push: branches: [main] pull_request: branches: [main] env: NODE_VERSION: "22" jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: ${{ env.NODE_VERSION }} - run: npm ci - run: npm run lint test: runs-on: ubuntu-latest needs: lint strategy: matrix: node-version: [22, 24] steps: - uses: actions/checkout@v4 - name: Run tests run: npm test ``` ### Matrix Builds - Run the same jobs with different parameters (OS, language version, architecture) - `strategy.matrix` — parameter combinations (Cartesian product) - `strategy.fail-fast` — stop all if one fails ### Reusable Workflows ```yaml # .github/workflows/deploy.yml (called) on: workflow_call: inputs: environment: required: true type: string secrets: cloud_role: required: true # Call in caller workflow jobs: deploy: uses: ./.github/workflows/deploy.yml with: environment: staging secrets: cloud_role: ${{ secrets.STAGING_ROLE }} ``` ### Composite Actions - Custom actions without needing a separate repository - Combination of `run`, `uses`, `shell` steps - Use case: standardize lint/test/build across repositories ### Self-hosted Runners - Own infrastructure for running GitHub Actions - Use case: private network, GPU, specific HW, compliance - Scaling: actions-runner-controller (Kubernetes), auto-scaling groups - Security: job isolation, ephemeral runners ## GitLab CI Detail ```yaml stages: - lint - test - build - deploy variables: DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA lint: stage: lint image: node:22 script: - npm ci - npm run lint test: stage: test image: node:22 needs: ["lint"] script: - npm test artifacts: paths: - coverage/ reports: coverage_report: coverage_format: cobertura path: coverage/cobertura-coverage.xml deploy-staging: stage: deploy needs: ["build"] rules: - if: $CI_COMMIT_BRANCH == "main" environment: name: staging url: https://staging.example.com script: - kubectl set image deployment/app app=$DOCKER_IMAGE ``` **Concepts**: - **Stages** — sequential phases (each stage can have multiple parallel jobs) - **Rules** — execution conditions (branch, tag, changes, variables) — replaces `only/except` - **Needs** — DAG dependencies (job doesn't have to wait for entire stage) - **Artifacts** — file sharing between jobs (binaries, reports, cache) - **Environments** — deployment tracking (rollback, history, approvals) ### DAG Pipelines (Needs) ``` lint ──→ test ──→ build ──→ deploy-staging ──→ deploy-prod ↓ build-arm ──→ test-arm ``` - Defines dependencies between jobs (not necessarily stages) - Enables parallelization of independent jobs - Reduces overall pipeline time ## Infrastructure as Code (IaC) | Tool | Type | Language | |------|------|----------| | Terraform | Declarative | HCL | | OpenTofu | Declarative | HCL (Terraform fork) | | Pulumi | Declarative | TypeScript, Python, Go, C# | | AWS CDK | Declarative | TypeScript, Python, Java, C# | | CloudFormation | Declarative | YAML/JSON (AWS) | | Azure ARM/Bicep | Declarative | Bicep, JSON | | Ansible | Imperative/Config | YAML | | Chef/Puppet | Config mgmt | Ruby DSL | ### Infrastructure as Code (2nd Edition) — Kief Morris Key reference for designing and operating dynamic cloud infrastructure with IaC. The book is tool-agnostic — it focuses on patterns and practices, not specific tools. #### Three Fundamental Practices | Practice | Description | |----------|-------------| | **Define everything as code** | All infrastructure defined in code, version control, repeatability | | **Continuously test and deliver** | Every change goes through a pipeline with automated tests | | **Small, independent pieces** | Small, loosely coupled components — easier change and testing | #### Principles of Cloud Infrastructure - **Systems reproducible** — infrastructure can be recreated from code at any time - **Systems disposable** — instances can be destroyed and recreated - **Systems consistent** — all environments identical (no snowflake servers) - **Processes repeatable** — automation instead of manual procedures - **Design always changing** — infrastructure is constantly evolving (not build-and-forget) #### Anti-patterns (Pitfalls) | Anti-pattern | Description | |--------------|-------------| | **Snowflake server** | Each server different, cannot reproduce | | **Configuration drift** | Manual changes → deviations from defined state | | **Server sprawl** | Too many servers without management | | **Fragile infrastructure** | Changes often break the system | | **Automation fear** | Fear of automation → manual interventions | #### Book Structure (4 Parts) 1. **Foundations** — framework of tools and technologies for cloud platforms 2. **Working with infrastructure stacks** — defining, provisioning, testing and CD of infrastructure changes 3. **Working with servers and application runtime platforms** — provisioning and configuring servers and clusters 4. **Working with large systems and teams** — workflow, governance, architectural patterns for multiple teams #### IaC Code Organization | Pattern | Description | |---------|-------------| | **Monorepo** | One repository for everything — build-time integration, suitable for small teams | | **Microrepo** | Separate repository for each project — isolation, suitable for large teams | | **Domain organization** | Organizing code by domain concepts (not by technology) | **Recommendations:** - Infrastructure and applications can be in the same or separate repository depending on organizational structure (Team Topologies) - Per-environment configuration files (test, staging, production) stored within the project - Tests belong to the project, integration tests can be in a separate project - Infrastructure code should not directly deploy applications — use OS packaging (RPM, deb) #### Expand-Contract Pattern for Infrastructure Changes Same principle as database migrations: 1. **Expand** — add new resource (old version still running) 2. **Migrate** — move traffic / dependencies to the new resource 3. **Contract** — remove old resource Prevents outages when refactoring infrastructure. ## Terraform Detail #### State Locking Mechanism | Backend | Locking Mechanism | Note | |---------|-------------------|------| | **S3 + DynamoDB** | DynamoDB (ConditionalPut) | Most common, cheap, simple | | **Terraform Cloud** | Built-in (API) | SaaS, audit logs, VCS integration | | **Azure Storage** | Azure Blob Lease | Similar to S3 model | | **GCS** | Cloud Storage Object Hold | Limited | | **Consul** | Consul KV session_lock | High-availability | | **PostgreSQL** | pg_advisory_lock / row lock | Custom backend | #### State Backends Comparison | Property | S3 + DynamoDB | Terraform Cloud | Consul | |----------|---------------|----------------|--------| | Cost | $ (S3 + DynamoDB) | $$ (free tier limited) | $$ (infra) | | Team workflow | GitHub Actions + OIDC | Native RBAC, runs | Custom | | Locking | DynamoDB | Built-in | Consul session | | History | S3 versioning | Full history, diff | None | | Remote ops | No (state only) | Yes (remote runs) | No | | Encryption | SSE-S3/KMS | At rest + in transit | TLS | #### Workspaces vs Terragrunt | Aspect | Terraform Workspaces | Terragrunt | |--------|---------------------|------------| | **State separation** | One backend, key: `env:/workspace` | Separate backend per env | | **Code reuse** | Same code, different variables | DRY configuration, modules | | **Risk** | Accidentally `apply` to wrong workspace | Isolated backends | | **When to use** | Simple projects, <5 envs | Microservices, multi-env, multi-team | | **Extra features** | — | Dependency, include, before_hook | #### Provider Versioning ```hcl terraform { required_version = ">= 1.5, < 2.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } kubernetes = { source = "hashicorp/kubernetes" version = ">= 2.23" } } } ``` - `~> 5.0` — only patch versions (5.x, x ≥ 0) - `>= 2.23, < 3.0` — any 2.x from 2.23 - `~>` constraints prevent breaking changes in major/minor ### Terraform Workflow ``` terraform init → Download provider modules terraform plan → Show changes terraform apply → Apply changes terraform destroy → Destroy infrastructure terraform validate → Syntax validation terraform fmt → Format HCL ``` ### State Management - Remote state (S3, Terraform Cloud, Azure Storage) - State locking (DynamoDB, Consul) - Workspaces for environment separation ### Terraform: Up and Running (3rd ed.) — Yevgeniy Brikman Practical guide to Terraform from the founder of Gruntwork. The 3rd edition (2022) adds over 100 pages of new content, updates from Terraform 0.12 to 1.2, and two new chapters. #### What's New in the 3rd Edition | New Feature | Description | |-------------|-------------| | **Chapter: Secrets management** | Managing secrets with Terraform — Vault, AWS Secrets Manager, KMS, OIDC, `sensitive` variables | | **Chapter: Multiple providers** | Working with multiple regions, accounts, clouds including Kubernetes (AWS EKS) | | **Terraform 1.0+** | Backward compatibility promise, stability, HashiCorp IPO | | **Provider versioning** | `required_providers` block + `terraform.lock.hcl` (lock file) | | **Module iteration** | `count` and `for_each` on modules (since Terraform 0.13) | | **Variable validation** | `validation {}` blocks, `precondition` / `postcondition` | | **Refactoring** | `moved` blocks — safe refactoring without manual state manipulation | | **CI/CD security** | OIDC authentication, isolated workers for `terraform apply` | #### Secrets Management with Terraform ```hcl # Variable marked as sensitive — never shown in log variable "db_password" { type = string sensitive = true } # Reading secrets from AWS Secrets Manager data "aws_secretsmanager_secret" "db" { name = "production/db/master" } data "aws_secretsmanager_secret_version" "db" { secret_id = data.aws_secretsmanager_secret.db.id } ``` **Recommended Security Hierarchy:** 1. **OIDC** — most secure, no creds on CI server (GitHub Actions → IAM role) 2. **IAM role** — instance profile (EC2, ECS, EKS) 3. **Environment variables** — limited, risk of log leakage 4. **Isolated workers** — separate worker with admin permissions, API only `plan`/`apply` #### Testing Terraform Code | Layer | Tools | Description | |-------|-------|-------------| | **Static analysis** | `terraform validate`, `tflint`, `tfsec`, `checkov` | Code analysis without execution | | **Plan testing** | `conftest` + OPA (Rego), `terraform plan` parse | Plan validation against policy | | **Unit tests** | Terratest (Go), `terraform fmt`, `terraform validate` | Testing modules in isolation | | **Integration tests** | Terratest (Go) | Actual provisioning + assert | | **End-to-end tests** | Terratest | Full stack, smoke tests | #### Policy Enforcement ```rego # OPA / conftest — deny public S3 bucket package main deny[msg] { resource := input.resource_changes[_] resource.type == "aws_s3_bucket" resource.change.after.acl == "public-read" msg = sprintf("%s must not be public", [resource.address]) } ``` #### Production-grade Checklist by Brikman 1. **Small modules** — one module = one thing (single responsibility) 2. **Composable modules** — modules can be composed into larger units 3. **Testable modules** — each module has tests (Terratest) 4. **Releasable modules** — versioning (Git tags, Terraform Registry) 5. **Version control** — everything in git, including `.terraform.lock.hcl` 6. **Remote state** — S3 + DynamoDB or Terraform Cloud 7. **CI/CD pipeline** — `plan` on MR, `apply` after merge to main 8. **Secrets management** — no secrets in plaintext in code 9. **Policy as code** — OPA / Sentinel for compliance 10. **Sandbox environment** — each developer has their own isolated environment #### Golden Rule of Terraform > **Master branch state must always be in sync with the production environment.** > Never run `terraform apply` manually locally on production — always via CI/CD. ## Dockerfile Best Practices ```dockerfile # Multi-stage build FROM node:22-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build # Runtime stage — distroless FROM gcr.io/distroless/nodejs22-debian12 WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules USER nonroot:nonroot EXPOSE 3000 CMD ["dist/server.js"] ``` **Rules**: - **Multi-stage build** — separate build tools from runtime - **Distroless images** — minimal attack surface (no shell, package manager) - **Non-root user** — USER nonroot (security best practice) - **Layer caching** — copy less-frequently changing files first (package.json → npm ci → code) - **Small base image** — Alpine (5 MB), distroless (minimal), scratch (Go static binary) - **Healthcheck** — HEALTHCHECK instruction for orchestrator - **Labels** — LABEL maintainer, version, git commit - **.dockerignore** — minimize build context ## Artifact Management ### Docker Registries | Registry | Public/Private | Cost | Integration | |----------|---------------|------|-------------| | **Docker Hub** | Both | Public free, private $5/month | GitHub Actions, GitLab | | **ECR (AWS)** | Private | $0.10/GB/month + data transfer | IAM, ECS, EKS | | **GHCR (GitHub)** | Both | Public free, private 500 MB free | GitHub Actions, npm | | **GCR / Artifact Registry** | Private | $0.10/GB/month | GKE, Cloud Build | | **ACR (Azure)** | Private | $0.11/GB/month | AKS, Azure DevOps | | **Harbor** | Private (self-hosted) | Free (open source) | Custom, CNCF | ### Helm Charts - **Repository** — index.yaml + chart .tgz on HTTP server (S3, GitHub Pages, ChartMuseum) - **OCI registry** — Helm 3.8+ supports storing charts in OCI registries (ECR, GHCR, Harbor) - **Versioning** — chart version (package) + app version (application) ### SBOM (Software Bill of Materials) - **SPDX** / **CycloneDX** — standard SBOM formats - Generation: Trivy, Syft, grype - Use case: supply chain security, compliance (EO 14028, EU CRA) ## Configuration and Secrets | Tool | Description | |------|-------------| | Vault (HashiCorp) | Dynamic secrets, encryption-as-a-service | | AWS Secrets Manager | Managed, auto-rotation | | Azure Key Vault | Managed, HSM support | | GCP Secret Manager | Managed | | SOPS | Encryption in git repos | | Sealed Secrets | Encrypted secrets for Kubernetes | ### Secret Management Workflows **Vault Agent Injector** (Kubernetes) - Sidecar container (vault-agent) injects secrets into the pod - Secrets mounted as tmpfs volume (not into environment variables) - Auto-rotation: vault-agent periodically refreshes secrets **External Secrets Operator** (Kubernetes) - CRD: `ExternalSecret` → creates `Secret` in K8s - Backend: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, Vault - Push-based refresh: change in external store → propagate to K8s **Sealed Secrets** - `kubeseal` encrypts Secret on the cluster (controller has private key) - Encrypted manifest (SealedSecret) can be safely in git - Controller decrypts on deploy ## GitOps - **Principle**: Git is the single source of truth - **Tools**: ArgoCD, Flux, Rancher Fleet - Pull-based deploy — agent in the cluster watches repo and applies changes - Auto-sync + drift detection ## Environment Promotion (dev → staging → prod) ``` Code → Dev (auto-deploy) → Staging (auto + smoke tests) → Prod (manual approval + gating) ``` **Quality Gates**: 1. **Unit tests** — pass rate 100 %, code coverage ≥ 80 % 2. **Integration tests** — all critical paths pass 3. **SAST scan** — no critical/high vulnerabilities 4. **SCA scan** — no known critical CVEs 5. **Container scan** — all fixable vulns addressed 6. **Smoke tests** — after staging deploy (health endpoint, basic flow) 7. **Manual approval** — for production (optional with CD) ## Deployment Strategies | Strategy | Description | Risk | |----------|-------------|------| | **Rolling update** | Gradual instance replacement | Low | | **Blue/Green** | Two identical environments, traffic switch | Medium | | **Canary** | % traffic to new version, gradual increase | Low | | **Feature flag** | Toggle feature on/off without deploy | Very low | | **A/B testing** | Different versions for different users | Low | ## Git Branching Strategies | Strategy | Description | Suitable For | |----------|-------------|--------------| | **Trunk-based** | Single main branch, short feature branches (< 1 day) | CD, microservices, mature teams | | **GitHub Flow** | Main + feature branches, PRs, simple | Startups, web apps | | **GitLab Flow** | Main + environment branches (staging, prod) + feature branches | Enterprise, regulated | | **GitFlow** | Develop + main + feature/release/hotfix branches | Release-based, enterprise legacy | | **One Flow** | Simplified GitFlow (no develop branch) | Medium teams | ## Rollback Strategies | Strategy | Description | Speed | Risk | |----------|-------------|-------|------| | **Forward fix** | New deploy with hotfix | Slow (build + deploy) | Low | | **Rollback (revert commit)** | Git revert, new deploy | Medium | Low | | **Blue/Green switchback** | Switch back to old version | Instant | DB incompatibility | | **Database rollback** | Revert DB migration (migrate down) | Slow | Data loss risk | ### Database Rollback Challenges - **Breaking changes** — removing a column/table means rollback problem (data lost) - **Best practice**: Expand → Migrate → Contract (never remove in a single deploy) - **Tooling**: Flyway undo (limited), Liquibase rollback, pgroll (Postgres) - **Feature flags** as prevention — new code is behind a flag, rollback = disable flag ## CI/CD Design Patterns Modern CI/CD pipelines solve recurring problems using design patterns: | Pattern | Description | |---------|-------------| | **Pipeline as Code** | Pipeline defined in YAML/Kotlin DSL (`.gitlab-ci.yml`, `.github/workflows/`) | | **Immutable Pipeline** | Each build is an artifact, never changed | | **Quality Gate** | Branch protection, required checks, code coverage threshold | | **Deployment Strategy** | Blue/Green, Canary, Rolling (see table below) | | **GitOps** | Pull-based deploy with auto-sync and drift detection | | **Shift-Left Security** | SAST/DAST/SCA part of the pipeline | | **Dependency Caching** | Cache layer between pipeline runs | ## Shift Left Security ### SCA (Software Composition Analysis) | Tool | Type | Integration | |------|------|-------------| | **Dependabot** | GitHub native | GitHub, auto-PR for fix | | **Renovate** | Multi-platform | GitHub, GitLab, Bitbucket | | **Snyk** | SaaS + CLI | All platforms, Docker, IaC | | **Trivy** | CLI, OSS | CI/CD pipeline (GitHub Actions, GitLab) | ### SAST (Static Application Security Testing) | Tool | Languages | Characteristics | |------|-----------|----------------| | **Semgrep** | 30+ (Python, Java, Go, JS/TS) | Fast, custom rules, CI-native | | **SonarQube** | 30+ | Comprehensive, quality gates, tech debt | | **CodeQL** | 12 (C++, C#, Go, Java, JS/TS, Python) | GitHub native, query-based | | **Checkmarx** | 30+ | Enterprise, CxSAST, CxFlow | | **Fortify** | 30+ | Enterprise, SAST + DAST | ### Container Scanning | Tool | Description | |------|-------------| | **Trivy** | OSS, scans OS packages + language-specific + IaC | | **Grype** | OSS, from Anchore, fast, Syft for SBOM | | **Clair** | Red Hat, OSS, OCI-compatible | | **Docker Scout** | Docker Desktop / CLI, integration with Docker Hub | ## AI-Native Software Delivery (2025–2026) AI is transforming DevOps 2.0: - **AI-assisted CI/CD** — automatic pipeline failure diagnosis, resource allocation optimization - **Agent Control Protocol (ACP)** / **Model Context Protocol (MCP)** — standards for AI agent interaction with tooling - **AI-driven cost management** — FinOps cloud optimization - **Intelligent test selection** — ML determines which tests to run based on code changes - **Self-healing pipelines** — AI auto-detects and fixes common issues New tools: Harness (AI-native CD), GitLab 19.0 (agentic MR workflows, secrets manager), Octopus Deploy. ## Pipeline Tools - **GitHub Actions** — integrated with GitHub, large marketplace - **GitLab CI** — native in GitLab, auto DevOps - **Jenkins** — oldest, extensible, self-hosted - **CircleCI** — SaaS, fast - **Argo Workflows** — Kubernetes native - **Buildkite** — hybrid (own agents, SaaS orchestrator) ## Best Practices - **Idempotent pipeline** — repeated runs give the same result - **Immutable infrastructure** — never modify a running server, always redeploy - **Shift left** — tests and security as early as possible in the pipeline - **Artifact management** — all builds versioned in registry (Docker Hub, ECR, GHCR) - **Dependency caching** — speed up pipeline (npm ci, pip cache, Docker layer caching) - **Fail fast** — pipeline fails as early as possible on error ## Resources Links, books and standards: [sources/cicd/sources.en.md](sources/cicd/sources.en.md) ### Recommended Reading | Book | Authors | ISBN | Key Contribution | |------|---------|------|-----------------| | The DevOps Handbook | Kim, Humble, Debois, Willis | 978-1942788003 | CALMS principles (Culture, Automation, Lean, Measurement, Sharing), flow map, deployment pipeline | | Continuous Delivery | Humble, Farley | 978-0321601912 | Deployment pipeline, commit stage, acceptance tests, capacity testing, zero-downtime release | | CI/CD Design Patterns | Bajpai, Schildmeijer, Piwosz, Mishra | 978-1-83588-965-7 | 30+ design patterns for CI/CD — pipeline patterns, GitOps, security, testing, deployment strategies | | DevOps Frameworks, Techniques, and Tools | Vijayakumaran, Kofler, Öggl, Springer | 978-1-4932-2670-2 | Framework for DevOps adoption, tool comparison (Jenkins vs GitLab vs GitHub Actions), techniques for monitoring and observability | - **Quality gates** — automated checks before every promotion to the next environment - **Pipeline visibility** — dashboard with current status of all pipelines (GitHub, GitLab, ArgoCD) ## OpenStack CI/CD OpenStack ecosystem uses its own CI/CD tools: ### Zuul - CI/CD system developed by the OpenStack community (now standalone, used outside OpenStack) - **Gating** — changes are tested before merge (not after merge) — prevents breaking main branch - **Ansible-based** — jobs are Ansible playbooks - **Nodepool** — dynamic test VM allocation in the cloud (OpenStack, AWS) - **Pipeline** — check, gate, post, periodic, tag, release ### OpenStack Infra (OpenDev) - Public CI infrastructure for OpenStack projects - Tools: Gerrit (code review), Zuul (CI), Nodepool (test nodes), Storyboard (issue tracking) - Base jobs: tempest (integration tests), grenade (upgrade tests), devstack-gate (gate tests) ### Integration with External Tools - **Terraform** — OpenStack provider for provisioning (terraform-provider-openstack) - **Ansible** — openstack.cloud collection for managing OpenStack resources - **Packer** — build OpenStack images (openstack builder) - **Jenkins** — older CI, still used in some distributions *Last revised: 2026-06-03*