Back to Blog

Keeping Your Databricks Asset Bundles Modular and Clean with Jinja2

Databricks DevOps Jinja2 Data Engineering

Databricks Asset Bundles (DABs) are a great way to manage your data pipelines as code — version-controlled YAML that defines jobs, clusters, schedules, and permissions. But as your project grows beyond a handful of jobs, your bundle configuration gets unwieldy fast.

DABs support include: directives to pull in multiple YAML files, but that’s about as far as native modularity goes. There’s no inheritance, no conditionals, no environment-aware logic, and no reusable fragments. You end up copy-pasting cluster definitions, permission blocks, and git source configs across every job.

This post walks through how we solved this by introducing Jinja2 templating as a pre-render step — keeping our DABs DRY, readable, and truly environment-aware.


The Problem: YAML Repetition at Scale

Consider a project with multiple jobs. Each one needs:

  • A cluster definition (same policy, same Spark version, same env vars)
  • A git source block (branch in dev/staging, tag in production)
  • Permissions (needed in dev, not in production)
  • A schedule (only in production)
  • Tags (same across all jobs)

In vanilla DABs, every job file repeats all of this. A 5-job project means 5 copies of the cluster block, 5 copies of the git source, 5 copies of the tags. Change the Spark version? Touch 5 files.

DABs include: only works at the top level — it merges separate YAML files into the bundle. But it doesn’t give you reusable fragments within a job definition. There’s no !include for nested blocks, no YAML anchors across files, and no way to say “use this cluster block, but only add a schedule in production.”


The Solution: Jinja2 as a Pre-Render Layer

The idea is simple:

  1. Write your bundle resources as Jinja2 templates (.yml.j2)
  2. Extract shared fragments into include files
  3. Run a Python script that renders the templates into plain YAML
  4. Point databricks.yml at the rendered output

DABs never sees Jinja — it only sees the final, valid YAML. Jinja runs before the bundle is validated or deployed.

Project Structure

.
├── databricks.yml                  # Root bundle config, includes rendered output
├── bundle/
│   ├── includes/                   # Reusable Jinja fragments
│   │   ├── base-job.yml.j2
│   │   ├── clusters.yml.j2
│   │   └── schedule.yml.j2
│   └── workflows/                  # One template per job
│       ├── my_etl_pipeline.yml.j2
│       └── my_ml_pipeline.yml.j2
├── dist-bundle/                    # Rendered output (gitignored)
│   ├── my_etl_pipeline.yml
│   └── my_ml_pipeline.yml
└── scripts/
    └── render_bundle.py            # The render script

The key insight: bundle/ contains templates, dist-bundle/ contains rendered YAML. Only the rendered files are included by databricks.yml.


Building the Include Fragments

Base Job

The base job fragment handles git source, permissions, and tags — the stuff every job needs:

# bundle/includes/base-job.yml.j2
base-job: &base-job
  {% if environment != 'production' %}
  permissions:
    - group_name: "my_team_group"
      level: "CAN_MANAGE"
  {% endif %}

  git_source:
    git_url: https://github.com/my-org/my-repo.git
    git_provider: gitHub
    {% if environment == 'production' %}
    git_tag: ${var.version}
    {% else %}
    git_branch: ${bundle.git.branch}
    {% endif %}

  tags:
    "team": DataEngineering
    "environment": {{ environment }}

Notice how this single fragment handles three concerns:

  • Permissions: Only added outside production (developers need CAN_MANAGE to iterate; production is locked down)
  • Git source: Production pins to a release tag; other environments track the branch
  • Tags: Environment name is injected dynamically

Clusters

# bundle/includes/clusters.yml.j2
my_clusters: &my_clusters
  job_clusters:
    - job_cluster_key: main_cluster
      new_cluster:
        policy_id: 000EE9580030D241
        spark_version: 16.4.x-scala2.12
        spark_env_vars:
          ENVIRONMENT: {{ environment }}
        autoscale:
          min_workers: 2
          max_workers: 8

One place to update the Spark version, cluster policy, or scaling config.

Schedule

# bundle/includes/schedule.yml.j2
everyday_5am: &everyday_5am
  quartz_cron_expression: "0 0 5 * * ?"
  timezone_id: "Europe/Rome"
  pause_status: "PAUSED"

Composing a Workflow

With the fragments in place, a workflow template is remarkably concise:

# bundle/workflows/my_etl_pipeline.yml.j2
{% include "includes/base-job.yml.j2" %}
{% include "includes/clusters.yml.j2" %}
{% include "includes/schedule.yml.j2" %}

resources:
  jobs:
    my_etl_pipeline:
      name: "{{ job_prefix }} my_etl_pipeline"
      <<: [*base-job, *my_clusters]
      {% if environment == 'production' %}
      schedule: *everyday_5am
      {% endif %}

      tasks:
        - task_key: ingest_raw_data
          notebook_task:
            notebook_path: src/pipelines/ingest_raw_data
          job_cluster_key: main_cluster

        - task_key: transform_silver
          depends_on:
            - task_key: ingest_raw_data
          notebook_task:
            notebook_path: src/pipelines/transform_silver
          job_cluster_key: main_cluster

The {% include %} directives pull in the YAML anchors (&base-job, &my_clusters, &everyday_5am), and the <<: merge key applies them. Adding a new job is a matter of copying this template and changing the job name and tasks — the infra boilerplate is inherited.

How YAML Anchors + Jinja Include Work Together

This is the trick that makes it all click. Jinja’s {% include %} inserts the fragment’s text verbatim into the template before YAML parsing. So the rendered output is a single YAML document where the anchors (&base-job) and aliases (*base-job) are all in scope.

The <<: merge key is standard YAML — it merges all fields from the referenced anchor into the current mapping. By listing multiple anchors (<<: [*base-job, *my_clusters]), you compose multiple fragments into one job definition.


The Render Script

The render script is intentionally simple — just Jinja2 with a FileSystemLoader:

# scripts/render_bundle.py
import os
import shutil
from enum import StrEnum
from pathlib import Path
from typing import Any

from jinja2 import Environment, FileSystemLoader

RESOURCES_DIR = Path(__file__).parent.parent / "bundle"
OUTPUT_DIR = Path(__file__).parent.parent / "dist-bundle"


class BundleEnvironment(StrEnum):
    DEVELOPMENT = "development"
    STAGING = "staging"
    PRODUCTION = "production"


def cleanup(directory: Path) -> None:
    """Recreate the output directory from scratch."""
    shutil.rmtree(directory, ignore_errors=True)
    directory.mkdir(parents=True, exist_ok=True)


def build_context() -> dict[str, Any]:
    """Build template context from environment variables."""
    target = os.getenv("BUNDLE_TARGET")
    if target not in {e.value for e in BundleEnvironment}:
        raise ValueError(
            f"BUNDLE_TARGET must be one of {[e.value for e in BundleEnvironment]}, "
            f"got: {target!r}"
        )

    job_prefix_map = {
        "development": "[${var.version}]",
        "staging": "[${bundle.target}] [master]",
        "production": "[${bundle.target}] [${var.version}]",
    }

    return {
        "environment": target,
        "job_prefix": job_prefix_map[target],
    }


def render(resources: Path, output: Path) -> None:
    """Render workflow templates into the output directory."""
    env = Environment(loader=FileSystemLoader(resources))
    ctx = build_context()

    for filename in resources.rglob("workflows/*.yml.j2"):
        print(f"Rendering {filename.stem}...")
        template_path = filename.relative_to(resources)
        template = env.get_template(str(template_path.as_posix())).render(ctx)
        (output / template_path.stem).write_text(template)

    print("Done!")


if __name__ == "__main__":
    cleanup(OUTPUT_DIR)
    render(RESOURCES_DIR, OUTPUT_DIR)

Key design choices:

  • BUNDLE_TARGET is the only input — set as an environment variable by the Makefile or CI
  • job_prefix_map controls how jobs are named per environment, mixing DABs variables (${bundle.target}, ${var.version}) with static text
  • cleanup() always starts fresh — no stale rendered files from a previous target
  • Templates are discovered automatically via rglob("workflows/*.yml.j2"), so adding a new job doesn’t require touching the script

Wiring It Into the Makefile

The render step slots into the Makefile chain naturally:

BUNDLE_TARGET ?= development

.PHONY: bundle-render
bundle-render: install-dependencies
	BUNDLE_TARGET=$(BUNDLE_TARGET) uv run python scripts/render_bundle.py

.PHONY: bundle-validate
bundle-validate: bundle-render
	databricks bundle validate --target $(BUNDLE_TARGET)

.PHONY: bundle-deploy
bundle-deploy: bundle-validate
	databricks bundle deploy --target $(BUNDLE_TARGET)

The dependency chain is: bundle-deploybundle-validatebundle-renderinstall-dependencies. Templates are always rendered fresh before any bundle operation.


Wiring It Into databricks.yml

The root bundle config simply includes the rendered output:

# databricks.yml
bundle:
  name: my_project

include:
  - "dist-bundle/*.yml"

variables:
  version:
    description: >
      Release version extracted from git (branch prefix or tag).

targets:
  development:
    mode: development
    default: true
    workspace:
      host: https://my-databricks-instance.cloud.databricks.com

  staging:
    mode: production
    workspace:
      host: https://my-databricks-instance.cloud.databricks.com
    run_as:
      service_principal_name: 00000000-0000-0000-0000-000000000000

  production:
    mode: production
    workspace:
      host: https://my-databricks-instance.cloud.databricks.com
    run_as:
      service_principal_name: 00000000-0000-0000-0000-000000000000

The include: ["dist-bundle/*.yml"] picks up whatever the render script produced. databricks.yml itself stays clean — just bundle metadata and target definitions.


Testing the Templates

Since the render script is pure Python, it’s straightforward to test:

# tests/test_render_bundle.py
import os
from unittest.mock import patch

import pytest
import yaml

from scripts.render_bundle import build_context, render, cleanup

RESOURCES = Path(__file__).parent.parent / "bundle"


class TestBuildContext:
    def test_development_context(self):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
            ctx = build_context()
        assert ctx["environment"] == "development"
        assert ctx["job_prefix"] == "[${var.version}]"

    def test_staging_context(self):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "staging"}):
            ctx = build_context()
        assert ctx["environment"] == "staging"
        assert "[master]" in ctx["job_prefix"]

    def test_production_context(self):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
            ctx = build_context()
        assert ctx["environment"] == "production"
        assert "${var.version}" in ctx["job_prefix"]

    def test_invalid_target_raises(self):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "invalid"}):
            with pytest.raises(ValueError):
                build_context()


class TestRenderIntegration:
    def test_production_renders_valid_yaml_with_schedule(self, tmp_path):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
            render(RESOURCES, tmp_path)

        for yml_file in tmp_path.glob("*.yml"):
            content = yaml.safe_load(yml_file.read_text())
            job = list(content["resources"]["jobs"].values())[0]
            assert "schedule" in job
            assert "git_tag" in yaml.dump(content)

    def test_development_renders_without_schedule(self, tmp_path):
        with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
            render(RESOURCES, tmp_path)

        for yml_file in tmp_path.glob("*.yml"):
            content = yaml.safe_load(yml_file.read_text())
            job = list(content["resources"]["jobs"].values())[0]
            assert "schedule" not in job

You’re testing real template rendering, not mocked YAML — so you catch Jinja syntax errors, broken anchors, and missing variables before they hit databricks bundle validate.


Why Not Just Use DABs Variables?

DABs does have ${var.*} variables, and they’re great for simple string substitution. But they can’t:

  • Conditionally include/exclude blocks (e.g., permissions only in non-prod, schedule only in prod)
  • Compose reusable fragments across jobs (clusters, git source, tags)
  • Generate job names with complex per-environment logic

Jinja handles all of these. And since it runs before DABs, you can still use ${var.*} and ${bundle.*} in the rendered output — they’re just opaque strings to Jinja.


Why Not Native include: in the Templates?

DABs include: merges top-level YAML files into the bundle. It doesn’t support:

  • Including a fragment inside a job definition
  • YAML anchors defined in one file and referenced in another
  • Any conditional logic

Jinja’s {% include %} is textual inclusion — it pastes the fragment inline before YAML parsing, so anchors and aliases work across fragments. It’s a fundamentally different (and more powerful) mechanism.


Recap

ConcernWithout JinjaWith Jinja
Cluster configCopied in every job fileDefined once in clusters.yml.j2
Git sourceCopied, manually toggledConditional in base-job.yml.j2
PermissionsCopied or forgottenConditional in base-job.yml.j2
ScheduleCopied or forgottenConditional per workflow
New jobCopy-paste full YAML3 includes + job-specific tasks
Environment logicManual file edits{{ environment }} and {% if %}
Testingdatabricks bundle validate onlyPython unit tests + validate

The Jinja layer adds one small Python script and a bundle/ directory to your project. In return, you get composable, testable, environment-aware bundle configurations that scale cleanly from 1 job to 50.