Keeping Your Databricks Asset Bundles Modular and Clean with Jinja2
Databricks Asset Bundles (DABs) are a great way to manage your data pipelines as code — version-controlled YAML that defines jobs, clusters, schedules, and permissions. But as your project grows beyond a handful of jobs, your bundle configuration gets unwieldy fast.
DABs support include: directives to pull in multiple YAML files, but that’s about as far as native modularity goes. There’s no inheritance, no conditionals, no environment-aware logic, and no reusable fragments. You end up copy-pasting cluster definitions, permission blocks, and git source configs across every job.
This post walks through how we solved this by introducing Jinja2 templating as a pre-render step — keeping our DABs DRY, readable, and truly environment-aware.
The Problem: YAML Repetition at Scale
Consider a project with multiple jobs. Each one needs:
- A cluster definition (same policy, same Spark version, same env vars)
- A git source block (branch in dev/staging, tag in production)
- Permissions (needed in dev, not in production)
- A schedule (only in production)
- Tags (same across all jobs)
In vanilla DABs, every job file repeats all of this. A 5-job project means 5 copies of the cluster block, 5 copies of the git source, 5 copies of the tags. Change the Spark version? Touch 5 files.
DABs include: only works at the top level — it merges separate YAML files into the bundle. But it doesn’t give you reusable fragments within a job definition. There’s no !include for nested blocks, no YAML anchors across files, and no way to say “use this cluster block, but only add a schedule in production.”
The Solution: Jinja2 as a Pre-Render Layer
The idea is simple:
- Write your bundle resources as Jinja2 templates (
.yml.j2) - Extract shared fragments into include files
- Run a Python script that renders the templates into plain YAML
- Point
databricks.ymlat the rendered output
DABs never sees Jinja — it only sees the final, valid YAML. Jinja runs before the bundle is validated or deployed.
Project Structure
.
├── databricks.yml # Root bundle config, includes rendered output
├── bundle/
│ ├── includes/ # Reusable Jinja fragments
│ │ ├── base-job.yml.j2
│ │ ├── clusters.yml.j2
│ │ └── schedule.yml.j2
│ └── workflows/ # One template per job
│ ├── my_etl_pipeline.yml.j2
│ └── my_ml_pipeline.yml.j2
├── dist-bundle/ # Rendered output (gitignored)
│ ├── my_etl_pipeline.yml
│ └── my_ml_pipeline.yml
└── scripts/
└── render_bundle.py # The render script
The key insight: bundle/ contains templates, dist-bundle/ contains rendered YAML. Only the rendered files are included by databricks.yml.
Building the Include Fragments
Base Job
The base job fragment handles git source, permissions, and tags — the stuff every job needs:
# bundle/includes/base-job.yml.j2
base-job: &base-job
{% if environment != 'production' %}
permissions:
- group_name: "my_team_group"
level: "CAN_MANAGE"
{% endif %}
git_source:
git_url: https://github.com/my-org/my-repo.git
git_provider: gitHub
{% if environment == 'production' %}
git_tag: ${var.version}
{% else %}
git_branch: ${bundle.git.branch}
{% endif %}
tags:
"team": DataEngineering
"environment": {{ environment }}
Notice how this single fragment handles three concerns:
- Permissions: Only added outside production (developers need
CAN_MANAGEto iterate; production is locked down) - Git source: Production pins to a release tag; other environments track the branch
- Tags: Environment name is injected dynamically
Clusters
# bundle/includes/clusters.yml.j2
my_clusters: &my_clusters
job_clusters:
- job_cluster_key: main_cluster
new_cluster:
policy_id: 000EE9580030D241
spark_version: 16.4.x-scala2.12
spark_env_vars:
ENVIRONMENT: {{ environment }}
autoscale:
min_workers: 2
max_workers: 8
One place to update the Spark version, cluster policy, or scaling config.
Schedule
# bundle/includes/schedule.yml.j2
everyday_5am: &everyday_5am
quartz_cron_expression: "0 0 5 * * ?"
timezone_id: "Europe/Rome"
pause_status: "PAUSED"
Composing a Workflow
With the fragments in place, a workflow template is remarkably concise:
# bundle/workflows/my_etl_pipeline.yml.j2
{% include "includes/base-job.yml.j2" %}
{% include "includes/clusters.yml.j2" %}
{% include "includes/schedule.yml.j2" %}
resources:
jobs:
my_etl_pipeline:
name: "{{ job_prefix }} my_etl_pipeline"
<<: [*base-job, *my_clusters]
{% if environment == 'production' %}
schedule: *everyday_5am
{% endif %}
tasks:
- task_key: ingest_raw_data
notebook_task:
notebook_path: src/pipelines/ingest_raw_data
job_cluster_key: main_cluster
- task_key: transform_silver
depends_on:
- task_key: ingest_raw_data
notebook_task:
notebook_path: src/pipelines/transform_silver
job_cluster_key: main_cluster
The {% include %} directives pull in the YAML anchors (&base-job, &my_clusters, &everyday_5am), and the <<: merge key applies them. Adding a new job is a matter of copying this template and changing the job name and tasks — the infra boilerplate is inherited.
How YAML Anchors + Jinja Include Work Together
This is the trick that makes it all click. Jinja’s {% include %} inserts the fragment’s text verbatim into the template before YAML parsing. So the rendered output is a single YAML document where the anchors (&base-job) and aliases (*base-job) are all in scope.
The <<: merge key is standard YAML — it merges all fields from the referenced anchor into the current mapping. By listing multiple anchors (<<: [*base-job, *my_clusters]), you compose multiple fragments into one job definition.
The Render Script
The render script is intentionally simple — just Jinja2 with a FileSystemLoader:
# scripts/render_bundle.py
import os
import shutil
from enum import StrEnum
from pathlib import Path
from typing import Any
from jinja2 import Environment, FileSystemLoader
RESOURCES_DIR = Path(__file__).parent.parent / "bundle"
OUTPUT_DIR = Path(__file__).parent.parent / "dist-bundle"
class BundleEnvironment(StrEnum):
DEVELOPMENT = "development"
STAGING = "staging"
PRODUCTION = "production"
def cleanup(directory: Path) -> None:
"""Recreate the output directory from scratch."""
shutil.rmtree(directory, ignore_errors=True)
directory.mkdir(parents=True, exist_ok=True)
def build_context() -> dict[str, Any]:
"""Build template context from environment variables."""
target = os.getenv("BUNDLE_TARGET")
if target not in {e.value for e in BundleEnvironment}:
raise ValueError(
f"BUNDLE_TARGET must be one of {[e.value for e in BundleEnvironment]}, "
f"got: {target!r}"
)
job_prefix_map = {
"development": "[${var.version}]",
"staging": "[${bundle.target}] [master]",
"production": "[${bundle.target}] [${var.version}]",
}
return {
"environment": target,
"job_prefix": job_prefix_map[target],
}
def render(resources: Path, output: Path) -> None:
"""Render workflow templates into the output directory."""
env = Environment(loader=FileSystemLoader(resources))
ctx = build_context()
for filename in resources.rglob("workflows/*.yml.j2"):
print(f"Rendering {filename.stem}...")
template_path = filename.relative_to(resources)
template = env.get_template(str(template_path.as_posix())).render(ctx)
(output / template_path.stem).write_text(template)
print("Done!")
if __name__ == "__main__":
cleanup(OUTPUT_DIR)
render(RESOURCES_DIR, OUTPUT_DIR)
Key design choices:
BUNDLE_TARGETis the only input — set as an environment variable by the Makefile or CIjob_prefix_mapcontrols how jobs are named per environment, mixing DABs variables (${bundle.target},${var.version}) with static textcleanup()always starts fresh — no stale rendered files from a previous target- Templates are discovered automatically via
rglob("workflows/*.yml.j2"), so adding a new job doesn’t require touching the script
Wiring It Into the Makefile
The render step slots into the Makefile chain naturally:
BUNDLE_TARGET ?= development
.PHONY: bundle-render
bundle-render: install-dependencies
BUNDLE_TARGET=$(BUNDLE_TARGET) uv run python scripts/render_bundle.py
.PHONY: bundle-validate
bundle-validate: bundle-render
databricks bundle validate --target $(BUNDLE_TARGET)
.PHONY: bundle-deploy
bundle-deploy: bundle-validate
databricks bundle deploy --target $(BUNDLE_TARGET)
The dependency chain is: bundle-deploy → bundle-validate → bundle-render → install-dependencies. Templates are always rendered fresh before any bundle operation.
Wiring It Into databricks.yml
The root bundle config simply includes the rendered output:
# databricks.yml
bundle:
name: my_project
include:
- "dist-bundle/*.yml"
variables:
version:
description: >
Release version extracted from git (branch prefix or tag).
targets:
development:
mode: development
default: true
workspace:
host: https://my-databricks-instance.cloud.databricks.com
staging:
mode: production
workspace:
host: https://my-databricks-instance.cloud.databricks.com
run_as:
service_principal_name: 00000000-0000-0000-0000-000000000000
production:
mode: production
workspace:
host: https://my-databricks-instance.cloud.databricks.com
run_as:
service_principal_name: 00000000-0000-0000-0000-000000000000
The include: ["dist-bundle/*.yml"] picks up whatever the render script produced. databricks.yml itself stays clean — just bundle metadata and target definitions.
Testing the Templates
Since the render script is pure Python, it’s straightforward to test:
# tests/test_render_bundle.py
import os
from unittest.mock import patch
import pytest
import yaml
from scripts.render_bundle import build_context, render, cleanup
RESOURCES = Path(__file__).parent.parent / "bundle"
class TestBuildContext:
def test_development_context(self):
with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
ctx = build_context()
assert ctx["environment"] == "development"
assert ctx["job_prefix"] == "[${var.version}]"
def test_staging_context(self):
with patch.dict(os.environ, {"BUNDLE_TARGET": "staging"}):
ctx = build_context()
assert ctx["environment"] == "staging"
assert "[master]" in ctx["job_prefix"]
def test_production_context(self):
with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
ctx = build_context()
assert ctx["environment"] == "production"
assert "${var.version}" in ctx["job_prefix"]
def test_invalid_target_raises(self):
with patch.dict(os.environ, {"BUNDLE_TARGET": "invalid"}):
with pytest.raises(ValueError):
build_context()
class TestRenderIntegration:
def test_production_renders_valid_yaml_with_schedule(self, tmp_path):
with patch.dict(os.environ, {"BUNDLE_TARGET": "production"}):
render(RESOURCES, tmp_path)
for yml_file in tmp_path.glob("*.yml"):
content = yaml.safe_load(yml_file.read_text())
job = list(content["resources"]["jobs"].values())[0]
assert "schedule" in job
assert "git_tag" in yaml.dump(content)
def test_development_renders_without_schedule(self, tmp_path):
with patch.dict(os.environ, {"BUNDLE_TARGET": "development"}):
render(RESOURCES, tmp_path)
for yml_file in tmp_path.glob("*.yml"):
content = yaml.safe_load(yml_file.read_text())
job = list(content["resources"]["jobs"].values())[0]
assert "schedule" not in job
You’re testing real template rendering, not mocked YAML — so you catch Jinja syntax errors, broken anchors, and missing variables before they hit databricks bundle validate.
Why Not Just Use DABs Variables?
DABs does have ${var.*} variables, and they’re great for simple string substitution. But they can’t:
- Conditionally include/exclude blocks (e.g., permissions only in non-prod, schedule only in prod)
- Compose reusable fragments across jobs (clusters, git source, tags)
- Generate job names with complex per-environment logic
Jinja handles all of these. And since it runs before DABs, you can still use ${var.*} and ${bundle.*} in the rendered output — they’re just opaque strings to Jinja.
Why Not Native include: in the Templates?
DABs include: merges top-level YAML files into the bundle. It doesn’t support:
- Including a fragment inside a job definition
- YAML anchors defined in one file and referenced in another
- Any conditional logic
Jinja’s {% include %} is textual inclusion — it pastes the fragment inline before YAML parsing, so anchors and aliases work across fragments. It’s a fundamentally different (and more powerful) mechanism.
Recap
| Concern | Without Jinja | With Jinja |
|---|---|---|
| Cluster config | Copied in every job file | Defined once in clusters.yml.j2 |
| Git source | Copied, manually toggled | Conditional in base-job.yml.j2 |
| Permissions | Copied or forgotten | Conditional in base-job.yml.j2 |
| Schedule | Copied or forgotten | Conditional per workflow |
| New job | Copy-paste full YAML | 3 includes + job-specific tasks |
| Environment logic | Manual file edits | {{ environment }} and {% if %} |
| Testing | databricks bundle validate only | Python unit tests + validate |
The Jinja layer adds one small Python script and a bundle/ directory to your project. In return, you get composable, testable, environment-aware bundle configurations that scale cleanly from 1 job to 50.