Building a Privacy-Aware DBT Sandbox
Why I built this#
I’ve been thinking a lot lately about what might be next in my career. I’ve always been a huge fan of building things and improving systems. Privacy has always been a topic of interest to me. Big tech has taken surveillance capitalism too far, and we need to work to take control of our data. I’ve started exploring privacy engineering as a way to address both of these issues.
The Architecture in One Diagram#
flowchart LR
pg[("Postgres
(OLTP)")] -.->|"sync_warehouse
(extract-load only)"| raw
subgraph duck["DuckDB warehouse"]
raw["raw_*
contains PII"]
stg["staging
PII hashed via macro"]
intm["intermediate"]
safe["marts/safe
no PII"]
rest["marts/restricted
PII re-joined deliberately"]
raw -->|DBT run| stg --> intm
intm --> safe
intm --> rest
endSource Declarations and PII Tagging#
The first step in the transformations is to tell DBT about the shape of the raw tables synced from PostgreSQL.
version: 2
sources:
- name: raw
description: "Raw tables synced from the django postgres OLTP database."
schema: main
tables:
- name: raw_user
columns:
- name: id
tests: [unique, not_null]
- name: username
meta: {
pii: true,
pii_type: username,
sensitivity: medium
}
tags: [pii]
- name: email
meta: {
pii: true,
pii_type: email,
sensitivity: medium
}
tags: [pii]
- name: first_name
meta: {
pii: true,
pii_type: name,
sensitivity: medium
}
tags: [pii]
- name: last_name
meta: {
pii: true,
pii_type: name,
sensitivity: medium
}
tags: [pii]
- name: organization_id
- name: date_joined
I’ve only included one table here for brevity, but I’ve applied metadata and tags to each PII column. For example, the email column has the following metadata and tags. I use this information later on to identify the data in our pipelines.
- name: email
meta: {
pii: true,
pii_type: email,
sensitivity: medium
}
tags: [pii]
Masking PII in Staging#
A DBT macro masks data when it moves from the raw tables to staging. The following macro takes a column and outputs a HEX encoded salted SHA-256 hash of the value.
{# Returns a deterministic salted SHA256 hex string for a column.
the salt comes from var('pii_salt'), set in dbt_project.yml.
Hex output makes the hash safe to store and human-comparable #}
{% macro hash_pii(column_name) %}
lower(to_hex(sha256(cast({{ column_name }} as varchar) || '{{ var("pii_salt") }}')))
{% endmacro %}
This isn’t the final solution for production data. Hashes don’t stop brute-forcing if the salt leaks. Also, they can’t handle right-to-be-forgotten workflows that require deleting specific records. For analytics, it’s perfect: the hash is deterministic. This means sha256(salt + email) will always produce the same string. Downstream models can still JOIN or GROUP BY on it without ever seeing the raw value.
One thing that surprised me: this isn’t a database function. DBT macros expand when you compile. For example, DBT replaces the {{ hash_pii('email') }} call with the SHA-256 expression. This happens before SQL is sent to DuckDB. The database never sees a function call.
Staging and Intermediate Tables#
Once I defined the raw tables through the schema, I defined the staging tables. We do this using plain SQL. An example of this is below.
-- Users table
select
id as user_id,
organization_id,
{{ hash_pii('username') }} as username_hash,
{{ hash_pii('email') }} as email_hash,
{{ hash_pii('phone') }} as phone_hash,
date_joined as joined_at
from {{ source('raw', 'raw_user') }}
This gives me a table with user information. The hash_pii macro masks the sensitive data. I defined tests for these tables. This checks if the data matches the expected format in the _schema.yml file. Here’s an example below.
version: 2
models:
- name: stg_users
columns:
- name: user_id
tests: [unique, not_null]
- name: organization_id
I also created intermediate tables during this process. Intermediate tables use complex logic. They include joins and other SQL constructs. This simplifies the top-level tables SQL and makes the project more manageable.
-- Events Enriched
select
e.event_id,
e.event_at,
cast(e.event_at as date) as event_date,
e.event_type,
e.user_id,
e.organization_id,
o.organization_name,
e.metadata
from {{ ref('stg_events') }} e
left join {{ ref('stg_organizations') }} o
on e.organization_id = o.organization_id
Access Tiers via Schema Separation#
Once we put the data into the staging tables from the previous step, I can begin creating our data mart. I have chosen to have two separate marts, one without PII and one with PII. You can use roles to restrict access to these tables.
The configuration of the split occurs in dbt_project.yml. Each subfolder in marts/ links to a DuckDB schema. So, the boundary exists in the warehouse layer, not in the SQL of single models.
models:
test_app:
marts:
safe: {+schema: marts, +materialized: table}
restricted: {+schema: marts_restricted, +materialized: table}
Anything in marts/safe/ goes into the marts schema. Anything in marts/restricted/ goes into marts_restricted. In a real warehouse, you’d give the analyst role read access to marts, and only the PII-cleared role would get access to marts_restricted — and the tier is genuinely enforced at the warehouse layer.
With that config in place, the SQL for each mart is standard SELECT statements. I have created 3 tables in the safe schema.
-- Appointment Funnel
select
organization_id,
date_trunc('week', scheduled_at)::date as week_start,
count(*) as scheduled,
count(*) filter (where status = 'completed') as completed,
count(*) filter (where status = 'canceled') as canceled,
count(*) filter (where status = 'no_show') as no_show,
round(count(*) filter (where status = 'completed')::double / nullif(count(*), 0), 3 ) as completion_rate
from {{ ref('int_appointment_lifecycle') }}
group by 1,2
-- Event Type Volume
select
event_type,
event_date,
count(*) as count,
count(distinct user_id) as user_count
from {{ ref('int_events_enriched') }}
group by 1,2
select
organization_id,
organization_name,
event_date,
count(*) as event_count,
count(*) filter (where event_type = 'login') as login_count,
count(*) filter (where event_type = 'appointment_created') as appointment_created_count,
count(distinct user_id) as active_user_count
from {{ ref('int_events_enriched') }}
group by 1,2,3
I can create tables with PII rehydrated. In my example, I also made a restricted mart with the PII added back in. The example below demonstrates how to do this. I pull the original PII by reaching back to the raw_user table.
-- User Engagement (With PII)
with user_pii as (
select
id as user_id,
first_name,
last_name
from {{ source('raw', 'raw_user')}}
),
user_events as (
select
user_id,
count(*) as total_events,
max(event_at) as last_active_at
from {{ ref('stg_events') }}
group by 1
),
user_appointments as (
select
provider_id as user_id,
count(*) as appointments_managed
from {{ ref('stg_appointments') }}
group by 1
)
select
up.user_id,
up.first_name,
up.last_name,
coalesce(ue.total_events, 0) as total_events,
coalesce(ua.appointments_managed, 0) as appointments_managed,
ue.last_active_at
from user_pii up
left join user_events ue on up.user_id = ue.user_id
left join user_appointments ua on up.user_id = ua.user_id
A Test That Enforces the Privacy Contract#
So far, the no-PII-in-safe-marts contract has been enforced by convention: safe marts read from staging where PII is already hashed, and only the restricted mart deliberately reaches back to raw. Nothing checks that the convention is actually honored. I want to enforce that automatically. The test below takes care of that. Claude wrote most of this. I’ll return with a blog post about this test. Stay tuned!
The test scans DBT’s manifest to find all columns marked pii in sources or models. For each model in marts/safe/, it checks if its raw SQL references any of those PII-tagged columns from its upstream dependencies. Any hit returns a row, and DBT fails when the row count is non-zero.
{# Singular test: fails (returns rows) if any model in marts/safe directly
references a column tagged 'pii' from a source or upstream model.
How: walks graph.nodes for models with path starting marts/safe/,
gathers their depends_on refs, then checks each referenced node's
columns for tag:pii.
Limitations: detects direct column references in compiled SQL via
regex match against PII column names from upstream nodes. Good enough
for a learning sandbox; a production version would parse SQL via
sqlglot or use column-level lineage from dbt's manifest. #}
{% set pii_columns_by_node = {} %}
{# Collect PII columns from all sources #}
{% for node in graph.sources.values() %}
{% set pii_cols = [] %}
{% for col_name, col in node.columns.items() %}
{% if 'pii' in (col.tags or []) %}
{% do pii_cols.append(col_name) %}
{% endif %}
{% endfor %}
{% if pii_cols %}
{% do pii_columns_by_node.update({node.unique_id: pii_cols}) %}
{% endif %}
{% endfor %}
{# Collect PII columns from all models (e.g. restricted marts) #}
{% for node in graph.nodes.values() %}
{% if node.resource_type == 'model' %}
{% set pii_cols = [] %}
{% for col_name, col in node.columns.items() %}
{% if 'pii' in (col.tags or []) %}
{% do pii_cols.append(col_name) %}
{% endif %}
{% endfor %}
{% if pii_cols %}
{% do pii_columns_by_node.update({node.unique_id: pii_cols}) %}
{% endif %}
{% endif %}
{% endfor %}
{# For each safe mart, check whether any of its compiled SQL references a PII column from upstream #}
with violations as (
{% set violation_rows = [] %}
{% for node in graph.nodes.values() %}
{% if node.resource_type == 'model' and 'marts/safe' in node.path %}
{% set compiled = node.raw_code or node.compiled_code or '' %}
{% for upstream_id, pii_cols in pii_columns_by_node.items() %}
{% if upstream_id in node.depends_on.nodes or upstream_id in (node.depends_on.macros or []) %}
{% for pii_col in pii_cols %}
{# crude word-boundary check #}
{% set pattern = '\\b' ~ pii_col ~ '\\b' %}
{% if compiled and (pii_col in compiled) %}
select
'{{ node.unique_id }}' as safe_model,
'{{ upstream_id }}' as upstream,
'{{ pii_col }}' as pii_column_referenced
union all
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
select null as safe_model, null as upstream, null as pii_column_referenced where 1 = 0
)
select * from violations
where safe_model is not null
Lineage as a Privacy Artifact#
DBT automatically builds a dependency graph from the project — every {{ ref(...) }} and {{ source(...) }} call becomes an edge. Running dbt docs generate builds the metadata. Then, dbt docs serve opens a browser UI. This UI lets me click through the entire DAG. I can see what feeds what and trace any column back to its source.
For the privacy angle, the useful trick is the tag filter. Filtering the lineage graph by tag:pii shows all models that manage sensitive data. This works because the source tags the PII columns. An auditor can check if mart_event_type_volume is PII-free. They don’t need to rely on my word. They can filter the graph and see that no PII-tagged column connects to that node.
What makes this load-bearing is that it’s generated from the code every time it runs. There’s no separate diagram to keep up to date, and no architecture document that can become outdated. If I create a new path that brings PII into a safe mart, the lineage shows this right away. This ties in with the earlier test. The test actively enforces the rule — it fails CI on a violation. The lineage provides the passive evidence that lets a human verify by looking.
What’s next#
This was a great learning experience. I went from never using DBT to having a functional example to explore. Up next is working on a pure data project. Analyze the dataset and set up a DBT pipeline. This will help create the needed data mart(s) and continue building my DBT skills.
