Skip to content

feat(showcase-ecommerce-databricks): add standalone Databricks datapack#200

Open
askumar27 wants to merge 2 commits into
mainfrom
feat/showcase-ecommerce-add-databricks
Open

feat(showcase-ecommerce-databricks): add standalone Databricks datapack#200
askumar27 wants to merge 2 commits into
mainfrom
feat/showcase-ecommerce-add-databricks

Conversation

@askumar27

@askumar27 askumar27 commented Jun 3, 2026

Copy link
Copy Markdown

Summary

  • Adds a new showcase-ecommerce-databricks datapack that mirrors the existing Snowflake showcase-ecommerce datapack with Databricks Unity Catalog as the warehouse platform
  • Built on top of alexsku's feat/showcase-ecommerce-add-queries branch (inherits the 312 Snowflake query entities) - feat(showcase-ecommerce): add 312 query entities for text-to-sql / anchor demo #198
  • Can be loaded independently or alongside the Snowflake datapack — no URN conflicts

What's in the datapack

File MCPs Content
01-definitions.json 10 Structured properties (shared with Snowflake pack)
02-shared.json 3,056 All cross-platform entities (dbt, Looker, PowerBI, Tableau, Spark, Postgres, S3, users, glossary, domains, tags) with Snowflake order_entry_db refs remapped to Databricks
03-data.json 777 14 Databricks datasets + 4 containers + 490 schema fields with full governance (tags, glossary, ownership, domains, lineage, siblings, structured properties, editable descriptions, test results, usage/storage features)
04-queries.json 1,248 312 queries translated from Snowflake SQL to Databricks SQL
05-context.json 54 Knowledge documents (shared with Snowflake pack)

SQL dialect translation (312 queries)

Snowflake Databricks Count
expr::TIMESTAMP / ::DATE CAST(expr AS TIMESTAMP/DATE) 143
DATEADD('day', ...) DATEADD(day, ...) 45
DATEDIFF('day', ...) DATEDIFF(day, ...) 34
TO_CHAR(..., 'YYYY-MM') DATE_FORMAT(..., 'yyyy-MM') 1
LISTAGG(DISTINCT col, sep) ARRAY_JOIN(COLLECT_SET(col), sep) 1
ROW_COUNT (INFORMATION_SCHEMA) Removed (Databricks-incompatible) 1

All 312 queries validated against a live Databricks serverless warehouse.

Parity verification

Every entity type matches or exceeds the Snowflake datapack (5,143 vs 5,119 MCPs). The only intentional difference is +23 container MCPs (Databricks catalog/schema containers from ingestion).

Test plan

  • All 312 translated queries executed successfully on Databricks serverless warehouse
  • Datapack loaded into local DataHub quickstart (datahub datapack load)
  • All 5 files loaded with zero failures (4,905 events total)
  • Entity-by-entity parity check against Snowflake datapack
  • No Snowflake URN leaks in Databricks entities or aspect values
  • Lineage verified — analytics.order_details shows 11 Databricks upstream tables
  • Siblings verified — Databricks datasets linked to dbt counterparts

@vercel

vercel Bot commented Jun 3, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
datahub-docs-archive Ignored Ignored Jun 3, 2026 7:52pm

Request Review

Add a new `showcase-ecommerce-databricks` datapack that mirrors the existing
Snowflake showcase-ecommerce datapack with Databricks as the warehouse platform.

- 14 Databricks datasets (order_entry_db.order_entry + analytics schemas)
- Full governance parity: tags, glossary terms, ownership, domains, structured
  properties, editable descriptions, lineage, siblings, schema fields
- 312 queries translated from Snowflake SQL to Databricks SQL dialect
  (CAST syntax, DATEADD/DATEDIFF units, LISTAGG -> ARRAY_JOIN, TO_CHAR -> DATE_FORMAT)
- All cross-platform entities included (dbt, Looker, PowerBI, Tableau, Spark, etc.)
  with Snowflake order_entry_db references remapped to Databricks
- Can be loaded independently or alongside the Snowflake datapack (no URN conflicts)
@askumar27 askumar27 force-pushed the feat/showcase-ecommerce-add-databricks branch from 7838188 to 22c9e0c Compare June 3, 2026 00:12
…r ingestion

Add queryUsageFeatures aspect to all 312 query entities so semantic-anchor
ingestion can discover and group them. Without this aspect, queries default
to exec_count=0/users=[] and get filtered by the min_distinct_users threshold.

- Cluster queries (160, 50 intents): exec_count 15-45, 3-4 users each
  Users derived from SQL author comments + intent-group sharing
- Noise queries (152): exec_count 1-5, 1-2 users (correctly below threshold)
- Uses the 8 author personas from Alex's query corpus README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant