📚 Glossary
Medallion Architecture
Bronze→Silver→Gold tiered data quality pattern for lake/warehouse pipelines
Bronze Layer
Raw ingestion zone — data stored exactly as received from sources
Silver Layer
Cleansed, deduplicated, validated data; business-ready but not aggregated
Gold Layer
Aggregated, business-metric-ready tables serving dashboards and ML models
PySpark
Python API for Apache Spark enabling distributed big-data transformations
Dataproc
Google Cloud's managed Spark/Hadoop cluster service
BigQuery
Google's serverless, petabyte-scale cloud data warehouse
Cloud Composer
Managed Apache Airflow service on GCP for workflow orchestration
DAG (Directed Acyclic Graph)
Airflow pipeline definition; nodes are tasks, edges are dependencies
Gemini Code Assist
Google's AI coding assistant integrated in Cloud Workbench and IDEs
GitHub Actions
CI/CD platform running automated workflows triggered by git events
Harness CD
Continuous Delivery platform with AI-driven rollback and deploy verification
Schema Validation
Automated check ensuring data fields match expected types/constraints
Data Lineage
Record of data origin, transformations, and consumption path
OpenSpec
Spec-driven AI workflow tool defining canonical field contracts across repos