This case study describes an incremental cross‑cloud ingestion pipeline that moves datasets from GCP BigQuery into an AWS data lake, with Apache Iceberg providing table format, snapshots, and schema evolution. AWS Glue (Spark) reads from BigQuery using the official connector, writes partitioned Parquet files to S3, and commits transactions to Iceberg for consistent, query‑ready tables.
graph LR
subgraph GCP[Google Cloud]
BQ[BigQuery]
end
subgraph AWS
Glue[AWS Glue - Spark]
S3[S3 Bucket]
Iceberg[Apache Iceberg Table]
end
BQ-->|BigQuery Spark Connector|Glue
Glue --> S3
S3 --> Iceberg
classDef aws fill:#F8D57E,stroke:#B8860B,stroke-width:2px,color:#111
classDef gcp fill:#7ED3F8,stroke:#0B7AB8,stroke-width:2px,color:#111
class Glue,S3,Iceberg aws
class BQ gcp
Notes
- Apache Iceberg provides snapshot‑based incremental reads, atomic commits, and schema evolution.
- AWS Glue (Spark) jobs are partition‑aware and retry‑safe; reruns are idempotent.
- Observability covers lag, freshness, error rates, and data egress; alerts surface late partitions and schema changes.
