data engineeringclouddatalake

AWS Datalake with Apache Iceberg

Incremental cross‑cloud ingestion from GCP BigQuery into an AWS data lake backed by Apache Iceberg.

Overview

This case study describes an incremental cross‑cloud ingestion pipeline that moves datasets from GCP BigQuery into an AWS data lake, with Apache Iceberg providing table format, snapshots, and schema evolution. AWS Glue (Spark) reads from BigQuery using the official connector, writes partitioned Parquet files to S3, and commits transactions to Iceberg for consistent, query‑ready tables.

graph LR
  subgraph GCP[Google Cloud]
    BQ[BigQuery]
  end
  subgraph AWS
    Glue[AWS Glue - Spark]
    S3[S3 Bucket]
    Iceberg[Apache Iceberg Table]
  end

  BQ-->|BigQuery Spark Connector|Glue
  Glue --> S3
  S3 --> Iceberg

  classDef aws fill:#F8D57E,stroke:#B8860B,stroke-width:2px,color:#111
  classDef gcp fill:#7ED3F8,stroke:#0B7AB8,stroke-width:2px,color:#111
  class Glue,S3,Iceberg aws
  class BQ gcp

Notes

Apache Iceberg provides snapshot‑based incremental reads, atomic commits, and schema evolution.
AWS Glue (Spark) jobs are partition‑aware and retry‑safe; reruns are idempotent.
Observability covers lag, freshness, error rates, and data egress; alerts surface late partitions and schema changes.

Problem

Analysts needed AWS‑hosted, query‑ready datasets sourced from BigQuery without nightly full extracts, schema breakages, or downtime.

Solution

Built a Spark‑based ingestion on AWS Glue reading directly from BigQuery via the BigQuery Spark Connector, writing partitioned Parquet to S3, and committing Iceberg snapshots. Added schema evolution controls, partition awareness, and idempotent reruns.

Outcome

Reliable, incremental ingestion from BigQuery to AWS with near‑real‑time freshness, lower egress/compute costs, and faster onboarding of new tables.

Technologies

AWS S3AWS GlueApache IcebergGCP BigQueryBigQuery Spark ConnectorApache Spark