Part 9: Data Platforms: Dataproc, Dataflow, Pub/Sub, and BigQuery
AuthorEmmanuel Secretaria
Published Aug 16, 2025
Combine batch, streaming, and warehouse views to build a coherent map of GCP data services.
Scope inspiration:
gcp_info_bigdata.sh, bigquery_list_datasets.sh, bigquery_list_tables.sh.
This series follows the repo’s GCP inventory flow so every step builds a repeatable, audit-friendly picture of your environment. Part 9 ties together batch, streaming, and warehouse services so data pipelines have a single operational map.
What this script does (walkthrough)
The big data inventory sweeps through region-aware services, then adds BigQuery enumeration as the warehouse counterpart.
- List Dataproc clusters and jobs in all regions (or a configured region list).
- List Dataflow jobs across all regions and statuses.
- Enumerate Pub/Sub topics to capture streaming ingress points.
- List IoT registries per region (if Cloud IoT is enabled).
- Use BigQuery helpers to enumerate datasets and tables for warehouse inventory.
Operational caveats and gotchas
- Dataproc and Cloud IoT don’t support
, so the script iterates regions and allows you to override them with--region=all
orGCE_REGIONS
.IOT_REGIONS - Dataflow listing works even when the API is disabled, but still returns an empty list, so treat lack of results carefully.
- BigQuery helpers rely on
andbq
, so ensure both are installed before using the dataset/table scripts.jq
Example command usage
# Full big data inventory (Dataproc, Dataflow, Pub/Sub, IoT) gcp/gcp_info_bigdata.sh
# Narrow Dataproc/IOT to specific regions for faster scans GCE_REGIONS="us-central1,us-east1" IOT_REGIONS="us-central1" \ gcp/gcp_info_bigdata.sh
# BigQuery datasets and tables #gcp/bigquery_list_datasets.sh prints one dataset per line gcp/bigquery_list_datasets.sh #gcp/bigquery_list_tables.sh expects a dataset name gcp/bigquery_list_tables.sh my_dataset