Part 9: Data Platforms: Dataproc, Dataflow, Pub/Sub, and BigQuery

AuthorEmmanuel Secretaria

Published Aug 16, 2025

Combine batch, streaming, and warehouse views to build a coherent map of GCP data services.

Share

Scope inspiration:

gcp_info_bigdata.sh, bigquery_list_datasets.sh, bigquery_list_tables.sh.

This series follows the repo’s GCP inventory flow so every step builds a repeatable, audit-friendly picture of your environment. Part 9 ties together batch, streaming, and warehouse services so data pipelines have a single operational map.


What this script does (walkthrough)

The big data inventory sweeps through region-aware services, then adds BigQuery enumeration as the warehouse counterpart.

  1. List Dataproc clusters and jobs in all regions (or a configured region list).
  2. List Dataflow jobs across all regions and statuses.
  3. Enumerate Pub/Sub topics to capture streaming ingress points.
  4. List IoT registries per region (if Cloud IoT is enabled).
  5. Use BigQuery helpers to enumerate datasets and tables for warehouse inventory.

Operational caveats and gotchas

  • Dataproc and Cloud IoT don’t support
    --region=all
    , so the script iterates regions and allows you to override them with
    GCE_REGIONS
    or
    IOT_REGIONS
    .
  • Dataflow listing works even when the API is disabled, but still returns an empty list, so treat lack of results carefully.
  • BigQuery helpers rely on
    bq
    and
    jq
    , so ensure both are installed before using the dataset/table scripts.

Example command usage

# Full big data inventory (Dataproc, Dataflow, Pub/Sub, IoT)
gcp/gcp_info_bigdata.sh
# Narrow Dataproc/IOT to specific regions for faster scans
GCE_REGIONS="us-central1,us-east1" IOT_REGIONS="us-central1" \
  gcp/gcp_info_bigdata.sh
# BigQuery datasets and tables
#gcp/bigquery_list_datasets.sh prints one dataset per line
gcp/bigquery_list_datasets.sh
#gcp/bigquery_list_tables.sh expects a dataset name
gcp/bigquery_list_tables.sh my_dataset