Part 9: Data Platforms: Dataproc, Dataflow, Pub/Sub, and BigQuery

AuthorEmmanuel Secretaria

Published Aug 16, 2025

Combine batch, streaming, and warehouse views to build a coherent map of GCP data services.

Scope inspiration:

gcp_info_bigdata.sh, bigquery_list_datasets.sh, bigquery_list_tables.sh.

This series follows the repo’s GCP inventory flow so every step builds a repeatable, audit-friendly picture of your environment. Part 9 ties together batch, streaming, and warehouse services so data pipelines have a single operational map.

What this script does (walkthrough)

The big data inventory sweeps through region-aware services, then adds BigQuery enumeration as the warehouse counterpart.

List Dataproc clusters and jobs in all regions (or a configured region list).
List Dataflow jobs across all regions and statuses.
Enumerate Pub/Sub topics to capture streaming ingress points.
List IoT registries per region (if Cloud IoT is enabled).
Use BigQuery helpers to enumerate datasets and tables for warehouse inventory.

Operational caveats and gotchas

Dataproc and Cloud IoT don’t support
```
--region=all
```
, so the script iterates regions and allows you to override them with
```
GCE_REGIONS
```
or
```
IOT_REGIONS
```
.
Dataflow listing works even when the API is disabled, but still returns an empty list, so treat lack of results carefully.
BigQuery helpers rely on bq
and
jq
, so ensure both are installed before using the dataset/table scripts.

Example command usage

# Full big data inventory (Dataproc, Dataflow, Pub/Sub, IoT)
gcp/gcp_info_bigdata.sh

# Narrow Dataproc/IOT to specific regions for faster scans
GCE_REGIONS="us-central1,us-east1" IOT_REGIONS="us-central1" \
  gcp/gcp_info_bigdata.sh

# BigQuery datasets and tables
#gcp/bigquery_list_datasets.sh prints one dataset per line
gcp/bigquery_list_datasets.sh
#gcp/bigquery_list_tables.sh expects a dataset name
gcp/bigquery_list_tables.sh my_dataset