Execution policy¶
This emulator ships one local emulator process with one C++
Engine interface (backend/engine/engine.h) and one persistent
storage implementation (DuckDBStorage). Behind that single
Engine, query execution is multi-strategy: a local execution
router classifies each resolved GoogleSQL AST shape and dispatches it
to the strategy that fits, all inside the same process.
TL;DR: there is no second engine binary, no runtime backend selector, and no fallback to a cloud BigQuery service. There is also no claim that the DuckDB transpiler is the universal lowering target. Inside
emulator_main, a route classifier picks one of seven local dispositions (DuckDB fast path, DuckDB rewrite, DuckDB UDF/polyfill, local semantic executor, control-op handler, local-stub, or the deliberateunsupportedenvelope) per query, and shapes that have no planned local route surface a BigQuery- shapedUNIMPLEMENTEDwhose message names the offending family and links back here. See "History" below for the runtimes this layout replaces.
The runtime¶
| Component | Source | Notes |
|---|---|---|
Engine interface |
backend/engine/engine.h |
Single public interface seen by the gRPC layer. Implementation behind it is the local execution coordinator. |
| Local execution coordinator | backend/engine/coordinator/ (LocalCoordinatorEngine) |
Owns the route classifier and dispatches each query to one of the routes documented below. The classifier consumes the node_dispositions.yaml / functions.yaml registries (built into backend/engine/duckdb/transpiler/) at build time and returns a RouteDecision{disposition; reason; offending_node} per resolved statement. |
| Route classifier | backend/engine/coordinator/route_classifier.{h,cc} |
Walks the resolved AST, consults the disposition registries, and picks one of kDuckdbNative / kDuckdbRewrite / kDuckdbUdf / kSemanticExecutor / kControlOp / kLocalStub / kUnsupported. Compositional fallback at planning time: any leaf with a higher-priority disposition promotes the whole query; planned rows do NOT promote (the fast path handles them via the transpiler's empty-string contract). Priority order is unsupported > local_stub > semantic_executor > control_op > duckdb_udf > duckdb_rewrite > duckdb_native, so a SELECT mixing a local-stub function (e.g. KEYS.NEW_KEYSET) and an unsupported function (APPROX_QUANTILES) surfaces UNIMPLEMENTED rather than a stubbed answer. |
| DuckDB fast path | backend/engine/duckdb/ (DuckDbExecutor) |
The transpiler + DuckDB execution surface. Produces DuckDB SQL via a googlesql::ResolvedASTVisitor and runs it through DuckDB's C++ client. Implements the coordinator::Executor interface (ExecuteQuery / ExecuteDml / ExecuteDdl); the coordinator hands it a pre-analyzed ResolvedStatement. |
| DuckDB rewrites + UDFs | backend/engine/duckdb/ |
DuckDB SQL expressed via rewrites or DuckDB UDFs/macros to make a BigQuery function correct locally. |
| Local semantic executor | backend/engine/semantic/ |
A local row/array/value interpreter for shapes that demand exact BigQuery semantics. |
| Control-op executor | backend/engine/control/ |
DDL / metadata / catalog ops routed straight through the storage layer instead of through query execution. |
Storage interface |
backend/storage/storage.h |
Single public interface. |
| DuckDB storage | backend/storage/duckdb/ |
Sole storage implementation. Catalog + table rows persist to a catalog.duckdb file under --data_dir. |
emulator_main does not accept --engine, --storage, or
--on_unknown_fn; those flags were removed when the ReferenceImpl
engine and in-memory storage were deleted, and the new multi-strategy
coordinator intentionally keeps the same single-knob posture (route
selection happens internally at AST-classification time, not via a
runtime flag). The only knobs are --host_port (default
localhost:9060) and --data_dir (default $HOME/.bigquery-emulator,
with ./.bigquery-emulator as the fallback when HOME is unset).
Routes¶
The route classifier maps every ResolvedAST shape (and every
built-in function) to exactly one route. The same vocabulary appears
in backend/engine/duckdb/transpiler/SHAPE_TRACKER.md
on a per-node basis.
| Route | Meaning | What it does at runtime |
|---|---|---|
duckdb_native |
Lowers cleanly to DuckDB SQL with semantics that already match BigQuery's exactly. | Transpiles to DuckDB SQL and runs it through DuckDB. Today's primary path. |
duckdb_rewrite |
Lowers to DuckDB SQL with a deliberate structural rewrite (e.g. struct/array shape rewrites, JSON operator mapping). | Same execution path as duckdb_native; the rewrite lives in the transpiler. |
duckdb_udf |
Lowers to DuckDB SQL that calls a DuckDB UDF / macro we register at engine startup. | Same DuckDB connection runs the UDF locally; the UDF body owns the BigQuery-specific semantics. |
semantic_executor |
Runs on a local row/value interpreter that owns exact BigQuery semantics. | Uses DuckDB only as a row source; expression evaluation, type coercion, and error surfaces are local. |
control_op |
DDL / metadata / catalog op. | Bypasses query execution; the storage layer applies the change and emits the BigQuery-shaped response. |
local_stub |
Deliberate BigQuery-shaped placeholder for a specialized family the emulator does not model end-to-end. | Function-level stubs (e.g. KEYS.NEW_KEYSET, ML.PREDICT) dispatch through the semantic executor's stub table (backend/engine/semantic/stubs/) and return a fixed-shape sentinel or schema-correct placeholder of the documented BigQuery return type. Statement-level stubs (e.g. CREATE MODEL) are pre-dispatched from the coordinator's ExecuteDdl to backend/engine/control/stubs/ and return OK without persisting anything. ML inference TVFs (ML.PREDICT, ML.FORECAST, ML.EVALUATE) are also local_stub: they return typed NULL placeholders so the query does not fail — values are explicitly not predictions. |
unsupported |
Deliberately out of scope locally. | Returns a BigQuery-shaped UNIMPLEMENTED whose message names the offending family (e.g. family: function:array_transform) and links to this document. |
A few rules the router obeys:
- One route per shape. A
ResolvedASTnode kind picks exactly one route. We do not catch DuckDB errors at runtime and retry on a semantic executor — drift between strategies is hidden by silent fallback, and we removed the oldFallbackEngineprecisely because of that. - Compositional fallback at lowering time, not runtime. When the
DuckDB fast path can lower most of a query but a leaf node is
semantic_executor, the router promotes the surrounding shape to the semantic executor at planning time instead of mixing strategies mid-execution. - Route labels are observable in tests. Conformance fixtures may
assert which route served a query (see
conformance/README.md), so a passing fixture cannot silently drift fromduckdb_nativetosemantic_executor. unsupportedis intentional. Promoting a row out ofunsupportedrequires a planned route, a landing implementation, and conformance coverage. We never silently approximate.
Specialized features¶
A handful of BigQuery feature families are intentionally NOT modeled
end-to-end by this emulator. The table below is the user-facing
summary the unsupported error envelope points at. Per-family posture
(local_impl vs local_stub vs unsupported) is recorded in
functions.yaml / node_dispositions.yaml and mirrored in
SHAPE_TRACKER.md.
| Family | Posture | What happens |
|---|---|---|
BigQuery ML (ML.PREDICT, ML.FORECAST, ML.EVALUATE) |
local_stub |
Accepted at parse / analyze time; TVFs evaluate on backend/engine/semantic/stubs/ml.cc and return schema-correct placeholder rows (input pass-through + NULL predicted/metric/forecast columns). ML is not useful locally (no Vertex AI / training / serving); values are explicitly NOT predictions — see ROADMAP §BigQuery ML. |
BigQuery ML CREATE MODEL |
local_stub |
Accepted as metadata-only; returns OK without registering a model. Stays a stub: no model is trained or stored. |
Geography / GIS (ST_*) |
local_impl (partial) |
GIS MVP on the semantic executor (backend/engine/semantic/functions/geog_funcs.cc): ST_GEOGPOINT, ST_GEOGFROMTEXT, ST_GEOGFROMWKB (2D WKB POINT), ST_ASTEXT, ST_DISTANCE, ST_WITHIN, ST_CONTAINS, ST_INTERSECTS. GEOGRAPHY persists as VARCHAR in DuckDB storage; extended casts evaluate on eval_expr_cast_extended.cc. Remaining ST_* (aggregates, buffer/simplify family, ...) surface UNIMPLEMENTED with family: function:st_<name>. |
Differential privacy / anonymized aggregation (AnonymizedAggregate*, DifferentialPrivacyAggregate*, ...) |
local_stub |
Accepted at parse / analyze time; privacy-preserving aggregate scans route to the semantic executor's existing aggregate eval with modifiers stripped (backend/engine/semantic/scan_eval_aggregate.cc). Returns the plain underlying aggregate — no noise, no group suppression. NOT differential privacy; see ROADMAP §Privacy-preserving aggregates. |
Approximate aggregation (APPROX_QUANTILES, APPROX_COUNT_DISTINCT, APPROX_TOP_COUNT, APPROX_TOP_SUM) |
local_impl |
Routes to the semantic executor (backend/engine/semantic/functions/aggregate_specialized.cc). Results are computed exactly (not sketch-approximated); not differential-privacy aggregation (see the DP row above). |
Networking (NET.*) |
local_impl |
Routes to the semantic executor; implemented in net_funcs.cc (IP parse/mask/trunc, HOST, PUBLIC_SUFFIX, REG_DOMAIN). Pinned by conformance/fixtures/specialized/net_host_reg_domain.yaml. |
Key management (KEYS.NEW_KEYSET, KEYS.KEYSET_LENGTH) |
local_stub |
Deterministic BigQuery-shaped placeholders (KEYS.NEW_KEYSET -> fixed BYTES envelope; KEYS.KEYSET_LENGTH -> 1; in stubs/keys.cc). NOT real Tink. (BigQuery has no KEYS.ENCRYPT / KEYS.DECRYPT_BYTES; encryption is the AEAD.* family, which the emulator does not model.) |
SESSION_USER (session_user) |
local_stub |
Returns the fixed placeholder principal bigquery-emulator@local so row/column-policy + audit queries do not fail. NOT an authenticated identity. Pinned by conformance/fixtures/specialized/session_user_stub.yaml. |
EXPLAIN (ResolvedExplainStmt) |
unsupported |
Not a BigQuery statement (bq dry-run: "Statement not supported: ExplainStatement"); BigQuery exposes query plans via job statistics / dry run instead. Surfaces UNIMPLEMENTED. |
HLL (HLL_COUNT.*) |
local_impl |
Routes to the semantic executor; hll_funcs.cc implements INIT/MERGE/MERGE_PARTIAL/EXTRACT with a local sketch wire format (not byte-compatible with cloud BigQuery). Pinned by conformance/fixtures/specialized/hll_count_round_trip.yaml. |
KLL (KLL_QUANTILES.*) |
local_impl |
Routes to the semantic executor; kll_funcs.cc implements INIT/MERGE/MERGE_PARTIAL/MERGE_POINT/EXTRACT/EXTRACT_POINT for INT64 and FLOAT64 using Apache DataSketches KLL with a local sketch wire format (not byte-compatible with cloud BigQuery). Pinned by conformance/fixtures/specialized/kll_quantiles_round_trip.yaml. |
Protobuf shapes (ResolvedMakeProto, ResolvedGetProtoField, ReplaceField, FilterField) |
unsupported |
Not reachable in BigQuery PRODUCT_EXTERNAL: no user PROTO types (bq: "Type not found" for NEW <proto>), REPLACE_FIELDS() → "is not supported", FILTER_FIELDS → "Function not found". The construct/read/mutate handlers were removed. Row field access (ResolvedGetRowField, value-table t.f) IS a real BigQuery shape and stays semantic_executor (eval_expr_proto.cc). |
MEASURE / measure functions (AGGREGATE(<measure>)) |
local_impl |
MEASURE-typed columns register through measure_catalog.{h,cc} (bqemu_measure:<expr>:<keys> description marker on table schema); the analyzer's measure rewrite expands AGG(<measure>) into multi-level aggregates on the semantic executor (route_classifier_visitor.cc multi-level promotion, agg in functions.yaml). Pinned by conformance/fixtures/specialized/measure_agg_group_by.yaml. CAST to/from MEASURE stays unsupported on eval_expr_cast_extended.cc. |
Graph (GRAPH_TABLE, GQL subqueries, ResolvedGraph*Scan, ResolvedCatalogColumnRef graph-property refs) |
unsupported (not planned) |
The GQL surface is effectively a whole second query language (its own analyzer, data model, and pattern grammar) and is not worth modeling in a local emulator. Stays unsupported; not on the roadmap. The graph use of ResolvedCatalogColumnRef (name set, column null) is out of scope for the same reason. Catalog DDL refs (column set) have no reachable PRODUCT_EXTERNAL SQL today. |
Sequences (ResolvedSequence, NEXT VALUE FOR) |
unsupported (not reachable) |
BigQuery PRODUCT_EXTERNAL does not ship general SQL sequences; no local query SQL resolves to ResolvedSequence / NEXT VALUE FOR today. The analyzer node exists for other product modes only — stays unsupported with a sharper envelope rather than a local sequence object. |
Expression columns (ResolvedExpressionColumn) |
local_impl |
AnalyzeExpression column bindings evaluate on the semantic executor via EvalContext::columns_by_name / script variables. Script SET RHS (ExecuteProcedureSet, EvalScalarSegment) registers bindings with AddExpressionColumn. Pinned by eval_expr_expression_column_test and conformance/fixtures/scripting/expression_column_set_increment.yaml. |
JavaScript UDFs (CREATE FUNCTION ... LANGUAGE js) |
local_impl |
CREATE FUNCTION ... LANGUAGE js registers through js_udf_registry.cc, persists the DDL in DuckDBStorage, and evaluates scalar bodies at call time on the semantic executor via embedded Duktape (js_udf_runtime.cc) so routines.get round-trips and call-time execution match. Pinned by conformance/fixtures/udf/js_scalar_add.yaml. Table-valued / aggregate JS UDFs remain unsupported. |
Python UDFs (CREATE FUNCTION ... LANGUAGE python) |
local_impl |
CREATE FUNCTION ... LANGUAGE python registers through python_udf_registry.cc, persists the DDL in DuckDBStorage, and evaluates scalar bodies at call time on the semantic executor via a sandboxed python3 subprocess (python_udf_runtime.cc) so routines.get round-trips and call-time execution match. Pinned by conformance/fixtures/udf/python_scalar_add.yaml; cw_xml_extract promoted in bqutils passing/. CREATE AGGREGATE FUNCTION ... LANGUAGE python and CREATE TABLE FUNCTION ... LANGUAGE python are not a production BigQuery surface (bq dry-run rejects both); the emulator surfaces the same analyzer-time envelopes, pinned by conformance/fixtures/udf/python_udaf_rejected.yaml and python_tvf_rejected.yaml. Declared packages resolve against BIGQUERY_EMULATOR_PYTHON, then $data_dir/python-udf-env/, then host python3; missing packages fail with a structured error (not a traceback). Operator provisioning: task python-udf:provision with BIGQUERY_EMULATOR_PYTHON_ALLOW_PIP=1 — see docs/guides/python-udfs.md. Pinned by conformance/fixtures/udf/python_packages_lxml.yaml and python_packages_missing.yaml. |
SQL scalar UDFs (CREATE FUNCTION ... AS (...)), including ANY TYPE templated parameters |
semantic_executor |
CREATE FUNCTION registers through the per-project UDF registry (backend/catalog/udf_registry.cc), writes through to DuckDBStorage (__bqemu_routines), and replays into each query's GoogleSqlCatalog (including after engine restart via RehydrateRoutinesFromStorage), shadowing a built-in when the names collide. Templated bodies evaluate on the semantic executor via EvalSqlUdfBody; SQL UDAFs evaluate via EvalSqlUdafBody. SQL TVFs and procedures follow the same persistence path through tvf_registry / procedure_registry. DROP FUNCTION deletes registry + storage rows. Conformance fixtures under conformance/fixtures/udf/. |
Scripting (DECLARE, SET, CALL, BEGIN…END multi-stmt blocks) |
semantic_executor |
Gateway sends DECLARE/SET/CALL scripts in one engine ExecuteQuery round-trip (script_runner_engine.go); control-flow scripts register a single child job for the final SELECT result. Simple statements (CREATE CONSTANT/SET/CALL/ASSERT/EXECUTE IMMEDIATE literals) run via ExecuteScriptViaAnalyzeNext; scripts with structured control flow (IF/WHILE/LOOP/FOR/EXECUTE IMMEDIATE/EXCEPTION/RAISE in blocks) delegate to googlesql::ScriptExecutor through EmulatorStatementEvaluator when BIGQUERY_EMULATOR_HAS_GOOGLESQL_SCRIPTING=1 (always set once the prebuilt artifact ships //googlesql/scripting:script_executor). @@error.* reads resolve through EvalContext::script_system_variables (populated by the ScriptExecutor during EXCEPTION handlers). Conformance fixtures under conformance/fixtures/scripting/. |
LOAD DATA LOCAL <local-uri> / LOAD DATA ... FROM FILES (uris = ['file://...']) |
control_op |
Local CSV/JSON/Parquet readers via DuckDB (RunLoadData). Pinned by conformance/fixtures/ddl/export_load_round_trip.yaml. |
LOAD DATA <gs://...> (cloud storage) |
control_op |
Resolves gs:// via $data_dir/external/gcs-cache/ snapshots or STORAGE_EMULATOR_HOST (fake-gcs). |
EXPORT DATA (local file:// URI) |
control_op |
DuckDB COPY (SELECT ...) TO for CSV/JSON/Parquet. Pinned by conformance/fixtures/ddl/export_load_round_trip.yaml. |
EXPORT DATA (gs:// URI) |
control_op |
Writes locally then uploads via fake-gcs when STORAGE_EMULATOR_HOST is set. |
CREATE MATERIALIZED VIEW |
control_op |
Full-refresh materialization at creation only (no incremental refresh). Stored as a regular table; reads use the existing table-scan path. Pinned by conformance/fixtures/ddl/materialized_view_query.yaml. |
UNDROP SCHEMA |
control_op (landed) |
BigQuery's only undrop form (UNDROP SCHEMA ...). UNDROP TABLE is not a BigQuery statement. Dataset tombstones + datasets.undelete share RestoreDataset. See ROADMAP §DML / DDL. |
INFORMATION_SCHEMA.* reflection views |
duckdb_native |
<dataset>.INFORMATION_SCHEMA.<VIEW> and region-qualified `region-<r>`.INFORMATION_SCHEMA.<VIEW> resolve at analyze time through backend/catalog/info_schema_table.{h,cc} (a VirtualCatalogTable) and materialize from Storage + the routine/view registries: TABLES, COLUMNS, SCHEMATA, VIEWS (view registry), ROUTINES (__bqemu_routines), COLUMN_FIELD_PATHS (recursed STRUCT/ARRAY paths), PARTITIONS / TABLE_STORAGE (Storage::CountRows, single unpartitioned partition; byte columns NULL per the persistence non-goal), and TABLE_OPTIONS / KEY_COLUMN_USAGE (empty — options/constraints are not modeled). JOBS / JOBS_BY_PROJECT are not engine-resolved: the gateway rewrites those queries to a snapshot table in `_bqemu_jobs.JOBS` materialized from the in-process job registry (gateway/query/info_schema_jobs.go, gateway/jobs/info_schema.go) so tooling can introspect job metadata without faking engine storage. Conformance fixtures under conformance/fixtures/info_schema/. |
The local_stub posture has two flavors:
- Probe placeholder (no downstream consumption) -- client-library
startup probes (e.g. a connector that issues
SELECT KEYS.NEW_KEYSET(...)to confirm the dialect understands the namespace) succeed without forcing the user to disable their probe. - No-fail placeholder (deliberate, for families that are not useful
locally) -- families the emulator will never model meaningfully
(BigQuery ML inference, differential-privacy aggregates,
SESSION_USER) return a deterministic, schema-correct placeholder so a query that references them does not fail. The product decision here is no-fail, not loud-fail: these are explicitly placeholders (documented as such per family above), never real answers. This supersedes the emulator's earlier "consume the stub and fail loudly" stance for these families.
A shape that is genuinely unsupported (e.g. graph / GQL) still surfaces
an UNIMPLEMENTED envelope naming the offending family (e.g.
family: ResolvedGraphScan) so a user landing on this document can find
the matching row in the table above without running the route classifier
in their head.
What this means in practice¶
-
New feature work has a route, not "a transpiler row." Identify the right local strategy (DuckDB native, DuckDB rewrite, DuckDB UDF, semantic executor, control op), land the implementation, and update the shape tracker row in the same commit.
-
DML / DDL routing is closed for in-scope GoogleSQL shapes. Remaining gaps are tracked in ROADMAP §Planned work (real implementations for the proto surface, Python UDFs,
gs://external data, MEASURE, ...; deterministic no-fail stubs forML.*, differential-privacy aggregates,SESSION_USER). Graph / GQL is the one family that staysunsupportedand is not planned. Landed on the local DML executor (backend/engine/semantic/dml/):INSERT VALUES,INSERT ... SELECT, scalar and deep-STRUCTUPDATE, protoUpdateConstructorSET expressions (eval_expr_update_constructor.cc), nested(DELETE arr WITH OFFSET ...)inUPDATE SET,UPDATE ... FROM,DELETE,THEN RETURNon INSERT/UPDATE/DELETE (GoogleSQL does not define MERGETHEN RETURN), pipe INSERT (ResolvedPipeInsertScan),ASSERT_ROWS_MODIFIED, and the harder MERGE matrix (WHEN NOT MATCHED BY SOURCE, multi-action sequences viadml_merge.cc). AllMERGEstatements route through the semantic executor (dml_merge.cc); the DuckDB verbatim-SQL MERGE path is retired.TRUNCATE TABLEclears rows viaRunTruncateTable(CountRows+OverwriteRowsempty) on the control-op route; the gateway posts TRUNCATE on the DML path, so the coordinatorExecuteDmlbridge callsRunTruncateTableand returnsdmlStats.deletedRowCount.CREATE TABLE,CREATE TABLE AS SELECT,DROP TABLE,ALTER TABLE,CREATE MATERIALIZED VIEW(full refresh),EXPORT DATA, andLOAD DATA(local files) are implemented today on the control-op route. -
Storage follows the same single-implementation rule. The in-memory storage backend is gone; every persistent state path goes through
DuckDBStorage. Tests that previously used a volatile in-memory store now allocate a temp--data_dirand rely on DuckDB instead. -
There is no cloud passthrough. This emulator never forwards query work to the real BigQuery service. Local coverage is the responsibility of the multi-strategy coordinator inside this process, full stop.
Implication for conformance fixtures¶
The conformance harness (conformance/cmd/runner,
conformance/fixtures/*.yaml) runs a single profile today:
| Profile | Engine | Storage |
|---|---|---|
local |
local execution coordinator | duckdb |
The legacy profile name duckdb refers to the same coordinator
binary and is accepted by the runner for backwards compatibility
during the rename. New fixtures should use local.
Per the policy above:
- Leave
profiles:unset in new fixtures unless you are intentionally targeting a future second profile. Today the default profile set is[local]and the harness runs every fixture against it. - Seed with
sql:INSERT VALUESorrows:setup steps —INSERTexecutes through the local DML executor, androws:steps (which calltabledata.insertAll) remain available for fixtures that want to bypass DML entirely. - Document
unsupportedgaps loudly. If a fixture is blocked on anunsupportedshape, leave it out of the suite rather thant.Skip()-ing it; the conformance harness's purpose is to pin what works. - Route labels are optional. Fixtures may assert which route
served a query alongside the row output (
expected.route), andtask conformance:routing-matrixreports the observed route for every fixture.
History¶
A previous iteration of the emulator carried two engines
(ReferenceImpl + DuckDB) bridged by a FallbackEngine wrapper that
routed DuckDB-uncovered constructs to ReferenceImpl at runtime,
plus an in-memory storage backend for hermetic tests. That layout was
removed because:
- ReferenceImpl coverage was incomplete (no DDL, partial DML, missing analytics functions) and not actively maintained.
- The runtime fallback bridge made it ambiguous which engine produced
any given result; production users reported subtle divergences when
fixtures ran on
duckdbbut tests ran onmemory. - Maintaining two storage backends doubled the test-fixture cost
for no production benefit, since
docker compose upalways used the persistent DuckDB store anyway.
The removal commit landed with a BREAKING CHANGE: footer noting
that --engine=reference_impl, --storage=memory, and
--on_unknown_fn=fallback are gone.
After that removal, the project briefly described itself as a
"DuckDB-only" emulator and tried to extend DuckDB coverage by
promoting every remaining shape to a transpiler row. That framing
underestimated the BigQuery vs. DuckDB semantic gap and would have
forced the transpiler to absorb work it is the wrong tool for. The
current policy — local multi-strategy execution behind a single
Engine — is the deliberate replacement: DuckDB stays as the fast
analytical path, and the BigQuery-specific work moves to whichever
local strategy actually owns it. Route selection happens at
AST-classification time rather than at runtime error-catch time, so
it never reintroduces the silent-drift hazard the old
FallbackEngine had.
BigQuery Storage gRPC (engine bigquery_emulator.v1.*)¶
The C++ engine on :9060 implements an internal Storage Read/Write
contract (proto/storage_read.proto, proto/storage_write.proto).
Official client libraries dial the public service names
(google.cloud.bigquery.storage.v1.BigQueryRead /
BigQueryWrite) and wire formats (Arrow/Avro read pages, proto-descriptor
row encoding for JsonStreamWriter). The Go gateway shim in
gateway/handlers/bqstorage/ registers those public services on
:9060 and adapts to the engine's internal contract. Java
WriteBufferedStreamIT and StorageArrowSampleIT pass against the
local emulator; Connection + DataTransfer ITs remain allowlisted.
| Surface | Posture |
|---|---|
Storage Write COMMITTED / _default |
Engine AppendRows commit path; shim decodes public proto rows |
Storage Write BUFFERED |
Engine buffered hold + FlushRows / FinalizeWriteStream; shim caches proto descriptors and reuses the emulator BigQueryWriteClient in samples |
Storage Write PENDING |
Engine buffered hold + FinalizeWriteStream + BatchCommitWriteStreams through DuckDBStorage::AppendRows; gateway shim forwards public RPCs |
| Storage Read sessions | CreateReadSession with projection + analyzer-transpiled row_restriction, multi-stream max_stream_count, and SplitReadStream; public-data tables readable with a caller-scoped parent project |
| Storage Read Arrow | Arrow schema + IPC record batches via gateway shim |
| Storage Read Avro | Arrow read from engine, Avro OCF encoding in gateway shim (gateway/handlers/bqstorage/avro.go) |
Set BIGQUERY_STORAGE_GRPC_ENDPOINT (default localhost:9060 in
task thirdparty:*) to reach the gateway gRPC listener. ManagedWriter /
Storage Read subtests skip when it is unset.
The same :9060 listener multiplexes additional public BigQuery gRPC
services the Go sample suites exercise:
| Public service | Posture |
|---|---|
google.cloud.bigquery.storage.v1.BigQueryRead / BigQueryWrite |
Gateway shim → engine internal Storage Read/Write |
google.cloud.bigquery.connection.v1.ConnectionService |
CRUD + persistence under $data_dir/external/connections/_registry/ (gateway/handlers/bqconnection/); EXTERNAL_QUERY is fixture-backed only |
google.cloud.bigquery.reservation.v1.ReservationService |
Empty list stubs (gateway/handlers/bqreservation/) |
google.cloud.bigquery.analyticshub.v1.AnalyticsHubService |
In-memory exchange/listing CRUD (gateway/handlers/bqanalyticshub/) |
google.cloud.bigquery.v2.* (Dataset/Table/Job/Project/Routine) |
Thin gRPC adapters over existing REST/catalog handlers (gateway/handlers/bqv2grpc/) |
Set BIGQUERY_ANALYTICSHUB_GRPC_ENDPOINT to the same host:port when
samples dial Analytics Hub separately. BigQuery v2 preview clients reuse
BIGQUERY_STORAGE_GRPC_ENDPOINT via bqopts.BigQueryV2GRPCClientOptions().
External query and federated sources¶
EXTERNAL_QUERY(connection, query) is fixture-backed only — the emulator
never opens live federation sockets to Cloud SQL, Spanner, or AlloyDB. Snapshot
layout, manifest format, and operator workflow are documented in
docs/guides/external-query.md.
| Surface | Posture | Conformance |
|---|---|---|
EXTERNAL_QUERY TVF |
semantic_executor; schema + rows from $data_dir/external/connections/<id>/ |
conformance/fixtures/external/external_query_fixture.yaml, external_query_missing_fixture.yaml |
| Connection API metadata | Persisted JSON registry; property blocks round-trip | gateway/handlers/bqconnection/grpc_test.go |
BigLake tables (biglakeConfiguration) |
501 notImplemented with ENGINE_POLICY link |
— |
Object tables (objectTableOptions / OBJECT_TABLE) |
501 notImplemented with ENGINE_POLICY link |
— |
External datasets (externalDatasetReference) |
501 notImplemented with ENGINE_POLICY link |
— |
Google Sheets external tables¶
Google Sheets external tables (GOOGLE_SHEETS / googleSheetsOptions) are
materialized at insert / tableDefinitions time through
gateway/external/sheets.go. Default mode reads a committed CSV snapshot
under --data_dir / gateway/external/fixtures/ (the public sample sheet
docId 1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms, Class Data tab).
Opt-in live mode (BIGQUERY_EMULATOR_LIVE_SHEETS=1 or per-source entry in
$data_dir/external_sources.yaml) fetches the link-public CSV export URL
without credentials. Private sheets still require Sheets API credentials.
Pinned by conformance/fixtures/external/google_sheets_class_data.yaml.
Python thirdparty samples test_query_external_sheets_* target a different
public sheet (1i_QCL-…) and remain skipped until that snapshot is added.
GCS-backed external tables remain in scope via fake-gcs.
Relational UNNEST / lateral / correlated subqueries¶
These shapes share the semantic executor's per-outer-row evaluation
frame (backend/engine/semantic/outer_row_eval.{h,cc}). The
classifier promotes divergent subsets to semantic_executor; the
fast path keeps the standalone cases on duckdb_native.
| Family | Representative SQL | Route | Conformance |
|---|---|---|---|
| Standalone UNNEST | SELECT n FROM UNNEST([1,2,3]) AS n |
duckdb_native |
fastpath/scan_array_unnest.yaml |
| Standalone UNNEST cross product | FROM UNNEST(a) CROSS JOIN UNNEST(b) (nested ArrayScan chain) |
duckdb_native |
fastpath/scan_array_unnest_cross_join.yaml, scan_array_unnest_cross_join_three.yaml, ctas_unnest_cross_join.yaml |
| UNNEST WITH OFFSET | ... UNNEST(arr) AS n WITH OFFSET AS idx |
semantic_executor |
array_struct/unnest_with_offset.yaml |
| Multi-array zip | FROM UNNEST(a, b) (array_zip_mode) |
semantic_executor |
array_struct/multi_array_unnest_pad.yaml |
| Cross-join UNNEST (table column) | FROM t, UNNEST(t.arr) AS n |
duckdb_native |
array_struct/cross_join_unnest.yaml |
| LEFT JOIN UNNEST | LEFT JOIN UNNEST(t.arr) AS n ON TRUE |
semantic_executor |
array_struct/left_join_unnest.yaml |
| INNER / LEFT / RIGHT / FULL / CROSS join | FROM a JOIN b ON a.k = b.k |
duckdb_native (subset) |
fastpath/scan_join_* |
| FULL OUTER JOIN (duplicate column names) | SELECT a.k, b.k FROM a FULL JOIN b ON a.k = b.k |
duckdb_native |
core_usage/everyday_sql/full_join.yaml, fastpath/scan_join_full.yaml |
| JOIN USING | INNER JOIN t2 USING (id) |
duckdb_native |
fastpath/join_using_inner.yaml |
| Lateral join scan | ... JOIN LATERAL (...) (is_lateral) |
semantic_executor |
unit tests (MaterializeJoinScan) |
| Correlated subquery | EXISTS (SELECT 1 FROM r WHERE r.k = o.k) |
semantic_executor |
cte_subquery/subquery_expr_correlated_exists.yaml |
When both join inputs expose the same column name, the transpiler emits
per-side __bq_j_<column_id> aliases at join time and keeps them visible
through passthrough ProjectScan / OrderByScan wrappers so unmatched
FULL OUTER JOIN rows null-pad the missing side instead of self-matching
the surviving column.
DML statement routing¶
| Statement | Route | Handler / notes | Conformance |
|---|---|---|---|
INSERT (VALUES / SELECT) |
semantic_executor |
dml_insert.cc; inner SELECT may promote to duckdb_native |
core_usage/dml_readback/insert_* |
UPDATE / DELETE |
semantic_executor |
dml_mutate.cc; nested array DELETE in UPDATE SET |
core_usage/dml_readback/update_readback.yaml, delete_readback.yaml |
MERGE (all branches) |
semantic_executor |
dml_merge.cc; DuckDB verbatim-SQL MERGE path retired |
dml/merge_*, core_usage/dml_readback/merge_readback.yaml |
TRUNCATE TABLE |
control_op (+ DML bridge) |
RunTruncateTable; coordinator ExecuteDml returns deletedRowCount |
core_usage/dml_readback/truncate_readback.yaml |
Pipe INSERT (|> INSERT) |
semantic_executor |
ResolvedPipeInsertScan via ExecuteDml |
ResolvedPipeInsertScan unit coverage |
Cast extensions, COLLATE, value tables, set-op CORRESPONDING¶
These expression / projection edges promote divergent subsets off the DuckDB fast path; the semantic executor owns exact BigQuery semantics where noted.
| Family | Representative SQL | Route | Conformance |
|---|---|---|---|
| CAST FORMAT | CAST(DATE '2018-01-30' AS STRING FORMAT 'YYYY') |
semantic_executor |
scalar/cast_format_date_to_string.yaml, cast_format_parse_date.yaml |
| CAST AT TIME ZONE | CAST(TIMESTAMP ... AS STRING AT TIME ZONE 'UTC') |
semantic_executor |
scalar/cast_timestamp_at_timezone.yaml |
| CAST extended | CAST(ST_GEOGPOINT(...) AS STRING) |
semantic_executor |
scalar/cast_extended_geography_string.yaml |
| COLLATE in ORDER BY | ORDER BY name COLLATE 'und:ci' |
semantic_executor |
scalar/order_by_collate_und_ci.yaml |
| SELECT AS VALUE | SELECT AS VALUE 42 AS n |
semantic_executor |
scalar/select_as_value_scalar.yaml |
| Set-op CORRESPONDING | UNION ALL CORRESPONDING |
duckdb_native (subset) |
setops/set_op_corresponding_union_all.yaml |
Extended-cast shapes that remain out of scope (proto / enum / range-of-proto /
graph / measure / tokenlist targets) still surface UNIMPLEMENTED on the
semantic path. Type-modifier casts (STRING(n), NUMERIC(p,s), …) evaluate
on eval_expr_cast.cc (cast_type_modifiers.yaml).
Exact-decimal (NUMERIC / BIGNUMERIC)¶
BigQuery's NUMERIC (DECIMAL(38,9) — precision 38, scale 9) and
BIGNUMERIC (precision ~76.8 digits, scale 38) are exact decimals.
DuckDB can store and compare NUMERIC natively, but it widens some
operations to DOUBLE and cannot store BIGNUMERIC as a DECIMAL at
all (BigQuery's BIGNUMERIC needs ~77 digits of precision, which
exceeds DuckDB's maximum DECIMAL precision of 38), so BIGNUMERIC
is persisted as VARCHAR. Following the no-silent-approximation rule,
shapes DuckDB would widen or reject reroute to the semantic executor's
decimal path (googlesql::NumericValue / BigNumericValue) rather
than approximating in DOUBLE.
| Family | Representative SQL | Route | Conformance |
|---|---|---|---|
SUM/MIN/MAX/COUNT over NUMERIC |
SUM(amount) |
duckdb_native |
aggregate/aggregate_numeric_sum.yaml |
AVG over NUMERIC/BIGNUMERIC |
AVG(amount) |
semantic_executor |
aggregate/aggregate_numeric_avg.yaml |
NUMERIC/BIGNUMERIC division |
a / b |
semantic_executor |
scalar/numeric_division.yaml |
+/-/* over BIGNUMERIC |
a + b |
semantic_executor |
scalar/bignumeric_arithmetic.yaml |
Results encode as exact decimal strings on the wire (no float
artifacts); arrow_to_bq renders HUGEINT-backed and VARCHAR-backed
decimals exactly (backend/engine/duckdb/arrow_to_bq_*.cc).
Row-access and column-level security (MVP)¶
Governance is enforced at query time for the synthetic principal
emulator@bigquery.local (see gateway/middleware/auth.go). The
emulator does not implement multi-principal IAM; policies are stored
via REST (rowAccessPolicies.*, tables.* with policy tags / maskKind)
and applied inside DuckDbExecutor::ExecuteQuery.
Row-access policies: predicates for policies whose grantee list is
empty or includes the synthetic principal are OR-composed and injected
into table materialization (WHERE on load). Policies targeting only
other principals do not grant the emulator access.
Column masking: policy tags on STRING/BYTES columns default to
SHA256 masking; explicit maskKind supports NULLIFY, DEFAULT_VALUE,
and DENIED. DENIED columns surface accessDenied (HTTP 403) when
selected. Masking runs in the DuckDB row source after the query
executes (duckdb_executor_security.*).
Conformance fixtures live under conformance/fixtures/security/.
Timestamp wire format¶
TIMESTAMP values cross several producer/parser boundaries inside one process. Every producer string must be accepted by every consumer parser so storage → SELECT → Storage Read stays symmetric.
| Producer | Format | Location |
|---|---|---|
| Semantic executor / storage attach | YYYY-MM-DD HH:MM:SS[.ffffff]+00 (short UTC offset) |
FormatTimestampUtc in backend/engine/semantic/value.cc via ToStorageValue |
| DuckDB Arrow read path | Microsecond-precision naive text (no TZ suffix) | FormatTimestampMicros in backend/engine/duckdb/arrow_to_bq_types.cc |
| Insert micros-as-string | Decimal epoch micros | TryFormatMicrosTimestampString in backend/storage/duckdb/duckdb_storage_literals.cc |
Parsing rules:
- C++ consumers normalize short
+HH/-HHsuffixes to+HH:00viaNormalizeTimestampOffsetSuffixbefore parse (backend/engine/semantic/value.h). ParseTimestampWireStringis the shared entry point for wire-shaped TIMESTAMP strings in the semantic executor and catalog typed reads.- Go gateway REST encoding uses
bqtypes.TimestampStringToMicros, which also accepts whole-second+00, fractional+00,Z, and+00:00forms (gateway/bqtypes/wire.go).
Asymmetry: REST query responses may pass unparseable TIMESTAMP strings
through as text; Storage Read hard-errors when timestampCellToMicros cannot
parse a cell (gateway/handlers/bqstorage/arrow.go). Pin with
gateway/e2e/storage_read_test.go (TestStorageReadTimestampRoundTrip) and
backend/engine/semantic/value_test.cc (TimestampWireRoundTrip).
Cross-references¶
backend/engine/duckdb/transpiler/SHAPE_TRACKER.md— per-node route dispositions (duckdb_native,duckdb_rewrite,duckdb_udf,semantic_executor,control_op,local_stub,unsupported).ROADMAP.md— work tracking and high-level milestone status.conformance/README.md— fixture authoring guide; references this document from its "Contributing a new fixture" section.DEVELOPMENT.md"Runtime configuration" — the user-facing version of the flag surface this document governs.