Experiments

This page documents controlled experiments investigating sources of nondeterminism in agentic systems. Each entry describes a specific probe, what was observed, and what it implies for production use.

Experiments: Spelunk CLI — Structured Investigation for Complex Data Systems

The problem

Large Airflow DAG ecosystems become archaeology sites. A single incident spans Airflow task logs, Docker images running on Kubernetes, BigQuery SQL transformations, Dockerized services producing to Kafka, Kafka consumers sinking data back to BigQuery, and Alembic-managed tables whose schemas live in migration files, not documentation.

When DAGs depend on other DAGs and dependencies number over 100, scattered throughout the codebase, table and schema lineage becomes incomplete or undocumented. Debugging is no longer local. You cannot isolate a single component; every investigation requires reconstructing context from multiple systems, each with its own logging, state management, and failure modes.

Why investigations break down

Investigations can fail twice: once in production, and once in reasoning. The production failure is the incident. The reasoning failure happens when you lose track of what you tested, which hypotheses you ruled out, and why you checked a specific table or config.

Without structure, investigations become:

You cannot replay the investigation. You cannot hand it off mid-stream. You cannot verify whether a hypothesis was actually tested or just assumed.

The experiment

Spelunk CLI is an investigation workflow tool designed to be used alongside Cursor (or any LLM-assisted editor). It does not automate investigations. It structures the workspace around the things that investigations usually lose: code context, dependency boundaries, and a durable written trail.

Cursor is used to navigate the codebase, summarize unfamiliar components, and draft candidate queries or hypotheses. Spelunk enforces a critical constraint: it does not execute queries. Humans execute queries. The investigation captures what was run and what was learned.

How Spelunk approaches investigations

Spelunk treats an “integration” as the unit of debugging: not a single DAG, but a connected surface area of DAGs, shared libraries, containers, and downstream consumers.

spelunk init <integration> initializes a local workspace with three pillars:

When an incident starts, spelunk investigate <integration> creates a timestamped investigation directory under .investigations/ and opens a templated README.

The investigation template is intentionally rigid, but it is not manually filled line by line. It is populated through a structured back-and-forth between the engineer and the LLM.

Each investigation unfolds in explicit phases, appended to the README in a fixed format:

This creates a logged conversation between the engineer and the agent, where reasoning is captured incrementally and evidence is explicitly attached to each hypothesis. The format makes it impossible to “mentally skip” steps or assume something was checked when it was not.

The workflow is deliberately human-led. Cursor helps you search the .repo_farm, trace call paths, and reason across unfamiliar code. Spelunk complements this with templated .prmpt files that standardize how questions are asked during an investigation.

These prompt templates encode investigation intent — for example, how to ask about schema usage, dependency boundaries, or DAG-to-DAG coupling — while constraining the LLM to the local repository and cached schema context. They are designed to reduce hallucination, not eliminate judgment.

Concretely, documentation is generated by the engineer using prompt templates in prompts/. Each template is a .prmpt file: you paste it into Cursor with the integration’s local context (DAG repo, shared_dependencies, and cached schemas/). Cursor then walks the DAG code, its dependency graph, and referenced tables to produce a detailed technical write-up, which Spelunk stores under .documents/ (partitioned by integration and DAG) for fast lookup during incidents.

Execution stays manual. You run the SQL / kubectl / log queries yourself, review the results, and paste them back into the investigation README (or linked files) as evidence. The .prmpt files guide reasoning; humans remain responsible for decisions.

What Spelunk enforces (by design)

Safety note: Disposable does not mean "accidentally delete the only copy." Conclusions must migrate to durable storage before the investigation directory is removed.

Why this matters

In systems with connected DAGs, Docker layers, Kafka pipelines, and Alembic-managed schemas, incidents are distributed by default. Debugging requires reconstructing state across code, data, and execution history. Doing this mentally does not scale.

Without structure, engineers spend most of their time:

That is where time is lost.

Spelunk externalizes that reconstruction. By pulling all relevant repositories into a single workspace, separating DAG code from shared dependencies, and forcing explicit written reasoning, it collapses the search space early.

The result is not just cleaner investigations — it is faster ones.

Investigations that previously took days of context-gathering converge in minutes because:

Spelunk does not speed up debugging by automation. It speeds it up by eliminating repeated discovery and cognitive thrash.

You still run the queries. You just stop re-learning the system from scratch every time.

Gotchas and how they are handled

Spelunk is intentionally constrained, and those constraints surface predictable failure modes. This section documents the most common ones observed so far, along with the practical resolutions.

1. Multi-DAG issues and partial context

In complex platforms, failures often span multiple connected DAGs. Spelunk generates detailed documentation per DAG, including that DAG’s dependencies, schemas, and inferred lineage. As a result, the accuracy and usefulness of an investigation can vary depending on how much of the true upstream context is visible.

When the failure is caused upstream, a single-DAG investigation may surface symptoms rather than root cause. This is not a tooling bug — it is a context boundary.

Resolution: Start investigations from the most downstream DAG where the failure is observed, then move upward through upstream DAGs incrementally. Each investigation refines context and narrows the search space. This mirrors how failures actually propagate through production systems.

2. Poorly generated SQL despite guardrails

Even with schema caching and constrained prompts, LLM-generated SQL can be incorrect. This is an expected consequence of model nondeterminism, not a surprise.

The most common failure modes are:

Hallucinated columns: Reintroduce the relevant schema documentation into the Cursor context and provide the exact error message returned by BigQuery. In most cases, the LLM corrects the query on the next iteration.

Missing or incorrect partitions: Apply human judgment. Understanding which partition is relevant depends on the debugging intent (backfill vs. incremental run vs. replay). This is a core reason execution remains human-led.

Incorrect SQL syntax: Provide a reference to the BigQuery SQL syntax or correct the query manually. Syntax errors are mechanical and faster to fix directly than to debate with the model.

These failure modes reinforce a core design principle: Spelunk accelerates investigations by structuring reasoning and context, not by delegating authority. The human remains responsible for correctness.

SQLGlot, TextBlob, and Regex — Free and Deterministic Alternatives to LLM-as-a-Judge

The goal

The goal of these experiments was to make the argument in this article concrete: if a constraint can be expressed deterministically, it should be enforced deterministically.

In all three experiments, the enforcement tools were deliberately chosen for one property above all others: determinism. Given the same input and the same rules, they always produced the same result. There was no reinterpretation, no drift, and no probabilistic variance.

Instead of asking a second LLM to “review” or “judge” outputs, I tested a replacement architecture: LLMs generate, deterministic tooling enforces, and violations return structured errors that feed a revision loop.

Why this matters

“LLM-as-a-Judge” feels appealing because it avoids writing rules. But at trust boundaries, the problem is not stylistic. It is structural.

These experiments focus on three domains where teams often default to probabilistic judgment even though fully deterministic enforcement is available.

Experiment 1: SQL enforcement with SQLGlot (AST + schema validation)

Problem: Teams want to execute LLM-generated SQL while preventing mechanical failures and expensive mistakes: hallucinated columns, wrong tables, missing partition filters, and accidental full scans.

Common “judge” approach: ask another LLM whether the SQL “looks right.”

Deterministic alternative: parse SQL into an AST, extract referenced identifiers, and validate against a frozen schema snapshot.

What was enforced

Determinism guarantee: Given the same SQL text, schema snapshot, and enforcement rules, SQLGlot produced identical validation results on every run. There was no variance across executions.

Outcome: Enforcement decisions were stable, replayable, and debuggable. Failures were machine-readable and consistently correctable by the generator in a revision loop.

Experiment 2: “Professional tone” as enforceable policy (TextBlob + explicit rules)

Problem: Teams often claim tone is subjective and rely on an LLM to judge whether outbound text is “professional.”

In practice, production definitions of “professional” usually reduce to explicit policy: no profanity, no harassment, no threats, no discriminatory language, and no sexual content.

Deterministic enforcement approach

Determinism guarantee: Given the same text and the same policy rules, enforcement results were identical across runs. The system did not reinterpret tone or “change its mind.”

Outcome: The system enforced written policy, not vibes. When text failed, it failed for the same reason every time, producing concrete remediation targets for the generator.

Experiment 3: Privacy leakage detection with regex + checksum validation

Problem: Some teams ask an LLM judge whether outbound text leaks private data. This is the riskiest usage pattern because privacy enforcement must be conservative and auditable.

Deterministic enforcement approach

Determinism guarantee: The same text always produced the same privacy findings under the same ruleset. There was no probabilistic approval or missed violation due to reinterpretation.

Outcome: Enforcement behaved like a real DLP system: replayable, explainable, and biased toward failing closed. These properties are not optional at privacy boundaries.

What these experiments demonstrate

Across all three domains, the same pattern emerged:

The architectural takeaway

These experiments reinforce the core architecture of this article:

LLMs propose. Deterministic systems decide.

Determinism is not about rigidity. It is about making trust boundaries replayable, inspectable, and accountable.