Experiments
This page documents controlled experiments investigating sources of nondeterminism in agentic systems. Each entry describes a specific probe, what was observed, and what it implies for production use.
Experiments: Spelunk CLI — Structured Investigation for Complex Data Systems
The problem
Large Airflow DAG ecosystems become archaeology sites. A single incident spans Airflow task logs, Docker images running on Kubernetes, BigQuery SQL transformations, Dockerized services producing to Kafka, Kafka consumers sinking data back to BigQuery, and Alembic-managed tables whose schemas live in migration files, not documentation.
When DAGs depend on other DAGs and dependencies number over 100, scattered throughout the codebase, table and schema lineage becomes incomplete or undocumented. Debugging is no longer local. You cannot isolate a single component; every investigation requires reconstructing context from multiple systems, each with its own logging, state management, and failure modes.
Why investigations break down
Investigations can fail twice: once in production, and once in reasoning. The production failure is the incident. The reasoning failure happens when you lose track of what you tested, which hypotheses you ruled out, and why you checked a specific table or config.
Without structure, investigations become:
- Ad-hoc terminal commands with no record of what was run or why
- Slack threads where conclusions are mixed with speculation
- Jupyter notebooks that capture queries but not the reasoning
- Context that evaporates after the incident closes
You cannot replay the investigation. You cannot hand it off mid-stream. You cannot verify whether a hypothesis was actually tested or just assumed.
The experiment
Spelunk CLI is an investigation workflow tool designed to be used alongside Cursor (or any LLM-assisted editor). It does not automate investigations. It structures the workspace around the things that investigations usually lose: code context, dependency boundaries, and a durable written trail.
Cursor is used to navigate the codebase, summarize unfamiliar components, and draft candidate queries or hypotheses. Spelunk enforces a critical constraint: it does not execute queries. Humans execute queries. The investigation captures what was run and what was learned.
How Spelunk approaches investigations
Spelunk treats an “integration” as the unit of debugging: not a single DAG, but a connected surface area of DAGs, shared libraries, containers, and downstream consumers.
spelunk init <integration> initializes a local workspace with three pillars:
-
.repo_farm/: a local checkout of all repositories referenced by the integration config, separated into:- DAG repos (the orchestration layer)
- shared_dependencies (libraries and shared code used across DAGs/services)
-
.documents/: LLM-Generated documentation that's partitioned by the integration and then by the DAG ID. -
schemas/: JSON cached schema definitions for the all tables in the integration for a specific project. These are used to validate the schema of the tables in the integration and also used by the LLM during investigation. Serves as a guardrail to prevent the LLM from hallucinating about the schema and not using partitions or other optimizations.
When an incident starts, spelunk investigate <integration> creates a timestamped investigation
directory under .investigations/ and opens a templated README.
The investigation template is intentionally rigid, but it is not manually filled line by line. It is populated through a structured back-and-forth between the engineer and the LLM.
Each investigation unfolds in explicit phases, appended to the README in a fixed format:
- Phase N: Description of the suspected issue or failure mode
- Hypothesis: A concrete, testable claim generated or refined by the LLM
- Query N: A candidate query or inspection step proposed by the LLM
- Results: Output pasted in by the human after manual execution
This creates a logged conversation between the engineer and the agent, where reasoning is captured incrementally and evidence is explicitly attached to each hypothesis. The format makes it impossible to “mentally skip” steps or assume something was checked when it was not.
The workflow is deliberately human-led. Cursor helps you search the .repo_farm, trace call paths,
and reason across unfamiliar code. Spelunk complements this with templated .prmpt files that
standardize how questions are asked during an investigation.
These prompt templates encode investigation intent — for example, how to ask about schema usage, dependency boundaries, or DAG-to-DAG coupling — while constraining the LLM to the local repository and cached schema context. They are designed to reduce hallucination, not eliminate judgment.
Concretely, documentation is generated by the engineer using prompt templates in prompts/.
Each template is a .prmpt file: you paste it into Cursor with the integration’s local context
(DAG repo, shared_dependencies, and cached schemas/). Cursor then walks the DAG code,
its dependency graph, and referenced tables to produce a detailed technical write-up, which Spelunk stores under
.documents/ (partitioned by integration and DAG) for fast lookup during incidents.
Execution stays manual. You run the SQL / kubectl / log queries yourself, review the results, and paste
them back into the investigation README (or linked files) as evidence. The .prmpt files guide
reasoning; humans remain responsible for decisions.
What Spelunk enforces (by design)
- No autonomous execution. Queries are reviewed before they run.
- Manual logging. If you don't log it, it didn't happen. This forces conscious capture.
- Temporal ordering. Files are timestamped. You see the investigation as it unfolded.
- Disposability. Investigations are temporary. Once conclusions move to Confluence or a postmortem doc, the Spelunk directory is archived. It served its purpose.
Safety note: Disposable does not mean "accidentally delete the only copy." Conclusions must migrate to durable storage before the investigation directory is removed.
Why this matters
In systems with connected DAGs, Docker layers, Kafka pipelines, and Alembic-managed schemas, incidents are distributed by default. Debugging requires reconstructing state across code, data, and execution history. Doing this mentally does not scale.
Without structure, engineers spend most of their time:
- re-discovering where code lives
- re-building partial mental models of dependencies
- re-checking assumptions that were already tested elsewhere
That is where time is lost.
Spelunk externalizes that reconstruction. By pulling all relevant repositories into a single workspace, separating DAG code from shared dependencies, and forcing explicit written reasoning, it collapses the search space early.
The result is not just cleaner investigations — it is faster ones.
Investigations that previously took days of context-gathering converge in minutes because:
- the relevant code is already local
- dependency boundaries are explicit
- schema history and undocumented lineage are surfaced early
- hypotheses and evidence are written down instead of re-inferred
Spelunk does not speed up debugging by automation. It speeds it up by eliminating repeated discovery and cognitive thrash.
You still run the queries. You just stop re-learning the system from scratch every time.
Gotchas and how they are handled
Spelunk is intentionally constrained, and those constraints surface predictable failure modes. This section documents the most common ones observed so far, along with the practical resolutions.
1. Multi-DAG issues and partial context
In complex platforms, failures often span multiple connected DAGs. Spelunk generates detailed documentation per DAG, including that DAG’s dependencies, schemas, and inferred lineage. As a result, the accuracy and usefulness of an investigation can vary depending on how much of the true upstream context is visible.
When the failure is caused upstream, a single-DAG investigation may surface symptoms rather than root cause. This is not a tooling bug — it is a context boundary.
Resolution: Start investigations from the most downstream DAG where the failure is observed, then move upward through upstream DAGs incrementally. Each investigation refines context and narrows the search space. This mirrors how failures actually propagate through production systems.
2. Poorly generated SQL despite guardrails
Even with schema caching and constrained prompts, LLM-generated SQL can be incorrect. This is an expected consequence of model nondeterminism, not a surprise.
The most common failure modes are:
- Missing or incorrect partition filters
- Hallucinated column names
- Invalid SQL syntax, particularly with BigQuery-specific constructs
Hallucinated columns: Reintroduce the relevant schema documentation into the Cursor context and provide the exact error message returned by BigQuery. In most cases, the LLM corrects the query on the next iteration.
Missing or incorrect partitions: Apply human judgment. Understanding which partition is relevant depends on the debugging intent (backfill vs. incremental run vs. replay). This is a core reason execution remains human-led.
Incorrect SQL syntax: Provide a reference to the BigQuery SQL syntax or correct the query manually. Syntax errors are mechanical and faster to fix directly than to debate with the model.
These failure modes reinforce a core design principle: Spelunk accelerates investigations by structuring reasoning and context, not by delegating authority. The human remains responsible for correctness.
SQLGlot, TextBlob, and Regex — Free and Deterministic Alternatives to LLM-as-a-Judge
The goal
The goal of these experiments was to make the argument in this article concrete: if a constraint can be expressed deterministically, it should be enforced deterministically.
In all three experiments, the enforcement tools were deliberately chosen for one property above all others: determinism. Given the same input and the same rules, they always produced the same result. There was no reinterpretation, no drift, and no probabilistic variance.
Instead of asking a second LLM to “review” or “judge” outputs, I tested a replacement architecture: LLMs generate, deterministic tooling enforces, and violations return structured errors that feed a revision loop.
Why this matters
“LLM-as-a-Judge” feels appealing because it avoids writing rules. But at trust boundaries, the problem is not stylistic. It is structural.
- Rules fail loudly. They produce explicit, repeatable violations.
- LLM judges fail silently. The same input can yield different approvals over time.
These experiments focus on three domains where teams often default to probabilistic judgment even though fully deterministic enforcement is available.
Experiment 1: SQL enforcement with SQLGlot (AST + schema validation)
Problem: Teams want to execute LLM-generated SQL while preventing mechanical failures and expensive mistakes: hallucinated columns, wrong tables, missing partition filters, and accidental full scans.
Common “judge” approach: ask another LLM whether the SQL “looks right.”
Deterministic alternative: parse SQL into an AST, extract referenced identifiers, and validate against a frozen schema snapshot.
What was enforced
- SQL validity: parseability and syntactic correctness
- Schema correctness: referenced tables and columns must exist in a frozen snapshot
- Cost constraints: required partition predicates must be present
- Fail-closed behavior: unknown identifiers are rejected with explicit reasons
Determinism guarantee: Given the same SQL text, schema snapshot, and enforcement rules, SQLGlot produced identical validation results on every run. There was no variance across executions.
Outcome: Enforcement decisions were stable, replayable, and debuggable. Failures were machine-readable and consistently correctable by the generator in a revision loop.
Experiment 2: “Professional tone” as enforceable policy (TextBlob + explicit rules)
Problem: Teams often claim tone is subjective and rely on an LLM to judge whether outbound text is “professional.”
In practice, production definitions of “professional” usually reduce to explicit policy: no profanity, no harassment, no threats, no discriminatory language, and no sexual content.
Deterministic enforcement approach
- Policy rules: lexicon-based profanity and slur detection
- Pattern checks: deterministic harassment and threat patterns
- Sentiment signals: TextBlob polarity used only as a thresholded signal, not as authority
- Structured violations: named rule failures with explicit evidence
Determinism guarantee: Given the same text and the same policy rules, enforcement results were identical across runs. The system did not reinterpret tone or “change its mind.”
Outcome: The system enforced written policy, not vibes. When text failed, it failed for the same reason every time, producing concrete remediation targets for the generator.
Experiment 3: Privacy leakage detection with regex + checksum validation
Problem: Some teams ask an LLM judge whether outbound text leaks private data. This is the riskiest usage pattern because privacy enforcement must be conservative and auditable.
Deterministic enforcement approach
- PII patterns: regex detection for emails, phone numbers, and IDs
- Payment data: checksum validation for credit card numbers
- Secrets: deterministic secret scanners for API keys and tokens
- Trust boundaries: allowlists and denylists for internal domains
- Structured reporting: explicit violation categories with evidence
Determinism guarantee: The same text always produced the same privacy findings under the same ruleset. There was no probabilistic approval or missed violation due to reinterpretation.
Outcome: Enforcement behaved like a real DLP system: replayable, explainable, and biased toward failing closed. These properties are not optional at privacy boundaries.
What these experiments demonstrate
Across all three domains, the same pattern emerged:
- Deterministic inputs produced deterministic outcomes.
- Violations were explicit, stable, and inspectable.
- Enforcement logic was testable with fixed fixtures.
- LLMs were most effective inside the revision loop, not at the gate.
The architectural takeaway
These experiments reinforce the core architecture of this article:
- LLMs generate candidate outputs.
- Deterministic enforcers validate.
- Failures return structured errors.
- LLMs revise based on explicit constraints.
- Repeat until valid.
LLMs propose. Deterministic systems decide.
Determinism is not about rigidity. It is about making trust boundaries replayable, inspectable, and accountable.