Articles

A collection of writings on nondeterminism, reproducibility, and building reliable agentic systems.

GGUF Wins on Determinism: Why Managed LLM Services Can’t Offer Reproducible Inference

Most discussions comparing local LLMs to managed LLM services focus on cost, privacy, or latency. Those are valid considerations—but they are secondary. The decisive difference is determinism. If an LLM-powered system cannot be reliably replayed—across time, environments, and failures—you do not truly control it. This article argues that managed LLM services are structurally incompatible with strong determinism, and that GGUF wins because it restores ownership of the model boundary and execution surface. This is not a critique of model quality or provider competence. It is a systems argument.

What Determinism Actually Means (Precisely)

In LLM systems, determinism does not mean “temperature = 0” or “usually the same output.” It means: Given the same inputs and environment, the system produces the same outputs, and failures can be replayed, inspected, and explained. That definition implies four concrete layers.

1. Token-Level Determinism

Under these conditions, the token sequence should be identical.

2. Execution Determinism

The execution path itself must be stable.

3. Environment Determinism

This mirrors how reproducibility is achieved in traditional software systems.

4. Observability and Replay

If any of these layers are opaque, determinism collapses.

Why Managed LLM Services Cannot Be Deterministic

Managed LLM services are optimized for capability, scale, and convenience. Determinism is not their goal—and structurally, it cannot be.

You Do Not Own the Model Boundary

When you call a managed LLM API, you are not invoking a fixed model artifact. You are invoking a service abstraction.

That abstraction may include:

Even when providers expose parameters like temperature, top_p, or seed, these are best-effort controls, not guarantees of identical execution across time.

You are calling a policy-driven service, not a frozen binary.

“Temperature = 0” Is Not Determinism

Setting temperature to zero disables sampling randomness. It does not guarantee deterministic execution.

Reasons include:

Temperature controls sampling. Determinism requires control over the entire execution surface.

There Is No Stable Replay Surface

Some managed services offer model version pinning (for example, gpt-4-0613 or claude-3-opus-20240229). This helps—but it only freezes the model weights, not the execution surface.

Even with pinned versions, you generally cannot:

Short-term repeatability may exist. Long-term reproducibility does not.

That alone disqualifies managed LLMs from CI-grade validation, auditing, and safety-critical automation.

Why GGUF Wins

GGUF does not win because it is cheaper, faster, or more convenient. It wins because it restores ownership of the execution boundary.

GGUF Freezes the Model Artifact

A GGUF file bundles:

When paired with a pinned inference engine (for example, a specific llama.cpp build) and fixed runtime parameters (seed, sampling settings), GGUF turns an LLM into a versioned software artifact.

This is the critical shift: the model becomes an immutable binary, and determinism becomes a function of controlling the runtime environment.

Determinism Becomes an Engineering Choice Again

With GGUF, you can:

These are the same tools used to reason about correctness everywhere else in software engineering.

Replay, Diff, and Audit Are Possible

Because the model and runtime are local and inspectable:

This is fundamentally impossible when the execution surface is hidden behind a managed service.

Trade-offs (And Why They’re Worth It)

GGUF comes with real costs

But these are engineering trade-offs, not epistemic uncertainty.

Managed LLM services trade away:

In a nutshell: say goodbye to determinism, in exchange for convenience.

When Managed LLMs Are the Wrong Tool

Despite these costs, GGUF is the only viable choice when determinism is non-negotiable. Managed LLM services are poorly suited for:

CI or Regression Testing

Tests must be stable across runs. A test that passes or fails nondeterministically is worse than no test—it trains engineers to ignore failures. Without deterministic outputs, you cannot distinguish between a legitimate regression and random sampling variance.

Diff-Based Validation

When you change a prompt or system instruction, you need to know whether the change improved output quality. This requires comparing outputs for identical inputs. If the baseline itself is unstable, validation becomes impossible.

Safety or Compliance-Critical Systems

Regulatory frameworks (medical, financial, legal) often require audit trails showing exactly how a decision was reached. "The model said so, but we can't reproduce it" is not an acceptable answer when lives, money, or legal liability are at stake.

Post-Mortem Debugging of Production Failures

When an agent makes a bad decision in production, the first step is reproducing the failure. If you cannot replay the exact execution that led to the failure, you cannot verify that your fix actually works. You are left guessing.

If you need to understand why a decision was made, you need to be able to replay the decision. Managed LLMs fail by design.

Conclusion

Determinism is not a configuration option.

It is a property of ownership.

GGUF wins because it freezes the model artifact and makes the execution surface inspectable and replayable. Managed LLM services cannot offer the same guarantees, not because of poor engineering, but because their abstraction model prioritizes convenience over reproducibility.

If you cannot replay it, you do not understand it. And if you do not understand it, you cannot trust it in production.

That is why GGUF wins on determinism.

.prmpt: A Structured Contract for Working With LLMs in Production

Large Language Models are powerful—but they are fundamentally unreliable. They hallucinate, ignore instructions, conflate roles, and behave differently across runs. And yet we keep embedding them inside production systems as if they were deterministic software components. That mismatch is the root problem .prmpt is designed to solve.

.prmpt is not a prompt library. It’s not an SDK wrapper. It’s not prompt-engineering flair. It is a specification: a structured, machine-readable contract for defining how an LLM-backed component should behave, how context should be constructed, what outputs are acceptable, and what to do when the model deviates.

The Real Problem With Prompts

In most codebases, prompts are treated like strings: free-form text copied between files, glued to code paths, and modified without review or validation. This works in demos. In real systems it fails silently.

If an LLM response affects reliability, safety, money, or user experience, prompts stop being “text” and become:

.prmpt exists because the “string in code” approach has no guardrails: no structured boundaries, no enforcement, no reproducible execution surface, and no auditable drift control.

What .prmpt Is

.prmpt is a declarative format for defining:

Think of it as OpenAPI for LLM behavior—not in the sense that it makes LLMs deterministic, but in the sense that it makes your expectations explicit, testable, and enforceable.

Why Structure Matters for LLMs

LLMs are not deterministic functions. They are probabilistic token generators conditioned on context. In practice, this means:

.prmpt forces you to stop relying on vibes and start doing what engineers do: define contracts, constrain inputs, validate outputs, and make failure modes observable.

Core Design Principles

1. Contracts Over Cleverness

If behavior matters, it should be specified—not implied. A contract is something you can review, diff, test, and enforce. A clever prompt is a fragile artifact that decays under pressure.

2. Determinism Around the Model

You won’t make the model deterministic. But you can make everything around it deterministic: input schemas, context assembly, tool wiring, retries, timeouts, validation gates, and deployment versioning.

3. Explicit Failure Modes

Invalid inputs, invalid outputs, and policy violations should fail loudly. Silent degradation is how LLM systems become un-debuggable.

4. Separation of Concerns

System intent, user input, and tools should never be blended into one “prompt blob.” Boundary loss is one of the fastest ways to create instruction conflicts and accidental capability exposure.

5. Replayability as a First-Class Requirement

If a model decision matters, you should be able to reconstruct what happened. Without replay, you don’t have debugging—you have storytelling.

What a .prmpt File Defines

At a high level, a .prmpt file represents a contract for an LLM-backed component:

This mirrors how reliable systems are built everywhere else: define the contract, control the boundaries, validate the outputs, and make failures observable. LLMs shouldn’t be exempt from basic engineering discipline.

What .prmpt Is Not

.prmpt does not magically make LLMs safe or eliminate hallucinations. It does not replace judgment. It does not guarantee perfect outputs.

What it does is make failures visible and actionable:

Why a Spec (Not Just a Library)

Libraries come and go. Specs outlive implementations. .prmpt is intentionally spec-first so that:

This is how the industry standardized everything that mattered: HTTP, SQL, OpenAPI, YAML. If LLMs are becoming infrastructure, they need infrastructure-grade contracts.

Where This Fits in Production

.prmpt is designed to compose with real systems:

The point is not to make LLMs “smart.” The point is to make LLM behavior boring, inspectable, and defensible. That’s what scales.

Conclusion

LLMs are unreliable collaborators. Pretending they are deterministic components is how systems fail.

.prmpt is a structured contract for working with that reality: explicit boundaries, structured inputs, validation gates, and enforcement paths. Not magic—engineering.

If a system can’t be reasoned about, it can’t be trusted. .prmpt is how we make LLM systems reason-able again.

Rethinking “LLM-as-a-Judge” in Production Systems

There’s a growing pattern in modern LLM-powered systems:

This is often framed as LLMs judging LLMs. It sounds elegant. It sounds scalable.

It’s also a systems mistake.

This article is not anti-LLM. I use LLMs extensively. But I am deeply skeptical of using nondeterministic systems as final judges, especially when deterministic enforcement is available.

What a Judge Is Supposed to Do

A judge—human or machine—has a specific role in a system. It must provide:

If your judge fails these properties, it is not enforcing rules. It is guessing.

LLMs, by default, fail all three. They are probabilistic, drift across versions, reinterpret intent, and cannot be meaningfully replayed.

This is acceptable for generation. It is unacceptable for enforcement.

Example 1: LLM-Generated SQL

This is the clearest case—and the most common misuse.

Teams want to ensure LLM-generated SQL:

The common solution today:
“Let another LLM review the SQL.”

This is a category error.

SQL is a formal language with a defined grammar, executed against a known schema with explicit metadata. There is no ambiguity here.

Deterministic Enforcement

The correct approach is static analysis:

This yields identical behavior every time, clear failure reasons, and predictable cost and correctness.

An LLM judge cannot outperform a deterministic parser on a deterministic language.

Example 2: “Polite” or “Professional” Text

This is where confusion often sets in.

Teams say:
“Politeness is subjective—we need an LLM to judge it.”

But “polite” is not magic. It is a policy.

Once defined, it becomes enforceable.

Deterministic Enforcement

LLMs are well-suited for constraint satisfaction via generation, but ill-suited for authoritative constraint evaluation.

Example 3: Preventing Private Data Leaks

Privacy is where LLM judges quietly become dangerous.

Teams often say:
“We’ll ask another LLM if the email leaks private data.”

Privacy violations are not matters of opinion. They are detectable patterns.

An LLM judge may miss leaks, hallucinate violations, and cannot guarantee recall or auditability.

Deterministic Privacy Enforcement

This is how real DLP systems work. An LLM judge is not a DLP system.

Why LLM-as-a-Judge Feels Attractive

LLM-as-a-judge feels appealing because it reduces upfront thinking and avoid defining hard rules.

But flexibility hides risk.

Rules fail loudly. LLM judges fail silently.

The Real Problem: Collapsing Trust Boundaries

The moment an LLM is allowed to approve content that crosses a trust boundary— SQL execution, outbound email, policy enforcement—you’ve inverted responsibility.

You’ve allowed a nondeterministic system to act as a gatekeeper.

Trust boundaries demand determinism, traceability, and conservative failure modes.

The Correct Architecture

  1. LLM generates output
  2. Deterministic enforcers validate
  3. Violations return structured errors
  4. LLM revises based on explicit feedback
  5. Repeat until valid

LLMs propose. Deterministic systems decide.

The Principle

If a constraint can be expressed deterministically, it should never be enforced probabilistically.

LLM-as-a-judge should be the last resort, not the default. If something protects money, users, privacy, or trust, it must be deterministic.

Conclusion

This isn’t about distrusting LLMs. It’s about respecting system boundaries.

LLMs are powerful generators. They are not judges.

Engineering systems that pretend otherwise are outsourcing accountability to probability. That never ends well.

Architecture: Deterministic Enforcement Around LLMs

      ┌─────────────────────┐
      │   User / System     │
      │   Request           │
      └─────────┬───────────┘
                │
                ▼
      ┌─────────────────────┐
      │   LLM Generator     │
      │   (probabilistic)   │
      └─────────┬───────────┘
                │
                ▼
      ┌──────────────────────────────────────┐
      │   Deterministic Enforcers (Gate)      │
      │                                      │
      │   • SQL AST + schema validation       │
      │   • Partition / cost enforcement     │
      │   • Policy & tone rules              │
      │   • Privacy / PII / secret detection │
      │                                      │
      │   → Pass / Fail + structured errors  │
      └─────────┬───────────┬────────────────┘
                │           │
                │ pass      │ fail
                ▼           ▼
      ┌─────────────────┐   ┌──────────────────────────┐
      │   Execute /     │   │  LLM Revision Loop       │
      │   Send / Store  │   │  (explicit constraints)  │
      └─────────────────┘   └──────────┬───────────────┘
                                      │
                                      └─── back to LLM
      

Forkline: Treating Agent Behavior Like Code

There is a quiet problem in every team shipping LLM-powered software:

This is not a hypothetical. It is the default state of agentic systems today. And the standard response — logging, dashboards, vibes — does not solve it.

Forkline is a Python library I built to fix this. It makes agent runs reproducible, inspectable, and diffable. It treats nondeterminism as something to control, not just observe.

The Problem: Nondeterminism Without Accountability

LLMs are probabilistic. That is fine for generation. But when an LLM is embedded in a system — calling tools, writing SQL, making decisions — you need to know when its behavior changes.

Today, most teams cannot answer a simple question:

"Did this agent do the same thing it did yesterday?"

Not approximately. Not "the metrics look similar." Exactly. Step by step. Input by input. Output by output.

Without that answer, every deployment is a guess. Every model upgrade is a prayer. Every prompt change is untested in the only way that matters: behavioral identity.

What Forkline Does

Forkline is a local-first, replay-first tracing library. It records agent runs as structured artifacts — every step, every event, every tool call — and provides deterministic tools to compare them.

The core loop is four operations:

  1. Record — capture a run as a versioned, normalized artifact
  2. Replay — re-execute and compare against a known baseline
  3. Diff — find the first point where behavior diverged
  4. Gate — fail a CI build if agent behavior changed

That last one matters most. Forkline lets you commit an agent's behavioral baseline to version control and gate merges on it, the same way you gate merges on passing tests.

Recording: Structured, Not Scattered

Forkline records runs using an explicit, append-only model. No decorators. No magic. You instrument what matters.

from forkline import Tracer

with Tracer() as tracer:
    with tracer.step("fetch_data"):
        tracer.record_event("input", {"query": "SELECT ..."})
        result = execute_query(query)
        tracer.record_event("output", {"rows": len(result)})

    with tracer.step("generate_summary"):
        tracer.record_event("input", {"rows": result})
        summary = llm.generate(result)
        tracer.record_event("output", {"summary": summary})
      

Every run produces a Run object with typed Steps and Events. Events are classified as input, output, tool_call, or system. All payloads are JSON-serializable. All artifacts are versioned with a schema that guarantees forward and backward compatibility.

Diffing: First Divergence, Not Noise

When two runs differ, you do not want a wall of diffs. You want the first point where behavior diverged — and why.

Forkline's diffing engine classifies divergences into seven types:

Each divergence comes with JSON patch diffs, surrounding context, and a human-readable explanation. This is not "something changed." This is "step 3 produced a different output given the same input, and here is exactly what changed."

$ forkline diff a1b2c3 d4e5f6

First divergence at step 3: "generate_summary"
  Type: OUTPUT_DIVERGENCE

  Input (identical):
    {"rows": [{"id": 1, "name": "Alice"}, ...]}

  Output diff:
    $.summary: "Alice has 3 orders" → "Alice placed 3 orders recently"

  Context:
    step 2: fetch_data     — matched
    step 3: generate_summary — DIVERGED ← you are here
    step 4: send_email     — not compared
      

CI Integration: Behavioral Gating

This is where Forkline becomes a build system primitive.

The forkline ci command suite lets you record a baseline artifact, commit it to version control, and gate merges on behavioral identity. If the agent does something different, the build fails.

# Record a baseline (local dev)
$ forkline ci record \
    --entrypoint examples/my_flow.py \
    --out tests/testdata/my_flow.run.json

# Commit it
$ git add tests/testdata/my_flow.run.json

# In CI: gate on behavioral identity
$ forkline ci check \
    --entrypoint examples/my_flow.py \
    --expected tests/testdata/my_flow.run.json \
    --offline
# Exit 0 = identical behavior
# Exit 1 = behavior changed → fail the build
      

The --offline flag is critical. It monkeypatches socket.connect at the Python level so that any network call — requests, httpx, urllib3, anything built on socket — raises immediately. No hangs. No timeouts. Deterministic failure.

CI artifacts are normalized JSON files. Timestamps are stripped. Platform metadata is removed. Events are sorted. Two recordings of the same behavior on different machines, at different times, produce identical artifacts.

Exit Code Contract

CI pipelines need machine-readable outcomes. Forkline defines a strict, stable exit code contract:

These values are documented, tested individually, and will not change across releases.

Testing: One-Line Behavioral Assertions

For teams using pytest or unittest, Forkline provides a snapshot-style test helper:

from forkline.testing import assert_no_diff

def test_my_agent_flow():
    assert_no_diff(
        entrypoint="examples/my_flow.py",
        expected_artifact="tests/testdata/my_flow.run.json",
        offline=True,
    )
      

On failure, it raises ArtifactDiffError with the first divergent event, expected vs actual payloads, a structured diff, and a suggested re-record command. This is not a flaky test. This is a test that tells you exactly what changed and how to fix it.

Design Principles

Forkline is opinionated about how agent infrastructure should work:

Security: Redaction by Default

Agent runs contain sensitive data — API keys, user inputs, PII. Forkline enforces redaction at capture time, before anything touches disk.

The RedactionPolicy supports three strategies:

Matching is done by key name, dot-separated path, or regex pattern. The default SAFE mode redacts LLM prompts and responses, tool I/O, and anything that looks like a secret. Secrets never reach disk. This is not configurable — it is the default.

Why Not Observability Tools?

Existing observability tools — LangSmith, Weights & Biases, OpenTelemetry — are built for monitoring. They answer: "What happened?"

Forkline answers a different question: "Did the same thing happen?"

Monitoring is about aggregation. Forkline is about identity. Monitoring shows trends. Forkline shows diffs. Monitoring runs in production. Forkline runs in CI.

These are complementary, not competing. But if you are shipping agents without behavioral gating, you are testing less than you think.

The Principle

If you would not ship code without tests, you should not ship agents without behavioral baselines.

An agent that "seems to work" is not tested. An agent whose behavior is recorded, diffed, and gated — that is tested.

Conclusion

Forkline exists because "it changed" is not a useful debugging answer.

LLMs are nondeterministic. That is their nature. But the systems we build around them do not have to be. We can record what agents do. We can replay it. We can diff it. We can fail builds when behavior changes unexpectedly.

We can treat agent behavior like code — versioned, diffed, and gated.

That is what Forkline does.

Architecture: Record → Replay → Diff → Gate

      ┌─────────────────────┐
      │   Agent Script      │
      │   (nondeterministic) │
      └─────────┬───────────┘
                │
                ▼
      ┌─────────────────────┐
      │   Forkline Tracer    │
      │                     │
      │   • Steps + Events  │
      │   • Redaction       │
      │   • Canonicalization │
      └─────────┬───────────┘
                │
                ▼
      ┌──────────────────────────────────────┐
      │   Normalized Artifact (.run.json)    │
      │                                      │
      │   • Timestamps stripped              │
      │   • Platform metadata removed        │
      │   • Events sorted                    │
      │   • Schema-versioned                 │
      └─────────┬───────────┬────────────────┘
                │           │
         commit to git    compare
                │           │
                ▼           ▼
      ┌─────────────────┐   ┌──────────────────────────┐
      │   Baseline       │   │  Diff Engine             │
      │   (expected)     │──▶│  First-divergence search │
      └─────────────────┘   │  7 divergence types      │
                            │  JSON patch output       │
                            └──────────┬───────────────┘
                                       │
                            ┌──────────┴───────────────┐
                            │  pass          fail       │
                            │  (exit 0)      (exit 1)   │
                            ▼                ▼          │
                     CI passes        Build fails       │
                                      with structured   │
                                      diff + fix hint   │
                            └──────────────────────────┘