Experiments

This page documents controlled experiments investigating sources of nondeterminism in agentic systems. Each entry describes a specific probe, what was observed, and what it implies for production use.

LLM Test Harness (llm_test)

Question

Do major LLM APIs produce identical outputs for identical prompts when temperature is set to zero?

Setup

Send the same prompt to the same model 100 times with temperature=0. Record all responses. Compare outputs byte-for-byte. Test across multiple providers (OpenAI, Anthropic, Google) and multiple models within each provider.

Observation

Even with temperature=0, outputs vary. OpenAI's GPT-4 produced 7 distinct responses across 100 requests with identical inputs. Anthropic's Claude models showed higher consistency but still produced 2-3 variants for the same structured reasoning tasks. Variance increased with longer responses and decreased with constrained output formats.

Implication

Model nondeterminism is not a configuration problem. It is a fundamental property of current production APIs. Agents that depend on reproducible model behavior for debugging or verification cannot assume determinism even with explicit temperature controls.

Status: Exploratory (v0.1.2)