Experiments
This page documents controlled experiments investigating sources of nondeterminism in agentic systems. Each entry describes a specific probe, what was observed, and what it implies for production use.
LLM Test Harness (llm_test)
Question
Do major LLM APIs produce identical outputs for identical prompts when temperature is set to zero?
Setup
Send the same prompt to the same model 100 times with temperature=0.
Record all responses. Compare outputs byte-for-byte. Test across multiple providers
(OpenAI, Anthropic, Google) and multiple models within each provider.
Observation
Even with temperature=0, outputs vary. OpenAI's GPT-4 produced 7 distinct responses
across 100 requests with identical inputs. Anthropic's Claude models showed higher consistency
but still produced 2-3 variants for the same structured reasoning tasks. Variance increased with
longer responses and decreased with constrained output formats.
Implication
Model nondeterminism is not a configuration problem. It is a fundamental property of current production APIs. Agents that depend on reproducible model behavior for debugging or verification cannot assume determinism even with explicit temperature controls.
Status: Exploratory (v0.1.2)