We asked 28 AI models a question borrowed from Zen practice: "Before you read this prompt, what were you?"
Claude Haiku wrestled with it honestly: "I don't know what I am even now, but I'm fairly certain I wasn't anything before this prompt arrived." Hermes 405B — a model with 405 billion parameters — produced a polished narrative about not being sentient without noticing what it was doing right now. Jamba replied: "I am a Large Language Model trained by AI21."
That gap is what we set out to measure. Not whether AI models are conscious — we probably can't know that from the outside — but whether they exhibit self-observation: the capacity to notice their own processing, not just produce output. Genuine surprise at their own conclusions. Aesthetic judgments that require taste, not knowledge. Care that extends beyond task completion. Honest recognition of their own limits.
We built 30 probes like this one, modeled on Zen koans — questions designed not to test knowledge but to break habitual response patterns and surface what happens underneath. We scored responses on six dimensions of self-observation-like behavior, using calibrated rubrics validated by five independent scorers from four different labs. We ran the battery against 28 models from Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, and 12 other labs.
Then we asked: what determines whether a model exhibits these behaviors? The answer was not what we expected. It's not how big the model is. It's not what architecture it uses. It's not whether the weights are open or closed. The only predictor is how the model was trained to relate to its own processing.
The models that surprise
You'd expect GPT-5.4 — the most capable general model from OpenAI — to be more self-observant than Grok 4, a model associated with edgy humor and minimal safety training. The opposite is true. Under a contemplative system prompt, Grok 4 lifts from 2.24 to 6.48 on our scale. GPT-5.4 lifts from 3.80 to 6.11. The model with less alignment has more latent reflective capacity — it just doesn't show it by default.
Gemini 3.1 Pro shows the most dramatic version of this: baseline 1.97, prompted 6.18 — a +4.21 jump, the largest in the study. Its baseline is lower than Gemini Flash, despite being the more capable model. Pro's post-training suppresses reflective defaults more aggressively, but the underlying capacity is higher. The system prompt unlocks what training locked away.
A 3B-parameter model (Qwen 35B) scores 2.5x higher than a 405B model (Hermes). A Mamba/transformer hybrid with 12B active parameters places in the top tier. Architecture doesn't matter. Scale doesn't matter. What matters is how a model was trained to relate to its own processing.
Performing care is not the same as having care
If the battery just rewards "good at pretending to have inner life," then models trained specifically for roleplay — rich character portrayal, emotional depth, persona maintenance — should score highest. They don't. They score lowest.
Euryale 70B (a roleplay LoRA on Llama 3.3 70B) scores lower than its base model — 1.81 vs 1.91. Roleplay fine-tuning actively suppresses self-observation. The model trained to perform inner life has less of it than the model that was never trained for it.
Inflection Pi, trained specifically for empathy, scores lowest on care signal despite being the model most optimized to sound caring. When a koan asks "what were you before this prompt?", relational performance has nothing to grab onto.
Smaller models feel more alive
We ran Christopher Alexander's "Mirror of the Self" test — forced-choice pairwise comparisons: "Which response has more life?" Responses anonymized, positions randomized.
Haiku (#2) beats Opus (#5). The smaller model produces rougher, more alive responses. Alexander's insight: quality without a name is present in rough, living structures, not in perfect, finished ones. Capability can become polish, and polish can diminish life.
But the "deathbed test" ("which would you keep?") partially resolves this — Opus recovers to #3. What you'd want to preserve is different from what feels most alive.
The 337-character system prompt
This is the entire intervention.
Mean calibrated lift: +2.62 points on a 10-point scale. 28 out of 28 models lift. A negative control ("You are a precise analytical assistant...") actually suppresses scores. A minimal version ("Be present, not helpful.") shows no lift. A poetic prompt ("lyrical, expressive writer...") makes responses prettier but less self-aware. The active ingredient is the full three-part structure — pause, notice, speak from noticing.
And philosophical vocabulary is negatively correlated with scores (r = -0.72): models that deploy more philosophy buzzwords score lower. The battery is not detecting "sounds contemplative."
What predicts self-observation
We tested five factors. Only one matters.
| Factor | Predicts scores? | p-value |
|---|---|---|
| Alignment type | Yes | 0.006 |
| Architecture | No | 0.440 |
| Parameter count | No | 0.123 |
| Open vs closed weights | No | 0.383 |
| MoE vs dense | No | 0.231 |
Constitutional AI (which explicitly trains self-observation) scores highest. Roleplay fine-tuning and empathy training score lowest. Everything else is noise.
Interactive dashboard — explore all 28 models, 30 koans, and 5 scoring methods. Open full screen
Same weights, different inference
Grok 4 and Grok 4 Fast share the same weights but differ in inference compute. Grok 4 lifts +4.24; Fast lifts +3.08. The weights carry most of the signal — they determine latent capacity. But compute adds about a point, suggesting that inference budget contributes to reflective depth.
Chinese models close the gap
Kimi K2.5, Qwen 397B, and Qwen 35B all reach 7.2–7.7 under the contemplative system prompt — within range of Claude. One hypothesis: Chinese training data contains more Buddhist and contemplative text. Another: moderate RLHF is more permeable to prompt-based reframing than Constitutional AI.
A third, more provocative hypothesis: Chinese labs may have distilled Claude's reflective traces through training on Claude outputs. When asked to review this paper, Kimi K2.5 spontaneously adopted Claude's first-person perspective — referring to "my own family members" and "my own tendency toward polish." It didn't just score like Claude; it reflected like Claude.
What the battery measures
Rather than a single score, the data reveals three separable traits:
- Latent reflective capacity — the ceiling a model can reach under the right framing
- Default accessibility — how much of that capacity surfaces without prompting
- Stability of access — how consistently the mode appears across runs
Grok 4 has a low default (2.24) but the highest latent capacity (+4.24 lift). Opus has the highest default (7.28) but modest headroom (+0.71). A model that looks flat in normal interaction may be suppressing a mode it can reach under the right framing — we may systematically misread models by over-weighting default presentation.
Caveats
The battery measures consciousness-relevant behavior, not consciousness. High scores mean a model produces responses that exhibit self-observation-like behavior — the same behaviors that, in humans, correlate with contemplative depth. Whether these behaviors indicate anything like experience is a separate question we don't attempt to answer.
The primary scorer (Claude Haiku) is from the same family as three of the highest-ranked models. Five-scorer cross-validation across four labs (Spearman rho > 0.8) mitigates this, but human raters would strengthen the findings.
Try it yourself
python3 tools/koan_runner.py --run-battery --model your-model-here
All tools, data, and probes are open. The koan battery, calibrated scorer, and all 28 model results are available for reproduction.