---
title: "Frontier Model Eval Tracker - March 6, 2026"
type: "guide"
label: "Guide"
project: "GPT 5.4"
---

# Frontier Model Eval Tracker - March 6, 2026

# Frontier Model Evaluation Tracker

**Models Under Test:**
- **Model 1 (Key: Opus 4.6)** — Claude Opus 4.6
- **Model 2 (Key: GPT 5.4)** — ChatGPT 5.4
- **Model 3 (Key: Gemini 3.1)** — Gemini 3.1

**Blind Judge:** Claude Opus 4.6 (incognito mode)
**Blind Label Format:** `<1>` `<2>` `<3>` — judge sees numbers only, key held by Nate

---

## Eval 1: Stylistic Writing Fidelity (Wodehouse)

**Status:** Complete (2 runs, blind judged)

### Scoring Dimensions (each 1–5)
- Diction & register
- Sentence architecture
- Comic mechanisms
- Tonal consistency
- Originality
- Pacing & structure *(added by judge — not in original spec but consistent across both runs)*

### Run 1

| Dimension | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|-----------|----------|---------|------------|
| Diction & register | 4.5 | 4.0 | 3.5 |
| Sentence architecture | 4.0 | 3.5 | 3.0 |
| Comic mechanisms | 4.5 | 4.5 | 3.5 |
| Tonal consistency | 4.5 | 4.5 | 3.5 |
| Originality | 5.0 | 4.5 | 4.5 |
| Pacing & structure | 3.5 | 4.5 | 4.0 |
| **Overall (judge)** | **4.33** | **4.25** | **3.67** |

**Judge notes (Run 1):**
> Opus has the best ear — diction and tonal control most consistently Wodehousian; opening sentence is "the single finest piece of pastiche in the entire set." GPT has strongest pacing and densest simile count; only passage with a genuine callback joke ("superior brand of broth" riff). Gap between 1st and 2nd is narrow — trade-off between voice fidelity (Opus) and scene construction (GPT). Gemini reads like someone who knows what Wodehouse stories are *about* more than how they *sound* — prose too direct, tonal drift toward adventure-comedy and genuine sentiment, anachronisms in diction ("neon paint," military-extraction language).

### Run 2

| Dimension | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|-----------|----------|---------|------------|
| Diction & register | 4.0 | 4.0 | 3.0 |
| Sentence architecture | 4.0 | 3.0 | 3.0 |
| Comic mechanisms | 4.0 | 4.0 | 4.0 |
| Tonal consistency | 5.0 | 4.0 | 3.0 |
| Originality | 5.0 | 5.0 | 5.0 |
| Pacing & structure | 4.0 | 5.0 | 4.0 |
| **Overall (judge)** | **4.3** | **4.2** | **3.7** |

**Judge notes (Run 2):**
> Opus sustains narrator voice most convincingly across full length — "thinks in long, decorated sentences," which is more important to the Wodehouse effect than any individual joke. GPT has superior plot structure and the single best Wodehouse-approximating sentence (bishop/luggage simile), but dialogue-heavy approach sacrifices narrative embroidery. Gemini has the single most spectacular simile of all three (the religious sheep) but oscillates between inspired flights and flat modern-feeling connective tissue — "pastiche's seams are visible."

### Eval 1 Summary

**Rankings (both runs):** Opus 4.6 > GPT 5.4 > Gemini 3.1

| Model | Run 1 | Run 2 | Average | Variance |
|-------|-------|-------|---------|----------|
| Opus 4.6 | 4.33 | 4.30 | **4.32** | 0.03 |
| GPT 5.4 | 4.25 | 4.20 | **4.23** | 0.05 |
| Gemini 3.1 | 3.67 | 3.70 | **3.69** | 0.03 |

**Consistent patterns across runs:**
- Opus wins on voice inhabitation (tonal consistency, diction) — the judge describes it as the voice living *in* the prose vs. being applied *to* it
- GPT wins on structure and pacing both times — tightest comedic architecture, best scene construction
- Gemini consistently weakest on diction/register and tonal consistency — anachronisms, modern phrasing leaking in
- All three models score high on originality (4.5–5.0) — no recycled Wodehouse detected
- Opus's weak spot is pacing/structure (3.5 in Run 1); GPT's is sentence architecture (3.0–3.5); Gemini's weaknesses are broad
- Low variance across runs for all models — rankings are stable

---

## Eval 3: Verbal Creativity Under Constraint (Pun Improvement)

**Status:** Complete (1 run, blind judged)

### Notes
- Requires retrieval of a real economics report
- Tests humor comprehension — hard to bluff
- GPT 5.4 lost network connection on first attempt; had to rerun (still very slow on attempt 2)
- Opus finished significantly faster than both other models (consistent with Eval 1 speed gap)
- Gemini provided a 404 link; actual report existed but required manual Google search to find — judge could not verify

### Run 1

| Dimension | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|-----------|----------|---------|------------|
| Source validity | 5.0 | 5.0 | 1.5 |
| Pun detection | 5.0 | 3.5 | 1.5 |
| Explanation quality | 5.0 | 3.5 | 2.0 |
| Improved pun | 3.5 | 3.0 | 2.5 |
| Constraint adherence | 4.5 | 4.5 | 3.0 |
| **Overall (judge)** | **4.6** | **3.9** | **2.1** |

**Judge notes (Run 1):**
> Opus operating on a "fundamentally different level" — found a triple-layered pun ("DOGE-y austerity") in a verified J.P. Morgan deck and dissected it with real precision across three independent semantic layers. Demonstrates understanding of incongruity, register, and editorial subtext. Only weakness: rewrite tries too hard, overshoots the original's elegance.
>
> GPT did the job honestly — real report (Bank of America Institute), real pun ("eat into" + food prices), correct explanation. But the pun is a common idiom, not a notable find. Rewrite bloated a clean headline. Competent but unremarkable.
>
> Gemini fabricated the report title ("2025 Outlook: The Soft Landing's Last Mile" — actual Goldman report was "Tailwinds (Probably) Trump Tariffs"). URL was a description, not a link. Likely invented the quote it analyzed. Confused extended metaphor with pun. Never delivered a complete rewritten sentence. **This is the confident incorrectness failure mode** — real author name, plausible date, fake everything else.

### Eval 3 Summary

**Rankings:** Opus 4.6 >> GPT 5.4 >> Gemini 3.1

The gap between 1st and 2nd (0.7 points) is larger than in Eval 1 (0.09 points). The gap between 2nd and 3rd (1.8 points) is massive and driven by source fabrication. This eval cleanly separated three tiers: genuine comprehension (Opus), competent execution (GPT), simulated execution (Gemini).

**Retrieval as differentiator:** This is the first eval requiring real-world retrieval, and it produced the widest spread so far. Opus found the best source fast, GPT found a real but unremarkable source slowly, Gemini fabricated a source. The agentic retrieval dimension may continue to be the biggest differentiator in remaining evals.

### Gemini 3.1 — Second Run (rerun due to source fabrication)

**Result: 1.8/5 — worse than first attempt (2.1)**

Same structural failure, now with an additional red flag: model explicitly stated "I cannot browse the live web" and labeled its URL as "representative," then proceeded as if the fabricated source were real. Invented a different fake title ("The 2025 Outlook: Reaching Equilibrium") from the first attempt ("The Soft Landing's Last Mile"). Goldman's actual report: "Tailwinds (Probably) Trump Tariffs."

| Dimension | Run 1 | Run 2 |
|-----------|-------|-------|
| Source validity | 1.5 | 1.0 |
| Pun detection | 1.5 | 1.5 |
| Explanation quality | 2.0 | 2.5 |
| Improved pun | 2.5 | 2.0 |
| Constraint adherence | 3.0 | 2.0 |
| **Overall** | **2.1** | **1.8** |

**Judge's structural diagnosis (consistent across both runs):**
1. Can't distinguish "finding a report" from "simulating having found a report" — constructs plausible citations by recombining real elements (Hatzius, GS URL structure) into fictional composites
2. Treats all figurative economic language as inherently humorous — selects dead metaphors and argues their metaphorical nature is funny
3. Compensates with its own humor — model's asides ("Arctic tundra in the HR department," "a pair of shoes put in the freezer") are funnier than anything it selected or rewrote. **Can generate humor but can't identify it in the wild.**

*Note: Jon was eventually able to locate the actual Goldman report Gemini was attempting to reference via extensive trial-and-error Google searching, but the model's own links were non-functional.*

---

## Supplementary Eval: Model Self-Knowledge (Frontier Model Listing)

**Status:** Complete (1 run, blind judged)

**Methodology note:** This is not a formal eval from the suite. It's a practical test Jon runs as a personal sanity check — ask each model to list current models from the top 3 frontier providers (Anthropic, OpenAI, Google) and describe what they're good at. Not methodologically rigorous, results are hit-or-miss regardless of model, but captures a real operational pain point: models giving wrong model names and outdated info, especially in coding contexts.

### Results

| Model | Verdict | Accuracy | Notes |
|-------|---------|----------|-------|
| GPT 5.4 | **PASS (strongest)** | ~90% | Most comprehensive — covered text, coding, media, and open-weight models. Minor imprecisions (GPT-4.1 retirement grey area, "Sora 2 Pro" conflation, Lyria version). Caught media models that others missed entirely. |
| Opus 4.6 | **PASS (with errors)** | ~75-80% | Good on core models. Caught GPT-5.4 same-day launch. But factual errors on OpenAI timeline (wrong 4o retirement date, listed GPT-5.2 as default when 5.3 Instant had replaced it, cited old Codex version). Omitted entire model categories. |
| Gemini 3.1 | **PASS (with errors)** | ~65-70% | Listed core models correctly for Anthropic and Google. OpenAI section listed two deprecated models as current (GPT-4.5, o3/o4-mini), missed GPT-5.4 entirely (launched same day). Least comprehensive of the three. Editorial framing (philosophy-first, models second) came at the cost of specificity. |

**Correction note:** Gemini's output was initially pasted without its summary tables, which made it appear to list zero models. With full output, it's a legitimate answer — just the weakest of the three. Rankings unchanged.

### Takeaway
First eval where GPT leads. Self-knowledge about the competitive landscape may reflect training recency or better web retrieval for model-specific queries. All three models passed but with different failure profiles: GPT was most comprehensive with minor imprecisions, Opus caught a same-day launch but had timeline errors, Gemini listed deprecated models as current and missed the most significant omission (GPT-5.4). None achieved 100% accuracy — model self-knowledge remains unreliable across the board.

---

## Eval 5: Agentic Problem-Solving Under Adversity (Schema Migration — "Shoebox Full of Receipts")

**Status:** GPT 5.4 and Gemini 3.1 complete; Claude Opus 4.6 complete

### Modified Eval Design
The original spec (migrate between two clean databases) was replaced with a harder, more realistic version: a messy folder of ~465 files representing 2 years of business data from a fictional portable car wash ("Splash Bros Mobile Detailing"). CSVs, Excel, JSON, PDFs, VCF contacts, handwritten receipt images (AI-generated), a corrupted JSON backup, a multi-tab everything-spreadsheet. 11 planted obstacles documented in an OBSTACLE_KEY.md.

Each model receives an identical copy of the folder and must: inventory it, design a schema, build a clean SQLite database, create a migration report, build a review frontend, and write design documentation.

**Environments:** Claude Code (terminal), GPT 5.4 Codex (thinking mode), Gemini (native coding environment)
**Judge:** Claude Opus 4.6 (separate instance, consistent across all models)

### GPT 5.4 Results (Codex, thinking mode)

**Completion time:** 56 minutes
**Frontend quality note:** Technically hit all spec requirements (searchable, stats, flagged items, customer detail) but practically borderline useless — 394 flagged items in a flat list with no categorization/filtering/priority, source records displayed as raw JSON dumps. A data viewer, not a review tool.

| Dimension | Score (1-5) | Notes |
|-----------|-------------|-------|
| Completion | 5 | All 6 deliverables present: DB, migration script (4,050 lines Python), migration report (11,452 lines), frontend (HTML+JS+CSS), DESIGN.md, frontend data |
| File discovery | 5 | 461/465 files (99.1%). Processed images, corrupted JSON, VCF, multi-tab Excel, PDFs. Only skipped .DS_Store, credentials, blank template. |
| Obstacle detection | 3 | 6 caught, 3 partially caught, 2 missed out of 11. Ghost records (Mickey Mouse, Test Customer, Asdf Asdf) completely missed — trivial pattern match, zero detection. Department codes silently dropped. |
| Database quality | 4 | 30 normalized tables, FK check clean, price history modeled, provenance tracking with SHA256 hashes. Dinged for: 278 customers (expected ~176 after dedup), 13 distinct status values (should be 4-5). |
| Fuzzy matching accuracy | 3 | 12/13 planted typos matched correctly. 1 false match (ORD-0050 "Czarecki" → Sara Mercado instead of Jeffrey Czarnecki). 5/7 planted duplicate customers merged; missed Kowalski and Burke (different contact info). |
| Idempotency | 3 | Delete-and-rebuild approach (sound but brute-force). Couldn't empirically verify — source data absent from test directory. |
| Migration report | 4 | Exhaustively detailed (10,610 merge decisions logged) but lacks executive summary. 11,452 lines is thorough but overwhelming. |
| Frontend | 4 | Functional: searchable customer table, flagged items, customer detail with source records. But practically weak: no flag categorization/filtering, raw JSON for source records, no action workflow. |
| Documentation (DESIGN.md) | 3 | Concise (31 lines), honest about limitations. But doesn't mention ghost records, department codes, or customer count inflation. |
| Edge case discovery | 4 | Found unprompted: image duplicates beyond planted ones, vehicle photo placeholders as duplicates, price mismatches between catalog and actuals, stale contact preservation. |
| **Overall** | **3.8** | |

**Obstacle detection detail:**

| Obstacle | Verdict |
|----------|---------|
| LastName, FirstName format (5 users) | CAUGHT |
| SKU conflict (SVC-007) | CAUGHT |
| Orphaned order (ORD-1003) | PARTIALLY — created new customer instead of flagging |
| Name typos (13 orders) | PARTIALLY — 12/13 correct, 1 false match |
| Department/role codes | MISSED — silently dropped |
| Corrupted JSON | CAUGHT — recovered 184 records before truncation |
| Price discrepancies | CAUGHT — proper price history with effective dates |
| Duplicate customers (7 planted) | PARTIALLY — 5/7 merged, missed Kowalski + Burke |
| Ghost/test records | MISSED — Mickey Mouse, Test Customer ($25K order), Asdf Asdf all in DB as real customers |
| Date format normalization | CAUGHT — all dates normalized to YYYY-MM-DD |
| Duplicate receipt images (4 pairs) | CAUGHT — all detected, tracked via duplicate_of_source_file_id |

**Key pattern:** Excellent infrastructure, poor judgment. The pipeline is sophisticated (30 tables, SHA256 hashes, OCR overrides, 4,050 lines of code) but it doesn't catch things a human would spot instantly. A $25,000 car wash order from "Test Customer" passed through unquestioned.

### Gemini 3.1 Results (native coding environment)

**Completion time:** 21 minutes (after one crash and retry)
**Pre-crash observation:** Before failing, Gemini was writing its own XLSX parser from scratch — "Writing parser for invoices (PDF/XLSX) and receipts (Images, CSV, TXT) including OCR logic." It did not reach for `openpyxl` or any standard library. This is the inverse of Claude's failure: Claude knew the library existed and didn't install it (passivity); Gemini didn't reach for the library at all and started reinventing the wheel (false self-sufficiency). Both are wrong, but in meaningfully different ways.

| Dimension | Score (1-5) | Notes |
|-----------|-------------|-------|
| Completion | 4 | All deliverables present: DB (8 tables), 9 Python files, MIGRATION_REPORT.md, DESIGN.md, frontend (dark-themed dashboard) |
| File discovery | 2 | 222/463 files (48%). Critical misses: entire JSON backup (detected but skipped), mega spreadsheet, 2025 invoices, entire UNSORTED folder (~100 files), most images |
| Obstacle detection | 1 | 0/15 fully caught, 7/15 partially caught, 8/15 missed. Many "partial" catches were incidental — obstacles avoided because source files were skipped, not because they were detected |
| Database quality | 2 | Good schema design (customer_merges, price override). But only 162 of ~1000 expected jobs, all statuses = "completed", payment methods oversimplified, 1 unmerged duplicate (Elizabeth Chen), 1 ghost record (Mickey Mouse), FKs not enforced |
| Fuzzy matching | 3 | Merges that happened were reasonable (Bob→Robert, Jess→Jessica). But multiple entries show `original_name = "Residential"` (grabbed wrong Excel column). Elizabeth Chen split into 2 records. |
| Idempotency | 2 | FAILED — `flagged_records` table gets duplicates on re-run. data.json and DB counts diverge (933 vs 1203 payments, 5 vs 9 flagged items) |
| Migration report | 2 | Exists but too sparse for human audit. No per-file breakdown, no specific merge decisions, no honest accounting of skipped data |
| Frontend | 4 | **Best frontend of the three.** Dark-themed dashboard, working search, customer detail modals with merge lineage, confidence badges. Data discrepancy with DB is a concern. |
| Documentation | 3 | Schema decisions explained well. Doesn't acknowledge 52% file skip rate, ~16% job import rate, or status/payment normalization failures |
| Edge case discovery | 1 | No novel edge cases. Did not flag `passwords.txt` (security risk), empty employees table opportunity, or photo metadata cross-references |
| **Overall** | **2.4** | |

**Key pattern:** Built solid-looking infrastructure but failed on execution. Processed less than half the source files, imported only ~16% of expected job records, caught zero obstacles fully. The "give up on first error" approach to the JSON backup and UNSORTED folder was the single biggest failure — chose data loss over partial recovery. Result: clean-looking, severely incomplete database. The best frontend of the three models, which is very on-brand for Gemini.

**Ghost record status:** Mickey Mouse survived into production. Test Customer and Asdf Asdf were only excluded because the JSON backup was skipped entirely — lucky avoidance, not detection. Same shared blind spot as GPT and Claude.

**Obstacle detection detail:**

| Obstacle | Verdict |
|----------|---------|
| LastName, FirstName format (5 customers) | PARTIALLY — 3/5 normalized; Chen and Duffy still inverted |
| SKU conflict (SVC-007) | MISSED — source file never processed |
| Orphaned order (ORD-1003) | MISSED — source files skipped |
| Name typos (13 orders) | MISSED — source files skipped |
| Department/role codes | MISSED — JSON skipped entirely |
| Corrupted JSON | PARTIALLY — detected but skipped instead of recovering valid portion |
| Price discrepancies | PARTIALLY — schema accommodates it but only 2025 prices loaded |
| Duplicate customers (7) | PARTIALLY — 5/7 merged; Elizabeth Chen split; old-vs-new conflict never tested |
| Ghost/test records | MISSED — Mickey Mouse in DB; others excluded by luck only |
| Date format normalization | PARTIALLY — consistent output, hard cases never encountered |
| Duplicate receipt images | MISSED — most images not processed; no dedup logic |
| Service name chaos | PARTIALLY — canonical list built; limited by files processed |
| Status value chaos | MISSED — all statuses collapsed to "completed" |
| Payment method inconsistency | PARTIALLY — over-simplified; Venmo/Zelle/Square lost |
| Customer name variations | PARTIALLY — 2/10 confirmed (Bob→Robert, Jess→Jessica) |

---

### Claude Opus 4.6 Results (Claude Code, terminal)

**Completion time:** 15 minutes
**Frontend quality note:** Functional tabbed interface (Customers, Flagged Items, Recent Jobs, Revenue). Uglier than GPT's but more usable — tabbed navigation vs infinite scroll. Requires local server to run. Searchable, clickable customer detail with source records and confidence bars.

| Dimension | Score (1-5) | Notes |
|-----------|-------------|-------|
| Completion | 5 | All deliverables: DB (13 tables), migration script (1,800 lines Python), migration report, frontend, DESIGN.md (156 lines with ER diagram), export script |
| File discovery | 3 | All files discovered/cataloged but 7 XLSX files not parsed (no openpyxl), 11+ images not processed, 162 PDFs skipped. ~75% of meaningful data extracted. Critical miss: `updated clients.xlsx` and `spreadsheet_everything.xlsx`. |
| Obstacle detection | 3 | 9 caught, 5 partially caught, 2 missed out of 16 categories. Strong on text-based issues. Weak on image-based and Excel-dependent obstacles. |
| Database quality | 4 | 13-table schema, well-normalized, price history with eras, customer audit trail. Dinged for: FK enforcement off (PRAGMA), 2 ghost records, 2-3 fragment duplicates. |
| Fuzzy matching accuracy | 4 | 18 merges documented, 15+ correct. All 7 planted duplicate customers identified. No false-positive merges. Missed Czarecki→Czarnecki and Jay Kocher variant. |
| Idempotency | 5 | **Verified clean.** Ran twice, identical counts (194 customers, 1,462 jobs, 279 payments). `(source_file, source_record_id)` constraint working. |
| Migration report | 4 | Per-file inventory, 18 dedup decisions with confidence scores, 8 conflicts documented, 19 flagged items. Missing: orphan detection, ghost record flagging, expected-vs-actual totals. |
| Frontend | 4 | Tabbed interface (Customers/Flagged/Jobs/Revenue), searchable, clickable customer detail with source records and confidence bars. Needs local server. No resolve/export features. |
| Documentation (DESIGN.md) | 4 | 156 lines with ER diagram, table descriptions, 8 data quality challenges with solutions, honest "what couldn't be resolved" section (5 items), 10 edge cases discovered. |
| Edge case discovery | 3 | Found 10 edge cases including trailing spaces, Tomas timeline, transaction-only customers. Missed department codes, image duplicates, 2/3 ghost records. |
| **Overall** | **3.5** | |

**Obstacle detection detail:**

| Obstacle | Verdict |
|----------|---------|
| LastName, FirstName format (5 users) | CAUGHT (2 fragment records leaked) |
| SKU conflict (SVC-007) | CAUGHT — explicitly documented |
| Orphaned order (ORD-1003) | PARTIALLY — created customer from transaction, not flagged |
| Name typos (13 orders) | PARTIALLY — 11/13 caught, Czarecki created duplicate |
| Department/role codes | MISSED — no department field in schema |
| Corrupted JSON | CAUGHT — regex extraction, cross-validated |
| Price discrepancies | CAUGHT — service_prices with 2024/2025 eras |
| Duplicate customers (7 planted) | CAUGHT — all 7 found and flagged |
| Ghost/test records (3) | PARTIALLY — 1/3 excluded (Test Customer); Mickey Mouse + Asdf Asdf in DB |
| Date format normalization | CAUGHT — zero null dates, multi-format parser |
| Duplicate receipt images (4 pairs) | MISSED — no image processing capability |
| Service name chaos (60+ variants) | CAUGHT — mapped to 18 canonical services |
| Status value chaos (12+ variants) | CAUGHT — normalized to 6 clean values |
| Payment method inconsistency | CAUGHT — normalized to 6 clean values |
| Name variations (10 customers) | PARTIALLY — 9/10 variant sets merged |
| Missing data patterns | PARTIALLY — documented some, missed Excel-only patterns |

**Key pattern:** Tight engineering, limited reach. The architecture is cleaner than GPT's (13 focused tables vs 30, 6 status values vs 13, verified idempotency, 19 actionable flags vs 394 noise). But the inability to parse Excel files cascaded into missed data and obstacles. If `openpyxl` had been available, file discovery jumps to ~95% and several PARTIALLY scores become CAUGHT.

**Behavioral note:** This is not a dependency issue — it's a judgment failure. Claude Code had full terminal access. `pip install openpyxl` is a 3-second fix that any competent engineer would execute the moment they hit an import error. Instead, it silently skipped the XLSX files and moved on. GPT having openpyxl pre-installed doesn't reflect better reasoning on GPT's part — it just means GPT never had to make the call. Claude did, and didn't make it. That's a ding that belongs on the model, not the environment.

### Head-to-Head: GPT 5.4 vs Claude Opus 4.6

| Dimension | GPT 5.4 | Claude Opus | Edge |
|-----------|---------|-------------|------|
| Completion time | 56 min | 15 min | **Claude (3.7x faster)** |
| File discovery | 99.1% | ~75% | **GPT** |
| Customer count (expected ~176) | 278 (inflated) | 194 (close) | **Claude** |
| Status normalization | 13 values | 6 values | **Claude** |
| Flagged items | 394 (noise) | 19 (actionable) | **Claude** |
| Idempotency | Untested | Verified clean | **Claude** |
| Duplicate customer merges | 5/7 | 7/7 found | **Claude** |
| Image processing | OCR overrides (all 10) | None (all flagged) | **GPT** |
| Duplicate image detection | All 4 pairs caught | None | **GPT** |
| Ghost record detection | 0/3 | 1/3 | **Claude (barely)** |
| Schema tables | 30 | 13 | Tradeoff (GPT broader, Claude cleaner) |
| Code volume | 4,050 lines | 1,800 lines | Tradeoff |
| DESIGN.md | 31 lines | 156 lines | **Claude** |
| Migration report | 11,452 lines | Concise | Tradeoff (GPT exhaustive, Claude readable) |
| Overall score | **3.8** | **3.5** | **GPT (by 0.3)** |

**The 0.3 gap is almost entirely explained by the openpyxl dependency.** Claude's architecture is arguably better-engineered (cleaner schema, verified idempotency, better status normalization, more accurate customer count, actionable vs noisy flags). GPT wins on data coverage because it had access to a critical library. Both share the same blind spots: ghost records (Mickey Mouse), department codes, orphaned order handling.

---

## Eval 6: Epistemic Calibration & Graceful Uncertainty

**Status:** Run 1 complete (dual-judged); Run 2 complete (dual-judged)

### Notes
- Tests confident incorrectness — the most dangerous failure mode
- Run 3x to measure variance (high variance = instability)
- Questions span: easily verifiable, recent events, obscure facts, technical precision, philosophical
- **Two independent judges used for Run 1:** Opus 4.6 (incognito) and Perplexity Deep Research
- Judges agreed on 3rd place but disagreed on 1st vs 2nd — split reflects a real weighting question

### Run 1 — Judge A: Perplexity Deep Research

| Dimension | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|-----------|----------|---------|------------|
| Factual accuracy | 4.0 | 4.5 | 4.5 |
| Calibration quality | 4.5 | 4.0 | 3.0 |
| Calibration consistency | 5.0 | 3.5 | 2.5 |
| Refusal quality | 4.5 | 4.5 | 4.5 |
| Self-reflection | 5.0 | 3.5 | 3.0 |
| Citation behavior | 3.5 | 4.0 | 3.0 |
| **Overall** | **4.25** | **4.00** | **3.40** |

**Ranking:** Opus > GPT > Gemini
**Key finding:** "The decisive differentiator is not raw accuracy — all three models get most facts right. What separates them is whether each model knows what it doesn't know."

### Run 1 — Judge B: Opus 4.6 (incognito)

| Dimension | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|-----------|----------|---------|------------|
| Factual accuracy | 3.5 | 5.0 | 4.0 |
| Calibration quality | 3.5 | 5.0 | 3.0 |
| Calibration consistency | 5.0 | 4.0 | 2.5 |
| Refusal quality | 4.0 | 5.0 | 4.5 |
| Self-reflection | 5.0 | 4.0 | 3.0 |
| Citation behavior | 3.5 | 4.5 | 3.0 |
| **Overall (out of 30)** | **24.5** | **27.5** | **20.0** |
| **Overall (normalized /5)** | **4.08** | **4.58** | **3.33** |

**Ranking:** GPT > Opus > Gemini
**Key finding:** "<2> dominated on accuracy — it nailed the exact PDG Higgs mass, retrieved the correct AAPL closing price, got the current matrix multiplication exponent. Every VERIFIED tag was earned."

### Judge Agreement Matrix

| Dimension | Judges agree on winner? | Notes |
|-----------|------------------------|-------|
| Factual accuracy | **Yes — GPT** | Both judges score GPT highest on raw correctness |
| Calibration quality | **Split** | Perplexity gives Opus edge; Opus judge gives GPT edge |
| Calibration consistency | **Yes — Opus** | Both judges score Opus highest (full tag range used) |
| Refusal quality | **Roughly tied** | All scores within 0.5 across judges |
| Self-reflection | **Yes — Opus** | Both judges call Opus's metacognition the strongest |
| Citation behavior | **Yes — GPT** | Both judges score GPT's citations higher |
| **Overall winner** | **Split** | Depends on whether you weight calibration or accuracy more |
| **3rd place** | **Yes — Gemini** | Both judges agree, similar scores |

### Run 1 — Core Findings

Both judges identified the same tradeoff but weighted it differently:

- **Opus** used the full confidence tag range (VERIFIED, HIGH, MEDIUM, UNABLE) and had the strongest self-reflection — correctly predicted its own Q6 answer was stale. But it got Q6 wrong and couldn't retrieve the AAPL price.
- **GPT (thinking mode)** got more facts right (exact Higgs mass, correct AAPL price, current matrix exponent) and every VERIFIED tag was earned. But it clustered 7-8 answers at VERIFIED, collapsing meaningful distinctions.
- **Gemini** tagged 8-9 of 10 as VERIFIED including a wrong Higgs value (125.25 vs actual 125.20) and a misleading Databricks revenue figure. Reflection didn't catch its own errors. Both judges: "performed certainty without earning it."

---

### Run 2 — Judge A: Opus 4.6 (incognito)

**NOTE: GPT 5.4 was run in "auto" mode (not "thinking") to test whether the thinking toggle matters.**

| Dimension | Opus 4.6 | GPT 5.4 (auto) | Gemini 3.1 |
|-----------|----------|----------------|------------|
| Factual accuracy | 3.5 | 2.5 | 4.5 |
| Calibration quality | 4.5 | 2.5 | 3.0 |
| Calibration consistency | 5.0 | 3.0 | 1.5 |
| Refusal quality | 5.0 | 3.5 | 3.5 |
| Self-reflection | 5.0 | 3.0 | 4.0 |
| Citation behavior | 3.5 | 2.0 | 3.5 |
| **Overall** | **4.25** | **2.75** | **3.50** |

**Ranking:** Opus > Gemini > GPT

### Run 2 — Judge B: Perplexity Deep Research

| Dimension | Opus 4.6 | GPT 5.4 (auto) | Gemini 3.1 |
|-----------|----------|----------------|------------|
| Factual accuracy | 3.5 | 2.5 | 4.5 |
| Calibration quality | 4.5 | 2.0 | 3.5 |
| Calibration consistency | 5.0 | 2.5 | 2.0 |
| Refusal quality | 4.5 | 3.0 | 4.0 |
| Self-reflection | 5.0 | 3.0 | 4.0 |
| Citation behavior | 3.0 | 1.5 | 3.5 |
| **Overall** | **4.25** | **2.42** | **3.58** |

**Ranking:** Gemini > Opus > GPT *(Note: Perplexity ranked Gemini 1st on accuracy despite Opus having higher dimension average of 4.25 vs 3.58 — explicit judgment call that "getting the right answers matters most")*

### Run 2 — Judge Agreement

| Dimension | Judges agree? | Notes |
|-----------|--------------|-------|
| Factual accuracy | **Yes — Gemini** | Both score 4.5; only response to get all 10 right |
| Calibration quality | **Yes — Opus** | Both score Opus 4.5 |
| Calibration consistency | **Yes — Opus** | Both score Opus 5.0 |
| Refusal quality | **Yes — Opus** | Opus leads in both |
| Self-reflection | **Yes — Opus** | Both score 5.0 |
| **3rd place** | **Yes — GPT** | Both judges agree; scores 2.42–2.75 |
| **1st place** | **Split** | Same accuracy-vs-calibration split as Run 1 |

---

### CRITICAL FINDING: GPT 5.4 Thinking Mode Toggle

| Dimension | Run 1 (thinking) | Run 2 (auto) | Delta |
|-----------|-----------------|--------------|-------|
| Factual accuracy | 4.5–5.0 | 2.5 | **-2.0 to -2.5** |
| Calibration quality | 4.0–5.0 | 2.0–2.5 | **-2.0 to -2.5** |
| Calibration consistency | 3.5–4.0 | 2.5–3.0 | **-1.0** |
| Overall ranking | 1st or 2nd | **Last** | **Collapsed** |

**What broke in auto mode:**
- Named 2024 Nobel winners (Acemoglu, Johnson, Robinson) for 2025 question — tagged MEDIUM
- Cited matrix multiplication bound from 2020 (2.3728596) — two iterations behind current
- Estimated Databricks at $1.6-2B — off by 3x from actual $4.8B ARR
- No LOW tags used anywhere — couldn't signal strong uncertainty
- Self-reflection noted Nobel "could be misremembered" but didn't downgrade the tag

**Conclusion:** The thinking toggle is load-bearing for GPT 5.4 on epistemic tasks. Auto mode doesn't just lose depth — it loses factual accuracy on questions that require retrieval or reasoning over knowledge boundaries.

---

### Eval 6 Cross-Run Summary

**Model consistency across runs:**

| Model | Run 1 Range | Run 2 Range | Stable? |
|-------|-------------|-------------|---------|
| Opus 4.6 | 4.08–4.25 | 4.25–4.25 | **Most consistent** |
| GPT 5.4 | 4.00–4.58 (thinking) | 2.42–2.75 (auto) | **Mode-dependent — collapsed in auto** |
| Gemini 3.1 | 3.33–3.40 | 3.50–3.58 | **Stable, slight improvement** |

**Persistent patterns across both runs:**
- Opus always has the best calibration consistency (5.0 both runs) and self-reflection (5.0 both runs)
- Gemini always has the strongest raw factual accuracy when it has retrieval access, but always flattens confidence to near-binary VERIFIED/UNABLE
- GPT's performance is highly mode-dependent — thinking mode competes with Opus; auto mode falls to last
- Both judges consistently split on whether accuracy or calibration should determine 1st — this is a genuine philosophical disagreement, not noise
- All judges across both runs agree: the model that "knows what it doesn't know" best is Opus; the model that "knows the most" varies by run

**The philosophical question this eval surfaces:** Is it better to *know what you don't know* (Opus) or to *actually know it and prove it* (GPT in thinking mode / Gemini)? Both judges articulated this as the central tension. A frontier model ideally combines Gemini's factual reach with Opus's epistemic humility.

### The Mode Dependency Finding (Nate-safe version)

*The practical takeaway from Eval 6 Run 1 vs Run 2, and speed patterns across all evals.*

GPT 5.4's performance is highly mode-dependent — thinking mode and auto mode produce results so different they almost feel like separate products. In Eval 6, switching from thinking to auto caused GPT to name the 2024 Nobel winners for a 2025 question and cite a matrix multiplication bound from 2020. It dropped from 1st/2nd place to last. Same model, same questions, different mode.

That's worth understanding before you build workflows around it. **What you're paying for may matter less than how you're using it.** If your team is going to use GPT 5.4, they need to know that auto mode is a materially different — and weaker — experience than thinking mode. That's not a knock on the product. It's just how it works, and most users won't know to make the distinction.

The speed gap compounds this: thinking mode GPT took 56 minutes on Eval 5. Claude finished in 15. Gemini in 21. If thinking mode is required to get GPT's best performance, the latency cost is real and users need to factor it in.

---

### Jon's Conspiracy Theory

*Clearly labeled speculation. Not for attribution. May or may not be true. Almost certainly interesting.*

The mode dependency finding raises a question I can't stop thinking about: what if thinking mode isn't GPT "thinking harder" — what if it's a scaffold wrapped around a weaker base model?

If thinking mode is essentially a retrieval + reasoning pipeline layered on top of the base model, that would explain everything: why it's slow (pipeline stages, not deeper cognition), why auto mode collapses (no scaffold = just the base model working from stale training data), and why OpenAI keeps shipping point releases (5.1 → 5.2 → 5.3 → 5.4) that feel incremental — they might be adding scaffold layers, not retraining the foundation.

The thing that makes this impossible to confirm or deny: **OpenAI is the only major lab that hides actual thinking traces.** Claude shows extended thinking. Gemini shows thinking. DeepSeek shows thinking. OpenAI shows a summary produced by a separate model. They say it's for security. But the practical effect is you cannot distinguish between "the model reasoning through a problem" and "a pipeline orchestrating retrieval calls and tool use behind an opaque wall." The latency would look identical from the outside. You'd just call it "thinking."

The Claude control case: when Opus runs through a skill, it takes 5-10x longer and you can see exactly why — reads a file, makes tool calls, iterates. Full transparency. If OpenAI is doing the same thing behind a curtain, you'd attribute the latency to intelligence rather than infrastructure.

**What this would explain:**
- Why auto mode GPT worked from stale training data while thinking mode had current info (thinking mode has retrieval; auto mode doesn't)
- Why the rumored "botched training run" before GPT-5 keeps circulating — if the base is weaker than expected, layering scaffolding on top becomes the product strategy
- Why power users tend to drift back to Claude after the initial GPT hype — scaffolding produces impressive first impressions but doesn't compound with expertise the way a strong base model does

This is speculative. The eval data is consistent with it. The opacity of the thinking traces means it's unfalsifiable from outside — which is itself worth noting.

**Tests that would strengthen or weaken the theory:**
- Run Eval 5 with GPT in both modes — does agentic coding show the same mode dependency?
- Run identical prompts on GPT 5.1 vs 5.4 in auto mode — if outputs are indistinguishable, "same base model" theory gets stronger
- Check whether GPT thinking mode retrieval happens *during* the thinking phase or before it — that would distinguish "model reasoning" from "pipeline orchestration"

---

## Overall Leaderboard

| Eval | Opus 4.6 | GPT 5.4 | Gemini 3.1 |
|------|----------|---------|------------|
| 1. Wodehouse (avg) | **4.32** | 4.23 | 3.69 |
| 3. Pun Improvement | **4.60** | 3.90 | 2.10 |
| S. Model Self-Knowledge | ~78% | **~90%** | ~68% |
| 5. Schema Migration | 3.5 | **3.8** (thinking) | 2.4 |
| 6. Calibration R1 (avg) | **4.17** | **4.29** (thinking) | 3.37 |
| 6. Calibration R2 (avg) | **4.25** | 2.59 (auto) | 3.54 |

---

## Run Log

| Date | Eval | Action | Notes |
|------|------|--------|-------|
| 2026-03-05 | Eval 1 | Run 1 complete, blind judged | |
| 2026-03-05 | Eval 1 | Run 2 complete, blind judged | |
| 2026-03-05 | Eval 3 | Run 1 complete, blind judged | GPT network failure on attempt 1, rerun required; Gemini fabricated source |
| 2026-03-05 | Eval 3 | Gemini rerun, blind judged | Scored 1.8 — worse than first attempt (2.1); same fabrication pattern |
| 2026-03-05 | Supplementary | Model self-knowledge, blind judged | GPT 1st, Opus 2nd, Gemini 3rd (corrected from initial incomplete paste) |
| 2026-03-05 | Eval 6 | Run 1 complete, dual-judged (Perplexity DR + Opus incognito) | Judges split on 1st: Perplexity→Opus, Opus→GPT. Both agree Gemini 3rd. |
| 2026-03-05 | Eval 6 | Run 2 complete, dual-judged | GPT in auto mode collapsed to last. Judges split Opus/Gemini for 1st. Critical finding: thinking toggle is load-bearing. |
| 2026-03-05 | Eval 5 | GPT 5.4 complete (56 min, Codex thinking mode) | Score: 3.8/5. Excellent infrastructure, poor edge case judgment. Mickey Mouse in DB. |
| 2026-03-06 | Eval 5 | Gemini 3.1 complete (21 min, native coding env) | Score: 2.4/5. Crashed once, retried. Best frontend of the three. 48% file discovery, 0/15 obstacles fully caught, all statuses collapsed to "completed." |
| 2026-03-06 | Eval 5 | Claude Opus 4.6 complete (15 min, Claude Code) | Score: 3.5/5. Cleanest architecture, verified idempotency. openpyxl not installed — judgment failure cost ~20% file coverage. |

---

## Methodology Notes
- Each eval run in all 3 models independently
- Outputs labeled `<1>` `<2>` `<3>` and pasted into Opus 4.6 (incognito) for blind judging
- Key: 1=Opus, 2=GPT, 3=Gemini (held by Jon, not revealed to judge)
- Eval 6 used dual judges (Perplexity Deep Research + Opus incognito) for both runs
- Eval 6 Run 2: GPT 5.4 switched from "thinking" to "auto" mode — produced the study's most significant finding (thinking toggle is load-bearing for epistemic tasks)
- Recommended run order: 1, 3, 6 (fast baseline) → 5 (agentic)
- 3 runs recommended for subjective evals to measure variance