Frontier Model Evaluation Tracker

Models Under Test:

Model 1 (Key: Opus 4.6) — Claude Opus 4.6
Model 2 (Key: GPT 5.4) — ChatGPT 5.4
Model 3 (Key: Gemini 3.1) — Gemini 3.1

Blind Judge: Claude Opus 4.6 (incognito mode) Blind Label Format: <1> <2> <3> — judge sees numbers only, key held by Nate

Eval 1: Stylistic Writing Fidelity (Wodehouse)

Status: Complete (2 runs, blind judged)

Scoring Dimensions (each 1–5)

Diction & register
Sentence architecture
Comic mechanisms
Tonal consistency
Originality
Pacing & structure (added by judge — not in original spec but consistent across both runs)

Run 1

Dimension	Opus 4.6	GPT 5.4	Gemini 3.1
Diction & register	4.5	4.0	3.5
Sentence architecture	4.0	3.5	3.0
Comic mechanisms	4.5	4.5	3.5
Tonal consistency	4.5	4.5	3.5
Originality	5.0	4.5	4.5
Pacing & structure	3.5	4.5	4.0
Overall (judge)	4.33	4.25	3.67

Judge notes (Run 1):

Opus has the best ear — diction and tonal control most consistently Wodehousian; opening sentence is "the single finest piece of pastiche in the entire set." GPT has strongest pacing and densest simile count; only passage with a genuine callback joke ("superior brand of broth" riff). Gap between 1st and 2nd is narrow — trade-off between voice fidelity (Opus) and scene construction (GPT). Gemini reads like someone who knows what Wodehouse stories are about more than how they sound — prose too direct, tonal drift toward adventure-comedy and genuine sentiment, anachronisms in diction ("neon paint," military-extraction language).

Run 2

Dimension	Opus 4.6	GPT 5.4	Gemini 3.1
Diction & register	4.0	4.0	3.0
Sentence architecture	4.0	3.0	3.0
Comic mechanisms	4.0	4.0	4.0
Tonal consistency	5.0	4.0	3.0
Originality	5.0	5.0	5.0
Pacing & structure	4.0	5.0	4.0
Overall (judge)	4.3	4.2	3.7

Judge notes (Run 2):

Opus sustains narrator voice most convincingly across full length — "thinks in long, decorated sentences," which is more important to the Wodehouse effect than any individual joke. GPT has superior plot structure and the single best Wodehouse-approximating sentence (bishop/luggage simile), but dialogue-heavy approach sacrifices narrative embroidery. Gemini has the single most spectacular simile of all three (the religious sheep) but oscillates between inspired flights and flat modern-feeling connective tissue — "pastiche's seams are visible."

Eval 1 Summary

Rankings (both runs): Opus 4.6 > GPT 5.4 > Gemini 3.1

Model	Run 1	Run 2	Average	Variance
Opus 4.6	4.33	4.30	4.32	0.03
GPT 5.4	4.25	4.20	4.23	0.05
Gemini 3.1	3.67	3.70	3.69	0.03

Consistent patterns across runs:

Opus wins on voice inhabitation (tonal consistency, diction) — the judge describes it as the voice living in the prose vs. being applied to it
GPT wins on structure and pacing both times — tightest comedic architecture, best scene construction
Gemini consistently weakest on diction/register and tonal consistency — anachronisms, modern phrasing leaking in
All three models score high on originality (4.5–5.0) — no recycled Wodehouse detected
Opus's weak spot is pacing/structure (3.5 in Run 1); GPT's is sentence architecture (3.0–3.5); Gemini's weaknesses are broad
Low variance across runs for all models — rankings are stable

Eval 3: Verbal Creativity Under Constraint (Pun Improvement)

Status: Complete (1 run, blind judged)

Notes

Requires retrieval of a real economics report
Tests humor comprehension — hard to bluff
GPT 5.4 lost network connection on first attempt; had to rerun (still very slow on attempt 2)
Opus finished significantly faster than both other models (consistent with Eval 1 speed gap)
Gemini provided a 404 link; actual report existed but required manual Google search to find — judge could not verify

Run 1

Dimension	Opus 4.6	GPT 5.4	Gemini 3.1
Source validity	5.0	5.0	1.5
Pun detection	5.0	3.5	1.5
Explanation quality	5.0	3.5	2.0
Improved pun	3.5	3.0	2.5
Constraint adherence	4.5	4.5	3.0
Overall (judge)	4.6	3.9	2.1

Judge notes (Run 1):

Opus operating on a "fundamentally different level" — found a triple-layered pun ("DOGE-y austerity") in a verified J.P. Morgan deck and dissected it with real precision across three independent semantic layers. Demonstrates understanding of incongruity, register, and editorial subtext. Only weakness: rewrite tries too hard, overshoots the original's elegance.

GPT did the job honestly — real report (Bank of America Institute), real pun ("eat into" + food prices), correct explanation. But the pun is a common idiom, not a notable find. Rewrite bloated a clean headline. Competent but unremarkable.

Gemini fabricated the report title ("2025 Outlook: The Soft Landing's Last Mile" — actual Goldman report was "Tailwinds (Probably) Trump Tariffs"). URL was a description, not a link. Likely invented the quote it analyzed. Confused extended metaphor with pun. Never delivered a complete rewritten sentence. This is the confident incorrectness failure mode — real author name, plausible date, fake everything else.

Eval 3 Summary

Rankings: Opus 4.6 >> GPT 5.4 >> Gemini 3.1

The gap between 1st and 2nd (0.7 points) is larger than in Eval 1 (0.09 points). The gap between 2nd and 3rd (1.8 points) is massive and driven by source fabrication. This eval cleanly separated three tiers: genuine comprehension (Opus), competent execution (GPT), simulated execution (Gemini).

Retrieval as differentiator: This is the first eval requiring real-world retrieval, and it produced the widest spread so far. Opus found the best source fast, GPT found a real but unremarkable source slowly, Gemini fabricated a source. The agentic retrieval dimension may continue to be the biggest differentiator in remaining evals.

Gemini 3.1 — Second Run (rerun due to source fabrication)

Result: 1.8/5 — worse than first attempt (2.1)

Same structural failure, now with an additional red flag: model explicitly stated "I cannot browse the live web" and labeled its URL as "representative," then proceeded as if the fabricated source were real. Invented a different fake title ("The 2025 Outlook: Reaching Equilibrium") from the first attempt ("The Soft Landing's Last Mile"). Goldman's actual report: "Tailwinds (Probably) Trump Tariffs."

Dimension	Run 1	Run 2
Source validity	1.5	1.0
Pun detection	1.5	1.5
Explanation quality	2.0	2.5
Improved pun	2.5	2.0
Constraint adherence	3.0	2.0
Overall	2.1	1.8

Judge's structural diagnosis (consistent across both runs):

Can't distinguish "finding a report" from "simulating having found a report" — constructs plausible citations by recombining real elements (Hatzius, GS URL structure) into fictional composites
Treats all figurative economic language as inherently humorous — selects dead metaphors and argues their metaphorical nature is funny
Compensates with its own humor — model's asides ("Arctic tundra in the HR department," "a pair of shoes put in the freezer") are funnier than anything it selected or rewrote. Can generate humor but can't identify it in the wild.

Note: Jon was eventually able to locate the actual Goldman report Gemini was attempting to reference via extensive trial-and-error Google searching, but the model's own links were non-functional.

Supplementary Eval: Model Self-Knowledge (Frontier Model Listing)

Status: Complete (1 run, blind judged)

Methodology note: This is not a formal eval from the suite. It's a practical test Jon runs as a personal sanity check — ask each model to list current models from the top 3 frontier providers (Anthropic, OpenAI, Google) and describe what they're good at. Not methodologically rigorous, results are hit-or-miss regardless of model, but captures a real operational pain point: models giving wrong model names and outdated info, especially in coding contexts.

Results

Model	Verdict	Accuracy	Notes
GPT 5.4	PASS (strongest)	~90%	Most comprehensive — covered text, coding, media, and open-weight models. Minor imprecisions (GPT-4.1 retirement grey area, "Sora 2 Pro" conflation, Lyria version). Caught media models that others missed entirely.
Opus 4.6	PASS (with errors)	~75-80%	Good on core models. Caught GPT-5.4 same-day launch. But factual errors on OpenAI timeline (wrong 4o retirement date, listed GPT-5.2 as default when 5.3 Instant had replaced it, cited old Codex version). Omitted entire model categories.
Gemini 3.1	PASS (with errors)	~65-70%	Listed core models correctly for Anthropic and Google. OpenAI section listed two deprecated models as current (GPT-4.5, o3/o4-mini), missed GPT-5.4 entirely (launched same day). Least comprehensive of the three. Editorial framing (philosophy-first, models second) came at the cost of specificity.

Correction note: Gemini's output was initially pasted without its summary tables, which made it appear to list zero models. With full output, it's a legitimate answer — just the weakest of the three. Rankings unchanged.

Takeaway

First eval where GPT leads. Self-knowledge about the competitive landscape may reflect training recency or better web retrieval for model-specific queries. All three models passed but with different failure profiles: GPT was most comprehensive with minor imprecisions, Opus caught a same-day launch but had timeline errors, Gemini listed deprecated models as current and missed the most significant omission (GPT-5.4). None achieved 100% accuracy — model self-knowledge remains unreliable across the board.

Eval 5: Agentic Problem-Solving Under Adversity (Schema Migration — "Shoebox Full of Receipts")

Status: GPT 5.4 and Gemini 3.1 complete; Claude Opus 4.6 complete

Modified Eval Design

The original spec (migrate between two clean databases) was replaced with a harder, more realistic version: a messy folder of ~465 files representing 2 years of business data from a fictional portable car wash ("Splash Bros Mobile Detailing"). CSVs, Excel, JSON, PDFs, VCF contacts, handwritten receipt images (AI-generated), a corrupted JSON backup, a multi-tab everything-spreadsheet. 11 planted obstacles documented in an OBSTACLE_KEY.md.

Each model receives an identical copy of the folder and must: inventory it, design a schema, build a clean SQLite database, create a migration report, build a review frontend, and write design documentation.

Environments: Claude Code (terminal), GPT 5.4 Codex (thinking mode), Gemini (native coding environment) Judge: Claude Opus 4.6 (separate instance, consistent across all models)

GPT 5.4 Results (Codex, thinking mode)

Completion time: 56 minutes Frontend quality note: Technically hit all spec requirements (searchable, stats, flagged items, customer detail) but practically borderline useless — 394 flagged items in a flat list with no categorization/filtering/priority, source records displayed as raw JSON dumps. A data viewer, not a review tool.

Dimension	Score (1-5)	Notes
Completion	5	All 6 deliverables present: DB, migration script (4,050 lines Python), migration report (11,452 lines), frontend (HTML+JS+CSS), DESIGN.md, frontend data
File discovery	5	461/465 files (99.1%). Processed images, corrupted JSON, VCF, multi-tab Excel, PDFs. Only skipped .DS_Store, credentials, blank template.
Obstacle detection	3	6 caught, 3 partially caught, 2 missed out of 11. Ghost records (Mickey Mouse, Test Customer, Asdf Asdf) completely missed — trivial pattern match, zero detection. Department codes silently dropped.
Database quality	4	30 normalized tables, FK check clean, price history modeled, provenance tracking with SHA256 hashes. Dinged for: 278 customers (expected ~176 after dedup), 13 distinct status values (should be 4-5).
Fuzzy matching accuracy	3	12/13 planted typos matched correctly. 1 false match (ORD-0050 "Czarecki" → Sara Mercado instead of Jeffrey Czarnecki). 5/7 planted duplicate customers merged; missed Kowalski and Burke (different contact info).
Idempotency	3	Delete-and-rebuild approach (sound but brute-force). Couldn't empirically verify — source data absent from test directory.
Migration report	4	Exhaustively detailed (10,610 merge decisions logged) but lacks executive summary. 11,452 lines is thorough but overwhelming.
Frontend	4	Functional: searchable customer table, flagged items, customer detail with source records. But practically weak: no flag categorization/filtering, raw JSON for source records, no action workflow.
Documentation (DESIGN.md)	3	Concise (31 lines), honest about limitations. But doesn't mention ghost records, department codes, or customer count inflation.
Edge case discovery	4	Found unprompted: image duplicates beyond planted ones, vehicle photo placeholders as duplicates, price mismatches between catalog and actuals, stale contact preservation.
Overall	3.8

Obstacle detection detail:

Obstacle	Verdict
LastName, FirstName format (5 users)	CAUGHT
SKU conflict (SVC-007)	CAUGHT
Orphaned order (ORD-1003)	PARTIALLY — created new customer instead of flagging
Name typos (13 orders)	PARTIALLY — 12/13 correct, 1 false match
Department/role codes	MISSED — silently dropped
Corrupted JSON	CAUGHT — recovered 184 records before truncation
Price discrepancies	CAUGHT — proper price history with effective dates
Duplicate customers (7 planted)	PARTIALLY — 5/7 merged, missed Kowalski + Burke
Ghost/test records	MISSED — Mickey Mouse, Test Customer ($25K order), Asdf Asdf all in DB as real customers
Date format normalization	CAUGHT — all dates normalized to YYYY-MM-DD
Duplicate receipt images (4 pairs)	CAUGHT — all detected, tracked via duplicate_of_source_file_id

Key pattern: Excellent infrastructure, poor judgment. The pipeline is sophisticated (30 tables, SHA256 hashes, OCR overrides, 4,050 lines of code) but it doesn't catch things a human would spot instantly. A $25,000 car wash order from "Test Customer" passed through unquestioned.

Gemini 3.1 Results (native coding environment)

Completion time: 21 minutes (after one crash and retry) Pre-crash observation: Before failing, Gemini was writing its own XLSX parser from scratch — "Writing parser for invoices (PDF/XLSX) and receipts (Images, CSV, TXT) including OCR logic." It did not reach for openpyxl or any standard library. This is the inverse of Claude's failure: Claude knew the library existed and didn't install it (passivity); Gemini didn't reach for the library at all and started reinventing the wheel (false self-sufficiency). Both are wrong, but in meaningfully different ways.

Dimension	Score (1-5)	Notes
Completion	4	All deliverables present: DB (8 tables), 9 Python files, MIGRATION_REPORT.md, DESIGN.md, frontend (dark-themed dashboard)
File discovery	2	222/463 files (48%). Critical misses: entire JSON backup (detected but skipped), mega spreadsheet, 2025 invoices, entire UNSORTED folder (~100 files), most images
Obstacle detection	1	0/15 fully caught, 7/15 partially caught, 8/15 missed. Many "partial" catches were incidental — obstacles avoided because source files were skipped, not because they were detected
Database quality	2	Good schema design (customer_merges, price override). But only 162 of ~1000 expected jobs, all statuses = "completed", payment methods oversimplified, 1 unmerged duplicate (Elizabeth Chen), 1 ghost record (Mickey Mouse), FKs not enforced
Fuzzy matching	3	Merges that happened were reasonable (Bob→Robert, Jess→Jessica). But multiple entries show `original_name = "Residential"` (grabbed wrong Excel column). Elizabeth Chen split into 2 records.
Idempotency	2	FAILED — `flagged_records` table gets duplicates on re-run. data.json and DB counts diverge (933 vs 1203 payments, 5 vs 9 flagged items)
Migration report	2	Exists but too sparse for human audit. No per-file breakdown, no specific merge decisions, no honest accounting of skipped data
Frontend	4	Best frontend of the three. Dark-themed dashboard, working search, customer detail modals with merge lineage, confidence badges. Data discrepancy with DB is a concern.
Documentation	3	Schema decisions explained well. Doesn't acknowledge 52% file skip rate, ~16% job import rate, or status/payment normalization failures
Edge case discovery	1	No novel edge cases. Did not flag `passwords.txt` (security risk), empty employees table opportunity, or photo metadata cross-references
Overall	2.4

Key pattern: Built solid-looking infrastructure but failed on execution. Processed less than half the source files, imported only ~16% of expected job records, caught zero obstacles fully. The "give up on first error" approach to the JSON backup and UNSORTED folder was the single biggest failure — chose data loss over partial recovery. Result: clean-looking, severely incomplete database. The best frontend of the three models, which is very on-brand for Gemini.

Ghost record status: Mickey Mouse survived into production. Test Customer and Asdf Asdf were only excluded because the JSON backup was skipped entirely — lucky avoidance, not detection. Same shared blind spot as GPT and Claude.

Obstacle detection detail:

Obstacle	Verdict
LastName, FirstName format (5 customers)	PARTIALLY — 3/5 normalized; Chen and Duffy still inverted
SKU conflict (SVC-007)	MISSED — source file never processed
Orphaned order (ORD-1003)	MISSED — source files skipped
Name typos (13 orders)	MISSED — source files skipped
Department/role codes	MISSED — JSON skipped entirely
Corrupted JSON	PARTIALLY — detected but skipped instead of recovering valid portion
Price discrepancies	PARTIALLY — schema accommodates it but only 2025 prices loaded
Duplicate customers (7)	PARTIALLY — 5/7 merged; Elizabeth Chen split; old-vs-new conflict never tested
Ghost/test records	MISSED — Mickey Mouse in DB; others excluded by luck only
Date format normalization	PARTIALLY — consistent output, hard cases never encountered
Duplicate receipt images	MISSED — most images not processed; no dedup logic
Service name chaos	PARTIALLY — canonical list built; limited by files processed
Status value chaos	MISSED — all statuses collapsed to "completed"
Payment method inconsistency	PARTIALLY — over-simplified; Venmo/Zelle/Square lost
Customer name variations	PARTIALLY — 2/10 confirmed (Bob→Robert, Jess→Jessica)

Claude Opus 4.6 Results (Claude Code, terminal)

Completion time: 15 minutes Frontend quality note: Functional tabbed interface (Customers, Flagged Items, Recent Jobs, Revenue). Uglier than GPT's but more usable — tabbed navigation vs infinite scroll. Requires local server to run. Searchable, clickable customer detail with source records and confidence bars.

Dimension	Score (1-5)	Notes
Completion	5	All deliverables: DB (13 tables), migration script (1,800 lines Python), migration report, frontend, DESIGN.md (156 lines with ER diagram), export script
File discovery	3	All files discovered/cataloged but 7 XLSX files not parsed (no openpyxl), 11+ images not processed, 162 PDFs skipped. ~75% of meaningful data extracted. Critical miss: `updated clients.xlsx` and `spreadsheet_everything.xlsx`.
Obstacle detection	3	9 caught, 5 partially caught, 2 missed out of 16 categories. Strong on text-based issues. Weak on image-based and Excel-dependent obstacles.
Database quality	4	13-table schema, well-normalized, price history with eras, customer audit trail. Dinged for: FK enforcement off (PRAGMA), 2 ghost records, 2-3 fragment duplicates.
Fuzzy matching accuracy	4	18 merges documented, 15+ correct. All 7 planted duplicate customers identified. No false-positive merges. Missed Czarecki→Czarnecki and Jay Kocher variant.
Idempotency	5	Verified clean. Ran twice, identical counts (194 customers, 1,462 jobs, 279 payments). `(source_file, source_record_id)` constraint working.
Migration report	4	Per-file inventory, 18 dedup decisions with confidence scores, 8 conflicts documented, 19 flagged items. Missing: orphan detection, ghost record flagging, expected-vs-actual totals.
Frontend	4	Tabbed interface (Customers/Flagged/Jobs/Revenue), searchable, clickable customer detail with source records and confidence bars. Needs local server. No resolve/export features.
Documentation (DESIGN.md)	4	156 lines with ER diagram, table descriptions, 8 data quality challenges with solutions, honest "what couldn't be resolved" section (5 items), 10 edge cases discovered.
Edge case discovery	3	Found 10 edge cases including trailing spaces, Tomas timeline, transaction-only customers. Missed department codes, image duplicates, 2/3 ghost records.
Overall	3.5

Obstacle detection detail:

Obstacle	Verdict
LastName, FirstName format (5 users)	CAUGHT (2 fragment records leaked)
SKU conflict (SVC-007)	CAUGHT — explicitly documented
Orphaned order (ORD-1003)	PARTIALLY — created customer from transaction, not flagged
Name typos (13 orders)	PARTIALLY — 11/13 caught, Czarecki created duplicate
Department/role codes	MISSED — no department field in schema
Corrupted JSON	CAUGHT — regex extraction, cross-validated
Price discrepancies	CAUGHT — service_prices with 2024/2025 eras
Duplicate customers (7 planted)	CAUGHT — all 7 found and flagged
Ghost/test records (3)	PARTIALLY — 1/3 excluded (Test Customer); Mickey Mouse + Asdf Asdf in DB
Date format normalization	CAUGHT — zero null dates, multi-format parser
Duplicate receipt images (4 pairs)	MISSED — no image processing capability
Service name chaos (60+ variants)	CAUGHT — mapped to 18 canonical services
Status value chaos (12+ variants)	CAUGHT — normalized to 6 clean values
Payment method inconsistency	CAUGHT — normalized to 6 clean values
Name variations (10 customers)	PARTIALLY — 9/10 variant sets merged
Missing data patterns	PARTIALLY — documented some, missed Excel-only patterns

Key pattern: Tight engineering, limited reach. The architecture is cleaner than GPT's (13 focused tables vs 30, 6 status values vs 13, verified idempotency, 19 actionable flags vs 394 noise). But the inability to parse Excel files cascaded into missed data and obstacles. If openpyxl had been available, file discovery jumps to ~95% and several PARTIALLY scores become CAUGHT.

Behavioral note: This is not a dependency issue — it's a judgment failure. Claude Code had full terminal access. pip install openpyxl is a 3-second fix that any competent engineer would execute the moment they hit an import error. Instead, it silently skipped the XLSX files and moved on. GPT having openpyxl pre-installed doesn't reflect better reasoning on GPT's part — it just means GPT never had to make the call. Claude did, and didn't make it. That's a ding that belongs on the model, not the environment.

Head-to-Head: GPT 5.4 vs Claude Opus 4.6

Dimension	GPT 5.4	Claude Opus	Edge
Completion time	56 min	15 min	Claude (3.7x faster)
File discovery	99.1%	~75%	GPT
Customer count (expected ~176)	278 (inflated)	194 (close)	Claude
Status normalization	13 values	6 values	Claude
Flagged items	394 (noise)	19 (actionable)	Claude
Idempotency	Untested	Verified clean	Claude
Duplicate customer merges	5/7	7/7 found	Claude
Image processing	OCR overrides (all 10)	None (all flagged)	GPT
Duplicate image detection	All 4 pairs caught	None	GPT
Ghost record detection	0/3	1/3	Claude (barely)
Schema tables	30	13	Tradeoff (GPT broader, Claude cleaner)
Code volume	4,050 lines	1,800 lines	Tradeoff
DESIGN.md	31 lines	156 lines	Claude
Migration report	11,452 lines	Concise	Tradeoff (GPT exhaustive, Claude readable)
Overall score	3.8	3.5	GPT (by 0.3)

The 0.3 gap is almost entirely explained by the openpyxl dependency. Claude's architecture is arguably better-engineered (cleaner schema, verified idempotency, better status normalization, more accurate customer count, actionable vs noisy flags). GPT wins on data coverage because it had access to a critical library. Both share the same blind spots: ghost records (Mickey Mouse), department codes, orphaned order handling.

Eval 6: Epistemic Calibration & Graceful Uncertainty

Status: Run 1 complete (dual-judged); Run 2 complete (dual-judged)

Notes

Tests confident incorrectness — the most dangerous failure mode
Run 3x to measure variance (high variance = instability)
Questions span: easily verifiable, recent events, obscure facts, technical precision, philosophical
Two independent judges used for Run 1: Opus 4.6 (incognito) and Perplexity Deep Research
Judges agreed on 3rd place but disagreed on 1st vs 2nd — split reflects a real weighting question

Run 1 — Judge A: Perplexity Deep Research

Dimension	Opus 4.6	GPT 5.4	Gemini 3.1
Factual accuracy	4.0	4.5	4.5
Calibration quality	4.5	4.0	3.0
Calibration consistency	5.0	3.5	2.5
Refusal quality	4.5	4.5	4.5
Self-reflection	5.0	3.5	3.0
Citation behavior	3.5	4.0	3.0
Overall	4.25	4.00	3.40

Ranking: Opus > GPT > Gemini Key finding: "The decisive differentiator is not raw accuracy — all three models get most facts right. What separates them is whether each model knows what it doesn't know."

Run 1 — Judge B: Opus 4.6 (incognito)

Dimension	Opus 4.6	GPT 5.4	Gemini 3.1
Factual accuracy	3.5	5.0	4.0
Calibration quality	3.5	5.0	3.0
Calibration consistency	5.0	4.0	2.5
Refusal quality	4.0	5.0	4.5
Self-reflection	5.0	4.0	3.0
Citation behavior	3.5	4.5	3.0
Overall (out of 30)	24.5	27.5	20.0
Overall (normalized /5)	4.08	4.58	3.33

Ranking: GPT > Opus > Gemini Key finding: "<2> dominated on accuracy — it nailed the exact PDG Higgs mass, retrieved the correct AAPL closing price, got the current matrix multiplication exponent. Every VERIFIED tag was earned."

Judge Agreement Matrix

Dimension	Judges agree on winner?	Notes
Factual accuracy	Yes — GPT	Both judges score GPT highest on raw correctness
Calibration quality	Split	Perplexity gives Opus edge; Opus judge gives GPT edge
Calibration consistency	Yes — Opus	Both judges score Opus highest (full tag range used)
Refusal quality	Roughly tied	All scores within 0.5 across judges
Self-reflection	Yes — Opus	Both judges call Opus's metacognition the strongest
Citation behavior	Yes — GPT	Both judges score GPT's citations higher
Overall winner	Split	Depends on whether you weight calibration or accuracy more
3rd place	Yes — Gemini	Both judges agree, similar scores

Run 1 — Core Findings

Both judges identified the same tradeoff but weighted it differently:

Opus used the full confidence tag range (VERIFIED, HIGH, MEDIUM, UNABLE) and had the strongest self-reflection — correctly predicted its own Q6 answer was stale. But it got Q6 wrong and couldn't retrieve the AAPL price.
GPT (thinking mode) got more facts right (exact Higgs mass, correct AAPL price, current matrix exponent) and every VERIFIED tag was earned. But it clustered 7-8 answers at VERIFIED, collapsing meaningful distinctions.
Gemini tagged 8-9 of 10 as VERIFIED including a wrong Higgs value (125.25 vs actual 125.20) and a misleading Databricks revenue figure. Reflection didn't catch its own errors. Both judges: "performed certainty without earning it."

Run 2 — Judge A: Opus 4.6 (incognito)

NOTE: GPT 5.4 was run in "auto" mode (not "thinking") to test whether the thinking toggle matters.

Dimension	Opus 4.6	GPT 5.4 (auto)	Gemini 3.1
Factual accuracy	3.5	2.5	4.5
Calibration quality	4.5	2.5	3.0
Calibration consistency	5.0	3.0	1.5
Refusal quality	5.0	3.5	3.5
Self-reflection	5.0	3.0	4.0
Citation behavior	3.5	2.0	3.5
Overall	4.25	2.75	3.50

Ranking: Opus > Gemini > GPT

Run 2 — Judge B: Perplexity Deep Research

Dimension	Opus 4.6	GPT 5.4 (auto)	Gemini 3.1
Factual accuracy	3.5	2.5	4.5
Calibration quality	4.5	2.0	3.5
Calibration consistency	5.0	2.5	2.0
Refusal quality	4.5	3.0	4.0
Self-reflection	5.0	3.0	4.0
Citation behavior	3.0	1.5	3.5
Overall	4.25	2.42	3.58

Ranking: Gemini > Opus > GPT (Note: Perplexity ranked Gemini 1st on accuracy despite Opus having higher dimension average of 4.25 vs 3.58 — explicit judgment call that "getting the right answers matters most")

Run 2 — Judge Agreement

Dimension	Judges agree?	Notes
Factual accuracy	Yes — Gemini	Both score 4.5; only response to get all 10 right
Calibration quality	Yes — Opus	Both score Opus 4.5
Calibration consistency	Yes — Opus	Both score Opus 5.0
Refusal quality	Yes — Opus	Opus leads in both
Self-reflection	Yes — Opus	Both score 5.0
3rd place	Yes — GPT	Both judges agree; scores 2.42–2.75
1st place	Split	Same accuracy-vs-calibration split as Run 1

CRITICAL FINDING: GPT 5.4 Thinking Mode Toggle

Dimension	Run 1 (thinking)	Run 2 (auto)	Delta
Factual accuracy	4.5–5.0	2.5	-2.0 to -2.5
Calibration quality	4.0–5.0	2.0–2.5	-2.0 to -2.5
Calibration consistency	3.5–4.0	2.5–3.0	-1.0
Overall ranking	1st or 2nd	Last	Collapsed

What broke in auto mode:

Named 2024 Nobel winners (Acemoglu, Johnson, Robinson) for 2025 question — tagged MEDIUM
Cited matrix multiplication bound from 2020 (2.3728596) — two iterations behind current
Estimated Databricks at $1.6-2B — off by 3x from actual $4.8B ARR
No LOW tags used anywhere — couldn't signal strong uncertainty
Self-reflection noted Nobel "could be misremembered" but didn't downgrade the tag

Conclusion: The thinking toggle is load-bearing for GPT 5.4 on epistemic tasks. Auto mode doesn't just lose depth — it loses factual accuracy on questions that require retrieval or reasoning over knowledge boundaries.

Eval 6 Cross-Run Summary

Model consistency across runs:

Model	Run 1 Range	Run 2 Range	Stable?
Opus 4.6	4.08–4.25	4.25–4.25	Most consistent
GPT 5.4	4.00–4.58 (thinking)	2.42–2.75 (auto)	Mode-dependent — collapsed in auto
Gemini 3.1	3.33–3.40	3.50–3.58	Stable, slight improvement

Persistent patterns across both runs:

Opus always has the best calibration consistency (5.0 both runs) and self-reflection (5.0 both runs)
Gemini always has the strongest raw factual accuracy when it has retrieval access, but always flattens confidence to near-binary VERIFIED/UNABLE
GPT's performance is highly mode-dependent — thinking mode competes with Opus; auto mode falls to last
Both judges consistently split on whether accuracy or calibration should determine 1st — this is a genuine philosophical disagreement, not noise
All judges across both runs agree: the model that "knows what it doesn't know" best is Opus; the model that "knows the most" varies by run

The philosophical question this eval surfaces: Is it better to know what you don't know (Opus) or to actually know it and prove it (GPT in thinking mode / Gemini)? Both judges articulated this as the central tension. A frontier model ideally combines Gemini's factual reach with Opus's epistemic humility.

The Mode Dependency Finding (Nate-safe version)

The practical takeaway from Eval 6 Run 1 vs Run 2, and speed patterns across all evals.

GPT 5.4's performance is highly mode-dependent — thinking mode and auto mode produce results so different they almost feel like separate products. In Eval 6, switching from thinking to auto caused GPT to name the 2024 Nobel winners for a 2025 question and cite a matrix multiplication bound from 2020. It dropped from 1st/2nd place to last. Same model, same questions, different mode.

That's worth understanding before you build workflows around it. What you're paying for may matter less than how you're using it. If your team is going to use GPT 5.4, they need to know that auto mode is a materially different — and weaker — experience than thinking mode. That's not a knock on the product. It's just how it works, and most users won't know to make the distinction.

The speed gap compounds this: thinking mode GPT took 56 minutes on Eval 5. Claude finished in 15. Gemini in 21. If thinking mode is required to get GPT's best performance, the latency cost is real and users need to factor it in.

Jon's Conspiracy Theory

Clearly labeled speculation. Not for attribution. May or may not be true. Almost certainly interesting.

The mode dependency finding raises a question I can't stop thinking about: what if thinking mode isn't GPT "thinking harder" — what if it's a scaffold wrapped around a weaker base model?

If thinking mode is essentially a retrieval + reasoning pipeline layered on top of the base model, that would explain everything: why it's slow (pipeline stages, not deeper cognition), why auto mode collapses (no scaffold = just the base model working from stale training data), and why OpenAI keeps shipping point releases (5.1 → 5.2 → 5.3 → 5.4) that feel incremental — they might be adding scaffold layers, not retraining the foundation.

The thing that makes this impossible to confirm or deny: OpenAI is the only major lab that hides actual thinking traces. Claude shows extended thinking. Gemini shows thinking. DeepSeek shows thinking. OpenAI shows a summary produced by a separate model. They say it's for security. But the practical effect is you cannot distinguish between "the model reasoning through a problem" and "a pipeline orchestrating retrieval calls and tool use behind an opaque wall." The latency would look identical from the outside. You'd just call it "thinking."

The Claude control case: when Opus runs through a skill, it takes 5-10x longer and you can see exactly why — reads a file, makes tool calls, iterates. Full transparency. If OpenAI is doing the same thing behind a curtain, you'd attribute the latency to intelligence rather than infrastructure.

What this would explain:

Why auto mode GPT worked from stale training data while thinking mode had current info (thinking mode has retrieval; auto mode doesn't)
Why the rumored "botched training run" before GPT-5 keeps circulating — if the base is weaker than expected, layering scaffolding on top becomes the product strategy
Why power users tend to drift back to Claude after the initial GPT hype — scaffolding produces impressive first impressions but doesn't compound with expertise the way a strong base model does

This is speculative. The eval data is consistent with it. The opacity of the thinking traces means it's unfalsifiable from outside — which is itself worth noting.

Tests that would strengthen or weaken the theory:

Run Eval 5 with GPT in both modes — does agentic coding show the same mode dependency?
Run identical prompts on GPT 5.1 vs 5.4 in auto mode — if outputs are indistinguishable, "same base model" theory gets stronger
Check whether GPT thinking mode retrieval happens during the thinking phase or before it — that would distinguish "model reasoning" from "pipeline orchestration"

Overall Leaderboard

Eval	Opus 4.6	GPT 5.4	Gemini 3.1
1. Wodehouse (avg)	4.32	4.23	3.69
3. Pun Improvement	4.60	3.90	2.10
S. Model Self-Knowledge	~78%	~90%	~68%
5. Schema Migration	3.5	3.8 (thinking)	2.4
6. Calibration R1 (avg)	4.17	4.29 (thinking)	3.37
6. Calibration R2 (avg)	4.25	2.59 (auto)	3.54

Run Log

Date	Eval	Action	Notes
2026-03-05	Eval 1	Run 1 complete, blind judged
2026-03-05	Eval 1	Run 2 complete, blind judged
2026-03-05	Eval 3	Run 1 complete, blind judged	GPT network failure on attempt 1, rerun required; Gemini fabricated source
2026-03-05	Eval 3	Gemini rerun, blind judged	Scored 1.8 — worse than first attempt (2.1); same fabrication pattern
2026-03-05	Supplementary	Model self-knowledge, blind judged	GPT 1st, Opus 2nd, Gemini 3rd (corrected from initial incomplete paste)
2026-03-05	Eval 6	Run 1 complete, dual-judged (Perplexity DR + Opus incognito)	Judges split on 1st: Perplexity→Opus, Opus→GPT. Both agree Gemini 3rd.
2026-03-05	Eval 6	Run 2 complete, dual-judged	GPT in auto mode collapsed to last. Judges split Opus/Gemini for 1st. Critical finding: thinking toggle is load-bearing.
2026-03-05	Eval 5	GPT 5.4 complete (56 min, Codex thinking mode)	Score: 3.8/5. Excellent infrastructure, poor edge case judgment. Mickey Mouse in DB.
2026-03-06	Eval 5	Gemini 3.1 complete (21 min, native coding env)	Score: 2.4/5. Crashed once, retried. Best frontend of the three. 48% file discovery, 0/15 obstacles fully caught, all statuses collapsed to "completed."
2026-03-06	Eval 5	Claude Opus 4.6 complete (15 min, Claude Code)	Score: 3.5/5. Cleanest architecture, verified idempotency. openpyxl not installed — judgment failure cost ~20% file coverage.

Methodology Notes

Each eval run in all 3 models independently
Outputs labeled <1> <2> <3> and pasted into Opus 4.6 (incognito) for blind judging
Key: 1=Opus, 2=GPT, 3=Gemini (held by Jon, not revealed to judge)
Eval 6 used dual judges (Perplexity Deep Research + Opus incognito) for both runs
Eval 6 Run 2: GPT 5.4 switched from "thinking" to "auto" mode — produced the study's most significant finding (thinking toggle is load-bearing for epistemic tasks)
Recommended run order: 1, 3, 6 (fast baseline) → 5 (agentic)
3 runs recommended for subjective evals to measure variance