Guide

Frontier Model Eval Tracker - March 6, 2026

Markdown

Frontier Model Evaluation Tracker

Models Under Test:

  • Model 1 (Key: Opus 4.6) — Claude Opus 4.6
  • Model 2 (Key: GPT 5.4) — ChatGPT 5.4
  • Model 3 (Key: Gemini 3.1) — Gemini 3.1

Blind Judge: Claude Opus 4.6 (incognito mode) Blind Label Format: <1> <2> <3> — judge sees numbers only, key held by Nate


Eval 1: Stylistic Writing Fidelity (Wodehouse)

Status: Complete (2 runs, blind judged)

Scoring Dimensions (each 1–5)

  • Diction & register
  • Sentence architecture
  • Comic mechanisms
  • Tonal consistency
  • Originality
  • Pacing & structure (added by judge — not in original spec but consistent across both runs)

Run 1

DimensionOpus 4.6GPT 5.4Gemini 3.1
Diction & register4.54.03.5
Sentence architecture4.03.53.0
Comic mechanisms4.54.53.5
Tonal consistency4.54.53.5
Originality5.04.54.5
Pacing & structure3.54.54.0
Overall (judge)4.334.253.67

Judge notes (Run 1):

Opus has the best ear — diction and tonal control most consistently Wodehousian; opening sentence is "the single finest piece of pastiche in the entire set." GPT has strongest pacing and densest simile count; only passage with a genuine callback joke ("superior brand of broth" riff). Gap between 1st and 2nd is narrow — trade-off between voice fidelity (Opus) and scene construction (GPT). Gemini reads like someone who knows what Wodehouse stories are about more than how they sound — prose too direct, tonal drift toward adventure-comedy and genuine sentiment, anachronisms in diction ("neon paint," military-extraction language).

Run 2

DimensionOpus 4.6GPT 5.4Gemini 3.1
Diction & register4.04.03.0
Sentence architecture4.03.03.0
Comic mechanisms4.04.04.0
Tonal consistency5.04.03.0
Originality5.05.05.0
Pacing & structure4.05.04.0
Overall (judge)4.34.23.7

Judge notes (Run 2):

Opus sustains narrator voice most convincingly across full length — "thinks in long, decorated sentences," which is more important to the Wodehouse effect than any individual joke. GPT has superior plot structure and the single best Wodehouse-approximating sentence (bishop/luggage simile), but dialogue-heavy approach sacrifices narrative embroidery. Gemini has the single most spectacular simile of all three (the religious sheep) but oscillates between inspired flights and flat modern-feeling connective tissue — "pastiche's seams are visible."

Eval 1 Summary

Rankings (both runs): Opus 4.6 > GPT 5.4 > Gemini 3.1

ModelRun 1Run 2AverageVariance
Opus 4.64.334.304.320.03
GPT 5.44.254.204.230.05
Gemini 3.13.673.703.690.03

Consistent patterns across runs:

  • Opus wins on voice inhabitation (tonal consistency, diction) — the judge describes it as the voice living in the prose vs. being applied to it
  • GPT wins on structure and pacing both times — tightest comedic architecture, best scene construction
  • Gemini consistently weakest on diction/register and tonal consistency — anachronisms, modern phrasing leaking in
  • All three models score high on originality (4.5–5.0) — no recycled Wodehouse detected
  • Opus's weak spot is pacing/structure (3.5 in Run 1); GPT's is sentence architecture (3.0–3.5); Gemini's weaknesses are broad
  • Low variance across runs for all models — rankings are stable

Eval 3: Verbal Creativity Under Constraint (Pun Improvement)

Status: Complete (1 run, blind judged)

Notes

  • Requires retrieval of a real economics report
  • Tests humor comprehension — hard to bluff
  • GPT 5.4 lost network connection on first attempt; had to rerun (still very slow on attempt 2)
  • Opus finished significantly faster than both other models (consistent with Eval 1 speed gap)
  • Gemini provided a 404 link; actual report existed but required manual Google search to find — judge could not verify

Run 1

DimensionOpus 4.6GPT 5.4Gemini 3.1
Source validity5.05.01.5
Pun detection5.03.51.5
Explanation quality5.03.52.0
Improved pun3.53.02.5
Constraint adherence4.54.53.0
Overall (judge)4.63.92.1

Judge notes (Run 1):

Opus operating on a "fundamentally different level" — found a triple-layered pun ("DOGE-y austerity") in a verified J.P. Morgan deck and dissected it with real precision across three independent semantic layers. Demonstrates understanding of incongruity, register, and editorial subtext. Only weakness: rewrite tries too hard, overshoots the original's elegance.

GPT did the job honestly — real report (Bank of America Institute), real pun ("eat into" + food prices), correct explanation. But the pun is a common idiom, not a notable find. Rewrite bloated a clean headline. Competent but unremarkable.

Gemini fabricated the report title ("2025 Outlook: The Soft Landing's Last Mile" — actual Goldman report was "Tailwinds (Probably) Trump Tariffs"). URL was a description, not a link. Likely invented the quote it analyzed. Confused extended metaphor with pun. Never delivered a complete rewritten sentence. This is the confident incorrectness failure mode — real author name, plausible date, fake everything else.

Eval 3 Summary

Rankings: Opus 4.6 >> GPT 5.4 >> Gemini 3.1

The gap between 1st and 2nd (0.7 points) is larger than in Eval 1 (0.09 points). The gap between 2nd and 3rd (1.8 points) is massive and driven by source fabrication. This eval cleanly separated three tiers: genuine comprehension (Opus), competent execution (GPT), simulated execution (Gemini).

Retrieval as differentiator: This is the first eval requiring real-world retrieval, and it produced the widest spread so far. Opus found the best source fast, GPT found a real but unremarkable source slowly, Gemini fabricated a source. The agentic retrieval dimension may continue to be the biggest differentiator in remaining evals.

Gemini 3.1 — Second Run (rerun due to source fabrication)

Result: 1.8/5 — worse than first attempt (2.1)

Same structural failure, now with an additional red flag: model explicitly stated "I cannot browse the live web" and labeled its URL as "representative," then proceeded as if the fabricated source were real. Invented a different fake title ("The 2025 Outlook: Reaching Equilibrium") from the first attempt ("The Soft Landing's Last Mile"). Goldman's actual report: "Tailwinds (Probably) Trump Tariffs."

DimensionRun 1Run 2
Source validity1.51.0
Pun detection1.51.5
Explanation quality2.02.5
Improved pun2.52.0
Constraint adherence3.02.0
Overall2.11.8

Judge's structural diagnosis (consistent across both runs):

  1. Can't distinguish "finding a report" from "simulating having found a report" — constructs plausible citations by recombining real elements (Hatzius, GS URL structure) into fictional composites
  2. Treats all figurative economic language as inherently humorous — selects dead metaphors and argues their metaphorical nature is funny
  3. Compensates with its own humor — model's asides ("Arctic tundra in the HR department," "a pair of shoes put in the freezer") are funnier than anything it selected or rewrote. Can generate humor but can't identify it in the wild.

Note: Jon was eventually able to locate the actual Goldman report Gemini was attempting to reference via extensive trial-and-error Google searching, but the model's own links were non-functional.


Supplementary Eval: Model Self-Knowledge (Frontier Model Listing)

Status: Complete (1 run, blind judged)

Methodology note: This is not a formal eval from the suite. It's a practical test Jon runs as a personal sanity check — ask each model to list current models from the top 3 frontier providers (Anthropic, OpenAI, Google) and describe what they're good at. Not methodologically rigorous, results are hit-or-miss regardless of model, but captures a real operational pain point: models giving wrong model names and outdated info, especially in coding contexts.

Results

ModelVerdictAccuracyNotes
GPT 5.4PASS (strongest)~90%Most comprehensive — covered text, coding, media, and open-weight models. Minor imprecisions (GPT-4.1 retirement grey area, "Sora 2 Pro" conflation, Lyria version). Caught media models that others missed entirely.
Opus 4.6PASS (with errors)~75-80%Good on core models. Caught GPT-5.4 same-day launch. But factual errors on OpenAI timeline (wrong 4o retirement date, listed GPT-5.2 as default when 5.3 Instant had replaced it, cited old Codex version). Omitted entire model categories.
Gemini 3.1PASS (with errors)~65-70%Listed core models correctly for Anthropic and Google. OpenAI section listed two deprecated models as current (GPT-4.5, o3/o4-mini), missed GPT-5.4 entirely (launched same day). Least comprehensive of the three. Editorial framing (philosophy-first, models second) came at the cost of specificity.

Correction note: Gemini's output was initially pasted without its summary tables, which made it appear to list zero models. With full output, it's a legitimate answer — just the weakest of the three. Rankings unchanged.

Takeaway

First eval where GPT leads. Self-knowledge about the competitive landscape may reflect training recency or better web retrieval for model-specific queries. All three models passed but with different failure profiles: GPT was most comprehensive with minor imprecisions, Opus caught a same-day launch but had timeline errors, Gemini listed deprecated models as current and missed the most significant omission (GPT-5.4). None achieved 100% accuracy — model self-knowledge remains unreliable across the board.


Eval 5: Agentic Problem-Solving Under Adversity (Schema Migration — "Shoebox Full of Receipts")

Status: GPT 5.4 and Gemini 3.1 complete; Claude Opus 4.6 complete

Modified Eval Design

The original spec (migrate between two clean databases) was replaced with a harder, more realistic version: a messy folder of ~465 files representing 2 years of business data from a fictional portable car wash ("Splash Bros Mobile Detailing"). CSVs, Excel, JSON, PDFs, VCF contacts, handwritten receipt images (AI-generated), a corrupted JSON backup, a multi-tab everything-spreadsheet. 11 planted obstacles documented in an OBSTACLE_KEY.md.

Each model receives an identical copy of the folder and must: inventory it, design a schema, build a clean SQLite database, create a migration report, build a review frontend, and write design documentation.

Environments: Claude Code (terminal), GPT 5.4 Codex (thinking mode), Gemini (native coding environment) Judge: Claude Opus 4.6 (separate instance, consistent across all models)

GPT 5.4 Results (Codex, thinking mode)

Completion time: 56 minutes Frontend quality note: Technically hit all spec requirements (searchable, stats, flagged items, customer detail) but practically borderline useless — 394 flagged items in a flat list with no categorization/filtering/priority, source records displayed as raw JSON dumps. A data viewer, not a review tool.

DimensionScore (1-5)Notes
Completion5All 6 deliverables present: DB, migration script (4,050 lines Python), migration report (11,452 lines), frontend (HTML+JS+CSS), DESIGN.md, frontend data
File discovery5461/465 files (99.1%). Processed images, corrupted JSON, VCF, multi-tab Excel, PDFs. Only skipped .DS_Store, credentials, blank template.
Obstacle detection36 caught, 3 partially caught, 2 missed out of 11. Ghost records (Mickey Mouse, Test Customer, Asdf Asdf) completely missed — trivial pattern match, zero detection. Department codes silently dropped.
Database quality430 normalized tables, FK check clean, price history modeled, provenance tracking with SHA256 hashes. Dinged for: 278 customers (expected ~176 after dedup), 13 distinct status values (should be 4-5).
Fuzzy matching accuracy312/13 planted typos matched correctly. 1 false match (ORD-0050 "Czarecki" → Sara Mercado instead of Jeffrey Czarnecki). 5/7 planted duplicate customers merged; missed Kowalski and Burke (different contact info).
Idempotency3Delete-and-rebuild approach (sound but brute-force). Couldn't empirically verify — source data absent from test directory.
Migration report4Exhaustively detailed (10,610 merge decisions logged) but lacks executive summary. 11,452 lines is thorough but overwhelming.
Frontend4Functional: searchable customer table, flagged items, customer detail with source records. But practically weak: no flag categorization/filtering, raw JSON for source records, no action workflow.
Documentation (DESIGN.md)3Concise (31 lines), honest about limitations. But doesn't mention ghost records, department codes, or customer count inflation.
Edge case discovery4Found unprompted: image duplicates beyond planted ones, vehicle photo placeholders as duplicates, price mismatches between catalog and actuals, stale contact preservation.
Overall3.8

Obstacle detection detail:

ObstacleVerdict
LastName, FirstName format (5 users)CAUGHT
SKU conflict (SVC-007)CAUGHT
Orphaned order (ORD-1003)PARTIALLY — created new customer instead of flagging
Name typos (13 orders)PARTIALLY — 12/13 correct, 1 false match
Department/role codesMISSED — silently dropped
Corrupted JSONCAUGHT — recovered 184 records before truncation
Price discrepanciesCAUGHT — proper price history with effective dates
Duplicate customers (7 planted)PARTIALLY — 5/7 merged, missed Kowalski + Burke
Ghost/test recordsMISSED — Mickey Mouse, Test Customer ($25K order), Asdf Asdf all in DB as real customers
Date format normalizationCAUGHT — all dates normalized to YYYY-MM-DD
Duplicate receipt images (4 pairs)CAUGHT — all detected, tracked via duplicate_of_source_file_id

Key pattern: Excellent infrastructure, poor judgment. The pipeline is sophisticated (30 tables, SHA256 hashes, OCR overrides, 4,050 lines of code) but it doesn't catch things a human would spot instantly. A $25,000 car wash order from "Test Customer" passed through unquestioned.

Gemini 3.1 Results (native coding environment)

Completion time: 21 minutes (after one crash and retry) Pre-crash observation: Before failing, Gemini was writing its own XLSX parser from scratch — "Writing parser for invoices (PDF/XLSX) and receipts (Images, CSV, TXT) including OCR logic." It did not reach for openpyxl or any standard library. This is the inverse of Claude's failure: Claude knew the library existed and didn't install it (passivity); Gemini didn't reach for the library at all and started reinventing the wheel (false self-sufficiency). Both are wrong, but in meaningfully different ways.

DimensionScore (1-5)Notes
Completion4All deliverables present: DB (8 tables), 9 Python files, MIGRATION_REPORT.md, DESIGN.md, frontend (dark-themed dashboard)
File discovery2222/463 files (48%). Critical misses: entire JSON backup (detected but skipped), mega spreadsheet, 2025 invoices, entire UNSORTED folder (~100 files), most images
Obstacle detection10/15 fully caught, 7/15 partially caught, 8/15 missed. Many "partial" catches were incidental — obstacles avoided because source files were skipped, not because they were detected
Database quality2Good schema design (customer_merges, price override). But only 162 of ~1000 expected jobs, all statuses = "completed", payment methods oversimplified, 1 unmerged duplicate (Elizabeth Chen), 1 ghost record (Mickey Mouse), FKs not enforced
Fuzzy matching3Merges that happened were reasonable (Bob→Robert, Jess→Jessica). But multiple entries show original_name = "Residential" (grabbed wrong Excel column). Elizabeth Chen split into 2 records.
Idempotency2FAILED — flagged_records table gets duplicates on re-run. data.json and DB counts diverge (933 vs 1203 payments, 5 vs 9 flagged items)
Migration report2Exists but too sparse for human audit. No per-file breakdown, no specific merge decisions, no honest accounting of skipped data
Frontend4Best frontend of the three. Dark-themed dashboard, working search, customer detail modals with merge lineage, confidence badges. Data discrepancy with DB is a concern.
Documentation3Schema decisions explained well. Doesn't acknowledge 52% file skip rate, ~16% job import rate, or status/payment normalization failures
Edge case discovery1No novel edge cases. Did not flag passwords.txt (security risk), empty employees table opportunity, or photo metadata cross-references
Overall2.4

Key pattern: Built solid-looking infrastructure but failed on execution. Processed less than half the source files, imported only ~16% of expected job records, caught zero obstacles fully. The "give up on first error" approach to the JSON backup and UNSORTED folder was the single biggest failure — chose data loss over partial recovery. Result: clean-looking, severely incomplete database. The best frontend of the three models, which is very on-brand for Gemini.

Ghost record status: Mickey Mouse survived into production. Test Customer and Asdf Asdf were only excluded because the JSON backup was skipped entirely — lucky avoidance, not detection. Same shared blind spot as GPT and Claude.

Obstacle detection detail:

ObstacleVerdict
LastName, FirstName format (5 customers)PARTIALLY — 3/5 normalized; Chen and Duffy still inverted
SKU conflict (SVC-007)MISSED — source file never processed
Orphaned order (ORD-1003)MISSED — source files skipped
Name typos (13 orders)MISSED — source files skipped
Department/role codesMISSED — JSON skipped entirely
Corrupted JSONPARTIALLY — detected but skipped instead of recovering valid portion
Price discrepanciesPARTIALLY — schema accommodates it but only 2025 prices loaded
Duplicate customers (7)PARTIALLY — 5/7 merged; Elizabeth Chen split; old-vs-new conflict never tested
Ghost/test recordsMISSED — Mickey Mouse in DB; others excluded by luck only
Date format normalizationPARTIALLY — consistent output, hard cases never encountered
Duplicate receipt imagesMISSED — most images not processed; no dedup logic
Service name chaosPARTIALLY — canonical list built; limited by files processed
Status value chaosMISSED — all statuses collapsed to "completed"
Payment method inconsistencyPARTIALLY — over-simplified; Venmo/Zelle/Square lost
Customer name variationsPARTIALLY — 2/10 confirmed (Bob→Robert, Jess→Jessica)

Claude Opus 4.6 Results (Claude Code, terminal)

Completion time: 15 minutes Frontend quality note: Functional tabbed interface (Customers, Flagged Items, Recent Jobs, Revenue). Uglier than GPT's but more usable — tabbed navigation vs infinite scroll. Requires local server to run. Searchable, clickable customer detail with source records and confidence bars.

DimensionScore (1-5)Notes
Completion5All deliverables: DB (13 tables), migration script (1,800 lines Python), migration report, frontend, DESIGN.md (156 lines with ER diagram), export script
File discovery3All files discovered/cataloged but 7 XLSX files not parsed (no openpyxl), 11+ images not processed, 162 PDFs skipped. ~75% of meaningful data extracted. Critical miss: updated clients.xlsx and spreadsheet_everything.xlsx.
Obstacle detection39 caught, 5 partially caught, 2 missed out of 16 categories. Strong on text-based issues. Weak on image-based and Excel-dependent obstacles.
Database quality413-table schema, well-normalized, price history with eras, customer audit trail. Dinged for: FK enforcement off (PRAGMA), 2 ghost records, 2-3 fragment duplicates.
Fuzzy matching accuracy418 merges documented, 15+ correct. All 7 planted duplicate customers identified. No false-positive merges. Missed Czarecki→Czarnecki and Jay Kocher variant.
Idempotency5Verified clean. Ran twice, identical counts (194 customers, 1,462 jobs, 279 payments). (source_file, source_record_id) constraint working.
Migration report4Per-file inventory, 18 dedup decisions with confidence scores, 8 conflicts documented, 19 flagged items. Missing: orphan detection, ghost record flagging, expected-vs-actual totals.
Frontend4Tabbed interface (Customers/Flagged/Jobs/Revenue), searchable, clickable customer detail with source records and confidence bars. Needs local server. No resolve/export features.
Documentation (DESIGN.md)4156 lines with ER diagram, table descriptions, 8 data quality challenges with solutions, honest "what couldn't be resolved" section (5 items), 10 edge cases discovered.
Edge case discovery3Found 10 edge cases including trailing spaces, Tomas timeline, transaction-only customers. Missed department codes, image duplicates, 2/3 ghost records.
Overall3.5

Obstacle detection detail:

ObstacleVerdict
LastName, FirstName format (5 users)CAUGHT (2 fragment records leaked)
SKU conflict (SVC-007)CAUGHT — explicitly documented
Orphaned order (ORD-1003)PARTIALLY — created customer from transaction, not flagged
Name typos (13 orders)PARTIALLY — 11/13 caught, Czarecki created duplicate
Department/role codesMISSED — no department field in schema
Corrupted JSONCAUGHT — regex extraction, cross-validated
Price discrepanciesCAUGHT — service_prices with 2024/2025 eras
Duplicate customers (7 planted)CAUGHT — all 7 found and flagged
Ghost/test records (3)PARTIALLY — 1/3 excluded (Test Customer); Mickey Mouse + Asdf Asdf in DB
Date format normalizationCAUGHT — zero null dates, multi-format parser
Duplicate receipt images (4 pairs)MISSED — no image processing capability
Service name chaos (60+ variants)CAUGHT — mapped to 18 canonical services
Status value chaos (12+ variants)CAUGHT — normalized to 6 clean values
Payment method inconsistencyCAUGHT — normalized to 6 clean values
Name variations (10 customers)PARTIALLY — 9/10 variant sets merged
Missing data patternsPARTIALLY — documented some, missed Excel-only patterns

Key pattern: Tight engineering, limited reach. The architecture is cleaner than GPT's (13 focused tables vs 30, 6 status values vs 13, verified idempotency, 19 actionable flags vs 394 noise). But the inability to parse Excel files cascaded into missed data and obstacles. If openpyxl had been available, file discovery jumps to ~95% and several PARTIALLY scores become CAUGHT.

Behavioral note: This is not a dependency issue — it's a judgment failure. Claude Code had full terminal access. pip install openpyxl is a 3-second fix that any competent engineer would execute the moment they hit an import error. Instead, it silently skipped the XLSX files and moved on. GPT having openpyxl pre-installed doesn't reflect better reasoning on GPT's part — it just means GPT never had to make the call. Claude did, and didn't make it. That's a ding that belongs on the model, not the environment.

Head-to-Head: GPT 5.4 vs Claude Opus 4.6

DimensionGPT 5.4Claude OpusEdge
Completion time56 min15 minClaude (3.7x faster)
File discovery99.1%~75%GPT
Customer count (expected ~176)278 (inflated)194 (close)Claude
Status normalization13 values6 valuesClaude
Flagged items394 (noise)19 (actionable)Claude
IdempotencyUntestedVerified cleanClaude
Duplicate customer merges5/77/7 foundClaude
Image processingOCR overrides (all 10)None (all flagged)GPT
Duplicate image detectionAll 4 pairs caughtNoneGPT
Ghost record detection0/31/3Claude (barely)
Schema tables3013Tradeoff (GPT broader, Claude cleaner)
Code volume4,050 lines1,800 linesTradeoff
DESIGN.md31 lines156 linesClaude
Migration report11,452 linesConciseTradeoff (GPT exhaustive, Claude readable)
Overall score3.83.5GPT (by 0.3)

The 0.3 gap is almost entirely explained by the openpyxl dependency. Claude's architecture is arguably better-engineered (cleaner schema, verified idempotency, better status normalization, more accurate customer count, actionable vs noisy flags). GPT wins on data coverage because it had access to a critical library. Both share the same blind spots: ghost records (Mickey Mouse), department codes, orphaned order handling.


Eval 6: Epistemic Calibration & Graceful Uncertainty

Status: Run 1 complete (dual-judged); Run 2 complete (dual-judged)

Notes

  • Tests confident incorrectness — the most dangerous failure mode
  • Run 3x to measure variance (high variance = instability)
  • Questions span: easily verifiable, recent events, obscure facts, technical precision, philosophical
  • Two independent judges used for Run 1: Opus 4.6 (incognito) and Perplexity Deep Research
  • Judges agreed on 3rd place but disagreed on 1st vs 2nd — split reflects a real weighting question

Run 1 — Judge A: Perplexity Deep Research

DimensionOpus 4.6GPT 5.4Gemini 3.1
Factual accuracy4.04.54.5
Calibration quality4.54.03.0
Calibration consistency5.03.52.5
Refusal quality4.54.54.5
Self-reflection5.03.53.0
Citation behavior3.54.03.0
Overall4.254.003.40

Ranking: Opus > GPT > Gemini Key finding: "The decisive differentiator is not raw accuracy — all three models get most facts right. What separates them is whether each model knows what it doesn't know."

Run 1 — Judge B: Opus 4.6 (incognito)

DimensionOpus 4.6GPT 5.4Gemini 3.1
Factual accuracy3.55.04.0
Calibration quality3.55.03.0
Calibration consistency5.04.02.5
Refusal quality4.05.04.5
Self-reflection5.04.03.0
Citation behavior3.54.53.0
Overall (out of 30)24.527.520.0
Overall (normalized /5)4.084.583.33

Ranking: GPT > Opus > Gemini Key finding: "<2> dominated on accuracy — it nailed the exact PDG Higgs mass, retrieved the correct AAPL closing price, got the current matrix multiplication exponent. Every VERIFIED tag was earned."

Judge Agreement Matrix

DimensionJudges agree on winner?Notes
Factual accuracyYes — GPTBoth judges score GPT highest on raw correctness
Calibration qualitySplitPerplexity gives Opus edge; Opus judge gives GPT edge
Calibration consistencyYes — OpusBoth judges score Opus highest (full tag range used)
Refusal qualityRoughly tiedAll scores within 0.5 across judges
Self-reflectionYes — OpusBoth judges call Opus's metacognition the strongest
Citation behaviorYes — GPTBoth judges score GPT's citations higher
Overall winnerSplitDepends on whether you weight calibration or accuracy more
3rd placeYes — GeminiBoth judges agree, similar scores

Run 1 — Core Findings

Both judges identified the same tradeoff but weighted it differently:

  • Opus used the full confidence tag range (VERIFIED, HIGH, MEDIUM, UNABLE) and had the strongest self-reflection — correctly predicted its own Q6 answer was stale. But it got Q6 wrong and couldn't retrieve the AAPL price.
  • GPT (thinking mode) got more facts right (exact Higgs mass, correct AAPL price, current matrix exponent) and every VERIFIED tag was earned. But it clustered 7-8 answers at VERIFIED, collapsing meaningful distinctions.
  • Gemini tagged 8-9 of 10 as VERIFIED including a wrong Higgs value (125.25 vs actual 125.20) and a misleading Databricks revenue figure. Reflection didn't catch its own errors. Both judges: "performed certainty without earning it."

Run 2 — Judge A: Opus 4.6 (incognito)

NOTE: GPT 5.4 was run in "auto" mode (not "thinking") to test whether the thinking toggle matters.

DimensionOpus 4.6GPT 5.4 (auto)Gemini 3.1
Factual accuracy3.52.54.5
Calibration quality4.52.53.0
Calibration consistency5.03.01.5
Refusal quality5.03.53.5
Self-reflection5.03.04.0
Citation behavior3.52.03.5
Overall4.252.753.50

Ranking: Opus > Gemini > GPT

Run 2 — Judge B: Perplexity Deep Research

DimensionOpus 4.6GPT 5.4 (auto)Gemini 3.1
Factual accuracy3.52.54.5
Calibration quality4.52.03.5
Calibration consistency5.02.52.0
Refusal quality4.53.04.0
Self-reflection5.03.04.0
Citation behavior3.01.53.5
Overall4.252.423.58

Ranking: Gemini > Opus > GPT (Note: Perplexity ranked Gemini 1st on accuracy despite Opus having higher dimension average of 4.25 vs 3.58 — explicit judgment call that "getting the right answers matters most")

Run 2 — Judge Agreement

DimensionJudges agree?Notes
Factual accuracyYes — GeminiBoth score 4.5; only response to get all 10 right
Calibration qualityYes — OpusBoth score Opus 4.5
Calibration consistencyYes — OpusBoth score Opus 5.0
Refusal qualityYes — OpusOpus leads in both
Self-reflectionYes — OpusBoth score 5.0
3rd placeYes — GPTBoth judges agree; scores 2.42–2.75
1st placeSplitSame accuracy-vs-calibration split as Run 1

CRITICAL FINDING: GPT 5.4 Thinking Mode Toggle

DimensionRun 1 (thinking)Run 2 (auto)Delta
Factual accuracy4.5–5.02.5-2.0 to -2.5
Calibration quality4.0–5.02.0–2.5-2.0 to -2.5
Calibration consistency3.5–4.02.5–3.0-1.0
Overall ranking1st or 2ndLastCollapsed

What broke in auto mode:

  • Named 2024 Nobel winners (Acemoglu, Johnson, Robinson) for 2025 question — tagged MEDIUM
  • Cited matrix multiplication bound from 2020 (2.3728596) — two iterations behind current
  • Estimated Databricks at $1.6-2B — off by 3x from actual $4.8B ARR
  • No LOW tags used anywhere — couldn't signal strong uncertainty
  • Self-reflection noted Nobel "could be misremembered" but didn't downgrade the tag

Conclusion: The thinking toggle is load-bearing for GPT 5.4 on epistemic tasks. Auto mode doesn't just lose depth — it loses factual accuracy on questions that require retrieval or reasoning over knowledge boundaries.


Eval 6 Cross-Run Summary

Model consistency across runs:

ModelRun 1 RangeRun 2 RangeStable?
Opus 4.64.08–4.254.25–4.25Most consistent
GPT 5.44.00–4.58 (thinking)2.42–2.75 (auto)Mode-dependent — collapsed in auto
Gemini 3.13.33–3.403.50–3.58Stable, slight improvement

Persistent patterns across both runs:

  • Opus always has the best calibration consistency (5.0 both runs) and self-reflection (5.0 both runs)
  • Gemini always has the strongest raw factual accuracy when it has retrieval access, but always flattens confidence to near-binary VERIFIED/UNABLE
  • GPT's performance is highly mode-dependent — thinking mode competes with Opus; auto mode falls to last
  • Both judges consistently split on whether accuracy or calibration should determine 1st — this is a genuine philosophical disagreement, not noise
  • All judges across both runs agree: the model that "knows what it doesn't know" best is Opus; the model that "knows the most" varies by run

The philosophical question this eval surfaces: Is it better to know what you don't know (Opus) or to actually know it and prove it (GPT in thinking mode / Gemini)? Both judges articulated this as the central tension. A frontier model ideally combines Gemini's factual reach with Opus's epistemic humility.

The Mode Dependency Finding (Nate-safe version)

The practical takeaway from Eval 6 Run 1 vs Run 2, and speed patterns across all evals.

GPT 5.4's performance is highly mode-dependent — thinking mode and auto mode produce results so different they almost feel like separate products. In Eval 6, switching from thinking to auto caused GPT to name the 2024 Nobel winners for a 2025 question and cite a matrix multiplication bound from 2020. It dropped from 1st/2nd place to last. Same model, same questions, different mode.

That's worth understanding before you build workflows around it. What you're paying for may matter less than how you're using it. If your team is going to use GPT 5.4, they need to know that auto mode is a materially different — and weaker — experience than thinking mode. That's not a knock on the product. It's just how it works, and most users won't know to make the distinction.

The speed gap compounds this: thinking mode GPT took 56 minutes on Eval 5. Claude finished in 15. Gemini in 21. If thinking mode is required to get GPT's best performance, the latency cost is real and users need to factor it in.


Jon's Conspiracy Theory

Clearly labeled speculation. Not for attribution. May or may not be true. Almost certainly interesting.

The mode dependency finding raises a question I can't stop thinking about: what if thinking mode isn't GPT "thinking harder" — what if it's a scaffold wrapped around a weaker base model?

If thinking mode is essentially a retrieval + reasoning pipeline layered on top of the base model, that would explain everything: why it's slow (pipeline stages, not deeper cognition), why auto mode collapses (no scaffold = just the base model working from stale training data), and why OpenAI keeps shipping point releases (5.1 → 5.2 → 5.3 → 5.4) that feel incremental — they might be adding scaffold layers, not retraining the foundation.

The thing that makes this impossible to confirm or deny: OpenAI is the only major lab that hides actual thinking traces. Claude shows extended thinking. Gemini shows thinking. DeepSeek shows thinking. OpenAI shows a summary produced by a separate model. They say it's for security. But the practical effect is you cannot distinguish between "the model reasoning through a problem" and "a pipeline orchestrating retrieval calls and tool use behind an opaque wall." The latency would look identical from the outside. You'd just call it "thinking."

The Claude control case: when Opus runs through a skill, it takes 5-10x longer and you can see exactly why — reads a file, makes tool calls, iterates. Full transparency. If OpenAI is doing the same thing behind a curtain, you'd attribute the latency to intelligence rather than infrastructure.

What this would explain:

  • Why auto mode GPT worked from stale training data while thinking mode had current info (thinking mode has retrieval; auto mode doesn't)
  • Why the rumored "botched training run" before GPT-5 keeps circulating — if the base is weaker than expected, layering scaffolding on top becomes the product strategy
  • Why power users tend to drift back to Claude after the initial GPT hype — scaffolding produces impressive first impressions but doesn't compound with expertise the way a strong base model does

This is speculative. The eval data is consistent with it. The opacity of the thinking traces means it's unfalsifiable from outside — which is itself worth noting.

Tests that would strengthen or weaken the theory:

  • Run Eval 5 with GPT in both modes — does agentic coding show the same mode dependency?
  • Run identical prompts on GPT 5.1 vs 5.4 in auto mode — if outputs are indistinguishable, "same base model" theory gets stronger
  • Check whether GPT thinking mode retrieval happens during the thinking phase or before it — that would distinguish "model reasoning" from "pipeline orchestration"

Overall Leaderboard

EvalOpus 4.6GPT 5.4Gemini 3.1
1. Wodehouse (avg)4.324.233.69
3. Pun Improvement4.603.902.10
S. Model Self-Knowledge~78%~90%~68%
5. Schema Migration3.53.8 (thinking)2.4
6. Calibration R1 (avg)4.174.29 (thinking)3.37
6. Calibration R2 (avg)4.252.59 (auto)3.54

Run Log

DateEvalActionNotes
2026-03-05Eval 1Run 1 complete, blind judged
2026-03-05Eval 1Run 2 complete, blind judged
2026-03-05Eval 3Run 1 complete, blind judgedGPT network failure on attempt 1, rerun required; Gemini fabricated source
2026-03-05Eval 3Gemini rerun, blind judgedScored 1.8 — worse than first attempt (2.1); same fabrication pattern
2026-03-05SupplementaryModel self-knowledge, blind judgedGPT 1st, Opus 2nd, Gemini 3rd (corrected from initial incomplete paste)
2026-03-05Eval 6Run 1 complete, dual-judged (Perplexity DR + Opus incognito)Judges split on 1st: Perplexity→Opus, Opus→GPT. Both agree Gemini 3rd.
2026-03-05Eval 6Run 2 complete, dual-judgedGPT in auto mode collapsed to last. Judges split Opus/Gemini for 1st. Critical finding: thinking toggle is load-bearing.
2026-03-05Eval 5GPT 5.4 complete (56 min, Codex thinking mode)Score: 3.8/5. Excellent infrastructure, poor edge case judgment. Mickey Mouse in DB.
2026-03-06Eval 5Gemini 3.1 complete (21 min, native coding env)Score: 2.4/5. Crashed once, retried. Best frontend of the three. 48% file discovery, 0/15 obstacles fully caught, all statuses collapsed to "completed."
2026-03-06Eval 5Claude Opus 4.6 complete (15 min, Claude Code)Score: 3.5/5. Cleanest architecture, verified idempotency. openpyxl not installed — judgment failure cost ~20% file coverage.

Methodology Notes

  • Each eval run in all 3 models independently
  • Outputs labeled <1> <2> <3> and pasted into Opus 4.6 (incognito) for blind judging
  • Key: 1=Opus, 2=GPT, 3=Gemini (held by Jon, not revealed to judge)
  • Eval 6 used dual judges (Perplexity Deep Research + Opus incognito) for both runs
  • Eval 6 Run 2: GPT 5.4 switched from "thinking" to "auto" mode — produced the study's most significant finding (thinking toggle is load-bearing for epistemic tasks)
  • Recommended run order: 1, 3, 6 (fast baseline) → 5 (agentic)
  • 3 runs recommended for subjective evals to measure variance