Most AI-capability papers don't say which model they tested, or when.

VERSIO-AI is the audit tool that does. Paste a DOI; the pipeline reads the open-access full text when one resolves (otherwise the abstract), runs a single Claude Opus 4.7 extraction, and renders the result as one editorial card: model variant tested, evaluation date, elicitation completeness (reasoning, tools, prompting, the strongest within-family variant), and whether the paper's conclusion stayed scoped to those tested models or escaped into class-level claims about AI or LLMs.

Paywalled, no preprint? Paste the abstract, full text, or PDF directly →

Try an exemplar

What the card returns

Model version

Item 1

Was the tested model named precisely enough that a domain reader can place it on the capability timeline?

Pass: Variant or pinned snapshot named: “GPT-4o”, “claude-3-5-sonnet-20240620”.
Warn: Vendor only: “ChatGPT”, “Claude”, “Gemini” without variant.
Fail: Generic: “an LLM”, “31 large language models”, “the AI”.

Elicitation completeness

Item 9

Was reasoning enabled, were tools and search allowed, was prompting and context disclosed, and was the strongest variant of the chosen family generation actually used?

Pass: ≥75 % of the configuration dimensions disclosed; strongest in-family variant used.
Warn: Partial disclosure, or a weaker variant chosen when a stronger one was available.
Fail: Black-box invocation: reader cannot reproduce the elicitation.

Capability frame

Item 5

Did the conclusion sentence keep its subject scoped to the tested model, or did it generalise to AI / LLMs as a class?

Pass: Subject is the named model, an enumeration, or an anaphoric collective.
Warn: Generic-tier subject mitigated by broad cross-vendor breadth.
Fail: Bare “LLMs” / “AI” as the subject of a capability verb.

Frontier-gap at evaluation

Item 12

How far behind the contemporaneous Arena-elicited frontier was the tested model when the evaluation actually ran?

0 mo: At the frontier: tested within ~6 months of the contemporaneous top.
1–6 mo: Modest lag; tested model has been visibly surpassed.
> 6 mo: Material lag; the finding cannot speak to current state-of-the-art.

Why this exists

Frontier Lag(the bibliometric audit this tool is the public face of) measures three structural disclosure failures in academic AI-capability research: tested models routinely months or years behind the elicitable frontier at the moment the evaluation runs (lag), comparator sets that span only a paper's preferred tier rather than the contemporaneous top (comparator inadequacy), and conclusion sentences that generalise findings from one model to AI / LLMs as a class without the cross-model breadth such claims require (frame asymmetry).

VERSIO-AI is what falls out when the same Opus 4.7 extraction prompt that scores the audit corpus (medicine, law, coding, education, scientific reasoning; preprint forthcoming) is re-pointed at a single DOI. Every verdict carries the verbatim substring that drove it; every frontier comparison anchors against the Arena trajectory and a registry of ~160 named models. Nothing on the card is a guess.

Read the methodology in full →