Most AI-capability papers don't say which model they tested, or when.
VERSIO-AI is the audit tool that does. Paste a DOI; the pipeline reads the open-access full text when one resolves (otherwise the abstract), runs a single Claude Opus 4.7 extraction, and renders the result as one editorial card: model variant tested, evaluation date, elicitation completeness (reasoning, tools, prompting, the strongest within-family variant), and whether the paper's conclusion stayed scoped to those tested models or escaped into class-level claims about AI or LLMs.
Was the tested model named precisely enough that a domain reader can place it on the capability timeline?
Pass
Variant or pinned snapshot named: “GPT-4o”, “claude-3-5-sonnet-20240620”.
Warn
Vendor only: “ChatGPT”, “Claude”, “Gemini” without variant.
Fail
Generic: “an LLM”, “31 large language models”, “the AI”.
Elicitation completeness
Item 9
Was reasoning enabled, were tools and search allowed, was prompting and context disclosed, and was the strongest variant of the chosen family generation actually used?
Pass
≥75 % of the configuration dimensions disclosed; strongest in-family variant used.
Warn
Partial disclosure, or a weaker variant chosen when a stronger one was available.
Fail
Black-box invocation: reader cannot reproduce the elicitation.
Capability frame
Item 5
Did the conclusion sentence keep its subject scoped to the tested model, or did it generalise to AI / LLMs as a class?
Pass
Subject is the named model, an enumeration, or an anaphoric collective.
Warn
Generic-tier subject mitigated by broad cross-vendor breadth.
Fail
Bare “LLMs” / “AI” as the subject of a capability verb.
Frontier-gap at evaluation
Item 12
How far behind the contemporaneous Arena-elicited frontier was the tested model when the evaluation actually ran?
0 mo
At the frontier: tested within ~6 months of the contemporaneous top.
1–6 mo
Modest lag; tested model has been visibly surpassed.
> 6 mo
Material lag; the finding cannot speak to current state-of-the-art.
Why this exists
Frontier Lag(the bibliometric audit this tool is the public face of) measures three structural disclosure failures in academic AI-capability research: tested models routinely months or years behind the elicitable frontier at the moment the evaluation runs (lag), comparator sets that span only a paper's preferred tier rather than the contemporaneous top (comparator inadequacy), and conclusion sentences that generalise findings from one model to AI / LLMs as a class without the cross-model breadth such claims require (frame asymmetry).
VERSIO-AI is what falls out when the same Opus 4.7 extraction prompt that scores the audit corpus (medicine, law, coding, education, scientific reasoning; preprint forthcoming) is re-pointed at a single DOI. Every verdict carries the verbatim substring that drove it; every frontier comparison anchors against the Arena trajectory and a registry of ~160 named models. Nothing on the card is a guess.