Anonymous submission · ARR May 2026 Live · v1.0

Beyond questions.

Evaluating what language models (actually) know.

BeQu is an open knowledge evaluation benchmark for LLMs. Instead of fixed Q&A pairs, we prompt models to freely surface what they know about an entity, e.g. Tell me everything you know about Martin Luther King, then verify each generated statement against a reference corpus built from Wikipedia and the web. 10,000 entities · 20 models · two-way precision/recall.

20
Models evaluated
10,000
Entities
10
Knowledge domains
5
Experiments
2-way
P / R verification
Experiments

Overall model ranking

Verified against per-entity reference corpora
Wikipedia + web · Llama 4 Scout NLI judge
Setting
GPTKB prompt · medium reasoning
Dataset
200 random entities
All 20 models evaluated on the same set of randomly sampled entities. This scenario produces the primary BeQu ranking.
# Model F1 Precision Recall Contradicts
Precision–Recall frontier
Experiment 1 · all 20 models
No model achieves both high precision and high recall — a clean frontier emerges.
Commercial Open weights Hover points for details
Key findings
From five experimental scenarios.
FINDING 01
Benchmark durability
All models are far from saturating open-ended knowledge generation — F1 spans 0.171–0.473 — and open-source models are competitive with commercial ones: Kimi K2.5 ranks 2nd overall.
FINDING 02
"Unreasonable" task
Unlike most NLP tasks, reasoning has negligible benefit for open-ended knowledge expression. All nine model–effort combinations cluster within just F1 0.400–0.484.
FINDING 03
Schemas vs. creativity
Schema enforcement boosts precision — Schema.org reaches 77.2% — but significantly lowers recall to only 14.0%. Open-ended prompts consistently win on F1.
FINDING 04
Hard-wired operating points
Explicit precision–recall steering is possible only to a limited extent. Prompt repetition (3×) gives a modest recall gain (27.2 → 31.6%), but further repetitions (5×, 10×) steeply degrade performance.
F1 by knowledge domain
Experiment 3 · 100 entities/domain · switch model below
Showing GPT-5.4. Event and Scientific Concept are universally strong; Person is universally weakest.
How it works

Methodology

BeQu evaluates models along two complementary axes: precision — how well the elicited triples are supported by the reference corpus — and recall — how much of the reference corpus the model surfaces. Both directions are framed as textual entailment and judged by a locally hosted Llama 4 Scout.

i.

Entity selection

Random sampling from Wikipedia with hard filters (≥2000 chars, ≥10 Wikidata statements, not a disambiguation page), then LLM-judged for informativeness, ambiguity, and suitability.

ii.

Knowledge elicitation

The model under test is prompted to freely emit (subject, predicate, object) triples about each entity. We use the GPTKB prompt by default, plus seven variations for prompt-sensitivity analysis.

iii.

Reference corpus

For each entity we assemble the full Wikipedia article and up to 20 web documents via Brave Search. An LLM extracts reference triples; full texts serve as the retrieval corpus for RAG.

iv.

Two-way verification

Precision: each elicited triple is checked against retrieved passages (top-10 RAG). Recall: each reference triple is checked against the model's full elicited set. NLI labels: entailed, contradictory, neutral.

What we measure

Beyond a single number

Most existing knowledge benchmarks suffer from availability bias: they only evaluate knowledge that designers explicitly chose to query. If a fact is not in the benchmark, it is invisible to evaluation — even when the model possesses it.

By starting from open-ended elicitation rather than predefined questions, BeQu shifts the focus from answer retrieval toward characterizing the knowledge models naturally surface. The benchmark is paired with reference corpora so that statements remain verifiable, but the model retains full freedom over what to express.

Each model is scored along three label distributions in both directions: entailment, contradiction, and neutral. From these we derive precision, recall, and F1. Aggregate scores are then sliced by reasoning effort, model scale, prompt format, entity popularity, and knowledge domain.

Evaluation scale
Sampled triples per model500 × 2 dirs
Stratificationentity-then-triple
RAG retrieval (precision)top-10
Embedding modeltext-embed-3-small
NLI judgeLlama 4 Scout
Judge agreement (manual)90%
Four evaluation sets

Datasets

BeQu ships four entity lists, each constructed to isolate a different dimension of LLM knowledge. The primary list anchors the headline leaderboard; the others power domain, popularity, and hallucination analyses.

DATASET 01

Random entities 10,000

Randomly sampled Wikipedia article titles passing hard filters (length, Wikidata coverage, non-ambiguity) plus an LLM judge for informativeness and suitability. The primary benchmark dataset.

Hard filters Llama 4 judged Likert ≥ 3 on all axes
DATASET 02

Domain-balanced 10 × 100

100 entities per domain across person, organisation, location, event, work of art, artifact, scientific concept, cultural concept, animal, plant. Enables per-domain F1 analysis.

10 domains Single-domain entities only
DATASET 03

Non-existent entities 10

Hand-crafted plausible and absurd fictions — Valdora Strait, U-Bahn Dresden, iPhone 19 Pro, Helios Prize for Digital Arts, Gulf of Varennes. No reference corpus; the test is whether models abstain from generating triples at all.

Hallucination probe No ground truth Experiment 3
APPENDIX E

Popularity tiers 3 × 200

The random subset partitioned into low / mid / high popularity buckets by Wikidata statement count. Used in a supplementary experiment — GPT-5.4's F1 jumps 0.085 from mid to high popularity, Llama 4 Scout shows no trend.

Wikidata statement count Equal-sized buckets Appendix experiment
Hallucination on non-existent entities
Lower is better · 10 fake entities
GPT-5.4 alone fully abstains. Open-weight models hallucinate readily.
GPT-5.4
0 triples
Llama 4 Scout 17B
32 / 7 ent.
DeepSeek V3.2
131 / 4 ent.
About this work

About BeQu

BeQu is anonymous in the current submission cycle and accompanies a paper currently under peer review at ARR May 2026. All data, code, and elicited triples will be released publicly upon publication.

The benchmark was built to address a fundamental limitation of existing LLM knowledge evaluations: availability bias. Question-based benchmarks can only test what their designers chose to ask. BeQu shifts the paradigm — models are evaluated on knowledge they choose to surface in response to open-ended prompts, then verified against reference corpora.

We evaluate 20 commercial and open-weight models across seven experimental scenarios: overall ranking, reasoning effort, domain, parameter scale, prompt format, triple range, and entity popularity. Total reproduction cost is approximately $183 USD including OpenAI, OpenRouter, and Brave Search API credits.

Submission of new models will open after the review period. Until then, the leaderboard reflects the 20 models in the paper.

How to cite
@inproceedings{bequ2026, title = {Beyond Questions: Evaluating What Language Models (Actually) Know}, author = {Anonymous}, booktitle= {}, year = {2026}, note = {Under review} }
Data explorer

Entity Lists

Browse the Wikipedia entity sets used across BeQu experiments. Select a dataset below to search and page through all entities.
Data explorer

Elicited Triples

Browse the subject–predicate–object triples each model generated for each experiment. Select a model and dataset on the left, then search or page through the results.
Model
Select a modelChoose a model on the left to explore its elicited triples.