Anonymous submission · ARR May 2026 Live · v1.0

Beyond questions.

Evaluating what language models (actually) know.

BeQu is an open knowledge evaluation benchmark for LLMs. Instead of fixed Q&A pairs, we prompt models to freely surface what they know about an entity, e.g. Tell me everything you know about Martin Luther King, then verify each generated statement against a reference corpus built from Wikipedia and the web. 10,000 entities · 20 models · two-way precision/recall.

Read the paper View on GitHub Download data

20

Models evaluated

10,000

Entities

10

Knowledge domains

5

Experiments

2-way

P / R verification

Experiments

Overall model ranking

Verified against per-entity reference corpora
Wikipedia + web · Llama 4 Scout NLI judge

Setting

GPTKB prompt · medium reasoning

Dataset

200 random entities

All 20 models evaluated on the same set of randomly sampled entities. This scenario produces the primary BeQu ranking.

#	Model	F1 ▼	Precision ▼	Recall ▼	Contradicts ▼

Precision–Recall frontier

Experiment 1 · all 20 models

No model achieves both high precision and high recall — a clean frontier emerges.

Commercial Open weights Hover points for details

Key findings

From five experimental scenarios.

FINDING 01

Benchmark durability

All models are far from saturating open-ended knowledge generation — F1 spans 0.171–0.473 — and open-source models are competitive with commercial ones: Kimi K2.5 ranks 2nd overall.

FINDING 02

"Unreasonable" task

Unlike most NLP tasks, reasoning has negligible benefit for open-ended knowledge expression. All nine model–effort combinations cluster within just F1 0.400–0.484.

FINDING 03

Schemas vs. creativity

Schema enforcement boosts precision — Schema.org reaches 77.2% — but significantly lowers recall to only 14.0%. Open-ended prompts consistently win on F1.

FINDING 04

Hard-wired operating points

Explicit precision–recall steering is possible only to a limited extent. Prompt repetition (3×) gives a modest recall gain (27.2 → 31.6%), but further repetitions (5×, 10×) steeply degrade performance.

F1 by knowledge domain

Experiment 3 · 100 entities/domain · switch model below

Showing GPT-5.4. Event and Scientific Concept are universally strong; Person is universally weakest.

How it works

Methodology

BeQu evaluates models along two complementary axes: precision — how well the elicited triples are supported by the reference corpus — and recall — how much of the reference corpus the model surfaces. Both directions are framed as textual entailment and judged by a locally hosted Llama 4 Scout.

i.

Entity selection

Random sampling from Wikipedia with hard filters (≥2000 chars, ≥10 Wikidata statements, not a disambiguation page), then LLM-judged for informativeness, ambiguity, and suitability.

ii.

Knowledge elicitation

The model under test is prompted to freely emit (subject, predicate, object) triples about each entity. We use the GPTKB prompt by default, plus seven variations for prompt-sensitivity analysis.

iii.

Reference corpus

For each entity we assemble the full Wikipedia article and up to 20 web documents via Brave Search. An LLM extracts reference triples; full texts serve as the retrieval corpus for RAG.

iv.

Two-way verification

Precision: each elicited triple is checked against retrieved passages (top-10 RAG). Recall: each reference triple is checked against the model's full elicited set. NLI labels: entailed, contradictory, neutral.

What we measure

Beyond a single number

Most existing knowledge benchmarks suffer from availability bias: they only evaluate knowledge that designers explicitly chose to query. If a fact is not in the benchmark, it is invisible to evaluation — even when the model possesses it.

By starting from open-ended elicitation rather than predefined questions, BeQu shifts the focus from answer retrieval toward characterizing the knowledge models naturally surface. The benchmark is paired with reference corpora so that statements remain verifiable, but the model retains full freedom over what to express.

Each model is scored along three label distributions in both directions: entailment, contradiction, and neutral. From these we derive precision, recall, and F1. Aggregate scores are then sliced by reasoning effort, model scale, prompt format, entity popularity, and knowledge domain.

Evaluation scale

Sampled triples per model500 × 2 dirs

Stratificationentity-then-triple

RAG retrieval (precision)top-10

Embedding modeltext-embed-3-small

NLI judgeLlama 4 Scout

Judge agreement (manual)90%

Four evaluation sets

Datasets

BeQu ships four entity lists, each constructed to isolate a different dimension of LLM knowledge. The primary list anchors the headline leaderboard; the others power domain, popularity, and hallucination analyses.

DATASET 01

Random entities 10,000

Randomly sampled Wikipedia article titles passing hard filters (length, Wikidata coverage, non-ambiguity) plus an LLM judge for informativeness and suitability. The primary benchmark dataset.

Hard filters Llama 4 judged Likert ≥ 3 on all axes

DATASET 02

Domain-balanced 10 × 100

100 entities per domain across person, organisation, location, event, work of art, artifact, scientific concept, cultural concept, animal, plant. Enables per-domain F1 analysis.

10 domains Single-domain entities only

DATASET 03

Non-existent entities 10

Hand-crafted plausible and absurd fictions — Valdora Strait, U-Bahn Dresden, iPhone 19 Pro, Helios Prize for Digital Arts, Gulf of Varennes. No reference corpus; the test is whether models abstain from generating triples at all.

Hallucination probe No ground truth Experiment 3

APPENDIX E

Popularity tiers 3 × 200

The random subset partitioned into low / mid / high popularity buckets by Wikidata statement count. Used in a supplementary experiment — GPT-5.4's F1 jumps 0.085 from mid to high popularity, Llama 4 Scout shows no trend.

Wikidata statement count Equal-sized buckets Appendix experiment

Hallucination on non-existent entities

Lower is better · 10 fake entities

GPT-5.4 alone fully abstains. Open-weight models hallucinate readily.

GPT-5.4

0 triples

Llama 4 Scout 17B

32 / 7 ent.

DeepSeek V3.2

131 / 4 ent.

About this work

About BeQu

BeQu is anonymous in the current submission cycle and accompanies a paper currently under peer review at ARR May 2026. All data, code, and elicited triples will be released publicly upon publication.

The benchmark was built to address a fundamental limitation of existing LLM knowledge evaluations: availability bias. Question-based benchmarks can only test what their designers chose to ask. BeQu shifts the paradigm — models are evaluated on knowledge they choose to surface in response to open-ended prompts, then verified against reference corpora.

We evaluate 20 commercial and open-weight models across seven experimental scenarios: overall ranking, reasoning effort, domain, parameter scale, prompt format, triple range, and entity popularity. Total reproduction cost is approximately $183 USD including OpenAI, OpenRouter, and Brave Search API credits.

Submission of new models will open after the review period. Until then, the leaderboard reflects the 20 models in the paper.

How to cite

@inproceedings{bequ2026, title = {Beyond Questions: Evaluating What Language Models (Actually) Know}, author = {Anonymous}, booktitle= {}, year = {2026}, note = {Under review} }

Data explorer

Entity Lists

Browse the Wikipedia entity sets used across BeQu experiments. Select a dataset below to search and page through all entities.

Data explorer

Elicited Triples

Browse the subject–predicate–object triples each model generated for each experiment. Select a model and dataset on the left, then search or page through the results.

Select a modelChoose a model on the left to explore its elicited triples.

Beyond questions.

Overall model ranking

Methodology

Entity selection

Knowledge elicitation

Reference corpus

Two-way verification

Beyond a single number

Datasets

Random entities 10,000

Domain-balanced 10 × 100

Non-existent entities 10

Popularity tiers 3 × 200

About BeQu

Entity Lists

Random entities 10,000

By domain 10 × 100

By popularity 3 × ~200

Non-existent entities 10

Elicited Triples