BeQu is an open knowledge evaluation benchmark for LLMs. Instead of fixed Q&A pairs,
we prompt models to freely surface what they know about an entity, e.g.
Tell me everything you know about Martin Luther King, then verify each
generated statement against a reference corpus built from Wikipedia and the web.
10,000 entities · 20 models · two-way precision/recall.
| # | Model | F1 ▼ |
|---|
BeQu evaluates models along two complementary axes: precision — how well the elicited triples are supported by the reference corpus — and recall — how much of the reference corpus the model surfaces. Both directions are framed as textual entailment and judged by a locally hosted Llama 4 Scout.
Random sampling from Wikipedia with hard filters (≥2000 chars, ≥10 Wikidata statements, not a disambiguation page), then LLM-judged for informativeness, ambiguity, and suitability.
The model under test is prompted to freely emit (subject, predicate, object) triples about each entity. We use the GPTKB prompt by default, plus seven variations for prompt-sensitivity analysis.
For each entity we assemble the full Wikipedia article and up to 20 web documents via Brave Search. An LLM extracts reference triples; full texts serve as the retrieval corpus for RAG.
Precision: each elicited triple is checked against retrieved passages (top-10 RAG). Recall: each reference triple is checked against the model's full elicited set. NLI labels: entailed, contradictory, neutral.
Most existing knowledge benchmarks suffer from availability bias: they only evaluate knowledge that designers explicitly chose to query. If a fact is not in the benchmark, it is invisible to evaluation — even when the model possesses it.
By starting from open-ended elicitation rather than predefined questions, BeQu shifts the focus from answer retrieval toward characterizing the knowledge models naturally surface. The benchmark is paired with reference corpora so that statements remain verifiable, but the model retains full freedom over what to express.
Each model is scored along three label distributions in both directions: entailment, contradiction, and neutral. From these we derive precision, recall, and F1. Aggregate scores are then sliced by reasoning effort, model scale, prompt format, entity popularity, and knowledge domain.
BeQu ships four entity lists, each constructed to isolate a different dimension of LLM knowledge. The primary list anchors the headline leaderboard; the others power domain, popularity, and hallucination analyses.
Randomly sampled Wikipedia article titles passing hard filters (length, Wikidata coverage, non-ambiguity) plus an LLM judge for informativeness and suitability. The primary benchmark dataset.
100 entities per domain across person, organisation, location, event, work of art, artifact, scientific concept, cultural concept, animal, plant. Enables per-domain F1 analysis.
Hand-crafted plausible and absurd fictions — Valdora Strait, U-Bahn Dresden, iPhone 19 Pro, Helios Prize for Digital Arts, Gulf of Varennes. No reference corpus; the test is whether models abstain from generating triples at all.
The random subset partitioned into low / mid / high popularity buckets by Wikidata statement count. Used in a supplementary experiment — GPT-5.4's F1 jumps 0.085 from mid to high popularity, Llama 4 Scout shows no trend.
BeQu is anonymous in the current submission cycle and accompanies a paper currently under peer review at ARR May 2026. All data, code, and elicited triples will be released publicly upon publication.
The benchmark was built to address a fundamental limitation of existing LLM knowledge evaluations: availability bias. Question-based benchmarks can only test what their designers chose to ask. BeQu shifts the paradigm — models are evaluated on knowledge they choose to surface in response to open-ended prompts, then verified against reference corpora.
We evaluate 20 commercial and open-weight models across seven experimental scenarios:
overall ranking, reasoning effort, domain, parameter scale, prompt format, triple
range, and entity popularity. Total reproduction cost is approximately
$183 USD
including OpenAI, OpenRouter, and Brave Search API credits.
Submission of new models will open after the review period. Until then, the leaderboard reflects the 20 models in the paper.