Why we don't publish a leaderboard score

The brief is the benchmark

Public leaderboards score answers to fixed questions. Pensiv ships a different answer shape — a Decision-Grade Brief, not a ranked list — so leaderboard rank tests the wrong thing. What we run instead, and why every claim has to trace to source, is below.

The numbers first

What an independent audit found

Three expert raters, blind to our marks, scored the economic case. Every number below traces to the eight-document study, available on request.

$950K

NPV per mid-size deployment, over three years at a 12% discount rate

2.6 mo

to pay back, at the median customer

56.7%

less process risk — median of three blind raters (range 7.9–76.2%)

kill criteria you can hold us to in a 30-day pilot

We don't publish a leaderboard score — those get walked back. We publish numbers an independent rater can check. See how we tested →

Three principles the product is built on

Why a knowledge system is not a search engine

Principle

Memory is rebuilt, not just played back

Useful memory isn't a passive log. It's an active structure — recurring roles, actions, and constraints — that gets rebuilt as new evidence comes in. A search system that just returns a ranked list of passages drops that rebuilding work back on you.

In Pensiv

Pensiv works out its own structure as material builds up. By the time you ask, the answer is shaped by everything your team's knowledge has seen — not just by what shares words with your question.

Principle

Match by the underlying pattern, not the vocabulary

The valuable cross-team connection — security's "lateral movement" and SRE's "cascading failure" — has no shared words. Keyword search misses it. Meaning-based search hedges on it. To find it reliably, the system has to compare the underlying pattern, not the exact words.

In Pensiv

Pensiv finds the prior case even when it doesn't share a single word with your query. Legal precedent, competitor moves, clinical patterns — matched by the underlying pattern.

Principle

Forget the way an expert forgets

Memory has to age. Neither "store everything forever" nor "delete after N days" matches how knowledge actually fades in practice. What matters is what people keep coming back to, what gets replaced, and what stops being relevant — without losing the trail of what came before.

In Pensiv

Important knowledge sticks around. Irrelevant knowledge fades. When new information contradicts old, the old one is linked to its replacement, not overwritten. The system forgets by ranking things lower, not by erasing them — every earlier state stays on the trail.

Just published · 2026-05-25

Independent expert-rater audit of the economic case

Study

AI agent knowledge that pays back in under 3 months

Twenty-seven-persona Monte Carlo. Dual-baseline comparison against running with no memory at all andagainst the commodity AI-memory tier. Failure-mode catalogue scored by three independent expert raters blind to the author's marks. Twelve kill criteria you can hold us to in a thirty-day pilot.

Headline

$950,000 NPV per mid-ACV deployment at a 12% discount rate over three years · 2.6-month payback at the median customer · 56.7% process-risk reduction (median across three independent raters, range 7.9%–76.2%). Every number traces to the canonical eight-document study, available on request.

Read the study summary Request the full study

What the research enables in practice

Five vocabularies, one underlying pattern

Five different words. One shared pattern. Pensiv connects them.

Cross-department bridge

The pattern that survives vocabulary drift

Security writes "lateral movement." SRE writes "cascading failure." HR writes "escalation pattern." Same underlying problem, five different vocabularies, no shared words. Keyword search has no way to connect them. Meaning-based search hedges and hands you the closest noun, not the closest pattern.

Pensiv matches by the underlying pattern. The same pattern shows up across all five teams — without anyone having to agree on a single term for it.

Sec ↔ SRE ↔ HR ↔ Legal ↔ Finance

Proof

What we measured: same-pattern pairs from a test set drawn across different fields show up in the top results, even when they share no keywords. Open the product, click any claim it puts together, and walk it back to the original source.

See how it ships

How we validate

Evals score answers to fixed questions. We're shipping a different kind of answer, so leaderboard rank isn't the right test. What we run instead — and what each gate measures — is below.

Workflow trial

Per-vertical proof

Defense tabletop exercises, partner-handoff simulations on legal matters, multi-year lab data rolled into one view for clinical. Measured in hours-to-get-up-to-speed, patterns recovered, decisions reversed.

Replaces public leaderboard rank

A trail to the source

A source trail on every claim

Every claim the brief makes traces back to its source. The audit trail is the proof — open the product, click any claim, and walk it back to the original record.

Auditable, not scored

Internal gates

Tier-graded test gates

Every feature is held back by a tiered test suite covering correctness, contracts, and stability. A feature only ships when its tests pass. The number of tests grows with the product — what matters is whether the tests are passing, not how many there are.

Continuous · pre-ship

No claim without trace

Built-in source trail

The source trail isn't a logging layer bolted on. Every search returns the path that produced the result. Every claim the system puts together links back to a source. In regulated industries that's the price of admission, not a feature.

By design

Cognitive Infrastructure

The architecture that makes AI spend compound

Three independent research programs reached the same conclusion: curated knowledge turns a smaller, cheaper model into a better one. The knowledge layer — memory that fades by importance, same-pattern matching, automatic pattern discovery — is the differentiator. Not the model size.

Reproducible across model families

Capability	Pensiv	A typical AI search setup
Source trail on every answer	Yes	Rare
Confidence breakdown by component	Yes	No
Temporal accuracy (when did we know what)	Yes	No
Immutable write trail for audit	Yes	No
Decision-Grade Brief output (not just retrieval)	Yes	No

Research roadmap

Now

Open the brief, not the leaderboard.

The headline proof is the product UI on your documents. Run a query, judge the output. The brief is the benchmark.

Now

The plain search, measured in the open.

We're publishing a straight side-by-side: ordinary search on its own, versus search with the thinking layer on top. So you can check that the basics are solid before you take our word for the rest. The brief is what we believe matters — but the search underneath has to hold its own first.

Same-pattern matching across teams at scale.

Extending same-pattern matching across every team's knowledge so a single query finds the right precedent no matter where it lives.

Later

Prospective triggers.

Time-based and condition-based prompts — the system surfaces knowledge when its conditions match new input, without being asked.

Later

Connection discovery across teams.

Surface non-obvious links that no single contributor was positioned to see — across departments, across time, across roles.

What three independent efforts found

A small AI with good knowledge beats a much bigger one without it — for a fraction of the running cost. The model isn't the thing holding you back. Memory is.

Want to see how it works in practice?

Technical deep-dive See the product