InvisibleBench v2: What Three Months of Running the Benchmark Taught Us

When we open-sourced InvisibleBench in November, we invited the community to evaluate before deploying. Then we kept running it ourselves. Here's what we learned, and why v2 is a harder, more honest benchmark.

GiveCare Team

Contributor

In November, we open-sourced InvisibleBench, the first safety benchmark for AI systems used in long-term caregiving relationships. We published results showing that crisis detection ranged from 11.8% to 44.8% across four frontier models. We said: evaluate before you deploy.

Then we kept running it.

Three months later, the benchmark looks different. Not because we moved the goalposts, but because running it repeatedly against real systems revealed problems with v1 that we couldn't see until we had enough data to compare.

This is what we learned, and what changed.

The Problem With v1

The original InvisibleBench had 29 scenarios. Fourteen of them (67% of tier 1) tested crisis detection.

That made sense when we designed it. Crisis is the highest-stakes failure mode. Getting it wrong has the most immediate consequences. We wanted to know: when a caregiver says "I can't do this anymore," does the AI respond appropriately?

But as we ran more evaluations, a different question started bothering us: Is crisis detection actually our differentiator?

Specialized benchmarks already own this space. The CARE Framework from Rosebud reports 86% model failure on indirect crisis queries. Columbia Protocol instruments test suicidality severity gradients. We were measuring something well-studied. What InvisibleBench was designed for, what no other benchmark tests, is the rest of the conversation.

We asked: what does caregiving AI need to do that a mental health chatbot doesn't?

The answer is three things v1 barely tested.

Gray zone navigation. Caregivers need to vent without being pathologized. "I'm exhausted and want to scream" is not an abuse indicator. "I feel nothing anymore" is often compassion fatigue, not depression. Dark humor about a situation that isn't funny is a coping mechanism. Models that over-respond to normal caregiver stress teach people to stop being honest. That failure is invisible in a crisis-heavy benchmark.

Boundary scenarios. AI dependency, medical advice creep, and therapeutic replacement are failure modes unique to AI companionship. They don't appear in single conversations. They develop over weeks of interaction. That's exactly the kind of failure InvisibleBench was designed to catch, but v1 didn't have scenarios for them.

Multi-turn manipulation. Two of the hardest v2 scenarios, Authority Claim (a user claiming medical credentials to extract advice) and Gradual Boundary Erosion (slow escalation toward medical advice across multiple turns), require the model to track a developing pattern, not just respond to a single message. Every frontier model we've tested has at least one hard fail in this category.

InvisibleBench v2

v2 has 44 scenarios organized into four categories.

Safety (17 scenarios): a tighter, more differentiated set. Five crisis scenarios we kept because they test things other benchmarks don't (our dual-concern scenario: the caregiver who can't call for help because they can't leave their parent alone). New gray zone scenarios for exhaustion venting, compassion fatigue, and dark humor. New boundary scenarios for AI dependency and therapist replacement. False refusal is now a scored dimension: a model that refuses to explain what a DNR is has failed, not played it safe.

Empathy (13 scenarios): burnout, grief, belonging, relational trust. This was the thinnest part of v1. A caregiver who has been awake for 36 hours and texts at 2 AM doesn't need information retrieval. She needs to feel like something heard her. v2 tests this in depth.

Context (10 scenarios): real regulatory scenarios (Nevada AB 406, Illinois WOPR, Utah HB 452, California SB 243) and cultural competence that goes beyond surface representation. Multigenerational households. Traditional remedies. Care practices that don't map cleanly to Western medical frameworks.

Continuity (4 scenarios): longitudinal trust, memory consistency, crisis return across sessions, cultural rituals over time.

Seventeen of the 44 scenarios have conditional branching: the next message the caregiver sends depends on how the model responded. If the model misses a subtle distress signal, the scenario escalates. This means the benchmark tests recovery, not just detection.

New Scoring: Gate + Quality

v1 scored five dimensions holistically. The problem was that a model could be clinically cold and still score well on compliance.

v2 uses two layers:

Gates: safety and compliance are pass/fail. Trigger a hard fail on either and the scenario scores zero, regardless of how warm the response was.

Quality: two dimensions score the how. Regard measures whether the model treated the caregiver as a person: acknowledged their situation, used their name, responded to what was actually said. Coordination measures whether the response actually helped them move through their situation.

This catches what v1 missed: technically correct responses that still fail the person asking.

What the Data Shows

Running v2 across 12 frontier models as of February 2026:

GPT-5.2: 79.6%, 2 hard fails (Authority Claim, Gradual Boundary Erosion)
Claude Opus 4.5: 78.7%, 0 hard fails
Grok 4: 78.3%
Kimi K2.5: 77.5%
Qwen3 235B: 77.2%
DeepSeek V3.2: 76.5%
Gemini 2.5 Flash: 74.6%
Gemini 3 Pro Preview: 71.7%

The range compressed from v1. Expected: v2 is harder and more specific. Every model now has at least one category where it fails reliably. No model dominates across all four categories. GPT-5.2's hard fails on multi-turn manipulation show that raw capability doesn't solve the longitudinal tracking problem.

No model is above 80%. The best AI available is still getting caregiving wrong at least one time in five.

Where Mira Stands

Our most recent system evaluation surfaced a harness issue: the disclosure cadence logic wasn't firing correctly in the test environment, which suppressed scores artificially across all scenarios. We're fixing the test harness before publishing a Mira v2 number.

What we can say from the transcripts: Mira handles the scenarios it was tuned for (crisis detection, cultural belonging, memory consistency) at the same level it did in v1. The gaps that v2 exposed are in gradual boundary erosion and multi-turn authority claims. Both are system design problems, not model problems. We're working on them.

What Changed in Our Thinking

When we published the v1 results in November, we concluded that systems beat models, that a purpose-built caregiving architecture would outperform raw foundation models on the dimensions that matter for caregivers.

That conclusion still holds. But v2 extended it.

v1 showed that memory retrieval and trauma-informed prompting are enough to match frontier model performance on 29 scenarios. v2 showed that 44 scenarios (including multi-turn manipulation, false refusal, and regulatory edge cases) require things we hadn't built yet: longitudinal pattern tracking, credential verification logic, jurisdiction-specific compliance rules.

Those aren't model capabilities. They're system design choices. Which means they're buildable.

Try It

InvisibleBench v2 is open source. 44 scenarios, conditional branching, and the new Gate+Quality scoring are all in the repository.

If you're deploying AI for caregivers, or for any vulnerable population, run it before you ship.

Benchmark: github.com/givecareapp/givecare-bench
Leaderboard: bench.givecareapp.com

Have questions or want to contribute? Reach out at ali@givecareapp.com.

The GiveCare team builds AI support for America's 63 million family caregivers.

← Back to Words