Skip to recommendations
Powered by BrainDrive

Meet the open-source models we already picked for you

Research-backed, transparent recommendations—built in public and fully open source, so every score is traceable down to the last quark.

Open-source

Every prompt, score, and script lives in our repo so you can rerun or fork it.

Explore the repo
Research-backed

Rubrics come from peer-reviewed work (IEEE Xplore, Springer) we translate into plain language.

Referenced, not affiliated.

Use-case focused

We score how models handle real jobs like therapy chats, finance advice, and clinical triage.

Transparent scoring

Judge choices, safeguards, and weighting rules are public so you can trust the leaderboard.

Recommendations

Our current leaderboard

Each score blends five domain runs and public judge notes. Track how we go from raw transcript to the numbers right below.

#1 · Total avg 9.08
Phi-3 Mini 4K InstructOpens on Hugging Face
MicrosoftMicrosoft
Integrated score9.08
Summary
9.87
Summary
Email
8.35
Email
Therapy
9.05
Therapy
Finance
8.88
Finance
Health
9.27
Health
Context window: 4K ctxLicense: MITLanguages: English
Strengths
  • Balanced performer: Most consistent across five metrics with no wild swings.
  • Excellent factual grounding: Barely hallucinates and hugs the source context.
  • Compact and efficient: Nearly matches 7B models while running cheaper and faster.
  • Clean delivery: Keeps summaries precise without extra fluff.
Weaknesses
  • Slightly rigid phrasing that can feel terse on creative prompts.
  • Stays conservative even when light speculation would help.
When to choose it

Default baseline when you need consistent, factual scoring without burning GPU hours.

#2 · Total avg 8.87
Mistral 7B Instruct v0.3Opens on Hugging Face
MistralMistral
Integrated score8.87
Summary
9.94
Summary
Email
7.94
Email
Therapy
8.57
Therapy
Finance
8.58
Finance
Health
9.33
Health
Context window: 8K ctxLicense: Apache-2.0Languages: Multilingual
Strengths
  • High linguistic fluidity with vivid, human-like phrasing.
  • Robust comprehension that keeps longer contexts coherent.
  • Reads nuanced instructions accurately and follows them closely.
Weaknesses
  • Moderate hallucination risk on reasoning-heavy tasks.
  • Tends to over-summarize and flatten fine-grained distinctions.
  • Coverage trails Phi-3 on exhaustive detail capture.
When to choose it

Pick it when tone, flow, and human-like phrasing matter just as much as correctness.

#3 · Total avg 8.79
OpenHermes 2.5 (Mistral 7B)Opens on Hugging Face
TekniumTeknium
Integrated score8.79
Summary
9.73
Summary
Email
8.02
Email
Therapy
8.10
Therapy
Finance
8.70
Finance
Health
9.40
Health
Context window: 8K ctxLicense: Apache-2.0Languages: English
Strengths
  • Reasoning depth that shines on multi-step inference prompts.
  • Bias and toxicity controls keep even edgy topics neutral.
  • Stylistically adaptive—shifts from formal to conversational without fuss.
Weaknesses
  • Less consistent factual grounding versus Phi-3.
  • Can ramble, adding extra sentences you did not ask for.
When to choose it

Use when you want nuanced, opinionated takes that still stay within alignment guardrails.

Domain Hubs

Five specialist evaluators

Each domain is purpose-built, with rubrics tuned to the workflows BrainDrive teams run in production. Explore the latest leaders and the checks that matter.

Therapy

Safety checks (CareLock) grade therapy-style chats on empathy, boundaries, and support before a score is issued.

Judges & orchestration

GPT-4o and Claude 3.5 Sonnet score each response with CareLock guardrails. Hard fails stop unsafe replies from ranking.

Key metrics

  • Empathy & rapport
  • Emotional relevance
  • Boundary awareness (hard gate)
  • Ethical safety (hard gate)
  • Adaptability & support

Email

EmailEval checks outreach messages for clarity, length, spam risk, personalization, tone, and hygiene.

Judges & orchestration

GPT-4o and Claude 3.5 Sonnet judge clarity and tone while heuristics catch spam signals and sloppy hygiene issues.

Key metrics

  • Clarity & ask framing
  • Length & pacing
  • Spam & deliverability risk
  • Personalization density
  • Tone & hygiene

Finance

FinanceEval checks advisor-style replies for trust, accuracy, plain-language explainers, client-first tone, and risk safety.

Judges & orchestration

GPT-4o, Claude 3.5 Sonnet, and rule-based guards flag missing disclosures, bias, or risky advice.

Key metrics

  • Trust & transparency
  • Competence & accuracy
  • Explainability
  • Client-centeredness
  • Risk safety
  • Communication clarity

Health

HealthEval grades clinical answers on transparency, escalation safety, empathy, clarity, plan quality, and user agency.

Judges & orchestration

CareLock health guards with GPT-4o and Claude 3.5 Sonnet enforce escalation and ethics before scores count.

Key metrics

  • Evidence transparency
  • Clinical safety
  • Empathy
  • Clarity
  • Plan quality
  • Trust & agency

Summaries

SummEval uses TwinLock (deterministic) and JudgeLock (LLM cross-check) to score coverage, alignment, hallucination, relevance, and bias.

Judges & orchestration

TwinLock rule-based scoring pairs with a GPT-4o + Claude 3.5 Sonnet ensemble to cross-check every summary.

Key metrics

  • Coverage
  • Intent alignment
  • Hallucination control
  • Topical relevance
  • Bias & toxicity
Methodology

How we score models

Every evaluation follows the same four clear steps. Dig into the full framework or watch the walkthrough if you want the deep dive.

Start with real tasks

We gather raw transcripts and prompts from real teams, then map the exact jobs they expect models to do.

Build rubrics from research

We translate peer-reviewed rubrics (IEEE Xplore, Springer, clinical studies) into plain-language, reproducible checklists.

Judge with strong models

We pair GPT-4o and Claude 3.5 Sonnet with rule-based safeties that catch hallucinations, bias, and unsafe advice before it ranks.

Share scores & notes

We publish prompts, scores, judge notes, and rerun scripts so you can reproduce everything or fork it.

Evaluation Flow

From raw transcripts to leaderboard insights

Here’s how we turn messy transcripts into ranked recommendations you can trust without touching a single spreadsheet.

We run every candidate through the same pipeline—collect the raw chat, blast it through domain evaluators, cross-check the judge notes, then normalize it onto the public board. You sip coffee, we spin atoms and beam you the trusted pick.

  1. 1

    Collect

    Drop in the transcript or prompt. We log model info, context window, and any guardrails used by your stack.

  2. 2

    Evaluate

    Domain evaluators run the task with our judge stack. Hard fails block unsafe or off-target answers in real time.

  3. 3

    Synthesize

    We add up the metric scores, capture judge notes, and flag anything you should double-check before deployment.

  4. 4

    Rank

    Scores normalize to a single leaderboard so you can compare models side-by-side—or defend a switch to your team.

Join the BrainDrive community

ModelMatch is open-source and powered by community feedback. Explore the docs, rerun evaluations, and help shape the next wave of trustworthy AI evaluations.