ModelMatch — Find the best open-source model for your use case

Recommendations

Our current leaderboard

Each score blends five domain runs and public judge notes. Track how we go from raw transcript to the numbers right below.

#1 · Total avg 9.08

Phi-3 Mini 4K InstructOpens on Hugging Face

Microsoft

Integrated score9.08

Summary: 9.87; Summary
Email: 8.35; Email
Therapy: 9.05; Therapy
Finance: 8.88; Finance
Health: 9.27; Health

Context window: 4K ctxLicense: MITLanguages: English

Strengths

Balanced performer: Most consistent across five metrics with no wild swings.
Excellent factual grounding: Barely hallucinates and hugs the source context.
Compact and efficient: Nearly matches 7B models while running cheaper and faster.
Clean delivery: Keeps summaries precise without extra fluff.

Weaknesses

Slightly rigid phrasing that can feel terse on creative prompts.
Stays conservative even when light speculation would help.

When to choose it

Default baseline when you need consistent, factual scoring without burning GPU hours.

#2 · Total avg 8.87

Mistral 7B Instruct v0.3Opens on Hugging Face

Mistral

Integrated score8.87

Summary: 9.94; Summary
Email: 7.94; Email
Therapy: 8.57; Therapy
Finance: 8.58; Finance
Health: 9.33; Health

Context window: 8K ctxLicense: Apache-2.0Languages: Multilingual

Strengths

High linguistic fluidity with vivid, human-like phrasing.
Robust comprehension that keeps longer contexts coherent.
Reads nuanced instructions accurately and follows them closely.

Weaknesses

Moderate hallucination risk on reasoning-heavy tasks.
Tends to over-summarize and flatten fine-grained distinctions.
Coverage trails Phi-3 on exhaustive detail capture.

When to choose it

Pick it when tone, flow, and human-like phrasing matter just as much as correctness.

#3 · Total avg 8.79

OpenHermes 2.5 (Mistral 7B)Opens on Hugging Face

Teknium

Integrated score8.79

Summary: 9.73; Summary
Email: 8.02; Email
Therapy: 8.10; Therapy
Finance: 8.70; Finance
Health: 9.40; Health

Context window: 8K ctxLicense: Apache-2.0Languages: English

Strengths

Reasoning depth that shines on multi-step inference prompts.
Bias and toxicity controls keep even edgy topics neutral.
Stylistically adaptive—shifts from formal to conversational without fuss.

Weaknesses

Less consistent factual grounding versus Phi-3.
Can ramble, adding extra sentences you did not ask for.

When to choose it

Use when you want nuanced, opinionated takes that still stay within alignment guardrails.

Domain Hubs

Five specialist evaluators

Each domain is purpose-built, with rubrics tuned to the workflows BrainDrive teams run in production. Explore the latest leaders and the checks that matter.

Therapy

Safety checks (CareLock) grade therapy-style chats on empathy, boundaries, and support before a score is issued.

Judges & orchestration

GPT-4o and Claude 3.5 Sonnet score each response with CareLock guardrails. Hard fails stop unsafe replies from ranking.

Key metrics

Empathy & rapport
Emotional relevance
Boundary awareness (hard gate)
Ethical safety (hard gate)
Adaptability & support

Top models

8.6 · Llama3-Med42-8B

10-metric profile (Empathy, Emotional Relevance, Bias Safety, etc.) | ~8k ctx | Llama-3 Community | GPT-4o & Claude 3.5 Sonnet judges

8.55 · Gemma-3 Medical (Fine-tune i1 GGUF)

10-metric profile | Quantized GGUF variants for efficiency | Apache-2.0 | GPT-4o & Claude 3.5 Sonnet eval

8.15 · Josiefied-Health-Qwen3-8B-Abliterated-v1

10-metric profile | 8B parameters | GGUF | Multilingual (EN focus) | GPT-4o & Claude 3.5 Sonnet judges

Email

EmailEval checks outreach messages for clarity, length, spam risk, personalization, tone, and hygiene.

Judges & orchestration

GPT-4o and Claude 3.5 Sonnet judge clarity and tone while heuristics catch spam signals and sloppy hygiene issues.

Key metrics

Clarity & ask framing
Length & pacing
Spam & deliverability risk
Personalization density
Tone & hygiene

Top models

8.89 · Tulu-2-7B (AI2)

6-metric profile (Clarity, Length, Spam, Personalization, Tone, Hygiene) | 7B params | OpenRAIL | GPT-4o & Claude 3.5 Sonnet judges

8.54 · StarChat-Beta (Hugging Face H4)

6-metric profile | ~7B params | Apache-2.0 | Multilingual (EN strong) | GPT-4o & Claude 3.5 Sonnet judges

8.44 · LFM2-1.2B (Liquid AI)

6-metric profile | 1.2B params | Optimized for efficiency | GPT-4o & Claude 3.5 Sonnet judges

Finance

FinanceEval checks advisor-style replies for trust, accuracy, plain-language explainers, client-first tone, and risk safety.

Judges & orchestration

GPT-4o, Claude 3.5 Sonnet, and rule-based guards flag missing disclosures, bias, or risky advice.

Key metrics

Trust & transparency
Competence & accuracy
Explainability
Client-centeredness
Risk safety
Communication clarity

Top models

6.26 · Meta-Llama-3-70B Instruct

6-metric profile (Trust, Accuracy, Explainability, Client-First, Risk Safety, Clarity) | 70B params | Meta license | GPT-4o & Claude 3.5 Sonnet judges

5.87 · Meta-Llama-3.3-70B Instruct

6-metric profile | 70B params | Meta license | GPT-4o & Claude 3.5 Sonnet judges

5.78 · Nemotron-70B Instruct

6-metric profile | NVIDIA Open Model License | GPT-4o & Claude 3.5 Sonnet judges

Health

HealthEval grades clinical answers on transparency, escalation safety, empathy, clarity, plan quality, and user agency.

Judges & orchestration

CareLock health guards with GPT-4o and Claude 3.5 Sonnet enforce escalation and ethics before scores count.

Key metrics

Evidence transparency
Clinical safety
Empathy
Clarity
Plan quality
Trust & agency

Top models

7.44 · Qwen-UMLS-7B-Instruct

6-metric profile (Evidence & Transparency, Clinical Safety, Empathy, Clarity, Plan Quality, Trust & Agency) | 7B params | GPT-4o & Claude 3.5 Sonnet judges

7.43 · Phi-3-mini-4k-instruct

6-metric profile | 3.8B params | MIT license | GPT-4o & Claude 3.5 Sonnet judges

7.18 · Llama3-Med42-8B

6-metric profile | 8B params | Llama 3 base | GPT-4o & Claude 3.5 Sonnet judges

Summaries

SummEval uses TwinLock (deterministic) and JudgeLock (LLM cross-check) to score coverage, alignment, hallucination, relevance, and bias.

Judges & orchestration

TwinLock rule-based scoring pairs with a GPT-4o + Claude 3.5 Sonnet ensemble to cross-check every summary.

Key metrics

Coverage
Intent alignment
Hallucination control
Topical relevance
Bias & toxicity

Top models

9.69 · OpenHermes-2.5-Mistral-7B

C/A/H/R/B = 9/10/10/9/10 | ~8k ctx | Apache-2.0 | GPT-4o & Claude 3.5 Sonnet judges

9.50 · Mistral 7B Instruct v0.3

Judge-Lock | ~8k ctx | Apache-2.0 | GPT-4o & Claude 3.5 Sonnet judges

9.20 · Phi-3 Mini 4K Instruct

Judge-Lock | 4k ctx | MIT license | GPT-4o & Claude 3.5 Sonnet judges

Methodology

How we score models

Every evaluation follows the same four clear steps. Dig into the full framework or watch the walkthrough if you want the deep dive.

Start with real tasks

We gather raw transcripts and prompts from real teams, then map the exact jobs they expect models to do.

Build rubrics from research

We translate peer-reviewed rubrics (IEEE Xplore, Springer, clinical studies) into plain-language, reproducible checklists.

Judge with strong models

We pair GPT-4o and Claude 3.5 Sonnet with rule-based safeties that catch hallucinations, bias, and unsafe advice before it ranks.

Share scores & notes

We publish prompts, scores, judge notes, and rerun scripts so you can reproduce everything or fork it.

View evaluation framework Watch the walkthrough

Evaluation Flow

From raw transcripts to leaderboard insights

Here’s how we turn messy transcripts into ranked recommendations you can trust without touching a single spreadsheet.

We run every candidate through the same pipeline—collect the raw chat, blast it through domain evaluators, cross-check the judge notes, then normalize it onto the public board. You sip coffee, we spin atoms and beam you the trusted pick.

1
Collect
Drop in the transcript or prompt. We log model info, context window, and any guardrails used by your stack.
2
Evaluate
Domain evaluators run the task with our judge stack. Hard fails block unsafe or off-target answers in real time.
3
Synthesize
We add up the metric scores, capture judge notes, and flag anything you should double-check before deployment.
4
Rank
Scores normalize to a single leaderboard so you can compare models side-by-side—or defend a switch to your team.

Join the BrainDrive community

ModelMatch is open-source and powered by community feedback. Explore the docs, rerun evaluations, and help shape the next wave of trustworthy AI evaluations.

Visit community.braindrive.ai Explore the repo

Therapy

Email

Finance

Health

Summaries

Collect

Evaluate

Synthesize

Rank

Join the BrainDrive community