Retena Quality Dashboard

Best model today

Loading…

Waiting for ASR ranking evidence.

Can we promote it?

Loading…

Soniox canary needs accepted/corrected evidence before all-user expansion.

Why / why not?

Loading benchmark blocker…

Loading…

🏆 ASR Model Ranking Gold first; Silver is automated/inferred; Live is diagnostic

Loading…

Soniox is the current primary ASR target for canary routing. Gold WER/CER is human accepted/corrected truth and is the only all-user promotion-grade score. Silver WER/CER is automated consensus/referee evidence for ranking and triage only. Live Drift/Health shows provider failures, truncation risk, latency, and disagreement; it is not accuracy truth.

4o mini backup vs MAI

Side-by-side backup transcript review. 4o mini is a fallback baseline; MAI is secondary shadow comparison.

Soniox / OpenAI / Voxtral

Four-column benchmark view for Soniox primary, OpenAI backups, and Voxtral candidate transcripts.

How to read it

Compare text differences quickly, but use Gold/Silver metrics in Overview to judge model quality. Disagreement alone is not WER.

Loading…

Compare Coverage metadata only on this page; transcript text opens in account-scoped compare views

Loading…

Gold / Silver Evidence coverage and review queue, no transcript text

Loading…

Review Queue human review should focus on disagreement, failures, and sparse languages

Loading…

Advanced Diagnostics legacy and raw panels moved out of the decision path

Global ASR LegacyPrevious provider matrix and validation plan. ASR ModelsAccepted eval, shadow drift, badges, and samples. ASR Accuracy LabOlder Gold/proxy explanation page. Production QualityHeuristic production transcript health, not provider ranking. MAI QualityMAI-only heuristic rows for shadow diagnostics. Gold ReviewHuman accepted/corrected truth workflow. Legacy OverviewOlder aggregate dashboard cards and recent samples. Legacy H2HOld Whisper/Deepgram validation screen.

Live Drift / Health diagnostic only, not promotion-grade

Loading…

Active Provider Ranking truth first; screening signals only when truth is missing

Loading…

Needs Review Now metadata only, no transcript or audio path exposure

Loading…

Advanced Diagnostics routing mix, schema warnings, and report readiness

Loading…

🏆 Overall Model Ranking ASR accepted eval: WER/CER lower is better

Loading…

📋 Recent Samples

Loading…

📏 ASR Accuracy Lab

This page keeps two ideas separate: measured accuracy from accepted/corrected reference truth, and risk proxy signals for every live voice note. Soniox is the current primary ASR canary; OpenAI 4o and 4o-mini remain backup baselines while MAI-Transcribe-1.5 remains a secondary shadow signal.

Guardrail: WER/CER require reference truth. If a row has no accepted/corrected transcript, the dashboard can flag risk, disagreement, or health only. It must not invent true WER.

Measured Accuracy

Gold, canary, and human-corrected rows only

Risk Proxy

Transcript health + backup↔MAI disagreement on unlabeled rows

Loading…

✅ Measured Accuracy raw WER/CER from accepted/corrected reference rows only

Loading…

🛰️ Risk Proxy On All Rows use for triage, not accuracy claims

Loading…

🔁 Human Truth Loop how the proxy becomes real WER/CER

Voice note

Soniox produces the primary canary transcript.

MAI shadow

MAI runs secondary for side-by-side disagreement.

Risk triage

Health, duration, language, and backup↔MAI delta rank review priority.

Human truth

Accepted/corrected transcript becomes the reference.

Measured score

Soniox, OpenAI backups, and shadow models get raw WER/CER against the same reference.

🗺️ Deployment Roadmap Opus 4.7 reviewed, UI-first slice today

Today UI contract

Separate measured WER/CER from risk proxy.
Show Soniox, OpenAI backup, and MAI raw metrics when truth exists.
Keep Production Quality page intact.

Next truth pipeline

Capture correction triples automatically.
Version a small frozen gold/canary set.
Nightly WER/CER worker by model and language.

Then calibration

Calibrate proxy risk against labeled rows.
Alert on canary WER drift.
Use model-vs-model disagreement for active learning.

Accepted/corrected eval decides quality. Shadow drift measures disagreement, not truth.

Loading…

✅ Accepted Eval Accuracy against verified/corrected transcript pairs

Loading…

🧪 Shadow Drift provider output vs current primary transcript

Loading…

🏷️ Live Routing Badges recent voice rows

Loading…

🔍 Recent Shadow Samples redacted comparison cards

Loading…

Soniox + Voxtral Provider Compare your messages only; OpenAI 4o and 4o-mini are backup baselines

Loading…

4o-mini Backup vs MAI Transcript Compare 4o-mini is fallback baseline; MAI is secondary comparison only

Loading…

📈 Translation Quality by Model higher = better

Loading…

Select two models to compare

🔍 Samples — click a row to expand translations

Loading…

🔍 Cleanup Pass Breakdown why was the 2nd pass skipped or run?

Loading…

📝 By Language avg length, bullets, improvement per language

Loading…

4o mini backup vs MAI

Soniox / OpenAI / Voxtral

How to read it

Today UI contract

Next truth pipeline

Then calibration

Gold Review Sample

Review Transcription