Garp Independent AI & technology journalism
Tuesday, June 23, 2026 Sign In · Join Subscribe
Latest Google Deepmind and A24 team up on AI filmmaking research

AI news, research, models, robotics, chips, startups, and infrastructure coverage.

Updated daily

Home  /  AI News  /  Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

AI News

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Researchers published new findings on Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech: can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment. Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline. Our benchmark covers four language pairs that were most relevant for our customer base: Spanish-English, French-English, Canadian French-English, and German-English. It uses the non-English language as the matrix framing, with English embedded at varying lengths. The data covers a wide range of Human Resources (HR) and IT Service management (ITSM) scenarios, including employee inquiries about benefits or payroll, and support requests such as password resets, VPN access, or device troubleshooting. To measure how various models perform, we report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). We choose these metrics to capture both (1) the models’ exact accuracy in transcription, as well as (2) their ability to preserve the meaning of the utterance for downstream tasks. We release our benchmark and data through our harness for evaluating voice models, AU-Harness. We also provide results from seven ASR systems, including some Large Audio Language Models (LALMs), frontier ASRs, and open-source ASRs. Our main finding is that the cost of codeswitching varies depending on the language-pair and model tested. ElevenLabs Scribe V2, Google Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task. The Benchmark Data Pipeline We start with an internal corpus of IT support and HR interactions.