Cohere AI Releases Cohere Transcribe: A SOTA Automatic Speech Recognition (ASR) Model Powering Enterprise Speech Intelligence

In the landscape of enterprise AI, the bridge between unstructured audio and actionable text has often been a bottleneck of proprietary APIs and complex cascaded pipelines. Today, Cohere—a company traditionally known for its text-generation and embedding models—has officially stepped into the Automatic Speech Recognition (ASR) market with the release of their latest model ‘Cohere Transcribe‘.

The Architecture: Why Conformer Matters

To understand the Cohere Transcribe model, one must look past the ‘Transformer’ label. While the model is an encoder-decoder architecture, it specifically utilizes a large Conformer encoder paired with a lightweight Transformer decoder.

A Conformer is a hybrid architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Transformers. In ASR, local features (like specific phonemes or rapid transitions in sound) are often handled better by CNNs, while global context (the meaning of the sentence) is the domain of Transformers. By interleaving these layers, Cohere’s model is designed to capture both fine-grained acoustic details and long-range linguistic dependencies.

The model was trained using standard supervised cross-entropy, a classic but robust training objective that focuses on minimizing the difference between the predicted text and the ground-truth transcript.

Performance

While some global models aim for 100+ languages with varying degrees of accuracy, Cohere has opted for a ‘quality over quantity’ approach. The model officially supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese, Japanese, and Korean.

Cohere positions Transcribe as a high-accuracy, production-oriented ASR model. It ranks #1 on the Hugging Face Open ASR Leaderboard (March 26, 2026) with an average WER of 5.42% across benchmark sets including AMI, Earnings22, GigaSpeech, LibriSpeech clean/other, SPGISpeech, TED-LIUM, and VoxPopuli. It also scores 8.13 on AMI, 10.86 on Earnings22, 9.34 on GigaSpeech, 1.25 on LibriSpeech clean, 2.37 on LibriSpeech other, 3.08 on SPGISpeech, 2.49 on TED-LIUM, and 5.87 on VoxPopuli, outperforming models such as Whisper Large v3 (7.44 average WER), ElevenLabs Scribe v2 (5.83), and Qwen3-ASR-1.7B (5.76) on various leaderboards.

https://cohere.com/blog/transcribe

Cohere team also reports stronger human preference results in English, where annotators preferred Transcribe over competing transcripts in head-to-head comparisons, including 78% against IBM Granite 4.0 1B Speech, 67% against NVIDIA Canary Qwen 2.5B, 64% against Whisper Large v3, and 56% against Zoom Scribe v1.

https://cohere.com/blog/transcribe

Long-Form Audio: The 35-Second Rule

Handling long-form audio—such as 60-minute earnings calls or legal proceedings—presents a unique challenge for memory-intensive architectures. Cohere addresses this not through sliding-window attention, but through a robust chunking and reassembly logic.

The model is natively designed to process audio in 35-second segments. For any file exceeding this limit, the system automatically:

Splits the audio into overlapping chunks.

Processes each segment through the Conformer-Transformer pipeline.

Reassembles the overlapping text to ensure continuity.

This approach ensures that the model can handle a 55-minute file without exhausting GPU VRAM, provided the engineering pipeline manages the chunking orchestration correctly.

Key Takeaways

State-of-the-Art Accuracy: The model launched at #1 on the Hugging Face Open ASR Leaderboard (March 26, 2026) with an average Word Error Rate (WER) of 5.42%. It outperforms established models like Whisper Large v3 (7.44%) and IBM Granite 4.0 (5.52%) across benchmarks including LibriSpeech, Earnings22, and TED-LIUM.

Hybrid Conformer Architecture: Unlike standard pure-Transformer models, Transcribe utilizes a large Conformer encoder paired with a lightweight Transformer decoder. This hybrid design allows the model to efficiently capture both local acoustic features (via convolution) and global linguistic context (via self-attention).

Automated Long-Form Handling: To maintain memory efficiency and stability, the model uses a native 35-second chunking logic. It automatically segments audio longer than 35 seconds into overlapping chunks and reassembles them, allowing it to process extended recordings—like 55-minute earnings calls—without performance degradation.

Defined Technical Constraints: The model is a pure ASR tool and does not natively feature speaker diarization or timestamps. It supports 14 specific languages and performs best when the target language is pre-defined, as it does not include explicit automatic language detection or optimized support for code-switching.

Check out the Technical details and Model Weight on HF. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cohere AI Releases Cohere Transcribe: A SOTA Automatic Speech Recognition (ASR) Model Powering Enterprise Speech Intelligence appeared first on MarkTechPost.