ElevenLabs charges between $5 and $330 per month for voice AI services. Every audio file you process goes through their cloud servers. For those looking for an open source alternative of ElevenLabs, OmniVoice Studio is good fit as an open-source desktop application that runs the same categories of tasks locally. It is a very interesting individual project that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization — without sending data to an external server.
What OmniVoice Studio Does
The application bundles six distinct capabilities. Understanding each one helps clarify what the system is doing under the hood.
Voice cloning works from a 3-second audio clip. The system uses zero-shot learning, meaning it clones a voice it has never been trained on before. It does this by conditioning a diffusion-based TTS model on the short reference audio. The underlying model, OmniVoice from k2-fsa, supports 600+ languages.
Voice design lets you build a new voice from parameters: gender, age, accent, pitch, speed, emotion, and dialect — without cloning any existing voice.
Video dubbing takes a YouTube URL or a local video file. It runs transcription using WhisperX, translates the transcript, synthesizes new audio using the TTS engine, and exports an MP4. The entire pipeline runs locally.
The dictation widget is a system-wide floating overlay. On macOS it activates via ⌘+⇧+Space from any application. It streams transcription via WebSocket and auto-pastes the result into whatever app is in focus.
The Batch Queue lets you drop up to 50 videos and walk away, with per-job progress bars tracking each one through the full pipeline.
The MCP Server exposes OmniVoice Studio’s capabilities to any MCP client — including Claude, Cursor, or your own tooling.
The Architecture
The project uses a React frontend talking to a FastAPI backend. The backend exposes 97 API endpoints, uses Server-Sent Events (SSE) for streaming updates, and stores data in SQLite.
Four core ML libraries handle the heavy work:
WhisperX handles automatic speech recognition (ASR) with word-level alignment. It supports 99 languages for transcription.
Demucs (Meta) handles source separation. It splits speech from background music and preserves both stems independently.
Pyannote handles speaker diarization — identifying which speaker said which words in a multi-speaker audio file. It is used together with WhisperX.
AudioSeal (Meta) embeds an invisible neural watermark into generated audio. This watermark survives compression and serves as AI provenance metadata.
The desktop wrapper is built with Tauri, a Rust-based framework for cross-platform native apps. The codebase is 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and 2.6% TypeScript.
For GPU support, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD). With 8 GB VRAM or less, TTS automatically offloads to CPU during transcription. No configuration is required.
Six TTS Engines, One Backend Registry
OmniVoice Studio ships a pluggable multi-engine TTS backend. You can switch engines in Settings → TTS Engine or by setting the OMNIVOICE_TTS_BACKEND environment variable.
The six built-in engines are OmniVoice (default, 600+ languages), CosyVoice 3 (9 languages plus 18 dialects, Apache-2.0), MLX-Audio (Apple Silicon-only, includes Kokoro and Qwen3-TTS among others), VoxCPM2 (30 languages, Apache-2.0), MOSS-TTS-Nano (20 languages, runs realtime on CPU), and KittenTTS (English-only, CPU-only, MIT).
Adding a custom engine takes roughly 50 lines of Python. You subclass TTSBackend in backend/services/tts_backend.py and register it in the _REGISTRY dictionary at the bottom of that file.
Language Coverage
ElevenLabs supports 32 languages. OmniVoice Studio supports 646 languages for TTS and 99 languages for transcription via WhisperX. Translation coverage depends on the target language pair.
Getting Started
Prerequisites are ffmpeg, Bun, and uv. Clone the repo, then run:
Copy CodeCopiedUse a different Browseruv sync
bun install
bun dev
The frontend loads at http://localhost:5173 and the API runs on port 8000. Model weights download automatically on first generation.
Marktechpost’s Visual Explainer
#ovs-wrap *{box-sizing:border-box!important;margin:0!important;padding:0!important;font-family:’Courier New’,Courier,monospace!important;}
#ovs-wrap p:empty,#ovs-wrap hr.wp,#ovs-wrap del,#ovs-wrap s{display:none!important;}
/* — outer shell — */
#ovs-wrap{
background:#0a0a0a!important;
border:1px solid #222!important;
max-width:760px!important;
width:100%!important;
margin:0 auto!important;
overflow:hidden!important;
display:flex!important;
flex-direction:column!important;
height:560px!important;
}
/* — viewport — */
#ovs-wrap .ovs-viewport{
flex:1!important;
overflow:hidden!important;
position:relative!important;
}
/* — track — */
#ovs-wrap .ovs-track{
display:flex!important;
height:100%!important;
transition:transform .4s cubic-bezier(.77,0,.175,1)!important;
}
/* — individual slide — */
#ovs-wrap .ovs-slide{
min-width:100%!important;
width:100%!important;
height:100%!important;
overflow-y:auto!important;
overflow-x:hidden!important;
padding:36px 44px 28px!important;
flex-shrink:0!important;
display:flex!important;
flex-direction:column!important;
gap:0!important;
}
/* scrollbar */
#ovs-wrap .ovs-slide::-webkit-scrollbar{width:4px!important;}
#ovs-wrap .ovs-slide::-webkit-scrollbar-track{background:#111!important;}
#ovs-wrap .ovs-slide::-webkit-scrollbar-thumb{background:#333!important;}
/* — top bar — */
#ovs-wrap .ovs-topbar{
display:flex!important;
justify-content:space-between!important;
align-items:center!important;
margin-bottom:20px!important;
flex-shrink:0!important;
}
#ovs-wrap .ovs-label{font-size:9px!important;letter-spacing:.18em!important;text-transform:uppercase!important;color:#444!important;font-weight:700!important;}
#ovs-wrap .ovs-step{font-size:9px!important;letter-spacing:.1em!important;color:#333!important;}
/* — title — */
#ovs-wrap .ovs-title{
font-size:19px!important;
font-weight:700!important;
color:#fff!important;
line-height:1.2!important;
margin-bottom:12px!important;
flex-shrink:0!important;
}
#ovs-wrap .ovs-title span{border-bottom:2px solid #fff!important;padding-bottom:2px!important;}
/* — rule — */
#ovs-wrap .ovs-rule{
height:1px!important;
background:#1e1e1e!important;
border:none!important;
display:block!important;
margin-bottom:16px!important;
flex-shrink:0!important;
}
/* — body text — */
#ovs-wrap .ovs-body{
font-size:12.5px!important;
line-height:1.7!important;
color:#999!important;
margin-bottom:14px!important;
}
#ovs-wrap .ovs-body strong{color:#e0e0e0!important;font-weight:700!important;}
/* — list — */
#ovs-wrap .ovs-list{list-style:none!important;margin-bottom:14px!important;}
#ovs-wrap .ovs-list li{
padding:6px 0!important;
border-bottom:1px solid #181818!important;
font-size:12px!important;
color:#888!important;
display:flex!important;
align-items:flex-start!important;
gap:8px!important;
line-height:1.55!important;
}
#ovs-wrap .ovs-list li:last-child{border-bottom:none!important;}
#ovs-wrap .ovs-list li::before{content:’2192′!important;color:#ddd!important;flex-shrink:0!important;margin-top:1px!important;}
#ovs-wrap .ovs-list li strong{color:#ddd!important;}
/* — code block — */
#ovs-wrap .ovs-code{
background:#111!important;
border:1px solid #1e1e1e!important;
border-left:3px solid #fff!important;
padding:12px 16px!important;
font-size:11.5px!important;
color:#d0d0d0!important;
overflow-x:auto!important;
margin-bottom:14px!important;
line-height:1.65!important;
white-space:pre!important;
display:block!important;
}
/* — tip box — */
#ovs-wrap .ovs-tip{
background:#111!important;
border:1px solid #1e1e1e!important;
border-left:3px solid #444!important;
padding:10px 14px!important;
font-size:11.5px!important;
color:#666!important;
line-height:1.6!important;
margin-bottom:14px!important;
}
#ovs-wrap .ovs-tip strong{color:#aaa!important;}
/* — table — */
#ovs-wrap .ovs-tbl{width:100%!important;border-collapse:collapse!important;font-size:11.5px!important;margin-bottom:14px!important;display:table!important;}
#ovs-wrap .ovs-tbl th{
background:#111!important;color:#fff!important;
padding:8px 10px!important;text-align:left!important;
font-size:9px!important;letter-spacing:.12em!important;
text-transform:uppercase!important;border-bottom:1px solid #2a2a2a!important;
}
#ovs-wrap .ovs-tbl td{
padding:7px 10px!important;border-bottom:1px solid #181818!important;
color:#888!important;vertical-align:top!important;
}
#ovs-wrap .ovs-tbl tr:last-child td{border-bottom:none!important;}
#ovs-wrap .ovs-tbl td strong{color:#ccc!important;}
/* — nav bar — */
#ovs-wrap .ovs-nav{
display:flex!important;
align-items:center!important;
justify-content:space-between!important;
padding:10px 44px!important;
border-top:1px solid #181818!important;
background:#0a0a0a!important;
flex-shrink:0!important;
}
#ovs-wrap .ovs-btn{
background:transparent!important;border:1px solid #2a2a2a!important;
color:#666!important;font-family:’Courier New’,Courier,monospace!important;
font-size:10px!important;letter-spacing:.1em!important;
padding:6px 14px!important;cursor:pointer!important;
text-transform:uppercase!important;transition:all .2s!important;
}
#ovs-wrap .ovs-btn:hover{background:#fff!important;color:#000!important;border-color:#fff!important;}
#ovs-wrap .ovs-btn:disabled{opacity:.2!important;cursor:default!important;pointer-events:none!important;}
#ovs-wrap .ovs-dots{display:flex!important;gap:5px!important;align-items:center!important;}
#ovs-wrap .ovs-dot{
width:5px!important;height:5px!important;border-radius:50%!important;
background:#252525!important;cursor:pointer!important;
transition:background .2s,transform .2s!important;border:none!important;
}
#ovs-wrap .ovs-dot.on{background:#fff!important;transform:scale(1.4)!important;}
/* — footer — */
#ovs-wrap .ovs-footer{
text-align:center!important;padding:8px!important;
font-size:9px!important;letter-spacing:.16em!important;
text-transform:uppercase!important;color:#2e2e2e!important;
border-top:1px solid #141414!important;flex-shrink:0!important;
}
#ovs-wrap .ovs-footer a{color:#3a3a3a!important;text-decoration:none!important;}
#ovs-wrap .ovs-footer a:hover{color:#888!important;}
/* — mobile — */
@media(max-width:640px){
#ovs-wrap{height:600px!important;}
#ovs-wrap .ovs-slide{padding:24px 20px 20px!important;}
#ovs-wrap .ovs-nav{padding:10px 20px!important;}
#ovs-wrap .ovs-title{font-size:16px!important;}
#ovs-wrap .ovs-body{font-size:12px!important;}
#ovs-wrap .ovs-code{font-size:10.5px!important;}
#ovs-wrap .ovs-tbl{display:block!important;overflow-x:auto!important;}
#ovs-wrap .ovs-tbl th,#ovs-wrap .ovs-tbl td{white-space:nowrap!important;}
}
OmniVoice Studio — How to Use It
01 / 08
What Is OmniVoice Studio?
OmniVoice Studio is an open-source desktop application for voice cloning, video dubbing, real-time dictation, and speaker diarization. Everything runs locally on your machine. No API keys, no cloud account, no subscription required.
646 languages supported for TTS via the default OmniVoice engine
99 languages for transcription via WhisperX
Available on macOS, Windows, and Linux
GPU is optional — full pipeline runs on CPU
Free for personal, educational, and research use (FSL-1.1-ALv2)
OmniVoice Studio — How to Use It
02 / 08
System Requirements
A GPU is optional. Without one, TTS runs approximately 3× slower on CPU. With ≤8 GB VRAM, TTS automatically offloads to CPU during transcription — no config needed.
ComponentMinimumRecommended
OSWin 10 / macOS 12+ / Ubuntu 20.04+Any modern 64-bit OS
RAM8 GB16 GB+
VRAM4 GB (auto-offloads)8 GB+ (RTX 3060+)
Disk10 GB free20 GB+ SSD
Python3.10+3.11–3.12
GPUOptionalCUDA / MPS / ROCm
OmniVoice Studio — How to Use It
03 / 08
Installation
The project recommends running from source. Install three prerequisites first: ffmpeg, Bun (JS runtime), and uv (Python package manager).
git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
uv sync
bun install
bun dev
Frontend loads at http://localhost:5173 | API runs on port 8000.
Model weights download automatically on first generation.
Pre-built installers available: macOS DMG, Windows MSI, Linux AppImage and .deb — see the Releases page on GitHub.
OmniVoice Studio — How to Use It
04 / 08
Voice Cloning
Voice cloning uses zero-shot learning — it clones a voice from a clip as short as 3 seconds, without prior training on that voice. The default OmniVoice engine conditions a diffusion-based TTS model on the reference audio.
Go to the Voice Clone tab in the UI
Upload or record a 3-second audio clip of the target voice
Enter your text and select a target language (646 available)
Click Generate — output is saved to your project library
Voice Gallery: Search YouTube, browse categories, and download reference clips directly inside the app to build your voice library.
OmniVoice Studio — How to Use It
05 / 08
Video Dubbing
The full dubbing pipeline runs locally: transcribe → translate → synthesize → mux. Demucs isolates vocals so the original background audio is preserved in the final export.
Go to the Dub tab — paste a YouTube URL or upload a local file
WhisperX transcribes speech with word-level alignment
Select a target language; translation runs automatically
TTS engine re-voices the transcript; Demucs preserves background audio
Export the final MP4 with dubbed audio mixed in
Batch Queue: Drop up to 50 videos and walk away. Each job has its own progress bar tracking through the full pipeline.
OmniVoice Studio — How to Use It
06 / 08
Dictation & Speaker Diarization
Dictation works system-wide from any application. Diarization identifies individual speakers in a multi-speaker audio file using Pyannote + WhisperX.
Press ⌘+⇧+Space (macOS) to open the floating dictation widget
Speech streams via WebSocket and auto-pastes into the active input field
Upload a multi-speaker file to the Diarization tab
Pyannote identifies who said what; each speaker gets an auto-extracted voice profile
Assign a TTS voice per speaker for per-speaker dubbing
Hugging Face token required for Pyannote diarization. See docs/setup/huggingface-token.md in the repo.
OmniVoice Studio — How to Use It
07 / 08
TTS Engines
Six TTS engines are built in. Switch via Settings → TTS Engine or the env var:
OMNIVOICE_TTS_BACKEND=cosyvoice
EngineLanguagesClonePlatform
OmniVoice (default)600+✓CUDA / MPS / CPU
CosyVoice 39 + 18 dialects✓CUDA / MPS / CPU
MLX-AudioMultiVariesApple Silicon only
VoxCPM230✓CUDA / MPS / CPU
MOSS-TTS-Nano20✓CUDA / CPU
KittenTTSEnglish✗CPU only
Custom engine: Subclass TTSBackend in backend/services/tts_backend.py and add it to _REGISTRY. ~50 lines of Python.
OmniVoice Studio — How to Use It
08 / 08
MCP Server & Resources
OmniVoice Studio ships a built-in MCP Server, exposing voice and dubbing capabilities to any MCP-compatible client — Claude, Cursor, or your own tooling — without opening the desktop UI.
MCP Server starts alongside the FastAPI backend on bun dev
Point your MCP client at the local server to access all endpoints
AudioSeal (Meta) embeds an invisible neural watermark in all generated audio for AI provenance
GitHub: github.com/debpalash/OmniVoice-Studio
Install docs: docs/install/ (macos / windows / linux / docker)
Troubleshooting: docs/install/troubleshooting.md
Discord: discord.gg/bzQavDfVV9
← Prev
Next →
■ Marktechpost — AI & ML Research, Simplified
(function(){
var track=document.getElementById(‘ovsTrack’);
var prev=document.getElementById(‘ovsPrev’);
var next=document.getElementById(‘ovsNext’);
var dotsWrap=document.getElementById(‘ovsDots’);
var slides=track.querySelectorAll(‘.ovs-slide’);
var total=slides.length;
var cur=0;
var dots=[];
for(var i=0;i0)go(cur-1);});
next.addEventListener(‘click’,function(){if(cur

