AI Models (7)
View all Speech & Audio ai modelsMusicGen
🔥 HotVITS
🔥 HotFastSpeech 2
🔥 HotStable Audio 2.0
🔥 HotSpeechT5
🔥 HotDeepSpeech
🔥 HotAt a glance
Compare the top Speech & Audio APIs
More to explore
Explore related categories
About this category
Speech & Audio — developer guide
What Are Speech and Audio AI Models?
Speech and Audio AI models handle the complete audio pipeline — from converting spoken words into accurate text transcripts to generating natural, expressive synthetic speech and composing original music. This category includes the models that power voice assistants, podcast editing tools, accessibility features, interactive voice response systems, and AI music generators. The field advanced dramatically with OpenAI's Whisper, which set a new open-source baseline for multilingual automatic speech recognition (ASR) across 99 languages.
Core Speech and Audio Tasks
- Automatic speech recognition (ASR) — transcribe speech to text from audio files or live microphone streams
- Text-to-speech (TTS) — synthesise natural-sounding speech from text with control over voice, emotion, and pace
- Voice cloning — create a voice model from a short audio sample that sounds indistinguishable from the original
- Speaker diarisation — identify and label who is speaking at each moment in a multi-speaker recording
- Music generation — compose complete songs with vocals and instrumentation from a text description
- Audio classification — detect speech, music, environmental sounds, or specific acoustic events
Key Speech and Audio Models
Whisper large-v3 (OpenAI, open-source, MIT license) supports 99 languages and achieves near-human accuracy on standard English benchmarks — it's the default choice for any open-source transcription pipeline. Wav2Vec 2.0 (Meta AI) enables fine-tuning on low-resource languages with minimal labelled data. For TTS, Kokoro TTS and StyleTTS2 are the open-source leaders in naturalness. MusicGen (Meta AI, open-source) generates mono and stereo music up to 30 seconds from text prompts. For speaker diarisation, pyannote-audio 3.0 is the open-source standard integrated into most transcription pipelines.


