Speech & Audio

Speech recognition in 99 languages (Whisper large-v3), TTS synthesis, voice cloning, music generation, and audio classification models — open-source and API options for every use case and budget.

3APIs10AI Models

Google Cloud Speech-to-Text API

API · Speech & Audio

FreemiumOAuth

Google Cloud Speech-to-Text transcribes audio into text with high accuracy across 125+ languages. Send short or long audio, or stream it live, and get transcripts with word timings and confidence as JSON.

2000+ usersNot rated yetView

AssemblyAI

API · Speech & Audio

FreemiumAPI Key

AssemblyAI is a speech-to-text and audio-intelligence API that transcribes audio and adds speaker labels, summarization, sentiment and topic detection through a simple REST interface.

3500+ usersNot rated yetView

Async.ai TTS API

API · Speech & Audio

FreemiumAPI Key

The Async.ai TTS API converts text into natural-sounding speech in multiple voices and languages, returning audio you can embed in apps, voiceovers and accessibility features.

5K+ usersNot rated yetView

Distil-Whisper

Model · Hugging Face

MIT

Distil-Whisper is a distilled version of OpenAI's Whisper speech-recognition model. It runs around six times faster and is roughly half the size, while staying within about 1% word-error-rate of the original on English transcription.

↓ 780K+Not rated yetView

SeamlessM4T v2

Model · Meta AI

CC BY-NC 4.0

SeamlessM4T v2 is Meta's foundational multilingual and multimodal translation model. A single system handles speech-to-speech, speech-to-text, text-to-speech, text-to-text and recognition across around 100 languages.

↓ 650K+Not rated yetView

wav2vec 2.0

Model · Meta AI

MIT

wav2vec 2.0 is Meta's self-supervised speech model that learns rich audio representations from raw, unlabelled speech. Fine-tuned with even a little labelled data, it delivers strong speech recognition across many languages.

↓ 580K+Not rated yetView

AI Models (7)

View all Speech & Audio ai models

OP

OpenVoice

🔥 Hot

by MyShell.ai

OpenVoice is an open-source instant voice-cloning model from MyShell and MIT. From a short reference clip it replicates a voice, with flexible control over tone, emotion and accent, and supports cross-lingual cloning.

MIT~200M

View model

MU

MusicGen

🔥 Hot

by Meta AI

MusicGen is Meta's open text-to-music model. From a text prompt — and optionally a melody to follow — it generates coherent musical audio in a single stage, producing short instrumental pieces across many genres and moods.

MIT300M / 1.5B / 3.3B

View model

VI

VITS

🔥 Hot

by Kakao Enterprise

VITS is an end-to-end text-to-speech model that produces remarkably natural, expressive speech in a single stage. By combining variational inference with adversarial training and flows, it skips the separate vocoder of older pipelines.

MIT~33M

View model

F2

FastSpeech 2

🔥 Hot

by Microsoft Research

FastSpeech 2 is Microsoft's non-autoregressive text-to-speech model. It generates speech in parallel — far faster than autoregressive TTS — with explicit control over pitch, energy and duration for natural, controllable voices.

MIT~27M

View model

SA

Stable Audio 2.0

🔥 Hot

by Stability AI

Stable Audio 2.0 is Stability AI's text-to-audio model that generates full-length, structured music tracks up to about three minutes from a prompt. It also supports audio-to-audio transformation, bringing coherent long-form AI music generation.

Stability AI Community License1.1B (Open)

View model

SP

SpeechT5

🔥 Hot

by Microsoft Research

SpeechT5 is Microsoft's unified-modal model for speech and text. A single encoder-decoder backbone handles text-to-speech, speech recognition, voice conversion and speech enhancement, all from shared pretraining.

MIT~144M

View model

DE

DeepSpeech

🔥 Hot

by Mozilla

DeepSpeech is Mozilla's open-source speech-to-text engine, based on Baidu's Deep Speech research. An end-to-end model trained with CTC, it runs offline on-device and helped popularise free, private speech recognition.

Mozilla Public License 2.0~47M

View model

Showing 13 of 13 resources

At a glance

Compare the top Speech & Audio APIs

Browse all APIs

APIAccessAuthFormatsRating

Google Cloud Speech-to-Text APIFreemiumOAuthRESTJSON—View

AssemblyAIFreemiumAPI KeyRESTJSON—View

Async.ai TTS APIFreemiumAPI KeyRESTJSON—View

More to explore

Explore related categories

All categories

Learn more

From our blog

Tutorials

About this category

Speech & Audio — developer guide

What Are Speech and Audio AI Models?

Speech and Audio AI models handle the complete audio pipeline — from converting spoken words into accurate text transcripts to generating natural, expressive synthetic speech and composing original music. This category includes the models that power voice assistants, podcast editing tools, accessibility features, interactive voice response systems, and AI music generators. The field advanced dramatically with OpenAI's Whisper, which set a new open-source baseline for multilingual automatic speech recognition (ASR) across 99 languages.

Core Speech and Audio Tasks

Automatic speech recognition (ASR) — transcribe speech to text from audio files or live microphone streams
Text-to-speech (TTS) — synthesise natural-sounding speech from text with control over voice, emotion, and pace
Voice cloning — create a voice model from a short audio sample that sounds indistinguishable from the original
Speaker diarisation — identify and label who is speaking at each moment in a multi-speaker recording
Music generation — compose complete songs with vocals and instrumentation from a text description
Audio classification — detect speech, music, environmental sounds, or specific acoustic events

Key Speech and Audio Models

Whisper large-v3 (OpenAI, open-source, MIT license) supports 99 languages and achieves near-human accuracy on standard English benchmarks — it's the default choice for any open-source transcription pipeline. Wav2Vec 2.0 (Meta AI) enables fine-tuning on low-resource languages with minimal labelled data. For TTS, Kokoro TTS and StyleTTS2 are the open-source leaders in naturalness. MusicGen (Meta AI, open-source) generates mono and stereo music up to 30 seconds from text prompts. For speaker diarisation, pyannote-audio 3.0 is the open-source standard integrated into most transcription pipelines.

Speech & Audio

Google Cloud Speech-to-Text API

AssemblyAI

Async.ai TTS API

Distil-Whisper

SeamlessM4T v2

wav2vec 2.0

AI Models (7)

OpenVoice

MusicGen

VITS

FastSpeech 2

Stable Audio 2.0

SpeechT5

DeepSpeech

Compare the top Speech & Audio APIs

Explore related categories

Productivity

Natural Language Processing

Development

Science & Nature

From our blog

DeepSeek API Tutorial: Free, Low-Cost AI in Python (2026)

Free Vector Database & Embeddings APIs in 2026

How to Build a Free MCP Server (Model Context Protocol)

Speech & Audio — developer guide

What Are Speech and Audio AI Models?

Core Speech and Audio Tasks

Key Speech and Audio Models