Category
🎙️

Speech & Audio

Speech recognition in 99 languages (Whisper large-v3), TTS synthesis, voice cloning, music generation, and audio classification models — open-source and API options for every use case and budget.

3APIs10AI Models
Most Popular In
Automatic Speech RecognitionText-to-SpeechVoice Cloning
Auth Breakdown
API Key67%
OAuth33%
Notable Developers
OpenAI (Whisper)Meta AI (Wav2Vec, MusicGen)MicrosoftElevenLabspyannote
Updated Jun 12, 2026
Curated by FreeAPIHub editors
Topics:Automatic Speech RecognitionText-to-SpeechVoice CloningSpeaker DiarisationMusic GenerationAudio Classification
13 of 13
Access:
Auth:
Format:
Top Resources
Google Cloud Speech-to-Text API logo

Google Cloud Speech-to-Text API

API · Speech & Audio
FreemiumOAuth

Google Cloud Speech-to-Text transcribes audio into text with high accuracy across 125+ languages. Send short or long audio, or stream it live, and get transcripts with word timings and confidence as JSON.

2000+ usersNot rated yetView
AssemblyAI logo

AssemblyAI

API · Speech & Audio
FreemiumAPI Key

AssemblyAI is a speech-to-text and audio-intelligence API that transcribes audio and adds speaker labels, summarization, sentiment and topic detection through a simple REST interface.

3500+ usersNot rated yetView
Async.ai TTS API logo

Async.ai TTS API

API · Speech & Audio
FreemiumAPI Key

The Async.ai TTS API converts text into natural-sounding speech in multiple voices and languages, returning audio you can embed in apps, voiceovers and accessibility features.

5K+ usersNot rated yetView

Distil-Whisper

Model · Hugging Face
MIT

Distil-Whisper is a distilled version of OpenAI's Whisper speech-recognition model. It runs around six times faster and is roughly half the size, while staying within about 1% word-error-rate of the original on English transcription.

↓ 780K+Not rated yetView

SeamlessM4T v2

Model · Meta AI
CC BY-NC 4.0

SeamlessM4T v2 is Meta's foundational multilingual and multimodal translation model. A single system handles speech-to-speech, speech-to-text, text-to-speech, text-to-text and recognition across around 100 languages.

↓ 650K+Not rated yetView

wav2vec 2.0

Model · Meta AI
MIT

wav2vec 2.0 is Meta's self-supervised speech model that learns rich audio representations from raw, unlabelled speech. Fine-tuned with even a little labelled data, it delivers strong speech recognition across many languages.

↓ 580K+Not rated yetView
OP

OpenVoice

🔥 Hot
by MyShell.ai

OpenVoice is an open-source instant voice-cloning model from MyShell and MIT. From a short reference clip it replicates a voice, with flexible control over tone, emotion and accent, and supports cross-lingual cloning.

MIT~200M
View model
MU

MusicGen

🔥 Hot
by Meta AI

MusicGen is Meta's open text-to-music model. From a text prompt — and optionally a melody to follow — it generates coherent musical audio in a single stage, producing short instrumental pieces across many genres and moods.

MIT300M / 1.5B / 3.3B
View model
VI

VITS

🔥 Hot
by Kakao Enterprise

VITS is an end-to-end text-to-speech model that produces remarkably natural, expressive speech in a single stage. By combining variational inference with adversarial training and flows, it skips the separate vocoder of older pipelines.

F2

FastSpeech 2

🔥 Hot
by Microsoft Research

FastSpeech 2 is Microsoft's non-autoregressive text-to-speech model. It generates speech in parallel — far faster than autoregressive TTS — with explicit control over pitch, energy and duration for natural, controllable voices.

SA

Stable Audio 2.0

🔥 Hot
by Stability AI

Stable Audio 2.0 is Stability AI's text-to-audio model that generates full-length, structured music tracks up to about three minutes from a prompt. It also supports audio-to-audio transformation, bringing coherent long-form AI music generation.

Stability AI Community License1.1B (Open)
View model
SP

SpeechT5

🔥 Hot
by Microsoft Research

SpeechT5 is Microsoft's unified-modal model for speech and text. A single encoder-decoder backbone handles text-to-speech, speech recognition, voice conversion and speech enhancement, all from shared pretraining.

MIT~144M
View model
DE

DeepSpeech

🔥 Hot
by Mozilla

DeepSpeech is Mozilla's open-source speech-to-text engine, based on Baidu's Deep Speech research. An end-to-end model trained with CTC, it runs offline on-device and helped popularise free, private speech recognition.

Mozilla Public License 2.0~47M
View model
Showing 13 of 13 resources

At a glance

Compare the top Speech & Audio APIs

Browse all APIs
APIAccessAuthFormatsRating
Google Cloud Speech-to-Text API logo
Google Cloud Speech-to-Text API
FreemiumOAuthRESTJSONView
AssemblyAI logo
AssemblyAI
FreemiumAPI KeyRESTJSONView
Async.ai TTS API logo
Async.ai TTS API
FreemiumAPI KeyRESTJSONView

About this category

Speech & Audio — developer guide

What Are Speech and Audio AI Models?

Speech and Audio AI models handle the complete audio pipeline — from converting spoken words into accurate text transcripts to generating natural, expressive synthetic speech and composing original music. This category includes the models that power voice assistants, podcast editing tools, accessibility features, interactive voice response systems, and AI music generators. The field advanced dramatically with OpenAI's Whisper, which set a new open-source baseline for multilingual automatic speech recognition (ASR) across 99 languages.

Core Speech and Audio Tasks

  • Automatic speech recognition (ASR) — transcribe speech to text from audio files or live microphone streams
  • Text-to-speech (TTS) — synthesise natural-sounding speech from text with control over voice, emotion, and pace
  • Voice cloning — create a voice model from a short audio sample that sounds indistinguishable from the original
  • Speaker diarisation — identify and label who is speaking at each moment in a multi-speaker recording
  • Music generation — compose complete songs with vocals and instrumentation from a text description
  • Audio classification — detect speech, music, environmental sounds, or specific acoustic events

Key Speech and Audio Models

Whisper large-v3 (OpenAI, open-source, MIT license) supports 99 languages and achieves near-human accuracy on standard English benchmarks — it's the default choice for any open-source transcription pipeline. Wav2Vec 2.0 (Meta AI) enables fine-tuning on low-resource languages with minimal labelled data. For TTS, Kokoro TTS and StyleTTS2 are the open-source leaders in naturalness. MusicGen (Meta AI, open-source) generates mono and stereo music up to 30 seconds from text prompts. For speaker diarisation, pyannote-audio 3.0 is the open-source standard integrated into most transcription pipelines.