open source

wav2vec 2.0

Provided by: Framework: PyTorch

wav2vec 2.0 is a self-supervised speech representation learning model developed by Meta AI, offering state-of-the-art performance in automatic speech recognition (ASR). Built on PyTorch and licensed under MIT, it drastically reduces the need for labeled data, making it ideal for multilingual transcription and voice applications. The model is widely used and integrated into the Hugging Face ecosystem.

Model Performance Statistics

15

Views

June 24, 2020

Released

Jul 20, 2025

Last Checked

v2

Version

Capabilities
  • Speech-to-Text
Performance Benchmarks
WER1.8% on LibriSpeech test-clean
Technical Specifications
Parameter Count
N/A
Training & Dataset

Dataset Used

LibriSpeech

Related AI Models

Discover similar AI models that might interest you

Modelopen source

SpeechT5

SpeechT5

SpeechT5

Microsoft

SpeechT5 is a versatile speech processing model developed by Microsoft, designed to handle speech recognition, speech synthesis, and speech translation tasks within a unified framework. Built using PyTorch and released under the MIT license, it leverages transformer architectures for improved accuracy and flexibility in various speech applications, including voice assistants and translation systems.

Speech & Audioasrspeech-recognition
14
Modelopen source

Distil-Whisper

Distil-Whisper

Distil-Whisper

Hugging Face

Distil‑Whisper is a distilled version of OpenAI’s Whisper model created by Hugging Face. Implemented in PyTorch and licensed under MIT, it offers up to six times faster inference and uses under half the parameters while maintaining ≤ 1% word error rate (WER) on English speech tasks. Ideal for real-time transcription in constrained resource environments.

Speech & Audioasrspeech-recognition
13
Modelopen source

DeepSpeech

DeepSpeech

DeepSpeech

Mozilla

DeepSpeech is an open-source automatic speech recognition (ASR) model developed by Mozilla, utilizing TensorFlow and licensed under the Mozilla Public License 2.0. It enables developers to build reliable, real-time speech-to-text transcription systems optimized for multiple languages and accents. Its architecture is designed for efficient deployment on edge devices and supports custom language model training.

Speech & Audiospeech-recognitionvoice
16