open source

VITS

Provided by: Framework: PyTorch

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an advanced speech synthesis model developed by NVIDIA. It combines variational autoencoders and GANs to generate high-quality, natural-sounding speech directly from text. Built on PyTorch and licensed under MIT, VITS supports fast, end-to-end training and inference, making it popular for voice assistants and media applications.

Model Performance Statistics

14

Views

June 10, 2021

Released

Jul 20, 2025

Last Checked

v1

Version

Capabilities
  • Text-to-Speech
Performance Benchmarks
MOS4.41
Technical Specifications
Parameter Count
N/A
Training & Dataset

Dataset Used

LJSpeech

Related AI Models

Discover similar AI models that might interest you

Modelopen source

FastSpeech 2

FastSpeech 2

FastSpeech 2

Microsoft Research Asia

FastSpeech 2 is an improved neural text-to-speech model from Microsoft that generates natural-sounding speech quickly and efficiently. Built with PyTorch and licensed under MIT, it enhances prosody modeling and robustness, making it suitable for real-time voice assistants, audiobooks, and accessibility tools. The open-source code allows developers to customize and deploy the model easily.

Speech & Audioaudiotext-to-speech
14
Modelopen source

Stable Audio 2.0

Stable Audio 2.0

Stable Audio 2.0

Stability AI

Stable Audio 2.0 is an advanced open-source AI model developed by Stability AI for generating music and audio from textual descriptions. Built with PyTorch and licensed under MIT, it offers creators and developers an accessible tool to produce diverse audio content, including music composition and sound design, with high fidelity and creativity.

Speech & Audioaudiomusic
14
Modelopen source

MusicGen

MusicGen

MusicGen

Meta AI

MusicGen is a cutting-edge, single-stage autoregressive transformer AI from Meta AI via the AudioCraft library. Trained to generate high-quality music conditioned on text or melody (via EnCodec tokenizer), it supports multiple model sizes like small, medium (1.5B), and large (3.3B). Licensed under MIT for code and CC-BY-NC-4.0 for weights, it enables controllable, high-fidelity music synthesis across genres.

Speech & Audioaudiotext-to-music
13