VI
Open SourceVideo Generationby UC Berkeley (Wilson Yan et al.)

VideoGPT

VideoGPT is an open research model for video generation that combines a VQ-VAE with a transformer. It learns to generate short video clips by modelling sequences of discrete spatiotemporal tokens, an influential early approach to AI video.

open-source-airesearch-aiuc-berkeleyvideo-generationvideogptvq-vae
Quick facts
LicenseOpen Source
TypeVQ-VAE + Transformer
OutputShort Clips
UseResearch
No ratings yet — be the first
Type
VQ-VAE + GPT
video tokens
Output
Short clips
low-res
License
Open source
research
By
UC Berkeley
PyTorch

What is VideoGPT?

VideoGPT is an open research model for video generation that applies the successful 'tokens + transformer' recipe from language and image modelling to video. Introduced by researchers at UC Berkeley, it showed that a relatively simple architecture — a VQ-VAE to compress video into discrete tokens, plus a transformer to model those tokens — could generate plausible short video clips. While modest by today's standards, VideoGPT was an influential early step toward modern AI video, demonstrating a clean, likelihood-based path to generative video.

How it works

VideoGPT works in two stages. First, a 3D VQ-VAE learns to encode video into a compact grid of discrete spatiotemporal tokens (capturing both space and motion) and to decode them back into frames. Second, a GPT-style transformer is trained to model the sequence of these tokens autoregressively — predicting the next token given the previous ones. To generate, the transformer samples a new token sequence, which the VQ-VAE decoder turns into a video clip. This mirrors how GPT generates text, but over video tokens.

What it is good at

VideoGPT is primarily a research and educational model: it generates short, low-to-moderate-resolution clips and is valuable for studying generative video, the token-based approach, and as a baseline or building block. Its clean design makes it a good way to understand how discrete latent models and transformers can be applied to video, and it informed later, more capable systems. It is best suited to experimentation rather than producing polished, long or high-resolution video.

Licensing & access

VideoGPT is open source, with the code and details available on GitHub for researchers to run and build on. It is implemented in PyTorch and runs on a GPU; training is compute-intensive (especially the transformer), while sampling from a trained model is lighter. As a research artefact, it comes with example datasets and configurations rather than a polished product interface.

Practical considerations

Set expectations: VideoGPT produces short, relatively low-resolution clips and is an earlier-generation approach — modern diffusion-based video models far exceed it in quality, length and controllability. Training is resource-heavy, and there is no simple text-to-video prompt interface like newer tools. It is best treated as a research baseline and learning tool rather than a production video generator; for real video work, use a current diffusion-based model.

How it compares

Stable Video Diffusion and AnimateDiff use diffusion for higher-quality, more controllable video and are the modern standard; VideoGPT instead uses a discrete-token, autoregressive approach. Its significance is historical and conceptual — it was an important demonstration that video could be generated like a sequence of tokens, a thread that influenced later work. For research and understanding choose VideoGPT; for producing good video today, use a diffusion-based model.

Getting started

Clone the VideoGPT repository, set up the PyTorch environment, and either use a provided configuration/dataset or train the VQ-VAE and transformer on your own short-video data (expect significant GPU time). Sample from a trained model to generate clips. Treat it as a research and learning exercise — to study token-based video generation — and reach for Stable Video Diffusion or AnimateDiff when you need high-quality, controllable video output.

Capabilities

🧩
Token-based video
Encodes video into discrete spatiotemporal tokens with a 3D VQ-VAE.
🔁
Autoregressive generation
A GPT-style transformer models and samples token sequences to create clips.
🔬
Research baseline
A clean, influential demonstration of generative video modelling.
🎓
Educational
Illustrates how language-model ideas transfer to video.

Pros & Cons

Pros6
  • Clean VQ-VAE + transformer video approach
  • Influential early generative-video research
  • Open source and self-hostable
  • Good for learning and as a baseline
  • Likelihood-based, token-style generation
  • Foundation for understanding AI video
Cons4
  • Short, low-resolution clips
  • Earlier generation — diffusion models far exceed it
  • Training is resource-heavy
  • No simple text-to-video interface

Inspiration

VideoGPT use cases & project ideas

Video research

Study generative video models.

Learning

Understand token-based video.

Baseline

Compare video approaches.

Short clips

Generate experimental video.

FAQ

Frequently asked questions

An open research model that generates short video by pairing a VQ-VAE (discrete video tokens) with a GPT-style transformer.