Yes, with PyTorch code on GitHub for research and experimentation.

VI

Open SourceVideo Generationby UC Berkeley (Wilson Yan et al.)

VideoGPT

VideoGPT is an open research model for video generation that combines a VQ-VAE with a transformer. It learns to generate short video clips by modelling sequences of discrete spatiotemporal tokens, an influential early approach to AI video.

open-source-airesearch-aiuc-berkeleyvideo-generationvideogptvq-vae

View on GitHub

Quick facts

LicenseOpen Source

TypeVQ-VAE + Transformer

OutputShort Clips

UseResearch

No ratings yet — be the first

Type

VQ-VAE + GPT

video tokens

Output

Short clips

low-res

License

Open source

research

UC Berkeley

PyTorch

What is VideoGPT?

VideoGPT is an open research model for video generation that applies the successful 'tokens + transformer' recipe from language and image modelling to video. Introduced by researchers at UC Berkeley, it showed that a relatively simple architecture — a VQ-VAE to compress video into discrete tokens, plus a transformer to model those tokens — could generate plausible short video clips. While modest by today's standards, VideoGPT was an influential early step toward modern AI video, demonstrating a clean, likelihood-based path to generative video.

How it works

VideoGPT works in two stages. First, a 3D VQ-VAE learns to encode video into a compact grid of discrete spatiotemporal tokens (capturing both space and motion) and to decode them back into frames. Second, a GPT-style transformer is trained to model the sequence of these tokens autoregressively — predicting the next token given the previous ones. To generate, the transformer samples a new token sequence, which the VQ-VAE decoder turns into a video clip. This mirrors how GPT generates text, but over video tokens.

What it is good at

VideoGPT is primarily a research and educational model: it generates short, low-to-moderate-resolution clips and is valuable for studying generative video, the token-based approach, and as a baseline or building block. Its clean design makes it a good way to understand how discrete latent models and transformers can be applied to video, and it informed later, more capable systems. It is best suited to experimentation rather than producing polished, long or high-resolution video.

Licensing & access

VideoGPT is open source, with the code and details available on GitHub for researchers to run and build on. It is implemented in PyTorch and runs on a GPU; training is compute-intensive (especially the transformer), while sampling from a trained model is lighter. As a research artefact, it comes with example datasets and configurations rather than a polished product interface.

Practical considerations

Set expectations: VideoGPT produces short, relatively low-resolution clips and is an earlier-generation approach — modern diffusion-based video models far exceed it in quality, length and controllability. Training is resource-heavy, and there is no simple text-to-video prompt interface like newer tools. It is best treated as a research baseline and learning tool rather than a production video generator; for real video work, use a current diffusion-based model.

How it compares

Stable Video Diffusion and AnimateDiff use diffusion for higher-quality, more controllable video and are the modern standard; VideoGPT instead uses a discrete-token, autoregressive approach. Its significance is historical and conceptual — it was an important demonstration that video could be generated like a sequence of tokens, a thread that influenced later work. For research and understanding choose VideoGPT; for producing good video today, use a diffusion-based model.

Getting started

Clone the VideoGPT repository, set up the PyTorch environment, and either use a provided configuration/dataset or train the VQ-VAE and transformer on your own short-video data (expect significant GPU time). Sample from a trained model to generate clips. Treat it as a research and learning exercise — to study token-based video generation — and reach for Stable Video Diffusion or AnimateDiff when you need high-quality, controllable video output.

Capabilities

🧩

Token-based video

Encodes video into discrete spatiotemporal tokens with a 3D VQ-VAE.

🔁

Autoregressive generation

A GPT-style transformer models and samples token sequences to create clips.

🔬

Research baseline

A clean, influential demonstration of generative video modelling.

🎓

Educational

Illustrates how language-model ideas transfer to video.

Pros & Cons

Pros6

Clean VQ-VAE + transformer video approach
Influential early generative-video research
Open source and self-hostable
Good for learning and as a baseline
Likelihood-based, token-style generation
Foundation for understanding AI video

Cons4

Short, low-resolution clips
Earlier generation — diffusion models far exceed it
Training is resource-heavy
No simple text-to-video interface

Inspiration

VideoGPT use cases & project ideas

Video research

Study generative video models.

Learning

Understand token-based video.

Baseline

Compare video approaches.

Short clips

Generate experimental video.

FAQ

Frequently asked questions

What is VideoGPT?+

An open research model that generates short video by pairing a VQ-VAE (discrete video tokens) with a GPT-style transformer.

How does it generate video?+

Is it good for production video?+

Is it open source?+

How does it compare to diffusion video models?+

More to explore

Learn more

From our blog

Tutorials

VideoGPT

What is VideoGPT?

How it works

What it is good at

Licensing & access

Practical considerations

How it compares

Getting started

Capabilities

Pros & Cons

VideoGPT use cases & project ideas

Video research

Learning

Baseline

Short clips

Frequently asked questions

You might also like

From our blog

Claude Fable 5: What's New and How to Use It (2026)

Build a Telegram Bot with a Free API in Python (2026)

Best Free Text-to-Speech APIs in 2026