What is FastChat?
FastChat is an open platform from LMSYS (UC Berkeley and collaborators) for training, serving and evaluating large language model chatbots. Rather than a single model, it is the toolkit that made the open-chatbot wave practical: it trained and released Vicuna, runs the well-known Chatbot Arena for human preference evaluation, and provides a production-style serving stack — including an OpenAI-compatible API server — so you can run open models behind the same interface apps already use. It is released under Apache 2.0.
What it provides
FastChat bundles three things developers reach for. A serving system with a controller, model workers and a web/REST gateway lets you host one or many models and scale across GPUs. An OpenAI-compatible endpoint means existing OpenAI SDK code can point at your local model with almost no changes. And it includes training and evaluation utilities — the recipes used to fine-tune Vicuna, plus tools like MT-Bench and the Arena methodology for judging chatbot quality.
What it is good at
Its sweet spot is running open chat models yourself with a familiar API. Teams use it to self-host models like Vicuna, Llama-derived chatbots and others behind a drop-in OpenAI interface, to compare models with MT-Bench and Arena-style evaluation, and as a reference for chatbot fine-tuning. Because it is widely adopted, it is well documented and integrates with the broader open-LLM ecosystem.
Licensing & access
FastChat is open source under Apache 2.0, installed from PyPI or GitHub and run on your own hardware. Note that while FastChat itself is permissive, the models you serve carry their own licences (for example Llama-based weights) that you must respect. It supports a range of GPUs and can serve quantised models, so you can match it to anything from a single consumer card to a multi-GPU server.
Practical considerations
FastChat is a serving and research framework, not a turnkey product — you provide the models, the hardware and the operational care. For maximum raw inference throughput, dedicated engines like vLLM, TensorRT-LLM or MLC-LLM may be faster, and FastChat can integrate with some of them as workers. Mind GPU memory for larger models, and remember the quality of your deployment depends on the model you choose to serve.
How it compares
Compared with pure inference engines (TensorRT-LLM, MLC-LLM) that focus on squeezing maximum speed from a model, FastChat is broader: it covers serving, training and evaluation and made human-preference benchmarking mainstream through the Arena. If you want to self-host an open chatbot behind an OpenAI-style API and also measure how good it is, FastChat is the established, all-in-one choice.
Getting started
Install FastChat with pip, then launch the three serving components — a controller, a model worker that loads your chosen model, and the OpenAI-compatible API server. Point any OpenAI SDK at the local endpoint and start chatting. From there you can add more workers to scale, run the MT-Bench suite to score and rank models, or follow the published Vicuna recipes to fine-tune your own chatbot. Because the API mirrors OpenAI's, swapping a hosted model for a self-hosted one is often a one-line change to the base URL.


