CL
Open SourceMultimodalby OpenAI

CLIP

CLIP is OpenAI's contrastive language-image model that learns a shared space for images and text. It enables zero-shot image classification and capable image-text search, and underpins much of modern multimodal AI.

computer-visionimage-textmultimodal-aiopenaiopen-source-aizero-shot
Quick facts
LicenseMIT
TypeVision-Language
StrengthZero-Shot
ByOpenAI
No ratings yet — be the first
Type
Contrastive VLM
embeddings
Sizes
ViT-B/L/+
speed vs accuracy
License
MIT
open source
By
OpenAI
+ OpenCLIP

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a foundational vision-language model from OpenAI that learns to connect images and text in a shared embedding space. Trained on a huge set of image-caption pairs from the web, it learns which images and which text descriptions go together. The breakthrough is that this single training objective yields a model capable of zero-shot image classification — recognising categories it was never explicitly trained on — and of capable image-text search, making CLIP one of the most influential building blocks in modern multimodal AI.

How it works

CLIP has two encoders: an image encoder (a Vision Transformer or ResNet) and a text encoder. During training it uses a contrastive objective: given a batch of image-text pairs, it pulls each image's embedding close to its matching caption and pushes it away from non-matching ones. The result is a shared space where similarity between an image and a piece of text is just a dot product. To classify zero-shot, you embed candidate labels as text (e.g. 'a photo of a cat') and pick the one closest to the image.

What it is good at

CLIP excels at zero-shot classification, image-text retrieval and semantic image search — find images by describing them, or tag images against an open-ended label set without training. Its embeddings are widely reused: they guide and evaluate image-generation models, power content moderation and de-duplication, enable multimodal search, and serve as the vision backbone for many later vision-language systems. Its flexibility and reusability are its defining strengths.

Licensing & access

CLIP is open source (OpenAI released it under the MIT licence), with weights and code available, plus the popular community project OpenCLIP offering many additional models trained on open datasets like LAION. It is supported in Hugging Face Transformers and runs on a single GPU (small variants even on CPU). A range of sizes (ViT-B, ViT-L and larger) trade speed for accuracy.

Practical considerations

CLIP is an understanding/embedding model, not a generator — it scores and matches but does not produce images or captions on its own (though it guides models that do). Zero-shot accuracy depends on good prompt phrasing for the text labels, and CLIP can reflect biases and gaps in its web training data, so evaluate on your task. For best results pick an appropriate model size and consider OpenCLIP variants trained on larger open datasets.

How it compares

ERNIE-ViL adds structured scene-graph knowledge to vision-language pretraining; CogVLM and LLaVA-NeXT are generative multimodal assistants. CLIP is different and more fundamental: a contrastive embedding model whose strength is flexible matching, retrieval and zero-shot recognition, and which serves as a backbone inside many of those newer systems. For embeddings, search and zero-shot classification, CLIP (or OpenCLIP) is the standard tool.

Getting started

Load CLIP from Hugging Face or OpenCLIP, embed your images and candidate text labels, and compare them by similarity to classify or search in a few lines. For zero-shot classification, phrase labels as descriptive prompts; for search, embed a query and rank images by closeness. Pick a model size for your accuracy/speed needs, consider OpenCLIP's larger-dataset variants, and evaluate prompts and bias on your own data.

Model variants

MOST POPULAR

CLIP ViT-B/32

151M
Fast

Lightweight baseline

MOST POPULAR

CLIP ViT-L/14

428M
High quality

Stronger accuracy

Capabilities

🎯
Zero-shot recognition
Classifies images against open-ended text labels without task-specific training.
🧮
Shared embedding space
Image-text similarity becomes a simple dot product for search and matching.
🔄
Reusable backbone
Its embeddings guide image generators and power many multimodal systems.
🔍
Retrieval
Find images by description, or captions for images, via embedding similarity.

Pros & Cons

Pros6
  • Zero-shot image classification without training
  • Capable image-text search and retrieval
  • Reusable embeddings across many tasks
  • Open source (MIT); OpenCLIP adds variants
  • Runs on a single GPU; multiple sizes
  • Foundational backbone for multimodal AI
Cons4
  • An embedding model, not a generator
  • Zero-shot accuracy depends on prompt wording
  • Can reflect web-data biases and gaps
  • Choose the right size/variant for the task

Inspiration

CLIP use cases & project ideas

Zero-shot tagging

Classify images by text labels.

Image search

Find images by description.

Moderation / dedupe

Filter and match content.

Guide generation

Steer image models with CLIP.

FAQ

Frequently asked questions

It learns a shared embedding space for images and text, enabling zero-shot image classification and image-text search.

More to explore

You might also like

01
ER
ERNIE-ViL
Various sizes (base · Apache 2.0