🔗

Multimodal

Explore free Multimodal APIs and AI models on Free API Hub that integrate multiple data types—like text, images, and audio—into cohesive AI solutions. Perfect for developers building innovative applications that require rich context and cross-modal understanding.

0 APIs 10 AI Models 10 Total

10 resources

Multimodal

Beijing Academy of AI (BAAI)

Emu2-Chat

Open SourcePyTorch

Emu2-Chat by BAAI is a free open-source 37B generative multimodal model that handles text, image, and video understanding plus image generation in one unified architecture. Best free generative multimodal AI for research.

Views

Favorites

Released

2023

Official URL

https://baaivision.github.io/emu2

Emu2Research AIBaai+3

Multimodal

Tsinghua KEG Lab & Zhipu AI

CogAgent

Open SourcePyTorch

CogAgent by Tsinghua/Zhipu AI is a free open-source 18B vision-language model specialized for GUI understanding. Reads any screen, clicks buttons, navigates apps. Best free open-source model for autonomous computer-use agents.

Views

Favorites

Released

2023

Official URL

https://github.com/THUDM/CogAgent

Computer UseCogagentGUI Automation+3

Multimodal

Tsinghua KEG Lab & Zhipu AI

CogVLM

Open SourcePyTorch

CogVLM by Tsinghua/Zhipu AI is a free open-source 17B vision-language model with visual expert architecture. Outperforms LLaVA on most benchmarks. Strong OCR, chart understanding, and reasoning. Apache 2.0 friendly.

Views

Favorites

Released

2023

Official URL

https://github.com/THUDM/CogVLM

CogvlmZhipu AITsinghua+3

Multimodal

DeepSeek-AI

DeepSeek-VL

Open SourcePyTorch

DeepSeek-VL is a free open-source vision-language model with strong real-world performance on charts, diagrams, OCR, and scientific images. MIT-style license, sizes 1.3B-7B. DeepSeek-VL2 brings frontier-class quality.

Views

Favorites

Released

2024

Official URL

https://github.com/deepseek-ai/DeepSeek-VL

OCR AIDocument AIVision Language+3

Multimodal

Baidu Research

ERNIE-ViL

Open SourcePaddlePaddle

ERNIE-ViL by Baidu is a free open-source vision-language model with strong scene-graph understanding. Excellent for image captioning, visual Q&A, and visual reasoning in both English and Chinese. Top free Chinese multimodal AI.

Views

Favorites

Released

2020

Official URL

https://research.baidu.com/Blog/index-view?id=193

Chinese AIErnieBaidu+3

Multimodal

OpenAI

CLIP

Open SourcePyTorch

CLIP by OpenAI is a free open-source vision-language model that connects images and text in one shared space. Powers zero-shot image classification, semantic search, content moderation, and AI image generators like Stable Diffusion.

Views

Favorites

Released

2021

Official URL

https://openai.com/research/clip

Zero ShotImage TextOpenai+3

Multimodal

Google DeepMind

Gemma 3 27B

Open SourcePyTorch / JAX

Gemma 3 27B by Google DeepMind is a free open-weights multimodal LLM with 128K context, 140+ language support, and vision input. Runs on a single GPU. Best free Gemini alternative for self-hosting in 2026.

Views

155

Favorites

Released

2025

Official URL

https://ai.google.dev/gemma

GemmaVision LanguageGoogle Deepmind+3

Multimodal

Haotian Liu et al. (UW-Madison & Microsoft Research)

LLaVA-NeXT

Open SourcePyTorch

LLaVA-NeXT is a free open-source multimodal AI that lets you chat with images. Free Apache 2.0, supports high-resolution vision, runs locally with Ollama. Best free GPT-4V alternative for visual Q&A and document understanding.

Views

Favorites

Released

2024

Official URL

https://llava-vl.github.io/blog/2024-01-30-llava-next/

Document AIVisual QaLlava+3

Multimodal

Meta AI Research (FAIR)

Chameleon 7B

FreePyTorch

Chameleon 7B by Meta AI is a free open-source early-fusion multimodal LLM that natively understands and generates text and images in a unified token space. Research-only license, foundational mixed-modal architecture.

Views

Favorites

Released

2024

Official URL

https://ai.meta.com/research/publications/chameleon-mixed-modal-early-fusion-foundation-models/

Unified ArchitectureEarly FusionResearch AI+3

Multimodal

Microsoft Research

Kosmos-2.5

Open SourcePyTorch

Kosmos-2.5 by Microsoft is a free multimodal AI specialized in reading text-rich images — receipts, documents, scientific papers, screenshots. State-of-the-art OCR + understanding in one model. MIT license, perfect for document AI.

Views

Favorites

Released

2023

Official URL

https://www.microsoft.com/en-us/research/publication/kosmos-2-5-a-multimodal-literate-model/

KosmosOCR AIDocument AI+3