open source

ERNIE-ViL

Provided by: Framework: PaddlePaddle

ERNIE-ViL is a powerful multimodal AI model developed by Baidu that integrates vision and language understanding into a unified framework. Built on PaddlePaddle and licensed under Apache 2.0, it supports tasks such as image captioning, visual question answering, and cross-modal retrieval. ERNIE-ViL advances AI’s ability to process and understand multi-source data effectively.

Model Performance Statistics

14

Views

September 25, 2019

Released

Jul 20, 2025

Last Checked

ERNIE-ViL 2.0

Version

Capabilities
  • Visual Question Answering
  • Image Captioning
Performance Benchmarks
VQA Accuracy74.5%
Technical Specifications
Parameter Count
N/A
Training & Dataset

Dataset Used

Visual Genome, COCO

Related AI Models

Discover similar AI models that might interest you

Modelopen source

CLIP

CLIP

CLIP

OpenAI

CLIP (Contrastive Language–Image Pretraining) is an open-source multimodal model developed by OpenAI that learns visual concepts from natural language supervision. Built with PyTorch and released under the MIT license, it enables powerful image and text embeddings for applications such as zero-shot classification, semantic search, and cross-modal retrieval. It remains actively used in research and AI product development.

Multimodalimage-text embeddingMultimodal AI
15
Modelopen source

DeepSeek-VL

DeepSeek-VL

DeepSeek-VL

DeepSeek AI

DeepSeek-VL is a cutting-edge open-source multimodal AI model that integrates vision and language processing to enable tasks like image captioning, semantic search, and cross-modal retrieval. Developed using PyTorch under the MIT license, it is suitable for building advanced AI systems requiring deep understanding across visual and textual data.

MultimodalMultimodal AI
13
Modelopen source

CogVLM

CogVLM

CogVLM

Tsinghua University

CogVLM is an advanced open-source vision-language model developed by Tsinghua University. Built with PyTorch and released under the Apache 2.0 license, it supports tasks such as image captioning, visual question answering (VQA), cross-modal retrieval, and semantic understanding. Designed for efficiency and accuracy, CogVLM enables developers to build multimodal AI applications with ease.

MultimodalMultimodal AI
13