Hugging Face Inference API

Reference

API Endpoints

Endpoints

Available routes, request structures, and code examples.

Run inference on Hugging Face models

Endpoint URL

https://api-inference.huggingface.co/models/{model_id}

Code Example

curl -X POST 'https://api-inference.huggingface.co/models/{model_id}' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Request Payload

{
  "inputs": "Hello world",
  "model_id": "bert-base-uncased"
}

Expected Response

{
  "output": [
    {
      "label": "POSITIVE",
      "score": 0.999
    }
  ]
}

Version:v1

Limit:Limited free tier

Integration

Quick Start

cURL ExampleREST

curl -X GET "https://api-inference.huggingface.co/models/gpt2"

Docs

Technical Documentation

Hugging Face Inference API is the hosted runtime that lets you call any of the 1.5+ million open-source AI models hosted on the Hugging Face Hub through a simple REST request.

Want to run Llama 3.3 70B without owning a GPU? Want sentiment classification, image segmentation, audio transcription, or protein folding? They are all behind one consistent API — you pick a model, send the input, get the result.

The practical value

Letting your project use specialized open-source models without building MLOps infrastructure. Picking the right model for a task — a fine-tuned BERT for medical text classification, a specific Stable Diffusion variant trained on architectural sketches, a Whisper variant optimized for accented English — is now a search query and an API call rather than a six-week ML engineering project.

Three distinct services under one umbrella

Serverless Inference (free tier with rate limits) — you call a model and Hugging Face spins up the runtime on-demand. Suitable for prototyping and low-traffic production.

Inference Endpoints (paid, dedicated) — you deploy a model to a dedicated GPU instance with predictable latency and unlimited throughput, billed hourly.

Inference Providers (newest tier, 2025+) — partnerships with Together, Fireworks, Replicate, and others where Hugging Face is a routing layer over those providers' optimized inference clusters with pay-per-token pricing.

Where this fits in projects

Sentiment analysis or text classification on customer feedback at small scale (use serverless free tier). Audio transcription with Whisper variants for podcast platforms or meeting tools.

Image generation with Stable Diffusion or FLUX models for marketing tools. Code generation with smaller open-source LLMs for internal IDE plugins.

Document understanding with LayoutLMv3 or DocLLM. Embedding generation for RAG pipelines using Sentence Transformers or Nomic Embed.

Where it does not fit

Production traffic at scale needs Inference Endpoints (dedicated, predictable) or a different provider entirely.

Low-latency real-time applications (under 200ms response time) need Inference Endpoints with provisioned warm instances.

Frontier-quality LLMs (GPT-4, Claude Opus, Gemini Pro level) are not on Hugging Face. For that you need OpenAI, Anthropic, or Google directly.

Highly regulated industries (healthcare HIPAA, finance) need either Inference Endpoints in a private VPC or a self-hosted deployment.

Getting started — two steps

Create a Hugging Face account at huggingface.co (free), generate an Access Token with Read permissions in your settings, and you can call any public model.

The Python client (pip install huggingface_hub) wraps the API:

client = InferenceClient(
    model='meta-llama/Llama-3.3-70B-Instruct',
    token=YOUR_TOKEN
)
response = client.text_generation('Once upon a time')

The REST API is also direct — POST to api-inference.huggingface.co/models/{model_id} with a JSON body and the token in the Authorization header.

Pricing has shifted in 2025-2026

The legacy serverless free tier still exists with rate limits (typically 1,000 calls per day for free users, higher for Pro subscribers at $9/month).

Inference Endpoints are billed by GPU instance hours: a small CPU instance is $0.06/hour, an A10G GPU is $1.30/hour, an H100 is $9.50/hour. Instances scale to zero when idle (saving cost during traffic gaps).

Inference Providers (the newest model) charge per million tokens with prices varying by underlying provider. Llama 3.3 70B on Together AI through HuggingFace is roughly $0.88 per million input tokens and $0.88 per million output tokens.

Real cost example

For a SaaS product running embeddings on user documents at 10M tokens per day, the math works out to roughly $5-10 per day on serverless tiers, or $40-50 per day on a dedicated A10G instance running 24/7.

The break-even where dedicated becomes cheaper than serverless is around 5-10M tokens per day depending on the model.

For sporadic batch workloads, serverless wins. For consistent traffic, dedicated wins.

Alternatives matter — Hugging Face is not the only option

Replicate — most direct competitor. Same idea (run open-source models via API) with a different model selection and a per-second billing model that fits image generation workloads particularly well.
Together AI — specializes in serving Llama and Mistral models at competitive token prices.
Fireworks AI — fast inference for popular open models with structured output support.
Modal Labs and Banana — let you deploy your own custom models on serverless GPUs with more flexibility than Hugging Face's curated catalog.
RunPod and Vast.ai — give you raw GPU rentals if you want to run inference yourself.

Production details that matter

Cold start latency on serverless is real. First request to an inactive model can take 20-60 seconds while the model loads.

For user-facing latency-sensitive apps, either keep the model warm by sending periodic ping requests or use Inference Endpoints with always-on instances.

Rate limit headers (X-Cache-Hit, X-Compute-Time) tell you whether your request hit a warm cache or required a cold start.

Batch processing patterns

For batch processing (embedding 1 million documents overnight), the serverless API throttles aggressively after sustained traffic.

Better patterns: deploy a temporary dedicated endpoint, run the batch, tear it down. Or use the dedicated batch processing pricing on Replicate which is optimized for this workload. Or self-host the model on RunPod for a few hours.

Model selection is the under-appreciated craft

The Hub has 50+ Llama variants, dozens of Stable Diffusion checkpoints, hundreds of Whisper fine-tunes. Picking the right one for your task affects quality dramatically.

The Hugging Face Spaces ecosystem (free demos hosted by users) lets you test models in a browser before committing to integration.

The benchmark leaderboards (Open LLM Leaderboard, MTEB for embeddings, Stable Bias for image models) help compare quantitatively.

Documentation at huggingface.co/docs/api-inference is current. Community forum at discuss.huggingface.co is active and helpful for technical questions. Pricing reference: huggingface.co/pricing covers Inference Endpoints, and the Inference Providers per-token rates are on each provider's partner page.

Examples

Real-World Applications

Building chatbots with natural language understanding
Sentiment analysis for customer feedback
Automated content generation
Image classification and object detection in apps
Real-time translation services

Evaluation

Advantages & Limitations

Advantages

✓ Access to a vast library of over 200,000 pre-trained AI models
✓ No infrastructure or model training required
✓ Official SDKs for Python and JavaScript streamline integration
✓ Strong security with API key-based authentication

Limitations

✗ Rate limits may limit high-volume applications
✗ Some advanced customization requires paid tiers
✗ Latency can be higher due to model complexity
✗ Limited offline usage as it is a cloud-based API

Support

Frequently Asked Questions

Reference

API Endpoints

Endpoints

Available routes, request structures, and code examples.

Run inference on Hugging Face models

Endpoint URL

https://api-inference.huggingface.co/models/{model_id}

Code Example

curl -X POST 'https://api-inference.huggingface.co/models/{model_id}' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Request Payload

{
  "inputs": "Hello world",
  "model_id": "bert-base-uncased"
}

Expected Response

{
  "output": [
    {
      "label": "POSITIVE",
      "score": 0.999
    }
  ]
}

Version:v1

Limit:Limited free tier

Integration

Quick Start

cURL ExampleREST

curl -X GET "https://api-inference.huggingface.co/models/gpt2"

Docs

Technical Documentation

Hugging Face Inference API is the hosted runtime that lets you call any of the 1.5+ million open-source AI models hosted on the Hugging Face Hub through a simple REST request.

The practical value

Three distinct services under one umbrella

Serverless Inference (free tier with rate limits) — you call a model and Hugging Face spins up the runtime on-demand. Suitable for prototyping and low-traffic production.

Inference Endpoints (paid, dedicated) — you deploy a model to a dedicated GPU instance with predictable latency and unlimited throughput, billed hourly.

Where this fits in projects

Sentiment analysis or text classification on customer feedback at small scale (use serverless free tier). Audio transcription with Whisper variants for podcast platforms or meeting tools.

Image generation with Stable Diffusion or FLUX models for marketing tools. Code generation with smaller open-source LLMs for internal IDE plugins.

Document understanding with LayoutLMv3 or DocLLM. Embedding generation for RAG pipelines using Sentence Transformers or Nomic Embed.

Where it does not fit

Production traffic at scale needs Inference Endpoints (dedicated, predictable) or a different provider entirely.

Low-latency real-time applications (under 200ms response time) need Inference Endpoints with provisioned warm instances.

Frontier-quality LLMs (GPT-4, Claude Opus, Gemini Pro level) are not on Hugging Face. For that you need OpenAI, Anthropic, or Google directly.

Highly regulated industries (healthcare HIPAA, finance) need either Inference Endpoints in a private VPC or a self-hosted deployment.

Getting started — two steps

Create a Hugging Face account at huggingface.co (free), generate an Access Token with Read permissions in your settings, and you can call any public model.

The Python client (pip install huggingface_hub) wraps the API:

client = InferenceClient(
    model='meta-llama/Llama-3.3-70B-Instruct',
    token=YOUR_TOKEN
)
response = client.text_generation('Once upon a time')

The REST API is also direct — POST to api-inference.huggingface.co/models/{model_id} with a JSON body and the token in the Authorization header.

Pricing has shifted in 2025-2026

The legacy serverless free tier still exists with rate limits (typically 1,000 calls per day for free users, higher for Pro subscribers at $9/month).

Real cost example

The break-even where dedicated becomes cheaper than serverless is around 5-10M tokens per day depending on the model.

For sporadic batch workloads, serverless wins. For consistent traffic, dedicated wins.

Alternatives matter — Hugging Face is not the only option

Replicate — most direct competitor. Same idea (run open-source models via API) with a different model selection and a per-second billing model that fits image generation workloads particularly well.
Together AI — specializes in serving Llama and Mistral models at competitive token prices.
Fireworks AI — fast inference for popular open models with structured output support.
Modal Labs and Banana — let you deploy your own custom models on serverless GPUs with more flexibility than Hugging Face's curated catalog.
RunPod and Vast.ai — give you raw GPU rentals if you want to run inference yourself.

Production details that matter

Cold start latency on serverless is real. First request to an inactive model can take 20-60 seconds while the model loads.

For user-facing latency-sensitive apps, either keep the model warm by sending periodic ping requests or use Inference Endpoints with always-on instances.

Rate limit headers (X-Cache-Hit, X-Compute-Time) tell you whether your request hit a warm cache or required a cold start.

Batch processing patterns

For batch processing (embedding 1 million documents overnight), the serverless API throttles aggressively after sustained traffic.

Model selection is the under-appreciated craft

The Hub has 50+ Llama variants, dozens of Stable Diffusion checkpoints, hundreds of Whisper fine-tunes. Picking the right one for your task affects quality dramatically.

The Hugging Face Spaces ecosystem (free demos hosted by users) lets you test models in a browser before committing to integration.

The benchmark leaderboards (Open LLM Leaderboard, MTEB for embeddings, Stable Bias for image models) help compare quantitatively.

Examples

Real-World Applications

Building chatbots with natural language understanding
Sentiment analysis for customer feedback
Automated content generation
Image classification and object detection in apps
Real-time translation services

Evaluation

Advantages & Limitations

Advantages

✓ Access to a vast library of over 200,000 pre-trained AI models
✓ No infrastructure or model training required
✓ Official SDKs for Python and JavaScript streamline integration
✓ Strong security with API key-based authentication

Limitations

✗ Rate limits may limit high-volume applications
✗ Some advanced customization requires paid tiers
✗ Latency can be higher due to model complexity
✗ Limited offline usage as it is a cloud-based API

Support

API Endpoints

POST/models/{model_id}Model Inference Auth

Quick Start

Technical Documentation

The practical value

Three distinct services under one umbrella

Where this fits in projects

Where it does not fit

Getting started — two steps

Pricing has shifted in 2025-2026

Real cost example

Alternatives matter — Hugging Face is not the only option

Production details that matter

Batch processing patterns

Model selection is the under-appreciated craft

Real-World Applications

Advantages & Limitations

Frequently Asked Questions

How do I authenticate with the Hugging Face Inference API?

Are there any rate limits for the Hugging Face Inference API?

What response format does the Hugging Face Inference API use?

Can I send requests for specific tasks using the Hugging Face Inference API?

What are the ideal use cases for the Hugging Face Inference API?

External Resources

API Specifications

Best For

Not Ideal For

More APIs Similar to Hugging Face Inference API

IBM Watson API

DeepAI API

DeepSeek API

Hugging Face Inference API

API Endpoints

POST/models/{model_id}Model Inference Auth

Quick Start

Technical Documentation

The practical value

Three distinct services under one umbrella

Where this fits in projects

Where it does not fit

Getting started — two steps

Pricing has shifted in 2025-2026

Real cost example

Alternatives matter — Hugging Face is not the only option

Production details that matter

Batch processing patterns

Model selection is the under-appreciated craft

Real-World Applications

Advantages & Limitations

Frequently Asked Questions

How do I authenticate with the Hugging Face Inference API?

Are there any rate limits for the Hugging Face Inference API?

What response format does the Hugging Face Inference API use?

Can I send requests for specific tasks using the Hugging Face Inference API?

What are the ideal use cases for the Hugging Face Inference API?

External Resources

API Specifications

Best For

Not Ideal For

More APIs Similar to Hugging Face Inference API

IBM Watson API

DeepAI API

DeepSeek API