You've probably hit the same wall every computer vision beginner hits. You read about Detectron2, you see the cool segmentation demos, and then you try to install it locally — and your laptop fan starts sounding like a jet engine. This Detectron2 tutorial API guide skips that pain. We'll call a hosted inference endpoint instead, so you can run real object detection without compiling CUDA or fighting PyTorch versions.
By the end of this post, you'll send an image URL to an API, get back bounding boxes, class labels, and confidence scores, and print them out. We'll use Python first, then do the same thing in JavaScript. Both examples handle errors the way you'd actually want them handled in a real app.
The goal here isn't to teach you the math behind Mask R-CNN. It's to get a working object detection script running on your machine in the next 15 minutes.
What Is Detectron2 and Why Use an API for It?
Detectron2 is Facebook AI's open-source library for object detection, instance segmentation, and keypoint detection. It's the backbone behind a lot of production vision systems — retail analytics, security feeds, medical imaging. The library is fast and accurate. It's also famously annoying to set up.
That's where a hosted Detectron2 tutorial API comes in. Instead of installing 4GB of dependencies, you POST an image (or image URL) to an endpoint and get JSON back. For this tutorial we'll use the public Hugging Face Inference API, which hosts Detectron2-style models like facebook/detr-resnet-50 for free. No credit card. No GPU.
One thing to note up front: the free tier is rate-limited and has a model cold-start delay. We'll handle both later in the error section.
Why Use This Free Object Detection API
- No local install: No PyTorch, no CUDA, no version mismatches.
- Free tier: Hugging Face's Inference API runs without payment for low-volume use.
- No API key needed for public models: Anonymous requests work, though you get higher limits if you add a free token.
- Real model output: Same architecture family used in Facebook AI object detection research.
- Beginner-friendly JSON: The response is a flat list of detections — easy to loop through.
Compared to running Detectron2 locally, you trade some latency for zero setup time. Honestly, for prototyping, that's a great trade.
Step-by-Step Setup
Here's what you need before writing any code.
- Python 3.8 or newer installed.
- The
requestslibrary for HTTP calls. - An image URL to test with (we'll use a public sample).
- Optional: a free Hugging Face account if you want a personal token for higher rate limits.
Install the one dependency:
pip install requests
That's it. No torch, no detectron2, no opencv-python. Just requests.
Code Examples: Python and JavaScript
We'll build this up in three steps. First a basic Python fetch so you can see the response. Then a practical Python version with error handling. Then the JavaScript equivalent.
Python Example: Basic Fetch
This sends an image URL to the model endpoint and prints whatever comes back. Minimal code, no error handling yet — just to confirm the API responds.
import requests
# Public DETR model — same family used in Facebook AI object detection research
API_URL = "https://api-inference.huggingface.co/models/facebook/detr-resnet-50"
# Any public image URL works. This is a sample street scene from Wikimedia.
IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/640px-Cat03.jpg"
# The API accepts raw image bytes in the request body
image_bytes = requests.get(IMAGE_URL).content
response = requests.post(API_URL, data=image_bytes, timeout=30)
print(response.status_code)
print(response.json())
Run that. If the model is warm, you'll see a JSON list of detections. If you see a message saying the model is loading, give it 20 seconds and try again — that's normal on the free tier.
Python Example: Practical Object Detection Script
Now let's make it production-shaped. We add error handling, a retry for cold starts, and a clean print of each detection. Free tier note: requests cap around 1 image per call, and you should space calls to avoid 429 errors — there's no published per-minute limit, so time.sleep(1) between calls is a safe rate.
import requests
import time
API_URL = "https://api-inference.huggingface.co/models/facebook/detr-resnet-50"
IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/640px-Cat03.jpg"
# Free tier caps: 1 image per request, ~30 requests/hour anonymous.
# Model cold-start can take 15-25 seconds the first call.
MAX_RETRIES = 3
COLD_START_WAIT = 20
def detect_objects(image_url):
# Download the image first so we can send raw bytes
image_response = requests.get(image_url, timeout=15)
image_response.raise_for_status()
image_bytes = image_response.content
for attempt in range(MAX_RETRIES):
try:
response = requests.post(API_URL, data=image_bytes, timeout=30)
# 503 means the model is loading — wait and retry
if response.status_code == 503:
print(f"Model is warming up. Waiting {COLD_START_WAIT}s...")
time.sleep(COLD_START_WAIT)
continue
# 429 means we hit the rate limit
if response.status_code == 429:
print("Rate limited. Backing off for 60s...")
time.sleep(60)
continue
response.raise_for_status()
return response.json()
except requests.exceptions.Timeout:
print(f"Timeout on attempt {attempt + 1}. Retrying...")
time.sleep(2)
raise RuntimeError("Could not get a response after retries.")
detections = detect_objects(IMAGE_URL)
# Filter out low-confidence boxes — anything under 0.7 is usually noise
for item in detections:
if item["score"] >= 0.7:
label = item["label"]
score = round(item["score"], 3)
box = item["box"]
print(f"{label} ({score}) at x:{box['xmin']}-{box['xmax']} y:{box['ymin']}-{box['ymax']}")
Here's what the script prints when the model has detected a cat in the sample image:
cat (0.998) at x:14-621 y:43-475
couch (0.812) at x:0-639 y:200-479
Two detections, both with confidence well above our threshold. The box values are pixel coordinates in the original image — top-left is (xmin, ymin), bottom-right is (xmax, ymax).
JavaScript Example: Fetch Detections With Error Handling
Same logic, written for Node.js 18+ (which has fetch built in). No npm install needed.
// Node.js 18+ or any modern browser
const API_URL = "https://api-inference.huggingface.co/models/facebook/detr-resnet-50";
const IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/640px-Cat03.jpg";
// Free tier caps: 1 image per request. Cold start can take ~20s.
async function detectObjects(imageUrl) {
try {
// Step 1 — download the image as a binary Blob
const imageResponse = await fetch(imageUrl);
if (!imageResponse.ok) {
throw new Error(`Image download failed — HTTP ${imageResponse.status}`);
}
const imageBlob = await imageResponse.blob();
// Step 2 — POST raw bytes to the inference endpoint
const response = await fetch(API_URL, {
method: "POST",
body: imageBlob
});
// 503 means the model is still loading on the free tier
if (response.status === 503) {
console.log("Model warming up. Try again in ~20 seconds.");
return;
}
if (!response.ok) {
throw new Error(`Inference failed — HTTP ${response.status}`);
}
const detections = await response.json();
if (!Array.isArray(detections) || detections.length === 0) {
console.log("No objects detected.");
return;
}
// Only print confident predictions
detections
.filter(item => item.score >= 0.7)
.forEach(item => {
const score = item.score.toFixed(3);
const { xmin, ymin, xmax, ymax } = item.box;
console.log(`${item.label} (${score}) at x:${xmin}-${xmax} y:${ymin}-${ymax}`);
});
} catch (error) {
console.error("Detection failed:", error.message);
}
}
detectObjects(IMAGE_URL);
And the console output once the model is warm:
cat (0.998) at x:14-621 y:43-475
couch (0.812) at x:0-639 y:200-479
Same shape as the Python output. That's by design — the API contract is the same regardless of client.
Understanding the Output
Each item in the response array represents one detected object. Here's the field breakdown:
- score: A float between 0 and 1. Higher means the model is more sure. Anything under 0.5 is usually noise.
- label: The class name as a string ("cat", "person", "car"). DETR uses the COCO label set — 80 common classes.
- box: An object with
xmin,ymin,xmax,ymax— pixel coordinates of the bounding box in the original image.
Sample raw JSON for one detection:
{
"score": 0.998,
"label": "cat",
"box": {"xmin": 14, "ymin": 43, "xmax": 621, "ymax": 475}
}
If you want instance segmentation tutorial output instead of bounding boxes, swap the model name for facebook/maskformer-swin-base-coco. The response shape changes — you get a per-pixel mask as a base64 PNG instead of a box. We'll keep the box version here because it's simpler to parse.
Error Handling: What Actually Breaks
This is the part most tutorials skip, and it's the part that'll save you two hours of debugging.
503 — Model is loading. The first request after a cold period boots the model on Hugging Face's servers. Response body says something like {"error": "Model is currently loading", "estimated_time": 20}. Don't treat this as a failure — sleep and retry.
429 — Rate limited. The free tier doesn't publish exact per-minute limits, but anonymous requests get throttled fast. Add time.sleep(1) between calls. If you're processing more than ~30 images an hour, sign up for a free token and pass it as Authorization: Bearer YOUR_TOKEN.
Empty list response. Sometimes the model returns []. That's not an error — it just means no objects passed the internal confidence floor. Lower your threshold or try a clearer image.
Image too large. The endpoint accepts up to about 10MB per request. Bigger images get rejected with a 413. Resize before sending if you're working with high-res photos.
Timeouts. Always set timeout=30 on Python requests. Without it, a stuck request will hang your script forever. Trust me on this one.
Real-World Use Cases
Where would you actually use a free object detection API like this? A few honest examples:
- Retail shelf monitoring: Snap a photo of a store shelf and count product instances. Flag empty slots.
- Security camera triage: Run frames through detection and only alert humans when a "person" appears outside business hours.
- Content moderation prep: Pre-filter user-uploaded images to detect what's in them before a human reviewer sees them.
- Wildlife camera traps: Process SD card dumps and auto-tag images with "deer", "bird", or "empty" so you don't scroll through 4,000 blank frames.
Comparison: Hosted API vs Local Detectron2
| Factor | Hosted API (this tutorial) | Local Detectron2 |
|---|---|---|
| Setup time | 2 minutes | 1-3 hours (CUDA + PyTorch) |
| Cost | Free, ~30 req/hour anonymous | Free, but needs a GPU |
| Latency per image | 500-2000 ms | 50-150 ms on GPU |
| Max image size | 10 MB | Limited by GPU RAM |
| Offline use | No | Yes |
FAQ
Do I need to install Detectron2 locally to follow this tutorial?
No. The whole point of using a hosted endpoint is to skip the local install. If you later need offline inference or sub-100ms latency, that's when installing Detectron2 locally makes sense.
Is this really a free computer vision free api?
Yes, with limits. The Hugging Face Inference API serves these models for free at low volume. For production traffic you'd want a paid tier or a self-hosted setup, but for learning and prototyping it's genuinely free.
What's the difference between object detection and instance segmentation?
Object detection draws a rectangle around each object. Instance segmentation draws the exact outline, pixel by pixel. This tutorial focuses on detection because the response is easier to parse. For an instance segmentation tutorial, swap to a MaskFormer or Mask R-CNN model on the same endpoint.
Why is my first request so slow?
Free-tier models go to sleep when nobody calls them. Your first request wakes the model up, which takes 15-25 seconds. After that, follow-up requests are fast. The retry logic in the practical example handles this for you.
Can I use my own images instead of a URL?
Yes. In Python, open the file with open("photo.jpg", "rb").read() and send those bytes as the data argument. In JavaScript, use a File or Blob object as the body in fetch. The API doesn't care where the bytes came from.
What object classes does the model recognise?
The DETR ResNet-50 model is trained on COCO, which has 80 classes — common things like person, car, dog, chair, laptop, bottle. It won't recognise specific products or faces. For custom classes you'd need to fine-tune the model.
Conclusion
You now have a working object detection pipeline in two languages, with proper error handling, and you didn't install a single deep learning library. That's the practical value of a Detectron2 tutorial API approach — you get to focus on what you do with the results instead of fighting the setup.
The logical next step is to wire this into something useful. Take the detection output and draw boxes on the image using Pillow in Python, or canvas in the browser. Or batch-process a folder of images and dump results to CSV. Both are 20 lines of code from where you are now.
Looking for more no-key APIs to plug into your projects? Browse the Free API Hub directory for hosted endpoints that work the same way.







