You want AI-powered code completion in your editor or app, but you don't want to pay OpenAI or Anthropic just to autocomplete a Python function. Good news: BigCode's StarCoder2 model is open source, free, and accessible through the Hugging Face Inference API. This starcoder2 api tutorial walks you through fetching code completions from the model in Python and JavaScript, with real working examples.
By the end of this guide, you'll have a script that sends a prompt to StarCoder2 and gets back generated code. You'll also know how to handle the gotchas — cold starts, token limits, and the weird shape of the response.
Let's get into it.
What Is StarCoder2?
StarCoder2 is a family of open-source code models trained by the BigCode project, a collaboration between Hugging Face and ServiceNow. It comes in three sizes: 3B, 7B, and 15B parameters. The model was trained on The Stack v2, a dataset of permissively licensed source code covering 600+ programming languages.
You can run it locally if you have a beefy GPU, but most developers just hit it through Hugging Face's free Inference API. That's what we'll do here. One thing worth knowing upfront: the free tier has a roughly 1000-token max output per request, and cold starts can take 20–30 seconds the first time you call a model that hasn't been used recently.
Why Use This Free AI Code Generation API?
- It's genuinely free — no credit card needed for basic Inference API usage
- Open source weights, so you're not locked into a vendor
- Trained on real code in 600+ languages, not just Python and JavaScript
- Great for code completion, docstring generation, and small refactors
- The same API endpoint pattern works for thousands of other Hugging Face models
Honest take: StarCoder2 isn't going to replace GPT-4 or Claude for complex reasoning. But for autocomplete-style tasks and short code snippets, it punches well above its weight.
Step-by-Step Setup
You need three things:
- Python 3.8 or newer
- A free Hugging Face account (sign up at huggingface.co)
- A user access token from your HF settings page (read scope is fine)
Wait — didn't I say no API key needed? Half-true. The Hugging Face Inference API technically works without a token for some public models, but you'll get rate-limited hard and hit auth errors on bigger models like the 15B variant. Grab the free token. It takes 30 seconds.
Install the requests library:
pip install requests
Set your token as an environment variable so you don't paste it into code:
export HF_TOKEN="hf_yourtokenhere"
Code Examples for the StarCoder2 API Tutorial
Python Example: Basic Fetch
Here's the smallest working call. It sends a prompt and prints the raw response.
import os
import requests
# StarCoder2-15B endpoint on Hugging Face Inference API
API_URL = "https://api-inference.huggingface.co/models/bigcode/starcoder2-15b"
# Read token from environment — never hardcode it
headers = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}"}
# The prompt is just the start of the code you want completed
payload = {
"inputs": "def fibonacci(n):\n \"\"\"Return the nth Fibonacci number.\"\"\"\n",
"parameters": {
"max_new_tokens": 100, # cap output length — free tier max around 1000
"temperature": 0.2, # lower = more deterministic code
"return_full_text": False
}
}
response = requests.post(API_URL, headers=headers, json=payload)
print(response.status_code)
print(response.json())
That's the bare bones. Run it and you'll either get a JSON list with a generated_text field, or a loading message if the model is cold. Let's fix that next.
Python Example: Practical Version With Error Handling
This version handles cold starts, network errors, and weird responses. It's what you'd actually ship.
import os
import time
import requests
API_URL = "https://api-inference.huggingface.co/models/bigcode/starcoder2-15b"
HEADERS = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}"}
# StarCoder2 free tier cap: ~1000 max_new_tokens per request
MAX_OUTPUT_TOKENS = 200
def generate_code(prompt: str, retries: int = 3) -> str:
"""Send a prompt to StarCoder2 and return generated code."""
payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": MAX_OUTPUT_TOKENS,
"temperature": 0.2,
"return_full_text": False
},
"options": {"wait_for_model": True} # waits during cold start instead of failing
}
for attempt in range(retries):
try:
response = requests.post(API_URL, headers=HEADERS, json=payload, timeout=60)
response.raise_for_status()
data = response.json()
# Response shape: [{"generated_text": "..."}]
if isinstance(data, list) and data:
return data[0].get("generated_text", "")
# Sometimes you get {"error": "...", "estimated_time": 22.5}
if isinstance(data, dict) and "error" in data:
wait = data.get("estimated_time", 10)
print(f"Model loading. Waiting {wait:.0f}s...")
time.sleep(wait)
continue
return ""
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e} — attempt {attempt + 1}/{retries}")
time.sleep(2 ** attempt) # exponential backoff
except requests.exceptions.Timeout:
print("Request timed out — retrying")
raise RuntimeError("Failed to get a response after retries")
if __name__ == "__main__":
prompt = (
"# Python function that checks if a string is a valid email address\n"
"def is_valid_email(email: str) -> bool:\n"
)
completion = generate_code(prompt)
print("--- Generated code ---")
print(prompt + completion)
The wait_for_model option is the part most tutorials skip. Without it, the first call after a cold start fails with a 503. With it, the API blocks until the model is loaded and then returns your response.
Sample Output
--- Generated code ---
# Python function that checks if a string is a valid email address
def is_valid_email(email: str) -> bool:
import re
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2}$"
return re.match(pattern, email) is not None
Not bad. The model picked a reasonable regex and matched the type hints from the prompt. This is how a code completion api free tier earns its keep — fast, decent suggestions for everyday tasks.
JavaScript Example: Fetch StarCoder2 Completions With Error Handling
// Node.js 18+ or any modern browser
const API_URL = "https://api-inference.huggingface.co/models/bigcode/starcoder2-15b";
const HF_TOKEN = process.env.HF_TOKEN;
// Free tier output cap is around 1000 tokens — keep requests modest
const MAX_OUTPUT_TOKENS = 200;
async function generateCode(prompt) {
const payload = {
inputs: prompt,
parameters: {
max_new_tokens: MAX_OUTPUT_TOKENS,
temperature: 0.2,
return_full_text: false
},
options: { wait_for_model: true } // handles cold start gracefully
};
try {
const response = await fetch(API_URL, {
method: "POST",
headers: {
"Authorization": `Bearer ${HF_TOKEN}`,
"Content-Type": "application/json"
},
body: JSON.stringify(payload)
});
if (!response.ok) {
throw new Error(`Request failed — HTTP ${response.status}`);
}
const data = await response.json();
// Expected shape: [{ generated_text: "..." }]
if (Array.isArray(data) && data.length > 0) {
return data[0].generated_text ?? "";
}
if (data.error) {
console.log("Model still loading:", data.error);
return "";
}
return "";
} catch (error) {
console.error("Fetch failed:", error.message);
return "";
}
}
const prompt = "// JavaScript function that reverses a string\nfunction reverseString(str) {\n";
generateCode(prompt).then(completion => {
console.log("--- Generated code ---");
console.log(prompt + completion);
});
Sample Console Output
--- Generated code ---
// JavaScript function that reverses a string
function reverseString(str) {
return str.split("").reverse().join("");
}
Understanding the Output
The Hugging Face Inference API returns a JSON array. Each entry is an object with one main field. Here's a labeled example:
[
{
"generated_text": " return str.split('').reverse().join('')"
}
]
Field breakdown:
generated_text— the model's completion. If you setreturn_full_text: false, this is only the new tokens. If true, it includes your original prompt prepended.error— only appears when something went wrong. Common values: "Model is currently loading" or rate limit messages.estimated_time— seconds until the model is ready. Only present alongside an error during cold start.
Heads up on token counting: max_new_tokens is the cap on the OUTPUT, not the input. Long prompts still count against the model's total context window of 16k tokens. You won't hit that easily with short completions.
Error Handling: What Actually Breaks
Here's what I've hit running this in production. None of it is in the official docs in one place.
503 Service Unavailable on first request. The model is cold. Add "options": {"wait_for_model": true} to your payload. Your request will hang for 20–30 seconds, then return.
401 Unauthorized. Your token is missing, expired, or scoped wrong. Regenerate it in HF settings with read scope.
429 Too Many Requests. You hit the free tier rate limit. There's no published exact number — informal throttling around a few hundred requests per hour for free accounts. Space calls with time.sleep(1) between them or upgrade to the Pro plan.
Empty generated_text. Usually means your prompt was malformed or the model decided the completion was zero new tokens. Try increasing temperature slightly (0.3–0.5) or rewording the prompt.
Output cuts off mid-function. You hit max_new_tokens. Raise it — the cap is roughly 1000 for free tier, but practical use is usually under 500. Going higher just costs you latency.
Stale model output. StarCoder2's training data has a cutoff. Don't expect it to know about libraries released in the last few months.
Real-World Use Cases
IDE autocomplete plugin. Wire the API into a VS Code extension that sends the current cursor context as a prompt and shows suggestions inline. The 15B model is fast enough for interactive use if you cap output at 50 tokens.
Code review bot. On every pull request, send the diff to StarCoder2 with a prompt like "# Review this code for bugs:\n". It won't replace a human reviewer, but it catches obvious issues.
Docstring generator. Feed it a function signature and the first line of code, and let it write the docstring. This is the most reliable use of an open source coding ai — small, scoped tasks with clear inputs.
Boilerplate scaffolding. Generate test stubs, Pydantic models, or repetitive CRUD handlers. The model is decent at pattern matching when you give it one good example.
StarCoder2 vs Other Free Code AI Options
| Model | Free Access | Context Window | Max Output Tokens (free) | License |
|---|---|---|---|---|
| StarCoder2-15B | HF Inference API, free token | 16,384 | 1,000 | BigCode OpenRAIL-M |
| Code Llama 7B | HF Inference API, free token | 16,384 | 1,000 | Llama 2 Community |
| DeepSeek Coder 6.7B | HF Inference API, free token | 16,384 | 1,000 | DeepSeek License |
| GPT-4 (for comparison) | Paid only | 128,000 | 4,096 | Proprietary |
FAQ
Is the StarCoder2 API really free?
Yes, through Hugging Face's Inference API. You need a free account and an access token, but you don't pay per request on the basic tier. Heavy usage will hit rate limits, and you'd need a Pro plan ($9/month at time of writing) for higher throughput.
Do I need a GPU to use this bigcode api example?
No. That's the whole point of using the Inference API — Hugging Face runs the model on their servers and you just hit an HTTP endpoint. You only need a GPU if you want to run StarCoder2 locally with the transformers library.
Which StarCoder2 size should I use?
Start with the 15B variant for best quality. If you need lower latency, try 7B or 3B. The 3B model is fast enough for real-time autocomplete but the suggestions are noticeably weaker on complex code.
Can I use StarCoder2 commercially?
Yes, under the BigCode OpenRAIL-M license. There are some usage restrictions (no malicious use, no surveillance applications), but standard commercial software development is fully allowed. Read the license before shipping.
Why is my first request so slow?Why is my first request so slow?
Cold start. Hugging Face spins down models that haven't been used recently. The first request after a quiet period loads the model into GPU memory, which takes 20–30 seconds. Subsequent requests in the next few minutes are fast.
How does StarCoder2 compare to GitHub Copilot?
Copilot is more polished and has tighter IDE integration. StarCoder2 is open source and free. For raw completion quality on common languages, StarCoder2-15B is in the same ballpark. For niche languages or unusual frameworks, Copilot's larger underlying model usually wins.
Conclusion
You now have a working free ai code generation api setup using StarCoder2. Two Python scripts, one JavaScript script, real error handling, and a clear picture of what breaks and why.
The next step? Pick one of the use cases above and build it. A docstring generator is the easiest weekend project — under 50 lines of code and immediately useful in your own workflow.
Looking for more free APIs to pair with your code generation pipeline? Browse the Free API Hub directory for hundreds of no-auth options.






