Overview
Hugging Face provides access to thousands of open-source models through their Inference API. Great for experimentation with cutting-edge models. Model Class:HuggingFaceModel
Authentication
Environment Variables
Using infer_model
Manual Configuration
Examples
Basic Usage
With Reasoning Models
With Qwen Models
Prompt Caching
Hugging Face does not currently support native prompt caching. Best Practice: Use memory for conversation context:Model Parameters
Base Settings
| Parameter | Type | Description | Default |
|---|---|---|---|
max_tokens | int | Maximum tokens to generate | Model default |
temperature | float | Sampling temperature | 1.0 |
top_p | float | Nucleus sampling | 1.0 |
seed | int | Random seed | None |
stop_sequences | list[str] | Stop sequences | None |
presence_penalty | float | Token presence penalty | 0.0 |
frequency_penalty | float | Token frequency penalty | 0.0 |
Example Configuration
Available Models
Meta Llama
meta-llama/Llama-3.3-70B-Instruct: Latest Llamameta-llama/Llama-4-Scout-17B-16E-Instruct: Llama 4 smallmeta-llama/Llama-4-Maverick-17B-128E-Instruct: Llama 4 large context
DeepSeek
deepseek-ai/DeepSeek-R1: Reasoning model
Qwen
Qwen/Qwen3-235B-A22B: Large modelQwen/Qwen3-32B: Efficient modelQwen/Qwen2.5-72B-Instruct: Previous generationQwen/QwQ-32B: Reasoning model
Model Selection Guide
| Model | Size | Best For |
|---|---|---|
| Llama 3.3 70B | 70B | Balanced performance |
| Llama 4 Scout | 17B | Fast inference |
| Llama 4 Maverick | 17B | Long contexts |
| DeepSeek R1 | Large | Reasoning tasks |
| Qwen 3 235B | 235B | Complex tasks |
| Qwen 3 32B | 32B | Efficient processing |
Best Practices
- Choose Right Model Size: Balance performance and cost
- Check Model Availability: Some models require approval
- Handle Rate Limits: Free tier has limitations
- Use Pro Subscription: For higher limits
- Monitor Costs: Paid usage can add up
- Test Before Production: Verify model quality
Rate Limits
Free Tier
- Limited requests per minute
- May experience queuing
- Good for testing
Pro Subscription
- Higher rate limits
- Priority access
- Better for production

