Skip to main content

Overview

Hugging Face provides access to thousands of open-source models through their Inference API. Great for experimentation with cutting-edge models. Model Class: HuggingFaceModel

Authentication

Environment Variables

export HF_TOKEN="hf_..."

Using infer_model

from upsonic import infer_model

model = infer_model("huggingface/meta-llama/Llama-3.3-70B-Instruct")

Manual Configuration

from upsonic.models.huggingface import HuggingFaceModel, HuggingFaceModelSettings

settings = HuggingFaceModelSettings(
    max_tokens=2048,
    temperature=0.7
)

model = HuggingFaceModel(
    model_name="meta-llama/Llama-3.3-70B-Instruct",
    settings=settings
)

Examples

Basic Usage

from upsonic import Agent, Task, infer_model

model = infer_model("huggingface/meta-llama/Llama-3.3-70B-Instruct")
agent = Agent(model=model)

task = Task("Explain deep learning")
result = agent.do(task)

With Reasoning Models

from upsonic import Agent, Task, infer_model

# DeepSeek R1 reasoning model
model = infer_model("huggingface/deepseek-ai/DeepSeek-R1")
agent = Agent(model=model)

task = Task("Solve this math problem step by step: ...")
result = agent.do(task)

With Qwen Models

from upsonic import Agent, Task, infer_model

# Qwen's large model
model = infer_model("huggingface/Qwen/Qwen3-235B-A22B")
agent = Agent(model=model)

task = Task("Generate Python code for web scraping")
result = agent.do(task)

Prompt Caching

Hugging Face does not currently support native prompt caching. Best Practice: Use memory for conversation context:
from upsonic import Agent, Task, infer_model
from upsonic.storage.memory import Memory
from upsonic.storage.providers.in_memory import InMemoryStorage

storage = InMemoryStorage()
memory = Memory(storage=storage, session_id="session-123")

model = infer_model("huggingface/meta-llama/Llama-3.3-70B-Instruct")
agent = Agent(model=model, memory=memory)

Model Parameters

Base Settings

ParameterTypeDescriptionDefault
max_tokensintMaximum tokens to generateModel default
temperaturefloatSampling temperature1.0
top_pfloatNucleus sampling1.0
seedintRandom seedNone
stop_sequenceslist[str]Stop sequencesNone
presence_penaltyfloatToken presence penalty0.0
frequency_penaltyfloatToken frequency penalty0.0

Example Configuration

from upsonic.models.huggingface import HuggingFaceModel, HuggingFaceModelSettings

settings = HuggingFaceModelSettings(
    max_tokens=2048,
    temperature=0.7,
    top_p=0.9,
    seed=42,
    presence_penalty=0.1,
    frequency_penalty=0.1
)

model = HuggingFaceModel(
    model_name="meta-llama/Llama-3.3-70B-Instruct",
    settings=settings
)

Available Models

Meta Llama

  • meta-llama/Llama-3.3-70B-Instruct: Latest Llama
  • meta-llama/Llama-4-Scout-17B-16E-Instruct: Llama 4 small
  • meta-llama/Llama-4-Maverick-17B-128E-Instruct: Llama 4 large context

DeepSeek

  • deepseek-ai/DeepSeek-R1: Reasoning model

Qwen

  • Qwen/Qwen3-235B-A22B: Large model
  • Qwen/Qwen3-32B: Efficient model
  • Qwen/Qwen2.5-72B-Instruct: Previous generation
  • Qwen/QwQ-32B: Reasoning model

Model Selection Guide

ModelSizeBest For
Llama 3.3 70B70BBalanced performance
Llama 4 Scout17BFast inference
Llama 4 Maverick17BLong contexts
DeepSeek R1LargeReasoning tasks
Qwen 3 235B235BComplex tasks
Qwen 3 32B32BEfficient processing

Best Practices

  1. Choose Right Model Size: Balance performance and cost
  2. Check Model Availability: Some models require approval
  3. Handle Rate Limits: Free tier has limitations
  4. Use Pro Subscription: For higher limits
  5. Monitor Costs: Paid usage can add up
  6. Test Before Production: Verify model quality

Rate Limits

Free Tier

  • Limited requests per minute
  • May experience queuing
  • Good for testing

Pro Subscription

  • Higher rate limits
  • Priority access
  • Better for production

Troubleshooting

Model Loading Errors

# Some models require explicit access approval
# Visit model page on Hugging Face to request access

Token Limit Exceeded

from upsonic.models.huggingface import HuggingFaceModelSettings

# Reduce max_tokens
settings = HuggingFaceModelSettings(
    max_tokens=1024  # Lower limit
)

Rate Limiting

import asyncio
from upsonic.utils.package.exception import ModelHTTPError

async def request_with_retry(agent, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            return agent.do(task)
        except ModelHTTPError as e:
            if e.status_code == 429:
                wait_time = 2 ** attempt
                await asyncio.sleep(wait_time)
            else:
                raise