Skip to main content

Overview

Ollama allows you to run large language models locally on your machine. Perfect for development, testing, and privacy-sensitive applications. Model Class: OpenAIChatModel (OpenAI-compatible API)

Authentication

Environment Variables

export OLLAMA_BASE_URL="http://localhost:11434"  # Optional, this is the default

Installation

First, install Ollama:
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Pull a Model

# Pull a model before using it
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

Using infer_model

from upsonic import infer_model

# Make sure Ollama is running and model is pulled
model = infer_model("ollama/llama3.2")

Manual Configuration

from upsonic.models.openai import OpenAIChatModel, OpenAIChatModelSettings

settings = OpenAIChatModelSettings(
    max_tokens=2048,
    temperature=0.7
)

model = OpenAIChatModel(
    model_name="llama3.2",
    provider="ollama",
    settings=settings
)

Examples

Basic Usage

from upsonic import Agent, Task, infer_model

model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task = Task("Explain quantum mechanics")
result = agent.do(task)

With Tools

from upsonic import Agent, Task, infer_model

def calculate(expression: str) -> float:
    """Evaluate a mathematical expression."""
    return eval(expression)

model = infer_model("ollama/llama3.2")
agent = Agent(model=model, tools=[calculate])

task = Task("Calculate 25 * 36 + 100")
result = agent.do(task)

Code Generation

from upsonic import Agent, Task, infer_model

# Use codellama for code tasks
model = infer_model("ollama/codellama")
agent = Agent(model=model)

task = Task("Write a Python function to find prime numbers")
result = agent.do(task)

Streaming

from upsonic import Agent, Task, infer_model

model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task = Task("Write a short story about space exploration")

async for chunk in agent.do_stream(task):
    print(chunk, end="", flush=True)

With Vision (Multi-modal Models)

from upsonic import Agent, Task, infer_model
from upsonic.messages import ImageUrl

# Use a vision model like llava
model = infer_model("ollama/llava")
agent = Agent(model=model)

task = Task(
    description="What's in this image?",
    attachments=[
        ImageUrl(url="file:///path/to/image.jpg")
    ]
)

result = agent.do(task)

Prompt Caching

Ollama does not support prompt caching in the traditional sense, but keeps models in memory:
from upsonic import Agent, Task, infer_model

# First request loads model into memory
model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task1 = Task("Hello")
result1 = agent.do(task1)  # Model loads (slower)

# Subsequent requests are faster as model stays in memory
task2 = Task("How are you?")
result2 = agent.do(task2)  # Faster

Keep Models in Memory

# Keep model loaded
ollama run llama3.2

# Or use API to keep alive
# Set keep_alive in model file or API call

Model Parameters

Base Settings

ParameterTypeDescriptionDefault
max_tokensintMaximum tokens to generateModel default
temperaturefloatSampling temperature0.8
top_pfloatNucleus sampling0.9
seedintRandom seedNone
stop_sequenceslist[str]Stop sequencesNone

Example Configuration

from upsonic.models.openai import OpenAIChatModel, OpenAIChatModelSettings

settings = OpenAIChatModelSettings(
    max_tokens=4096,
    temperature=0.7,
    top_p=0.9,
    seed=42,
    stop_sequences=["END"]
)

model = OpenAIChatModel(
    model_name="llama3.2",
    provider="ollama",
    settings=settings
)

Available Models

Meta Llama

  • llama3.2: Latest Llama (3B/1B)
  • llama3.1: Previous generation (8B/70B/405B)
  • llama3: Original Llama 3 (8B/70B)
  • llama2: Llama 2 models

Mistral

  • mistral: Mistral 7B
  • mistral-openorca: Fine-tuned variant
  • mixtral: Mixture of Experts

Code Models

  • codellama: Code generation
  • deepseek-coder: DeepSeek code model
  • starcoder2: StarCoder models

Vision Models

  • llava: Vision understanding
  • bakllava: Vision model variant

Specialized

  • phi3: Microsoft Phi-3
  • gemma2: Google Gemma 2
  • qwen2: Alibaba Qwen 2

Model Selection Guide

ModelSizeRAMBest For
llama3.2:1b1B1GBQuick responses, low resources
llama3.2:3b3B2GBBalanced performance
llama3.1:8b8B8GBGeneral purpose
llama3.1:70b70B48GBComplex tasks
codellama7B5GBCode generation
llava7B8GBVision tasks

Hardware Requirements

Minimum Requirements

  • CPU: Modern multi-core processor
  • RAM: 8GB minimum (16GB recommended)
  • Storage: 10GB+ free space
  • OS: macOS 11+, Linux, Windows 10+
  • GPU: NVIDIA GPU with 8GB+ VRAM (CUDA support)
  • RAM: 32GB+ for larger models
  • Storage: SSD for faster model loading

GPU Acceleration

Ollama automatically uses GPU if available:
# Check if GPU is being used
ollama ps

# Models will show GPU memory usage

Best Practices

  1. Pull Models Before Use: Avoid delays during first requests
  2. Choose Appropriate Size: Match model to your hardware
  3. Keep Models Loaded: For faster subsequent requests
  4. Use GPU: Dramatically improves performance
  5. Monitor Resources: Check RAM/VRAM usage
  6. Quantization: Use quantized models for better performance
  7. Local Development: Perfect for offline work

Model Management

Pull Models

# Pull latest version
ollama pull llama3.2

# Pull specific size
ollama pull llama3.1:70b

List Models

ollama list

Remove Models

ollama rm llama3.2

Update Models

ollama pull llama3.2  # Re-pull to update

Customization

Create Custom Models

Create a Modelfile:
# Modelfile
FROM llama3.2

# Set temperature
PARAMETER temperature 0.8

# Set system prompt
SYSTEM You are a helpful coding assistant

# Set stop sequences
PARAMETER stop "END"
Create model:
ollama create mycustommodel -f Modelfile
Use custom model:
from upsonic import infer_model

model = infer_model("ollama/mycustommodel")

Troubleshooting

Connection Errors

# Make sure Ollama is running
# Start Ollama service:
# macOS/Linux: ollama serve
# Or use the Ollama app

Out of Memory

# Use smaller model
ollama pull llama3.2:1b

# Or use quantized version
ollama pull llama3.1:8b-q4_0

Slow Performance

# Use GPU acceleration
# Install CUDA drivers for NVIDIA GPUs

# Or use smaller/quantized models
ollama pull llama3.2:3b-q4_0

Advantages of Ollama

  1. Privacy: Data never leaves your machine
  2. No API Costs: Free to run locally
  3. Offline: Works without internet
  4. Customization: Full control over models
  5. Development: Perfect for iteration
  6. Learning: Experiment without costs

Limitations

  1. Hardware: Requires significant resources
  2. Performance: Slower than cloud APIs
  3. Model Selection: Limited to open models
  4. Updates: Manual model updates
  5. Scaling: Not suitable for high-traffic production