Ollama - Upsonic AI

Overview

Ollama allows you to run large language models locally on your machine. Perfect for development, testing, and privacy-sensitive applications. Model Class: OpenAIChatModel (OpenAI-compatible API)

Authentication

Environment Variables

export OLLAMA_BASE_URL="http://localhost:11434"  # Optional, this is the default

Installation

First, install Ollama:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Pull a Model

# Pull a model before using it
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

Using infer_model

from upsonic import infer_model

# Make sure Ollama is running and model is pulled
model = infer_model("ollama/llama3.2")

Manual Configuration

from upsonic.models.openai import OpenAIChatModel, OpenAIChatModelSettings

settings = OpenAIChatModelSettings(
    max_tokens=2048,
    temperature=0.7
)

model = OpenAIChatModel(
    model_name="llama3.2",
    provider="ollama",
    settings=settings
)

Examples

Basic Usage

from upsonic import Agent, Task, infer_model

model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task = Task("Explain quantum mechanics")
result = agent.do(task)

With Tools

from upsonic import Agent, Task, infer_model

def calculate(expression: str) -> float:
    """Evaluate a mathematical expression."""
    return eval(expression)

model = infer_model("ollama/llama3.2")
agent = Agent(model=model, tools=[calculate])

task = Task("Calculate 25 * 36 + 100")
result = agent.do(task)

Code Generation

from upsonic import Agent, Task, infer_model

# Use codellama for code tasks
model = infer_model("ollama/codellama")
agent = Agent(model=model)

task = Task("Write a Python function to find prime numbers")
result = agent.do(task)

Streaming

from upsonic import Agent, Task, infer_model

model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task = Task("Write a short story about space exploration")

async for chunk in agent.do_stream(task):
    print(chunk, end="", flush=True)

from upsonic import Agent, Task, infer_model
from upsonic.messages import ImageUrl

# Use a vision model like llava
model = infer_model("ollama/llava")
agent = Agent(model=model)

task = Task(
    description="What's in this image?",
    attachments=[
        ImageUrl(url="file:///path/to/image.jpg")
    ]
)

result = agent.do(task)

Prompt Caching

Ollama does not support prompt caching in the traditional sense, but keeps models in memory:

from upsonic import Agent, Task, infer_model

# First request loads model into memory
model = infer_model("ollama/llama3.2")
agent = Agent(model=model)

task1 = Task("Hello")
result1 = agent.do(task1)  # Model loads (slower)

# Subsequent requests are faster as model stays in memory
task2 = Task("How are you?")
result2 = agent.do(task2)  # Faster

Keep Models in Memory

# Keep model loaded
ollama run llama3.2

# Or use API to keep alive
# Set keep_alive in model file or API call

Model Parameters

Base Settings

Parameter	Type	Description	Default
`max_tokens`	`int`	Maximum tokens to generate	Model default
`temperature`	`float`	Sampling temperature	0.8
`top_p`	`float`	Nucleus sampling	0.9
`seed`	`int`	Random seed	None
`stop_sequences`	`list[str]`	Stop sequences	None

Example Configuration

from upsonic.models.openai import OpenAIChatModel, OpenAIChatModelSettings

settings = OpenAIChatModelSettings(
    max_tokens=4096,
    temperature=0.7,
    top_p=0.9,
    seed=42,
    stop_sequences=["END"]
)

model = OpenAIChatModel(
    model_name="llama3.2",
    provider="ollama",
    settings=settings
)

Available Models

Meta Llama

llama3.2: Latest Llama (3B/1B)
llama3.1: Previous generation (8B/70B/405B)
llama3: Original Llama 3 (8B/70B)
llama2: Llama 2 models

Mistral

mistral: Mistral 7B
mistral-openorca: Fine-tuned variant
mixtral: Mixture of Experts

Code Models

codellama: Code generation
deepseek-coder: DeepSeek code model
starcoder2: StarCoder models

Vision Models

llava: Vision understanding
bakllava: Vision model variant

Specialized

phi3: Microsoft Phi-3
gemma2: Google Gemma 2
qwen2: Alibaba Qwen 2

Model Selection Guide

Model	Size	RAM	Best For
llama3.2:1b	1B	1GB	Quick responses, low resources
llama3.2:3b	3B	2GB	Balanced performance
llama3.1:8b	8B	8GB	General purpose
llama3.1:70b	70B	48GB	Complex tasks
codellama	7B	5GB	Code generation
llava	7B	8GB	Vision tasks

Hardware Requirements

Minimum Requirements

CPU: Modern multi-core processor
RAM: 8GB minimum (16GB recommended)
Storage: 10GB+ free space
OS: macOS 11+, Linux, Windows 10+

Recommended for Performance

GPU: NVIDIA GPU with 8GB+ VRAM (CUDA support)
RAM: 32GB+ for larger models
Storage: SSD for faster model loading

GPU Acceleration

Ollama automatically uses GPU if available:

# Check if GPU is being used
ollama ps

# Models will show GPU memory usage

Best Practices

Pull Models Before Use: Avoid delays during first requests
Choose Appropriate Size: Match model to your hardware
Keep Models Loaded: For faster subsequent requests
Use GPU: Dramatically improves performance
Monitor Resources: Check RAM/VRAM usage
Quantization: Use quantized models for better performance
Local Development: Perfect for offline work

Model Management

Pull Models

# Pull latest version
ollama pull llama3.2

# Pull specific size
ollama pull llama3.1:70b

List Models

ollama list

Remove Models

ollama rm llama3.2

Update Models

ollama pull llama3.2  # Re-pull to update

Customization

Create Custom Models

Create a Modelfile:

# Modelfile
FROM llama3.2

# Set temperature
PARAMETER temperature 0.8

# Set system prompt
SYSTEM You are a helpful coding assistant

# Set stop sequences
PARAMETER stop "END"

Create model:

ollama create mycustommodel -f Modelfile

Use custom model:

from upsonic import infer_model

model = infer_model("ollama/mycustommodel")

Troubleshooting

Connection Errors

# Make sure Ollama is running
# Start Ollama service:
# macOS/Linux: ollama serve
# Or use the Ollama app

Out of Memory

# Use smaller model
ollama pull llama3.2:1b

# Or use quantized version
ollama pull llama3.1:8b-q4_0

Slow Performance

# Use GPU acceleration
# Install CUDA drivers for NVIDIA GPUs

# Or use smaller/quantized models
ollama pull llama3.2:3b-q4_0

Advantages of Ollama

Privacy: Data never leaves your machine
No API Costs: Free to run locally
Offline: Works without internet
Customization: Full control over models
Development: Perfect for iteration
Learning: Experiment without costs

Limitations

Hardware: Requires significant resources
Performance: Slower than cloud APIs
Model Selection: Limited to open models
Updates: Manual model updates
Scaling: Not suitable for high-traffic production

GET STARTED

UPSONIC 101 GUIDE

CONCEPTS

DEPLOYMENT

FURTHER READINGS

​Overview

​Authentication

​Environment Variables

​Installation

​Pull a Model

​Using infer_model

​Manual Configuration

​Examples

​Basic Usage

​With Tools

​Code Generation

​Streaming

​With Vision (Multi-modal Models)

​Prompt Caching

​Keep Models in Memory

​Model Parameters

​Base Settings

​Example Configuration

​Available Models

​Meta Llama

​Mistral

​Code Models

​Vision Models

​Specialized

​Model Selection Guide

​Hardware Requirements

​Minimum Requirements

​Recommended for Performance

​GPU Acceleration

​Best Practices

​Model Management

​Pull Models

​List Models

​Remove Models

​Update Models

​Customization

​Create Custom Models

​Troubleshooting

​Connection Errors

​Out of Memory

​Slow Performance

​Advantages of Ollama

​Limitations

​Related Resources