Overview
Ollama allows you to run large language models locally on your machine. Perfect for development, testing, and privacy-sensitive applications. Model Class:OpenAIChatModel (OpenAI-compatible API)
Authentication
Environment Variables
Installation
First, install Ollama:Pull a Model
Using infer_model
Manual Configuration
Examples
Basic Usage
With Tools
Code Generation
Streaming
With Vision (Multi-modal Models)
Prompt Caching
Ollama does not support prompt caching in the traditional sense, but keeps models in memory:Keep Models in Memory
Model Parameters
Base Settings
| Parameter | Type | Description | Default | 
|---|---|---|---|
| max_tokens | int | Maximum tokens to generate | Model default | 
| temperature | float | Sampling temperature | 0.8 | 
| top_p | float | Nucleus sampling | 0.9 | 
| seed | int | Random seed | None | 
| stop_sequences | list[str] | Stop sequences | None | 
Example Configuration
Available Models
Meta Llama
- llama3.2: Latest Llama (3B/1B)
- llama3.1: Previous generation (8B/70B/405B)
- llama3: Original Llama 3 (8B/70B)
- llama2: Llama 2 models
Mistral
- mistral: Mistral 7B
- mistral-openorca: Fine-tuned variant
- mixtral: Mixture of Experts
Code Models
- codellama: Code generation
- deepseek-coder: DeepSeek code model
- starcoder2: StarCoder models
Vision Models
- llava: Vision understanding
- bakllava: Vision model variant
Specialized
- phi3: Microsoft Phi-3
- gemma2: Google Gemma 2
- qwen2: Alibaba Qwen 2
Model Selection Guide
| Model | Size | RAM | Best For | 
|---|---|---|---|
| llama3.2:1b | 1B | 1GB | Quick responses, low resources | 
| llama3.2:3b | 3B | 2GB | Balanced performance | 
| llama3.1:8b | 8B | 8GB | General purpose | 
| llama3.1:70b | 70B | 48GB | Complex tasks | 
| codellama | 7B | 5GB | Code generation | 
| llava | 7B | 8GB | Vision tasks | 
Hardware Requirements
Minimum Requirements
- CPU: Modern multi-core processor
- RAM: 8GB minimum (16GB recommended)
- Storage: 10GB+ free space
- OS: macOS 11+, Linux, Windows 10+
Recommended for Performance
- GPU: NVIDIA GPU with 8GB+ VRAM (CUDA support)
- RAM: 32GB+ for larger models
- Storage: SSD for faster model loading
GPU Acceleration
Ollama automatically uses GPU if available:Best Practices
- Pull Models Before Use: Avoid delays during first requests
- Choose Appropriate Size: Match model to your hardware
- Keep Models Loaded: For faster subsequent requests
- Use GPU: Dramatically improves performance
- Monitor Resources: Check RAM/VRAM usage
- Quantization: Use quantized models for better performance
- Local Development: Perfect for offline work
Model Management
Pull Models
List Models
Remove Models
Update Models
Customization
Create Custom Models
Create aModelfile:
Troubleshooting
Connection Errors
Out of Memory
Slow Performance
Advantages of Ollama
- Privacy: Data never leaves your machine
- No API Costs: Free to run locally
- Offline: Works without internet
- Customization: Full control over models
- Development: Perfect for iteration
- Learning: Experiment without costs
Limitations
- Hardware: Requires significant resources
- Performance: Slower than cloud APIs
- Model Selection: Limited to open models
- Updates: Manual model updates
- Scaling: Not suitable for high-traffic production

