Overview
Groq provides ultra-fast inference through their Language Processing Unit (LPU) technology. Access open-source models with industry-leading speed and built-in web search capabilities. Model Class:GroqModel
Authentication
Environment Variables
Using infer_model
Manual Configuration
Examples
Basic Usage
Ultra-Fast Streaming
With Web Search
With Reasoning Format
With Tools
Vision Understanding
Prompt Caching
Groq does not currently support native prompt caching. Best Practice: Use memory for conversation context:Model Parameters
Base Settings
| Parameter | Type | Description | Default |
|---|---|---|---|
max_tokens | int | Maximum tokens to generate | 1024 |
temperature | float | Sampling temperature (0.0-2.0) | 1.0 |
top_p | float | Nucleus sampling | 1.0 |
seed | int | Random seed | None |
stop_sequences | list[str] | Stop sequences | None |
presence_penalty | float | Token presence penalty | 0.0 |
frequency_penalty | float | Token frequency penalty | 0.0 |
parallel_tool_calls | bool | Allow parallel tools | True |
timeout | float | Request timeout (seconds) | 600 |
Groq-Specific Settings
| Parameter | Type | Description |
|---|---|---|
groq_reasoning_format | 'hidden' | 'raw' | 'parsed' | How to format reasoning output |
hidden: Don’t show reasoning (default)raw: Show raw reasoning with tagsparsed: Show structured reasoning
Example Configuration
Available Models
Production Models
Meta Llama
llama-3.3-70b-versatile: Latest, most capablellama-3.3-70b-specdec: Speculative decoding variantllama-3.1-8b-instant: Fast, efficientllama3-70b-8192: Extended contextllama3-8b-8192: Small, fast
Google Gemma
gemma2-9b-it: Efficient instruction model
Preview Models
Reasoning Models
qwen-qwq-32b: Qwen reasoning modeldeepseek-r1-distill-qwen-32b: DeepSeek R1 distilleddeepseek-r1-distill-llama-70b: DeepSeek R1 large
Vision Models
llama-3.2-90b-vision-preview: Large vision modelllama-3.2-11b-vision-preview: Efficient vision
Specialized
mistral-saba-24b: Mistral variantqwen-2.5-coder-32b: Code specialistqwen-2.5-32b: General purpose
Model Comparison
| Model | Tokens/sec* | Context | Best For |
|---|---|---|---|
| llama-3.3-70b-versatile | ~700 | 128K | General purpose, highest quality |
| llama-3.1-8b-instant | ~1500 | 128K | Speed-critical apps |
| qwen-qwq-32b | ~600 | 32K | Reasoning tasks |
| llama-3.2-90b-vision | ~500 | 128K | Vision understanding |
LPU Technology
Groq’s Language Processing Unit delivers:- Extreme Speed: 10-100x faster than GPUs
- Low Latency: Sub-second first token
- Consistent: Predictable performance
- Cost-Effective: Competitive pricing
- Energy Efficient: Lower power consumption
Performance Benefits
Built-in Web Search
All Groq models support web search:Best Practices
- Use for Speed-Critical Apps: Leverage LPU performance
- Enable Streaming: Show responses as they generate
- Choose Right Model: Balance speed vs capability
- Use Preview Models: Try latest models for specific tasks
- Enable Web Search: For current information
- Monitor Rate Limits: Free tier has limits
- Implement Retry Logic: Handle rate limiting gracefully
Rate Limits
Free Tier
- Generous limits for testing
- Rate-limited during peak hours
- Suitable for development
Paid Plans
- Higher rate limits
- Priority access
- Production-ready
Use Cases
Real-Time Chat
- Ultra-fast response times
- Great user experience
- Low latency
High-Volume Processing
- Batch processing
- Data analysis
- Content generation at scale
Cost Optimization
- Fast inference = lower costs
- Efficient token usage
- Good price/performance
Advantages
- Speed: Industry-leading inference speed
- Cost-Effective: Competitive pricing
- Quality: Access to top open models
- Web Search: Built-in for all models
- Simple API: Easy integration
- Reliable: Consistent performance
Limitations
- Model Selection: Limited to supported models
- No Caching: Each request is independent
- Rate Limits: Free tier restrictions
- Open Models Only: No proprietary models

