Skip to main content

Classes

ModelCapability

Categories of model capabilities. Type: Enum Values:
  • REASONING - “reasoning”
  • CODE_GENERATION - “code_generation”
  • MATHEMATICS - “mathematics”
  • CREATIVE_WRITING - “creative_writing”
  • ANALYSIS - “analysis”
  • MULTILINGUAL - “multilingual”
  • VISION - “vision”
  • AUDIO - “audio”
  • LONG_CONTEXT - “long_context”
  • FAST_INFERENCE - “fast_inference”
  • COST_EFFECTIVE - “cost_effective”
  • FUNCTION_CALLING - “function_calling”
  • STRUCTURED_OUTPUT - “structured_output”
  • ETHICAL_SAFETY - “ethical_safety”
  • RESEARCH - “research”
  • PRODUCTION - “production”

ModelTier

Model performance tiers. Type: Enum Values:
  • FLAGSHIP - “flagship” (Top-tier, most capable models)
  • ADVANCED - “advanced” (High performance, balanced cost)
  • STANDARD - “standard” (Good performance, cost-effective)
  • FAST - “fast” (Optimized for speed and low cost)
  • SPECIALIZED - “specialized” (Domain-specific optimizations)

BenchmarkScores

Performance metrics from standard AI benchmarks. Parameters:
ParameterTypeDefaultDescription
mmluOptional[float]NoneMassive Multitask Language Understanding (0-100)
gpqaOptional[float]NoneGraduate-level questions (0-100)
mathOptional[float]NoneMATH benchmark (0-100)
gsm8kOptional[float]NoneGrade school math (0-100)
aimeOptional[float]NoneAmerican Invitational Mathematics Examination (0-100)
humanevalOptional[float]NonePython code generation (0-100)
mbppOptional[float]NoneMostly Basic Python Problems (0-100)
dropOptional[float]NoneDiscrete Reasoning Over Paragraphs (0-100)
mgsmOptional[float]NoneMultilingual Grade School Math (0-100)
arc_challengeOptional[float]NoneAI2 Reasoning Challenge (0-100)
Functions:

overall_score

Calculate a weighted overall score. Returns:
  • float: The overall benchmark score

ModelMetadata

Complete metadata for an AI model. Parameters:
ParameterTypeDefaultDescription
namestrRequiredModel name
providerstrRequiredModel provider
tierModelTierRequiredModel performance tier
release_datestrRequiredModel release date
capabilitiesList[ModelCapability][]Model capabilities
context_windowint8192Context window (in tokens)
benchmarksOptional[BenchmarkScores]NonePerformance benchmarks
strengthsList[str][]Model strengths and ideal use cases
ideal_forList[str][]Ideal use cases
limitationsList[str][]Model limitations
cost_tierint5Cost indicators (relative scale: 1-10, where 1 is cheapest)
speed_tierint5Speed indicators (relative scale: 1-10, where 10 is fastest)
notesstr""Additional notes

Constants

MODEL_REGISTRY

A comprehensive registry of all available models. Type: Dict[str, ModelMetadata] Contains model metadata for:
  • OpenAI models (GPT-4o, GPT-4o-mini, O1-Pro, O1-Mini)
  • Anthropic models (Claude 4 Opus, Claude 3.7 Sonnet, Claude 3.5 Haiku)
  • Google models (Gemini 2.5 Pro, Gemini 2.5 Flash)
  • Meta Llama models (Llama 3.3 70B)
  • DeepSeek models (DeepSeek-R1, DeepSeek-Chat)
  • Qwen models (Qwen 3 235B)
  • Mistral models (Mistral Large, Mistral Small)
  • Cohere models (Command R+)
  • Grok models (Grok 4)

Functions

get_model_metadata

Get metadata for a specific model. Parameters:
  • model_name (str): The model name (with or without provider prefix)
Returns:
  • Optional[ModelMetadata]: ModelMetadata if found, None otherwise

get_models_by_capability

Get all models that have a specific capability. Parameters:
  • capability (ModelCapability): The capability to filter by
Returns:
  • List[ModelMetadata]: List of ModelMetadata objects with the capability

get_models_by_tier

Get all models in a specific tier. Parameters:
  • tier (ModelTier): The tier to filter by
Returns:
  • List[ModelMetadata]: List of ModelMetadata objects in the tier

get_top_models

Get the top N models by overall score or specific benchmark. Parameters:
  • n (int): Number of top models to return (default: 10)
  • by_benchmark (Optional[str]): Specific benchmark to sort by (e.g., ‘mmlu’, ‘humaneval’)
Returns:
  • List[ModelMetadata]: List of top ModelMetadata objects

Predefined Model Metadata

OpenAI Models

GPT_4O

  • Name: “openai/gpt-4o”
  • Tier: FLAGSHIP
  • Context Window: 128,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Analysis, Multilingual, Vision, Audio, Long Context, Function Calling, Structured Output, Production
  • Benchmarks: MMLU: 88.7, GPQA: 53.6, Math: 76.6, HumanEval: 90.2, GSM8K: 95.8, MGSM: 90.5, DROP: 83.4
  • Cost Tier: 7/10
  • Speed Tier: 6/10

GPT_4O_MINI

  • Name: “openai/gpt-4o-mini”
  • Tier: FAST
  • Context Window: 128,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Multilingual, Vision, Fast Inference, Cost Effective, Function Calling, Structured Output, Production
  • Benchmarks: MMLU: 82.0, Math: 70.2, HumanEval: 87.2, GSM8K: 91.8, MGSM: 86.7, DROP: 80.1
  • Cost Tier: 2/10
  • Speed Tier: 9/10

O1_PRO

  • Name: “openai/o1-pro”
  • Tier: SPECIALIZED
  • Context Window: 128,000 tokens
  • Capabilities: Reasoning, Mathematics, Code Generation, Analysis
  • Benchmarks: MMLU: 91.8, GPQA: 78.3, Math: 94.8, AIME: 79.2, HumanEval: 92.5
  • Cost Tier: 10/10
  • Speed Tier: 3/10

O1_MINI

  • Name: “openai/o1-mini”
  • Tier: SPECIALIZED
  • Context Window: 128,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Cost Effective
  • Benchmarks: MMLU: 85.2, Math: 87.2, HumanEval: 89.3, GPQA: 60.0
  • Cost Tier: 6/10
  • Speed Tier: 5/10

Anthropic Models

CLAUDE_4_OPUS

  • Name: “anthropic/claude-4-opus-20250514”
  • Tier: FLAGSHIP
  • Context Window: 200,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Analysis, Multilingual, Vision, Long Context, Function Calling, Ethical Safety, Production
  • Benchmarks: MMLU: 90.7, GPQA: 59.4, Math: 80.5, HumanEval: 92.0, GSM8K: 96.4, DROP: 85.3
  • Cost Tier: 9/10
  • Speed Tier: 5/10

CLAUDE_3_7_SONNET

  • Name: “anthropic/claude-3-7-sonnet-20250219”
  • Tier: ADVANCED
  • Context Window: 200,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Analysis, Multilingual, Vision, Long Context, Function Calling, Ethical Safety, Production
  • Benchmarks: MMLU: 88.3, GPQA: 54.6, Math: 78.6, HumanEval: 90.0, GSM8K: 94.6, DROP: 84.4
  • Cost Tier: 6/10
  • Speed Tier: 7/10

CLAUDE_3_5_HAIKU

  • Name: “anthropic/claude-3-5-haiku-20241022”
  • Tier: FAST
  • Context Window: 200,000 tokens
  • Capabilities: Reasoning, Code Generation, Creative Writing, Multilingual, Vision, Fast Inference, Cost Effective, Function Calling, Production
  • Benchmarks: MMLU: 81.0, Math: 65.5, HumanEval: 82.0, GSM8K: 88.3
  • Cost Tier: 2/10
  • Speed Tier: 9/10

Google Models

GEMINI_2_5_PRO

  • Name: “google-gla/gemini-2.5-pro”
  • Tier: FLAGSHIP
  • Context Window: 1,000,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Analysis, Multilingual, Vision, Audio, Long Context, Function Calling, Production
  • Benchmarks: MMLU: 89.5, GPQA: 56.1, Math: 76.2, HumanEval: 88.9, GSM8K: 94.6, MGSM: 91.7, DROP: 84.9
  • Cost Tier: 7/10
  • Speed Tier: 7/10

GEMINI_2_5_FLASH

  • Name: “google-gla/gemini-2.5-flash”
  • Tier: FAST
  • Context Window: 1,000,000 tokens
  • Capabilities: Reasoning, Code Generation, Creative Writing, Multilingual, Vision, Fast Inference, Cost Effective, Long Context, Function Calling, Production
  • Benchmarks: MMLU: 83.7, Math: 69.5, HumanEval: 84.7, GSM8K: 89.7
  • Cost Tier: 2/10
  • Speed Tier: 10/10

Other Notable Models

LLAMA_3_3_70B

  • Name: “groq/llama-3.3-70b-versatile”
  • Tier: ADVANCED
  • Context Window: 128,000 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Creative Writing, Multilingual, Function Calling, Research
  • Benchmarks: MMLU: 86.0, Math: 66.0, HumanEval: 79.5, GSM8K: 90.2
  • Cost Tier: 3/10
  • Speed Tier: 7/10

DEEPSEEK_R1

  • Name: “deepseek/deepseek-reasoner”
  • Tier: SPECIALIZED
  • Context Window: 64,000 tokens
  • Capabilities: Reasoning, Mathematics, Code Generation, Analysis, Research
  • Benchmarks: MMLU: 90.8, Math: 97.3, AIME: 79.8, HumanEval: 90.2, GPQA: 71.5
  • Cost Tier: 5/10
  • Speed Tier: 4/10

QWEN_3_235B

  • Name: “huggingface/Qwen/Qwen3-235B-A22B”
  • Tier: ADVANCED
  • Context Window: 32,768 tokens
  • Capabilities: Reasoning, Code Generation, Mathematics, Multilingual, Analysis, Research
  • Benchmarks: MMLU: 88.5, Math: 72.5, HumanEval: 87.2, GSM8K: 93.4
  • Cost Tier: 4/10
  • Speed Tier: 5/10
I