Multimodal LLMs With an API
As of June 2026, AIDiveForge tracks 3 multimodal llms with an api. Curated multimodal llms with an api tracked by AIDiveForge. Listings are verified against each tool's live website and re-checked regularly.
Last updated June 4, 2026 · 3 tools

1. Claude Sonnet 4.5
Claude Sonnet 4.5 is a large language model from Anthropic with particular strengths in software coding, agentic tasks where it runs in a loop and uses tools, and in using computers. The model maintains focus for more than 30 hours on complex, multi-step tasks. Pricing remains the same as Claude Sonnet 4, at $3/$15 per million tokens. It is the most aligned frontier model Anthropic has released, showing large improvements across several areas of alignment compared to previous Claude models.
Paid
2. Llama 4 Scout
Scout carries a 10M token context window, meaning you can feed it an entire codebase or a stack of legal documents in a single pass without chunking pipelines or retrieval hacks. Maverick trades raw context depth for stronger multimodal reasoning, handling interleaved image and text inputs through native early-fusion architecture rather than a bolted-on vision adapter. Both models ship as open weights, downloadable from Hugging Face after license acceptance, with no API bill required if you run them yourself. The ceiling appears at inference: the Mixture-of-Experts architecture demands hardware that most teams do not have sitting idle, and running Scout's full 10M context window in practice requires significant GPU memory that a standard cloud instance will not cover.
FreeOpen Source
3. Muse Spark
A natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration developed by Meta Superintelligence Labs.
Paid
Listings on this page are sourced and verified by the AIDiveForge data pipeline. AIDiveForge is editorially independent — no money changes hands for inclusion.