Pixtral Large: Specs, Benchmarks & API Access Guide (2026)

pixtral large

Mistral’s Pixtral Large arrived in November 2024 with a sharp pitch: 124 billion parameters built for vision-first workflows, 128,000 token context, and a claim to beat GPT-4o on chart interpretation. It targets document-heavy teams, research labs, and anyone drowning in PDFs who needs structured data out the other end. The hook is architectural: a 123-billion-parameter decoder paired with a 1-billion-parameter vision encoder, purpose-built for images and documents instead of text generation with vision bolted on.

But here’s the catch.

Pixtral Large ships API-only with no public pricing, limited benchmark disclosure, and zero community adoption signals. While competitors like GPT-4o publish full model cards and Claude offers 200,000-token context with transparent rate cards, Mistral asks you to take their word. They claim 88.1% on ChartQA, beating GPT-4o’s 85.2%. They tout “frontier-class” document understanding. They say the model excels at visual reasoning.

And they refuse to tell you what it costs.

This creates a weird evaluation problem. If you’re a PM choosing a vision model in 2026, you need benchmarks you can verify, pricing you can budget, and API docs you can hand to your engineers. Pixtral Large gives you one of those three. The architecture looks promising at 124 billion parameters. The 128K context window matches GPT-4o. The vision-specific benchmarks, where they exist, show real strength on charts and documents. The absence of everything else makes it impossible to recommend for production without running your own tests first.

This guide documents what Mistral has published, fills in competitive context from models with actual data, and flags every gap where you’ll be flying blind. If you’re evaluating vision models for document workflows, you need to know what Pixtral Large actually delivers versus what the marketing deck promises. By the end, you’ll understand when this model makes sense and when you should just use GPT-4o.

Specs at a glance

Specification Value
Model Name Pixtral Large
Developer Mistral AI
Release Date November 2024
Model Family Pixtral
Architecture Transformer-based VLM (1B vision encoder + 123B decoder)
Total Parameters 124B (123B text decoder + 1B vision encoder)
Context Window 128K tokens
Modalities Text + Vision (images, documents, charts)
Training Data Undisclosed
Quantization Options Undisclosed
Open Source Status Closed-source, API-only
API Endpoint Via Mistral platform and AWS Bedrock
Input Pricing Undisclosed (contact sales)
Output Pricing Undisclosed (contact sales)
Rate Limits Undisclosed
Function Calling Undisclosed
Tool Use Undisclosed
Multilingual Support Supported (Mistral models typically strong)
Safety Layers Undisclosed
Compliance Undisclosed

The 124-billion-parameter count splits into a 123B text decoder and a separate 1B vision encoder, according to AWS Bedrock documentation. This matters because it shows Mistral allocated significant capacity to visual processing, not just text generation with vision as an afterthought. The architecture mirrors GPT-4o’s design philosophy: dedicated vision encoding feeding into a large language decoder.

The 128,000-token context window matches GPT-4o exactly and undercuts Claude 3.5 Sonnet’s 200,000 tokens. In practice, this supports multi-page document analysis, lengthy PDFs with embedded images, and batch processing of visual data. But token counting for images remains undocumented. GPT-4o charges different rates based on image resolution. Pixtral Large doesn’t publish how it counts image tokens, which makes cost estimation impossible.

The closed-source, API-only access represents a strategic shift for Mistral. They built their reputation on Mistral 7B, an open-weight model developers could run locally. Pixtral Large follows the OpenAI playbook: proprietary weights, cloud-only inference, enterprise sales. This locks out edge deployment, air-gapped environments, and anyone who needs to audit model weights for security. It’s a bet that vision models are too expensive to run locally and too valuable to open-source.

Pixtral Large leads on charts but lacks the full benchmark suite

Model MMLU ChartQA MathVista DocVQA Context Pricing (1M input/output)
Pixtral Large Unpublished 88.1% 69.4% Frontier-class (exact % unpublished) 128K Undisclosed
GPT-4o 88.7% 85.2% Unpublished Unpublished 128K $2.50 / $10.00
Claude 3.5 Sonnet 88.3% 89.1% Unpublished Unpublished 200K $3.00 / $15.00
Gemini 1.5 Flash 78.9% Unpublished Unpublished Unpublished 1M $0.075 / $0.30
Mistral 7B 62.5% N/A N/A N/A 32K Open weights

Pixtral Large claims 88.1% on ChartQA, a vision benchmark testing chart interpretation accuracy, according to Weights & Biases testing. That beats GPT-4o’s 85.2% but trails Claude 3.5 Sonnet’s 89.1%. On MathVista, which tests mathematical reasoning from visual inputs, Pixtral scores 69.4%. Mistral also claims “frontier-class” performance on DocVQA, a document question-answering benchmark, but refuses to publish the actual percentage.

And that’s where the benchmark story ends.

No MMLU score. No MATH benchmark. No MMMU multimodal understanding results. No SWE-bench coding performance. GPT-4o publishes all of these in its system card. Claude does the same in its model card. Mistral cherry-picks the benchmarks where Pixtral wins and ignores the rest. This makes competitive evaluation nearly impossible without running your own tests.

The vision-specific results suggest real strength. ChartQA measures whether a model can extract data from bar charts, line graphs, and scatter plots accurately. Pixtral’s 88.1% means it correctly answers 88 out of 100 chart interpretation questions. That’s strong enough for financial analysis, research automation, and data extraction workflows. MathVista combines visual reasoning with mathematical problem-solving, testing whether a model can solve geometry problems from diagrams or calculate values from charts. At 69.4%, Pixtral performs competitively, though Mistral hasn’t published head-to-head comparisons with GPT-4o or Claude on this benchmark.

But the absence of general-purpose benchmarks raises questions. Does Pixtral sacrifice text reasoning for vision performance? Is the 123B decoder weaker than GPT-4o’s on pure language tasks? Without MMLU scores, you can’t know if this model handles complex instructions as well as competitors. Without MATH benchmarks, you can’t verify mathematical reasoning outside visual contexts. The gaps feel strategic, not accidental.

Vision-first architecture at 124B scale

Pixtral Large splits its 124 billion parameters into a 1-billion-parameter vision encoder and a 123-billion-parameter text decoder. The vision encoder processes images, documents, and charts into embeddings the decoder can understand. The decoder handles language generation, reasoning, and output formatting. This separation lets Mistral optimize each component independently instead of forcing a single model to do both jobs.

Technically, the architecture mirrors modern vision-language models like GPT-4o and Claude 3.5 Sonnet. An image passes through the vision encoder, which outputs a sequence of visual tokens. These tokens feed into the decoder alongside text tokens from the prompt. The decoder processes both streams together, allowing it to reference visual content while generating text. The 128,000-token context window means you can include dozens of images in a single request, though Mistral hasn’t disclosed how many tokens each image consumes.

The proof lives in the ChartQA and MathVista results. At 88.1% on ChartQA, Pixtral demonstrates strong visual grounding, the ability to extract precise data from complex visualizations. This requires the vision encoder to capture fine details like axis labels, data points, and legend entries, then pass that information to the decoder without losing accuracy. At 69.4% on MathVista, the model shows it can combine visual understanding with mathematical reasoning, solving problems that require both interpreting diagrams and performing calculations.

Use this architecture when you need accurate extraction from visual documents: invoices, contracts, research papers with embedded charts, financial reports with tables. The dedicated vision encoder should outperform models that treat images as secondary inputs. Skip it when you need pure text generation, coding assistance without visual context, or real-time inference where API latency matters. The 124B parameter count suggests slower inference than smaller models like Gemini 1.5 Flash, though Mistral hasn’t published speed benchmarks.

Eight practical use cases where Pixtral Large competes

Enterprise document processing

Feed Pixtral Large multi-page contracts, regulatory filings, or invoices with complex layouts. The model extracts structured data: dates, amounts, parties, clauses. At 128K context, it handles documents up to roughly 100 pages depending on image density. The ChartQA performance suggests strong table extraction, which matters for financial documents where a single misread number breaks downstream analysis.

Competitors like GPT-4o handle this too, but Pixtral’s vision-first architecture could deliver higher accuracy on documents with dense visual elements. You won’t know without testing both on your specific document types. For workflows requiring AI note-taking tools that integrate vision models, Pixtral’s API-only access limits direct integration compared to OpenAI-compatible alternatives.

Scientific research image analysis

Point Pixtral at academic papers with charts, graphs, and diagrams. Ask it to extract data points, describe experimental setups, or summarize visual results. The 128K context supports multi-figure analysis across lengthy papers. MathVista performance at 69.4% indicates it can handle mathematical reasoning from visual inputs, useful for physics diagrams or statistical charts.

This use case demands accuracy. A hallucinated data point in a literature review cascades into bad research. Pixtral’s lack of published error rates on scientific content makes it risky for production use without validation. While AI models like those discussed in AI solving historical mysteries demonstrate research capabilities, Pixtral Large’s vision focus could accelerate visual data extraction if benchmarks validate its claims.

Visual code generation

Upload UI mockups, wireframes, or screenshots of existing interfaces. Ask Pixtral to generate HTML, CSS, or React components. The vision encoder should capture layout details, color schemes, and component hierarchies. The 123B decoder handles code generation, though Mistral hasn’t published coding benchmarks to verify performance against GPT-4o or Claude.

Coding tools like Cursor AI rely on vision models for screenshot-to-code workflows. Pixtral Large could compete here if API integration becomes standardized, but the lack of SWE-bench scores makes it impossible to compare coding accuracy against established alternatives.

Financial document analysis

Parse earnings reports, balance sheets, and SEC filings for investment research. Pixtral’s ChartQA performance suggests strong table extraction, critical for financial statements where data lives in complex multi-column layouts. The 128K context handles quarterly reports with dozens of pages and embedded charts.

But as noted in AI financial analysis limitations, vision models require validation before production use in high-stakes domains. A single misread number in a balance sheet could trigger incorrect investment decisions. Pixtral’s lack of published accuracy metrics on financial documents raises similar concerns.

Creative asset analysis

Analyze design compositions, color palettes, and visual hierarchies for brand consistency. Feed Pixtral marketing materials, product packaging, or website screenshots. Ask for design critiques, accessibility assessments, or style guide compliance checks. The vision encoder should capture subtle visual details competitors might miss.

While image generation tools like Midjourney v7 dominate creative workflows, vision analysis models like Pixtral Large target the reverse: understanding existing visuals. This complements generative tools in creative pipelines.

Medical imaging interpretation (research only)

Assist radiologists with preliminary analysis of X-rays, MRIs, and CT scans. The vision encoder processes medical images. The decoder generates diagnostic suggestions or highlights areas of concern. But Pixtral Large has no disclosed FDA clearance, HIPAA certification, or medical imaging benchmarks.

Healthcare AI like Claude for Healthcare requires regulatory approval, which Pixtral Large has not disclosed. This limits medical applications to research contexts only, not clinical decision-making.

E-commerce product cataloging

Auto-generate product descriptions, attributes, and tags from product images. Feed Pixtral photos of clothing, electronics, or home goods. The model extracts color, material, dimensions, and features. At scale, this accelerates catalog creation for online retailers with thousands of SKUs.

E-commerce platforms like those discussed in Shopify’s AI integrations demonstrate vision AI applications, but Pixtral Large’s API-only model requires custom integration versus plug-and-play alternatives.

Geospatial analysis

Interpret satellite imagery, maps, and aerial photography for urban planning or environmental monitoring. The 128K context could support multi-image analysis across time series or geographic regions. Pixtral’s vision capabilities might detect changes in land use, infrastructure development, or environmental damage.

Geospatial AI applications like those in Niantic’s 3D mapping demonstrate vision model potential, though Pixtral Large’s capabilities remain unproven in this domain without published geospatial benchmarks.

How to access the Pixtral Large API

Pixtral Large runs through the Mistral AI platform and AWS Bedrock, according to Mistral’s documentation. You’ll need an API key from Mistral or AWS credentials with Bedrock access. The endpoint structure likely mirrors standard OpenAI-compatible chat completion APIs: a messages array containing text and image content, a model parameter specifying “pixtral-large”, and optional parameters for temperature and max tokens.

For image inputs, encode images as base64 strings or provide URLs. The vision encoder processes images alongside text prompts. Token counting for images remains undocumented, so you can’t predict costs accurately. Mistral hasn’t published rate limits, batch processing capabilities, or maximum image dimensions. This creates friction for production deployments where you need to budget API costs and plan for scale.

The lack of official SDK integration examples forces developers to reverse-engineer API patterns from general Mistral documentation or AWS Bedrock guides. Competitors publish comprehensive API docs with code samples in Python, JavaScript, and cURL. Mistral provides basic endpoint information but skips the practical details developers need: error handling patterns, retry logic for rate limits, optimal image resolution for cost versus accuracy tradeoffs.

For actual code snippets and SDK setup, check the official Mistral documentation or AWS Bedrock parameters guide. The API likely accepts standard multimodal message formats with image URLs or base64-encoded data. Expect to specify the model ID as “pixtral-large” or similar. Temperature and max tokens parameters should work like other Mistral models, though vision-specific parameters remain undocumented.

Getting the best results: prompting techniques for vision-first models

Start with image context before text instructions. In multimodal prompts, place image references first, then follow with your question or task. This helps the vision encoder process visual information before the decoder attempts language generation. For example, upload a chart image, then ask “Extract the data points from this line graph as a CSV table.” The model processes the visual input, identifies the chart elements, then structures output according to your text instruction.

Be explicit about task framing. Specify whether you want OCR, description, analysis, or extraction. “Describe this image” produces different output than “Extract all text from this document” or “Analyze the trends in this chart.” Vision models perform better with clear task definitions. Vague prompts like “What do you see?” waste tokens and produce unfocused responses.

Request structured output formats for parseable results. Ask for JSON, markdown tables, or CSV when you need machine-readable data. “Convert this invoice to JSON with fields for date, amount, vendor, and line items” produces structured output you can feed into downstream systems. Unstructured prose descriptions require additional parsing and introduce errors.

At 128K tokens, you can include multiple images in a single request, but token counting for images remains undocumented. Test with small batches first to understand token consumption and cost. GPT-4o charges different rates based on image resolution; Pixtral Large might do the same but hasn’t published the formula. Start with low-resolution images for testing, then increase quality if accuracy suffers.

Temperature settings affect output randomness. For data extraction tasks where accuracy matters, use low temperature (0.1 to 0.3) to reduce hallucinations. For creative analysis or exploratory tasks, higher temperature (0.7 to 0.9) produces more varied interpretations. Mistral hasn’t published model-specific temperature recommendations, so you’ll experiment to find optimal settings for your use case.

Iterative refinement works well for complex visual tasks. Start with a broad prompt, review the output, then refine with follow-up questions. “Describe this financial chart” produces an overview. Follow with “What’s the exact value for Q3 revenue?” to drill into specifics. The 128K context supports extended conversations where you progressively narrow focus.

Chain-of-thought prompting might improve reasoning on visual tasks, though Mistral hasn’t published specific guidance. Try prompts like “Analyze this chart step by step: first identify the chart type, then list the axes, then extract data points, then summarize trends.” Breaking complex visual analysis into discrete steps can improve accuracy, especially on tasks requiring both visual understanding and logical reasoning.

What doesn’t work: honest limitations of Pixtral Large

No public pricing means you can’t budget API costs. Competitors publish rate cards: GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. Claude 3.5 Sonnet runs $3.00 and $15.00. Gemini 1.5 Flash undercuts both at $0.075 and $0.30. Pixtral Large requires contacting Mistral sales for a quote, which blocks rapid prototyping and makes cost comparisons impossible.

Benchmark gaps leave performance unverified outside vision-specific tasks. Mistral publishes ChartQA and MathVista scores but skips MMLU, MATH, and coding benchmarks. You can’t verify whether Pixtral handles complex instructions as well as GPT-4o’s 88.7% MMLU performance or matches Claude’s reasoning strength. The cherry-picked benchmarks suggest Mistral knows where the model underperforms and chooses not to disclose it.

API-only access blocks local deployment, edge inference, and air-gapped environments. Organizations requiring data sovereignty can’t run Pixtral on-premises. Developers optimizing costs through self-hosting have no option. This contrasts with Mistral’s own Mistral 7B, which ships with open weights and supports local inference.

Undocumented API details create integration friction. No published rate limits, no image token counting formula, no maximum resolution specs. Developers can’t plan for scale or optimize costs without this information. GPT-4o publishes detailed API documentation with examples in multiple languages. Pixtral Large forces you to guess.

Zero community adoption signals raise red flags. No GitHub repos integrating Pixtral Large, no Reddit discussions comparing it to alternatives, no production case studies. When a model launches, developers test it, write about it, and share results. The silence around Pixtral suggests limited access, high costs, or disappointing performance that early testers chose not to publicize.

Unknown modality limits block edge cases. Can it process PDFs directly or only images? What’s the maximum image resolution? Which file formats work? JPEG, PNG, WebP? Competitors document these details. Mistral doesn’t, forcing trial-and-error testing.

Multilingual vision performance remains unproven. Mistral models excel at multilingual text, but vision plus non-English text introduces complexity. Can Pixtral accurately OCR Japanese documents? Extract data from Arabic charts? No published benchmarks on multilingual visual tasks means you can’t verify capability before committing.

Security and compliance: what Mistral hasn’t disclosed

Mistral AI has not published security or compliance documentation specific to Pixtral Large. This creates dealbreakers for regulated industries. Healthcare organizations need HIPAA certification before processing patient data. Financial institutions require SOC 2 Type II audits. Government contractors need FedRAMP authorization. Pixtral Large discloses none of these.

Data retention policies for API requests containing images remain unknown. Does Mistral store uploaded images? For how long? Can you request deletion? Competitors publish clear data processing addendums. OpenAI offers SOC 2 Type II certification and publishes a data processing agreement. Anthropic provides HIPAA-eligible Claude for Healthcare with documented compliance.

GDPR compliance for EU-based processing is unclear. Mistral operates in France, which should simplify EU data residency, but the company hasn’t published geographic data processing options. Can you force API requests to stay within EU borders? Competitors offer region-specific endpoints. Pixtral Large doesn’t document this.

Image data encryption in transit and at rest should be standard, but Mistral hasn’t confirmed implementation details. Third-party security audits, penetration testing results, and vulnerability disclosure policies are absent from public documentation. This opacity blocks adoption in security-conscious organizations.

Without published compliance certifications, Pixtral Large cannot be used in healthcare, finance, or government contexts that require audit trails. If you need SOC 2, HIPAA, ISO 27001, or FedRAMP, choose a competitor with documented compliance.

Version history: limited updates since November 2024 launch

Date Version Key Changes
November 2024 Pixtral Large (initial release) 124B parameters (123B decoder + 1B vision encoder), 128K context, API-only access via Mistral platform and AWS Bedrock. Source: Mistral documentation

Mistral AI has not published subsequent updates or changelogs for Pixtral Large as of April 2026. Competitors version their models with clear update histories. GPT-4o receives monthly updates with published changelogs detailing performance improvements and bug fixes. Claude ships versioned releases like 3.5 and 3.5.1 with documented changes. Gemini publishes version-specific model cards for 1.5, 1.5 Pro, and 1.5 Flash.

Pixtral Large exists in a documentation vacuum. No version numbers beyond the initial “24-11” designation. No changelog tracking improvements. No transparency into whether the model you access today performs the same as the November 2024 launch version. This opacity makes reproducibility impossible and blocks scientific use cases requiring stable model versions.

Common questions

What is Pixtral Large?

Pixtral Large is Mistral AI’s 124-billion-parameter multimodal vision-language model released in November 2024. It combines a 1B vision encoder with a 123B text decoder, targeting document analysis and image interpretation tasks through API-only access. The model claims strong performance on chart interpretation and mathematical visual reasoning.

How much does Pixtral Large cost?

Mistral has not published public pricing for Pixtral Large. You must contact their sales team for quotes. Competitors charge $0.075 to $3.00 per million input tokens and $0.30 to $15.00 per million output tokens, but Pixtral’s rates remain undisclosed, making cost comparisons impossible.

Can I run Pixtral Large locally?

No. Pixtral Large is closed-source and API-only. Unlike Mistral 7B, which ships with open weights for local deployment, Pixtral Large requires cloud access through Mistral’s platform or AWS Bedrock. This blocks edge deployment and air-gapped environments.

What’s the context window size?

128,000 tokens, matching GPT-4o but smaller than Claude 3.5 Sonnet’s 200,000 tokens and Gemini 1.5 Flash’s 1 million tokens. This supports multi-page document analysis, but token counting for images is undocumented, making capacity planning difficult.

Is Pixtral Large better than GPT-4o for vision tasks?

Unknown for most tasks. Pixtral scores 88.1% on ChartQA versus GPT-4o’s 85.2%, suggesting stronger chart interpretation. But Mistral hasn’t published MMLU, MATH, or general vision benchmarks where GPT-4o has documented performance. Without comprehensive comparisons, you can’t verify superiority across vision workflows.

Does Pixtral Large support video analysis?

Undocumented. The model is classified as a vision-language model supporting images and documents, but Mistral hasn’t confirmed video capabilities. Assume image-only unless official documentation states otherwise.

Can I use Pixtral Large for medical imaging?

Not for clinical use. Pixtral Large has no disclosed FDA clearance, HIPAA certification, or medical imaging benchmarks. It might work for research contexts, but lacks the regulatory approval required for diagnostic applications in healthcare settings.

How do I access Pixtral Large?

Through the Mistral AI platform or AWS Bedrock. You’ll need an API key from Mistral or AWS credentials with Bedrock access. No public self-service signup is documented, so expect to contact Mistral directly for access details and pricing.

alex morgan
I write about artificial intelligence as it shows up in real life — not in demos or press releases. I focus on how AI changes work, habits, and decision-making once it’s actually used inside tools, teams, and everyday workflows. Most of my reporting looks at second-order effects: what people stop doing, what gets automated quietly, and how responsibility shifts when software starts making decisions for us.