News / AI Tools & Platforms

NVIDIA Launches Nemotron 3 Nano Omni, a Single Model That Handles Vision, Audio, and Language for AI Agents

Written by Arooj Ishtiaq

Thu Apr 30 2026

Ask AI now on Chatly to understand what NVIDIA Nemotron 3 Nano Omni means for the future

NVIDIA Launches Nemotron 3 Nano Omni, a Single Model That Handles Vision, Audio, and Language for AI Agents

NVIDIA has launched Nemotron 3 Nano Omni, an open multimodal model that combines vision, audio, and language processing into a single architecture for enterprise AI agents. The model is available from April 28 on Hugging Face, OpenRouter, build.nvidia.com, Amazon SageMaker JumpStart, and over 25 partner platforms.

What the Model Does and Why It Matters

Most enterprise AI agent systems today stitch together separate models for vision, speech, and language, passing data between them in repeated inference passes. This increases latency, fragments context across modalities, and adds cost and error over time.

Nemotron 3 Nano Omni solves this by combining all three capabilities into a single inference pass. According to NVIDIA's technical report, the model delivers:

9.2x higher system efficiency for video use cases compared to other open omni models with the same interactivity
7.4x higher system efficiency for multi-document use cases
2.9x faster single-stream reasoning speed on multimodal tasks
Top scores on six leaderboards covering document intelligence, video understanding, audio understanding, and GUI navigation

Gautier Cloix, CEO of H Company, said in the official announcement that by building on Nemotron 3 Nano Omni, their agents can rapidly interpret full HD screen recordings at native 1920x1080 resolution, something that was not practical before, calling it a fundamental shift in how agents perceive and interact with digital environments in real time.

Architecture and Technical Specifications

Nemotron 3 Nano Omni is built on a 30B-A3B hybrid Mixture-of-Experts architecture with the following components:

Language backbone: Nemotron 3 Nano 30B-A3B, combining 23 Mamba selective state-space layers, 23 MoE layers with 128 experts and top-6 routing, and 6 grouped-query attention layers
Vision encoder: C-RADIOv4-H, supporting dynamic resolution from 512x512 to 1840x1840 per image with between 1,024 and 13,312 visual patches
Audio encoder: Parakeet-TDT-0.6B-v2, sampling audio at 16kHz and supporting up to 20 minutes of audio input per inference, with LLM context supporting 5+ hours total
Context window: 131K tokens, with support for up to 256 video frames and up to 2 minutes of video per call

For video, the model uses Conv3D tubelet embedding to fuse pairs of consecutive frames before the vision encoder, halving the number of tokens the language model processes. A secondary Efficient Video Sampling mechanism then drops redundant static tokens between frames at inference time, further reducing latency without affecting accuracy.

What It Is Used For

According to NVIDIA's launch post and the AWS SageMaker JumpStart announcement, enterprise teams are deploying Nemotron 3 Nano Omni across the following workflows:

Document intelligence: Parsing contracts, financial statements, compliance packets, and 100+ page technical documents with cross-page reasoning across tables, charts, figures, and formulas
Computer use agents: Reading and reasoning over GUI screenshots in real time to navigate interfaces, automate browser workflows, and handle incident management dashboards
Audio and video understanding: Analyzing meeting recordings, customer service calls, product demos, and long-form video archives by jointly reasoning over what was said and shown
General multimodal reasoning: Synthesizing information across text, images, tables, and audio in a single reasoning loop for tasks requiring multi-step analysis

Benchmark Performance

According to Hugging Face's model page, Nemotron 3 Nano Omni outperforms the closest comparable open omni model, Qwen3-Omni 30B-A3B, on the majority of benchmarks:

Availability and Adoption

The model launches with open weights under the NVIDIA Open Model Agreement, with broad industry adoption already underway.

It is available in three precision formats for commercial use:

BF16, FP8, and NVFP4. The broader Nemotron 3 family has surpassed 50 million downloads in the past year.

The following companies are already deploying the model in production:

Aible, Applied Scientific Intelligence, and Eka Care are among the early adopters putting it to active use.
Foxconn, H Company, Palantir, and Pyler have also integrated it into their workflows.

The following companies are currently in the evaluation phase:

Dell Technologies, Docusign, Infosys, Oracle, and Zefr are assessing the model for their respective use cases.

NVIDIA is also releasing a set of resources for organizations building custom document-understanding datasets:

Training code and curated datasets are included to support fine-tuning and specialization.
NeMo Data Designer pipeline recipes provide a structured starting point for dataset construction.

Early benchmark results point to a meaningful performance jump in computer use tasks. In preliminary evaluations on the OSWorld benchmark, H Company's computer use agent powered by Nano Omni demonstrated a significant leap in navigating complex graphical interfaces compared to prior approaches.