Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system that can answer questions about video and image content? This presents a far more complex task.
Traditional video analytics tools struggle due to their limited functionality and a narrow focus on predefined objects. This makes it difficult to build general-purpose systems that understand and extract rich context from video streams. Developers face the following core challenges:
Limited understanding: Computer vision models struggle with contextual insights beyond predefined objects. Retaining context: Capturing and maintaining systems’ relevant context over time for videos is challenging. Integration complexity: Building a seamless user experience requires integrating multiple AI technologies.In this post, we introduce a solution to these challenges by using the NVIDIA AI Blueprint for video search and summarization. This approach enables the development of a visual AI agent capable of multi-step reasoning over video streams.
By incorporating NVIDIA Morpheus SDK, NVIDIA Riva for automatic speech recognition (ASR) and text-to-speech (TTS), and AI Blueprint, the system creates a robust RAG workflow that accepts speech input and delivers audio responses for a hands-free experience.
To show the real-world applications, we conclude with a sample use case: open-world question-answering on first-person video streams.
Solving the challenges of traditional video analytics with VLMs
Traditional methods rely on computer vision models pretrained to recognize a predefined set of features or objects. Vision-language models (VLMs) enable generic and adaptable scene understanding.
Limited understanding of predefined objects
Traditional models are often limited to recognizing only predefined objects or events, making it difficult to handle the diverse and evolving range of inputs in real-world environments.
VLMs, as foundation models, overcome this limitation by using large-scale, diverse datasets to understand a wide variety of objects, relationships, and scenarios without explicit retraining. They also exhibit advanced spatial and temporal understanding. This enables them to identify and describe novel objects and events with unprecedented flexibility for real-world applications.
Maintaining context over time
Videos often contain a long sequence of events, and ensuring the system retains relevant context for answering questions becomes a significant challenge, especially for complex multi-step reasoning tasks.
VLMs, with their multimodal capabilities, can incorporate temporal data into their analysis. With multi-frame input, they maintain and process context over time. The AI Blueprint for video search and summarization can use these VLMs, understand context for even longer videos, and build a complex knowledge graph out of it for future queries.
Combining multiple services for a seamless user experience
Building a system that not only understands the video but also interacts with users through speech requires integrating multiple technologies—video analysis, speech recognition, reasoning, and audio output.
You can use REST APIs for individual services and orchestrate a cohesive workflow. This modular approach simplifies scaling, maintenance, and the addition of new features, enabling a seamless user experience with robust interactions.
Visual AI agent workflow overview
In this workflow, you create a visual AI agent question-answering tool for videos. The tool performs complex multi-step reasoning based on a video stream, providing a hands-free user interface by taking in speech input and providing audio output. You can set up this workflow and try it out using the /via_workflows/video_agentic_rag_with_morpheus_riva Jupyter notebook.
We showcase the blueprint’s broad contextual understanding by providing it with live first person point-of-view video streams of everyday activities, not limited by any specific contextual scope.
Such video feeds could originate from a head-mounted camera, such as one found on augmented reality glasses. Using the provided video, the tool accurately answers questions about the user’s past and present environment. For example, “Where did I leave my concert tickets?” and “What was the name of the coworker I just met?”
With some imagination, this broad capability can easily be adapted to a multitude of industry-specific use cases, from construction site safety to accessibility for the visually impaired.
Integrating a reasoning pipeline, speech inputs, and audio outputs with AI Blueprint
To build this type of workflow, you need the following components:
AI Blueprint for video search and summarization NVIDIA Morpheus SDK Riva NIM ASR and TTS microservices An LLM NIM microservice, for generating the final responseThis workflow consists of the following steps:
Video processing: Stored or streamed video is processed using the AI Blueprint to create natural-language summaries of the events. The blueprint also creates a knowledge graph of the video, which can be queried later through REST APIs. Speech-to-text conversion: User audio queries are transcribed into text using the Riva Parakeet model for automatic speech recognition. Reasoning pipeline: The Morpheus SDK powers the LLM reasoning pipeline, generating actionable checklists based on the user query. Context retrieval: Relevant information is fetched from three parallel pipelines: Querying the blueprint to fetch answers from pre-existing summaries and knowledge graphs. Sending new queries to the blueprint to retrieve specific insights from the video that weren’t captured during initial processing. Performing an internet search to supplement the video insights with additional facts relevant to the scene and user query. Final response generation: An LLM NIM microservice synthesizes the gathered context to produce a summarized answer to the user. Text-to-speech conversion: The Riva text-to-speech FastPitch model outputs an audio version of the answer.Understanding video using the AI Blueprint for Video Search and Summarization
Traditional video analytics applications and their development workflows are typically built on a collection of fixed-function and limited models that are designed to detect and identify only a select set of predefined objects.
With generative AI and vision foundation models, it is now possible to build applications with fewer models, each of which possess incredibly complex and broad perception as well as rich contextual understanding. This new generation of VLMs gives rise to powerful visual AI agents.
The recently released AI Blueprint for video search and summarization provides a cloud-native solution to accelerate the development of visual AI agents. It offers a modular architecture with customizable model support and exposes REST APIs, enabling easy integration with other technologies. We’ve used REST APIs to integrate this blueprint in the project discussed in this post.
The AI Blueprint is available for early access to download it in your own infrastructures. For more information about this blueprint, including its components and models, see Build a Video Search and Summarization Agent with NVIDIA AI Blueprint.
This AI Blueprint enables you to understand the video and store all information in a vector and graph database. You process historical video as a preliminary step.
In a real-world use case, you can concurrently curate this historical database as the live video feed is being streamed. After the video is processed, query this blueprint with the checklist that the reasoning pipeline generates.
LLM reasoning pipeline using Morpheus
This is solving an open world-understanding problem, so it’s crucial that the LLM reasons iteratively to avoid hallucinations and incorrect answers. This means you should perform multiple retrieval and inference steps.
Here, the LLM reasoning pipeline is developed using the Morpheus SDK, because it provides a powerful LLM engine module that is built to optimally handle inference by enabling parallelized inference calls on GPU.
The Morpheus SDK is built to provide native support to NIM microservices, so you can easily integrate a variety of AI Foundation models, including the ones for LLM inference and speech as required by this workflow.
Finally, Morpheus also provides a reference architecture for a dynamic agentic reasoning pipeline, which you use in this workflow. Per this architecture, you first generate a preliminary checklist of actionable items to gather context helpful towards answering the user’s query.
Use a default Llama 3.1 70B NIM microservice as the LLM, and provide it with three tools, built using the standard langchain library for retrieval-augmented generation:
Google Search: Accesses real-time information from the Internet using SerpAPI. Present View: Queries the user’s current view using the AI Blueprint. Past View: Retrieves insights about the user’s past view using the AI Blueprint.Due to Morpheus’ GPU-enabled and highly optimized data processing framework, the full end-to-end question answering process can be completed in near-real time, enabling users to quickly receive feedback. The Morpheus pipeline’s ability to run multiple inference calls in parallel drastically increases throughput.
Sample use case
Suppose a user is preparing to leave the house for an important meeting. Before heading out, they remember walking through the kitchen, turning off the stove, and quickly leaving. However, while driving, they start to worry. “Did I really turn off the stove?” To put their mind at ease, they ask the video-understanding agent, “Did I turn off the stove before leaving?”
In this scenario, the workflow uses the following checklist to gather information and provide a response. Each checklist item is executed asynchronously, meaning that it is handled by its own separate LLM call, enabling parallel execution. Here’s how the video-understanding agent would handle these tasks using the available tools:
Verify stove status: The LLM pipeline queries the blueprint to check the latest video feed for the stove. If the stove is not currently visible, it relies on the past view tool to determine its last known status. In this case, the agent confirms, “The stove was last seen off at Timestamp <3.00s>.” Check historical video: The LLM calls the blueprint agent’s Q&A handler to analyze past recordings. It verifies that the user did indeed turn off the stove before exiting the kitchen, responding, “Yes, the user turned off the stove before leaving.” Plan if the stove is not in the view: The agent checks whether there are additional areas where the user might have gone after leaving the kitchen, such as the living room or outdoors. If necessary, it recommends checking footage from these areas to confirm that the stove was indeed turned off. Inspect kitchen environment: The agent performs a quick scan of the surrounding kitchen area for signs such as leftover food, cooking utensils, or flames. After confirming no signs of active cooking, it concludes: “The stove was turned off.”After collecting all this information, the LLM uses a summarization model to synthesize a clear and concise response. For example, “Yes, you turned off the stove before leaving the house. The burners were confirmed to be off.”
You can also ask more nuanced questions that it can handle using historical context. For example, after leaving home, you might wonder, “By any chance, did you see if the dog’s bowl had food in it?” The agent can review past video frames from your kitchen and respond, “Yes, the dog’s bowl had food in it when you last walked through the kitchen.”
This flexibility enables a wide range of queries, making it a powerful assistant for remembering the details that you might have missed in your busy day-to-day routine.
Video 1. Build an Agentic Video Workflow with Video Search and Summarization
Getting started: Unleash the power of vision AI agents
Build powerful vision AI agents using the NVIDIA AI Blueprint for video search and summarization, combined with NVIDIA NIM. This post showed how REST APIs from these advanced tools can integrate seamlessly with existing technologies, enabling innovative workflows for your projects.
Ready to dive in? To try this workflow yourself, follow the step-by-step guide on the /NVIDIA/metropolis-nim-workflows GitHub repo. For technical questions and discussions, see the Visual AI Agents forum.
Explore more resources to deepen your understanding and get started:
Apply for Early Access Program: Assess NVIDIA AI Blueprint for video search and summarization. Metropolis NIM Workflows: Discover other GenAI Workflows to enhance your projects. Build a Video Search and Summarization Agent: Learn how to create powerful vision AI agents with NVIDIA AI Blueprint.