Vision Agents

Name: Vision Agents
Availability: OnlineOnly
Author: Stream

Agent Frameworks

Open-source Video AI framework for building real-time voice and video applications with built-in AI integrations.

Visit Website

At a Glance

Pricing

Open Source

Free open-source framework for building real-time voice and video AI applications

Engagement

Available On

API

SDK

StreamBoulder, COEst. 2015$58.1M raised

Listed Jan 2026

About Vision Agents

Vision Agents is an open-source Video AI framework designed for building real-time voice and video applications. It ships with Stream Video as its default low-latency transport, powered by a global edge network, while remaining edge/transport agnostic so developers can bring any edge layer they prefer. The framework makes it simple to prototype and scale a wide range of AI-powered video applications.

Coaching & Training Applications — Build live sports coaching apps, guided workouts, and interactive training experiences with real-time video AI capabilities.
Collaboration Tools — Create meeting assistants, automated note-taking systems, and transcription services for enhanced team productivity.
Automation & Robotics — Develop IoT control systems, surveillance applications, and manufacturing workflow automation using video AI processing.
Video AI Features — Build video avatars and character agents for interactive and engaging user experiences.
23+ Built-in AI Integrations — Connect with popular providers including OpenAI, Gemini, xAI, OpenRouter for LLMs; Deepgram, Fast-Whisper, Wizper for speech-to-text; ElevenLabs, Cartesia, AWS Polly for text-to-speech; and Ultralytics YOLO, Moondream, Roboflow for video processing.
Realtime API Support — Leverage WebRTC connections through OpenAI, Gemini, AWS Bedrock, and Qwen for low-latency real-time interactions.
Extensible Architecture — Build custom integrations using BaseProcessor or VideoProcessorMixin classes to plug in custom computer-vision models and extend functionality.
Memory & Context Management — Utilize in-memory storage and Stream Chat integration for maintaining conversation context and state.

To get started, install Vision Agents and set up your first project following the installation guide. The framework provides comprehensive documentation covering voice agents, video agents, and integration setup. Developers can explore step-by-step implementation guides and ready-to-use cookbook examples for common use cases like building a golf coach application.

Community Discussions

Be the first to start a conversation about Vision Agents

Share your experience with Vision Agents, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free open-source framework for building real-time voice and video AI applications

Full framework access
23+ AI integrations
Voice agents
Video agents
Extensible plugin architecture

Capabilities

Key Features

Real-time voice agents
AI-powered video applications
23+ built-in AI integrations
LLM support (OpenAI, Gemini, xAI, OpenRouter)
Speech-to-text (Deepgram, Fast-Whisper, Wizper)
Text-to-speech (ElevenLabs, Cartesia, AWS Polly)
Video processing (Ultralytics YOLO, Moondream, Roboflow)
Turn detection (Smart Turn, Vogent)
Memory and context management
Extensible plugin architecture
Edge/transport agnostic design
Low-latency transport via Stream Video
WebRTC realtime API support
Video avatars and character agents

Integrations

OpenAI

Gemini

xAI

OpenRouter

Anthropic

AWS Bedrock

Qwen

Deepgram

Fast-Whisper

Wizper

Fish Audio

ElevenLabs

Cartesia

AWS Polly

Inworld

Kokoro

Smart Turn

Vogent

Ultralytics YOLO

Moondream

Roboflow

Decart

HeyGen

Stream Video

Stream Chat

API Available

View Docs

Back to all tools Suggest an edit

About Vision Agents

Coaching & Training Applications — Build live sports coaching apps, guided workouts, and interactive training experiences with real-time video AI capabilities.
Collaboration Tools — Create meeting assistants, automated note-taking systems, and transcription services for enhanced team productivity.
Automation & Robotics — Develop IoT control systems, surveillance applications, and manufacturing workflow automation using video AI processing.
Video AI Features — Build video avatars and character agents for interactive and engaging user experiences.
23+ Built-in AI Integrations — Connect with popular providers including OpenAI, Gemini, xAI, OpenRouter for LLMs; Deepgram, Fast-Whisper, Wizper for speech-to-text; ElevenLabs, Cartesia, AWS Polly for text-to-speech; and Ultralytics YOLO, Moondream, Roboflow for video processing.
Realtime API Support — Leverage WebRTC connections through OpenAI, Gemini, AWS Bedrock, and Qwen for low-latency real-time interactions.
Extensible Architecture — Build custom integrations using BaseProcessor or VideoProcessorMixin classes to plug in custom computer-vision models and extend functionality.
Memory & Context Management — Utilize in-memory storage and Stream Chat integration for maintaining conversation context and state.

Vision Agents