agents.components

agents.components#

A Component is the main execution unit in EmbodiedAgents and in essence each component is synctactic sugar over a ROS2 Lifecycle Node. EmbodiedAgents provides the following components. These components can be arbitrarily combined to form an embodied agent graph.

Component Name

Description

LLM

Uses large language models (e.g., LLaMA) to process text input. Can be used for reasoning, tool calling, instruction following, or dialogue. It can also utilize vector DBs for storing and retreiving contextual information.

MLLM

Leverages multimodal LLMs (e.g., Llava) for understanding and processing both text and image data. Inherits all functionalities of the LLM component.

SpeechToText

Converts spoken audio into text using speech-to-text models (e.g., Whisper). Suitable for voice command recognition. It also implements small on-board models for Voice Activity Detection (VAD) and Wakeword recognition, using audio capture devices onboard the robot.

TextToSpeech

Synthesizes audio from text using TTS models (e.g., SpeechT5, Bark). Output audio can be played using the robot’s speakers or published to a topic. Implements say(text) and stop_playback functions to play/stop audio based on events from other components or the environment.

MapEncoding

Provides a spatio-temporal working memory by converting semantic outputs (e.g., from MLLMs or Vision) into a structured map representation. Uses robot localization data and output topics from other components to store information in a vector DB.

SemanticRouter

Routes information between topics based on semantic content and predefined routing rules. Uses a vector DB for semantic matching and decision-making. This allows for creating complex graphs of components where a single input source can trigger different information processing pathways.

Vision

An essential component in all vision powered robots. Performs object detection and tracking on incoming images. Outputs object classes, bounding boxes, and confidence scores. It implements a low-latency small on-board classification model as well.

VideoMessageMaker

This component generates ROS video messages from input image messages. A video message is a collection of image messages that have a perceivable motion. I.e. the primary task of this component is to make intentionality decisions about what sequence of consecutive images should be treated as one coherent temporal sequence. The chunking method used for selecting images for a video can be configured in component config. It can be useful in sending videos to ML models that take image sequences.