Create a conversational agent with audio#

Often times robots are equipped with a speaker system and a microphone. Once these peripherals have been exposed through ROS, we can use ROS Agents to trivially create a conversational interface on the robot. Our conversational agent will use a multimodal LLM for contextual question/answering utilizing the camera onboard the robot. Furthermore, it will use speech-to-text and text-to-speech models for converting audio to text and vice versa. We will start by importing the relavent components that we want to string together.

from agents.components import MLLM, SpeechToText, TextToSpeech

Components are basic functional units in ROS Agents. Their inputs and outputs are defined using ROS Topics. And their function can be any input transformation, for example the inference of an ML model. Lets setup these components one by one. Since our input to the robot would be speech, we will setup the speech-to-text component first.

SpeechToText Component#

This component listens to input an audio input topic, that takes in a multibyte array of audio (captured in a ROS std_msgs message, which maps to Audio msg_type in ROS Sugar) and can publish output to a text topic. It can also be configured to get the audio stream from microphones on board our robot. With this configuration the component is always listening and uses a small Voice Activity Detection (VAD) model, Silero-VAD to filter out any audio that is not speech. We will be using this configuration in our example. First we will setup our input and output topics and then create a config object which we can later pass to our component.

from agents.ros import Topic
from agents.config import SpeechToTextConfig

# Define input and output topics (pay attention to msg_type)
audio_in = Topic(name="audio0", msg_type="Audio")
text_query = Topic(name="text0", msg_type="String")

s2t_config = SpeechToTextConfig(enable_vad=True)  # option to always listen for speech through the microphone

Note

With enable_vad set to True, the component automatically deploys Silero-VAD by default in ONNX format. This model has a small footprint and can be easily deployed on the edge. However we need to install a couple of dependencies for this to work. These can be installed with: pip install pyaudio torchaudio onnxruntime

To initialize the component we also need a model client for a speech to text model. We will be using the HTTP client for RoboML for this purpose.

Note

RoboML is a aggregator library that provides a model serving aparatus for locally serving opensource ML models useful in robotics. Learn about setting up RoboML here.

Additionally, we will use the client with a model called, Whisper, a popular opensource speech to text model from OpenAI. Lets see what the looks like in code.

from agents.clients.roboml import HTTPModelClient
from agents.models import Whisper

# Setup the model client
whisper = Whisper(name="whisper")  # Custom model init params can be provided here
roboml_whisper = HTTPModelClient(whisper)

# Initialize the component
speech_to_text = SpeechToText(
    inputs=[audio_in],  # the input topic we setup
    outputs=[text_query], # the output topic we setup
    model_client=roboml_whisper,
    trigger=audio_in,
    config=s2t_config  # pass in the config object
)

The trigger parameter lets the component know that it has to perform its function (in this case model inference) when an input is received on this particular topic. In our configuration, the component will be triggered using voice activity detection on the continuous stream of audio being received on the microphone. Next we will setup our MLLM component.

MLLM Component#

The MLLM component takes as input a text topic (the output of the SpeechToText component) and an image topic, assuming we have a camera device onboard the robot publishing this topic. And just like before we need to provide a model client, this time with an MLLM model. This time we will use the OllamaClient along with Llava, a popular opensource multimodal LLM.

Note

Ollama is one of the most popular local LLM serving projects. Learn about setting up Ollama here.

Here is the code for our MLLM setup.

from agents.clients.ollama import OllamaClient
from agents.models import Llava

# Define the image input topic and a new text output topic
image0 = Topic(name="image_raw", msg_type="Image")
text_answer = Topic(name="text1", msg_type="String")

# Define a model client (working with Ollama in this case)
llava = Llava(name="llava")
llava_client = OllamaClient(llava)

# Define an MLLM component
mllm = MLLM(
    inputs=[text_query, image0],  # Notice the text input is the same as the output of the previous component
    outputs=[text_answer],
    model_client=llava_client,
    trigger=text_query,
    component_name="vqa" # We have also given our component an optional name
)

We can further customize the our MLLM component by attaching a context prompt template. This can be done at the component level or at the level of a particular input topic. In this case we will attach a prompt template to the input topic text_query.

# Attach a prompt template
mllm.set_topic_prompt(text_query, template="""You are an amazing and funny robot.
Answer the following about this image: {{ text0 }}"""
)

Notice that the template is a jinja2 template string, where the actual name of the topic is set as a variable. For longer templates you can also write them to a file and provide its path when calling this function. After this we move on to setting up our last component.

TextToSpeech Component#

The TextToSpeech component setup will be very similar to the SpeechToText component. We will once again use a RoboML client, this time with the SpeechT5 model (opensource model from Microsoft). Furthermore, this component can be configured to play audio on a playback device available onboard the robot. We will utilize this option through our config. An output topic is optional for this component as we will be playing the audio directly on device.

Note

In order to utilize play_on_device you need to install a couple of dependencies as follows: pip install soundfile sounddevice

from agents.config import TextToSpeechConfig
from agents.models import SpeechT5

# config for playing audio on device
t2s_config = TextToSpeechConfig(play_on_device=True)

speecht5 = SpeechT5(name="speecht5")
roboml_speecht5 = HTTPModelClient(speecht5)
text_to_speech = TextToSpeech(
    inputs=[text_answer],
    trigger=text_answer,
    model_client=roboml_speecht5,
    config=t2s_config
)

Launching the Components#

The final step in this example is to launch the components. This is done by passing the defined components to the launcher and calling the bringup method.

from agents.ros import Launcher

# Launch the components
launcher = Launcher()
launcher.add_pkg(
    components=[speech_to_text, mllm, text_to_speech]
    )
launcher.bringup()

Et voila! we have setup a graph of three components in less than 50 lines of well formatted code. The complete example is as follows:

Multimodal Audio Conversational Agent#
 1from agents.components import MLLM, SpeechToText, TextToSpeech
 2from agents.config import SpeechToTextConfig, TextToSpeechConfig
 3from agents.clients.roboml import HTTPModelClient
 4from agents.clients.ollama import OllamaClient
 5from agents.models import Whisper, SpeechT5, Llava
 6from agents.ros import Topic, Launcher
 7
 8audio_in = Topic(name="audio0", msg_type="Audio")
 9text_query = Topic(name="text0", msg_type="String")
10
11whisper = Whisper(name="whisper")  # Custom model init params can be provided here
12roboml_whisper = HTTPModelClient(whisper)
13
14speech_to_text = SpeechToText(
15    inputs=[audio_in],
16    outputs=[text_query],
17    model_client=roboml_whisper,
18    trigger=audio_in,
19    config=SpeechToTextConfig(enable_vad=True)  # option to always listen for speech through the microphone
20)
21
22image0 = Topic(name="image_raw", msg_type="Image")
23text_answer = Topic(name="text1", msg_type="String")
24
25llava = Llava(name="llava")
26llava_client = OllamaClient(llava)
27
28mllm = MLLM(
29    inputs=[text_query, image0],
30    outputs=[text_answer],
31    model_client=llava_client,
32    trigger=text_query,
33    component_name="vqa"
34)
35
36# config for playing audio on device
37t2s_config = TextToSpeechConfig(play_on_device=True)
38
39speecht5 = SpeechT5(name="speecht5")
40roboml_speecht5 = HTTPModelClient(speecht5)
41text_to_speech = TextToSpeech(
42    inputs=[text_answer],
43    trigger=text_answer,
44    model_client=roboml_speecht5,
45    config=t2s_config
46)
47
48launcher = Launcher()
49launcher.add_pkg(
50    components=[speech_to_text, mllm, text_to_speech]
51    )
52launcher.bringup()

Web Based Client for Interacting with the Robot#

To interact with text and audio based topics on the robot, ROS Agents includes a tiny browser based client made with chainlit. This is useful if the robot does not have a microphone/speaker interface or if one wants to communicate with it remotely. The client can be launched as follows:

ros2 run automatika_embodied_agents tiny_web_client

The client displays a web UI on http://localhost:8000. Open this address from browser. ROS input and output topic settings for text and audio topics can be configured from the web UI by pressing the settings icon.

See also

To customize the webapp, checkout how to set the configuration options for the app and the UI