VLAs in More Sophisticated Agents#
In the previous recipe, we saw how VLAs can be used in EmbodiedAgents to perform physical tasks. However, the real utility of VLAs is unlocked when they are part of a bigger cognitive system. With its event-driven agent graph development, EmbodiedAgents allows us to do exactly that.
Most VLA policies are “open-loop” regarding task completion, they run for a fixed number of steps and then stop, regardless of whether they succeeded or failed.
In this tutorial, we will build a Closed-Loop Agent. We will combine:
The Player (VLA): Attempts to pick up an object.
The Referee (VLM): Watches the camera stream and judges if the task is complete.
We will use the Event System to trigger a stop command on the VLA the moment the VLM confirms success.
The Player: Setting up the VLA#
First, we setup our VLA component exactly as we did in the previous recipe. We will use the same SmolVLA policy trained for picking oranges.
from agents.components import VLA
from agents.config import VLAConfig
from agents.clients import LeRobotClient
from agents.models import LeRobotPolicy
from agents.ros import Topic
# Define Topics
state = Topic(name="/isaac_joint_states", msg_type="JointState")
camera1 = Topic(name="/front_camera/image_raw", msg_type="Image")
camera2 = Topic(name="/wrist_camera/image_raw", msg_type="Image")
joints_action = Topic(name="/isaac_joint_command", msg_type="JointState")
# Setup Policy
policy = LeRobotPolicy(
name="my_policy",
policy_type="smolvla",
checkpoint="aleph-ra/smolvla_finetune_pick_orange_20000",
dataset_info_file="[https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json)",
)
client = LeRobotClient(model=policy)
# Configure VLA (Mapping omitted for brevity, see previous tutorial)
# ... (assume joints_map and camera_map are defined)
config = VLAConfig(
observation_sending_rate=5,
action_sending_rate=5,
joint_names_map=joints_map,
camera_inputs_map=camera_map,
robot_urdf_file="./so101_new_calib.urdf"
)
player = VLA(
inputs=[state, camera1, camera2],
outputs=[joints_action],
model_client=client,
config=config,
component_name="vla_player",
)
The Referee: Setting up the VLM#
Now we introduce the “Referee”. We will use a Vision Language Model (like Qwen-VL) to monitor the scene.
We want this component to periodically look at the camera1 feed and answer a specific question: “Are all the oranges in the bowl?”
We use a FixedInput to ensure the VLM is asked the exact same question every time.
from agents.components import VLM
from agents.clients import OllamaClient
from agents.models import OllamaModel
from agents.ros import FixedInput
# Define the topic where the VLM publishes its judgment
referee_verdict = Topic(name="/referee/verdict", msg_type="String")
# Setup the Model
qwen_vl = OllamaModel(name="qwen_vl", checkpoint="qwen2.5vl:7b")
qwen_client = OllamaClient(model=qwen_vl)
# Define the constant question
question = FixedInput(
name="prompt",
msg_type="String",
fixed="Look at the image. Are all the orange in the bowl? Answer only with YES or NO."
)
# Initialize the VLM
# Note: We trigger periodically (regulated by loop_rate)
referee = VLM(
inputs=[question, camera1],
outputs=[referee_verdict],
model_client=qwen_client,
trigger=10.0,
component_name="vlm_referee"
)
Note
To prevent the VLM from consuming too much compute, we have configured a float trigger, which means our VLM component will be triggered, not by a topic, but periodically with a loop_rate of once every 10 seconds.
Tip
In order to make sure that the VLM output is formatted as per our requirement (YES or NO), checkout how to use pre-processors in this tutorial.
The Bridge: Semantic Event Trigger#
Now comes the “Self-Referential” magic. We simply define an Event that fires when the /referee/verdict topic contains the word “YES”.
from agents.events import OnEqual
# Define the Success Event
event_task_success = OnEqual(
event_name="success_verified",
event_source=referee_verdict,
nested_attributes='data', # Check this attribute in the message
trigger_value="YES", # The VLM output we are looking for
)
Finally, we attach this event to the VLA using the set_termination_trigger method. We set the mode to event.
# Tell the VLA to stop immediately when the event fires
player.set_termination_trigger(
mode="event",
stop_event=event_task_success,
max_timesteps=500 # Fallback: stop if 500 steps pass without success
)
See also
Events are a very powerful concept in EmbodiedAgents. You can get inifintely creative with them. For example, imagine setting off the VLA component with a voice command. This can be done with combining the output of a SpeechToText component and an Event that generates an action command. To learn more about them check out the recipes for Events & Actions.
Launching the System#
When we launch this graph:
The VLA starts moving the robot to pick the orange.
The VLM simultaneously watches the feed.
Once the oranges are in the bowl, the VLM outputs “YES”.
The Event system catches this, interrupts the VLA, and signals that the task is complete.
from agents.ros import Launcher
launcher = Launcher()
launcher.add_pkg(components=[player, referee])
launcher.bringup()
You can send the action command to the VLA as defined in the previous recipe.
Complete Code#
1from agents.components import VLA, VLM
2from agents.config import VLAConfig
3from agents.clients import LeRobotClient, OllamaClient
4from agents.models import LeRobotPolicy, OllamaModel
5from agents.ros import Topic, Launcher, FixedInput
6from agents.events import OnContains
7
8# --- Define Topics ---
9state = Topic(name="/isaac_joint_states", msg_type="JointState")
10camera1 = Topic(name="/front_camera/image_raw", msg_type="Image")
11camera2 = Topic(name="/wrist_camera/image_raw", msg_type="Image")
12joints_action = Topic(name="/isaac_joint_command", msg_type="JointState")
13referee_verdict = Topic(name="/referee/verdict", msg_type="String")
14
15# --- Setup The Player (VLA) ---
16policy = LeRobotPolicy(
17 name="my_policy",
18 policy_type="smolvla",
19 checkpoint="aleph-ra/smolvla_finetune_pick_orange_20000",
20 dataset_info_file="[https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json)",
21)
22vla_client = LeRobotClient(model=policy)
23
24# VLA Config (Mappings assumed defined as per previous tutorial)
25# joints_map = { ... }
26# camera_map = { ... }
27
28config = VLAConfig(
29 observation_sending_rate=5,
30 action_sending_rate=5,
31 joint_names_map=joints_map,
32 camera_inputs_map=camera_map,
33 robot_urdf_file="./so101_new_calib.urdf"
34)
35
36player = VLA(
37 inputs=[state, camera1, camera2],
38 outputs=[joints_action],
39 model_client=vla_client,
40 config=config,
41 component_name="vla_player",
42)
43
44# --- Setup The Referee (VLM) ---
45qwen_vl = OllamaModel(name="qwen_vl", checkpoint="qwen2.5vl:7b")
46qwen_client = OllamaClient(model=qwen_vl)
47
48# A static prompt for the VLM
49question = FixedInput(
50 name="prompt",
51 msg_type="String",
52 fixed="Look at the image. Are all the orange in the bowl? Answer only with YES or NO."
53)
54
55referee = VLM(
56 inputs=[question, camera1],
57 outputs=[referee_verdict],
58 model_client=qwen_client,
59 trigger=camera1,
60 component_name="vlm_referee"
61)
62
63# --- Define the Logic (Event) ---
64# Create an event that looks for "YES" in the VLM's output
65event_success = OnContains(
66 event_name="task_success",
67 event_source=referee_verdict,
68 trigger_value="YES"
69)
70
71# Link the event to the VLA's stop mechanism
72player.set_termination_trigger(
73 mode="event",
74 stop_event=event_success,
75 max_timesteps=400 # Failsafe
76)
77
78# --- Launch ---
79launcher = Launcher()
80launcher.add_pkg(components=[player, referee])
81launcher.bringup()