Controlling Robots with Vision Language Action Models#

The frontier of Embodied AI is moving away from modular pipelines (perception -> planning -> control) toward end-to-end learning. Vision-Language-Action (VLA) models take visual observations and natural language instructions as input and output direct robot joint commands.

In this tutorial, we will build an agent capable of performing physical manipulation tasks using the VLA component. We will utilize the LeRobot ecosystem to load a pretrained “SmolVLA” policy and connect it to a robot arm.

Important

In order to run this tutorial you will need to install LeRobot as a model serving platform. You can see the installation instructions here. After installation run the LeRobot async inference server as follows.

python -m lerobot.async_inference.policy_server --host=<HOST_ADDRESS> --port=<PORT>

Simulation Setup#

WILL BE ADDED SOON

Setting up our VLA based Agent#

We will start by importing the relevant components.

from agents.components import VLA
from agents.clients import LeRobotClient
from agents.models import LeRobotPolicy

Defining the Senses and Actuators#

Unlike purely digital agents, a VLA agent needs to be firmly grounded in its physical body. We need to define the ROS topics that represent the robot’s state (proprioception), its vision (eyes), and its actions (motor commands).

In this example, we are working with a so101 arm setup requiring two camera angles, so we define two camera inputs alongside the robot’s joint states.

from agents.ros import Topic

# 1. Proprioception: The current angle of the robot's joints
state = Topic(name="/isaac_joint_states", msg_type="JointState")

# 2. Vision: The agent's eyes
camera1 = Topic(name="/front_camera/image_raw", msg_type="Image")
camera2 = Topic(name="/wrist_camera/image_raw", msg_type="Image")

# 3. Action: Where the VLA will publish command outputs
joints_action = Topic(name="/isaac_joint_command", msg_type="JointState")

Setting up the Policy#

To drive our VLA component, we need a robot policy. EmbodiedAgents provides the LeRobotPolicy class, which interfaces seamlessly with models trained with LeRobot and hosted on the HuggingFace Hub.

We will use a finetuned SmolVLA model, a lightweight VLA policy trained by LeRobot team and finetuned on our simulation scenario setup above. We also need to provide a dataset_info_file. This is useful because the VLA needs to know the statistical distribution of the training data (normalization stats) to correctly interpret the robot’s raw inputs. This file is part of the standard LeRobot Dataset format. We will use the info file from the dataset on which our SmolVLA policy was finetuned on.

Important

In order to use the LeRobotClient you will need extra dependencies that can be installed as follows:

pip install grpcio protobuf
pip install torch --index-url https://download.pytorch.org/whl/cpu # And a lightweight CPU version (recommended) of torch

# Specify the LeRobot Policy to use
policy = LeRobotPolicy(
    name="my_policy",
    policy_type="smolvla",
    checkpoint="aleph-ra/smolvla_finetune_pick_orange_20000",
    dataset_info_file="https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json",
)

# Create the client
client = LeRobotClient(model=policy)

Note

The policy_type parameter supports various architectures including diffusion, act, pi0, and smolvla. Ensure this matches the architecture of your checkpoint.

VLA Configuration#

This is the most critical step. Pre-trained VLA models expect inputs to be named exactly as they were in the training dataset (e.g., “shoulder_pan.pos”). However, your robot’s URDF likely uses different names (e.g., “Rotation” or “joint_1”).

We use the VLAConfig to create a mapping layer that translates your robot’s specific hardware signals into the language the model understands.

Joint Mapping: Map dataset keys to your ROS joint names.
Camera Mapping: Map dataset camera names to your ROS image topics.
Safety Limits: Provide the URDF file so the component knows the physical joint limits and can cap actions safely.

from agents.config import VLAConfig

# Map dataset names (keys) -> Robot URDF names (values)
joints_map = {
    "shoulder_pan.pos": "Rotation",
    "shoulder_lift.pos": "Pitch",
    "elbow_flex.pos": "Elbow",
    "wrist_flex.pos": "Wrist_Pitch",
    "wrist_roll.pos": "Wrist_Roll",
    "gripper.pos": "Jaw",
}

# Map dataset camera names (keys) -> ROS Topics (values)
camera_map = {"front": camera1, "wrist": camera2}

config = VLAConfig(
    observation_sending_rate=3, # Hz: How often we infer
    action_sending_rate=3,      # Hz: How often we publish commands
    joint_names_map=joints_map,
    camera_inputs_map=camera_map,
    # URDF is required for safety capping and joint limit verification
    robot_urdf_file="./so101_new_calib.urdf"
)

Warning

If the joint_names_map is incomplete, the component will raise an error during initialization.

The VLA Component#

Now we assemble the component. The VLA component acts as a ROS2 Action Server. It creates a feedback loop: it ingests the state and images, processes them through the LeRobotClient, and publishes the resulting actions to the joints_action topic.

We also define a termination trigger. Since VLA tasks (like picking up an object) are finite, we can tell the component to stop after a specific number of timesteps.

Note

The termination trigger can be timesteps, keyboard and event. The event can be based on a topic published by another component observing the scene, for example a VLM component that is asking a periodic question to itself with a FixedInput. Check out the following tutorial.

from agents.components import VLA

vla = VLA(
    inputs=[state, camera1, camera2],
    outputs=[joints_action],
    model_client=client,
    config=config,
    component_name="vla_with_smolvla",
)

# Attach the stop trigger
vla.set_termination_trigger("timesteps", max_timesteps=50)

Here is the completed section with the terminal command instructions.

Launching the Component#

from agents.ros import Launcher

launcher = Launcher()
launcher.add_pkg(components=[vla])
launcher.bringup()

Now we can send our pick and place command to the component. Since the VLA component acts as a ROS2 Action Server, we can trigger it directly from the terminal using the standard ros2 action CLI.

Open a new terminal, source your workspace and send the goal (the natural language instruction) to the component. The action server endpoint defaults to component_name/action_name.

ros2 action send_goal /vla_with_smolvla/vision_language_action automatika_embodied_agents/action/VisionLanguageAction "{task: 'pick up the oranges and place them in the bowl'}"

Note

The task string is the natural language instruction that the VLA model conditions its actions on. Ensure this instruction matches the distribution of prompts used during the training of the model (e.g. “pick orange”, “put orange in bin” etc).

And there you have it! You have successfully configured an end-to-end VLA agent. The complete code is available below.

Vision Language Action Agent#

from agents.components import VLA
from agents.config import VLAConfig
from agents.clients import LeRobotClient
from agents.models import LeRobotPolicy
from agents.ros import Topic, Launcher
from agents.events import OnEqual

# --- Define Topics ---
state = Topic(name="/isaac_joint_states", msg_type="JointState")
camera1 = Topic(name="/front_camera/image_raw", msg_type="Image")
camera2 = Topic(name="/wrist_camera/image_raw", msg_type="Image")
joints_action = Topic(name="/isaac_joint_command", msg_type="JointState")

# --- Setup Policy (The Brain) ---
policy = LeRobotPolicy(
    name="my_policy",
    policy_type="smolvla",
    checkpoint="aleph-ra/smolvla_finetune_pick_orange_20000",
    dataset_info_file="https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange/resolve/main/meta/info.json",
)

client = LeRobotClient(model=policy)

# --- Configure Mapping (The Nervous System) ---
# Map dataset names -> robot URDF names
joints_map = {
    "shoulder_pan.pos": "Rotation",
    "shoulder_lift.pos": "Pitch",
    "elbow_flex.pos": "Elbow",
    "wrist_flex.pos": "Wrist_Pitch",
    "wrist_roll.pos": "Wrist_Roll",
    "gripper.pos": "Jaw",
}

# Map dataset cameras -> ROS topics
camera_map = {"front": camera1, "wrist": camera2}

config = VLAConfig(
    observation_sending_rate=3,
    action_sending_rate=3,
    joint_names_map=joints_map,
    camera_inputs_map=camera_map,
    # Ensure you provide a valid path to your robot's URDF
    robot_urdf_file="./so101_new_calib.urdf"
)

# --- Initialize Component ---
vla = VLA(
    inputs=[state, camera1, camera2],
    outputs=[joints_action],
    model_client=client,
    config=config,
    component_name="vla_with_smolvla",
)

# Set the component to stop after a certain number of timesteps
vla.set_termination_trigger('timesteps', max_timesteps=50)

# --- Launch ---
launcher = Launcher()
launcher.add_pkg(components=[vla])
launcher.bringup()

Controlling Robots with Vision Language Action Models

Contents

Controlling Robots with Vision Language Action Models#

Simulation Setup#

Setting up our VLA based Agent#

Defining the Senses and Actuators#

Setting up the Policy#

VLA Configuration#

The VLA Component#

Launching the Component#