Prompt engineering for LLMs/MLLMs using vision models#

In this example we will use the output of an object detection component to enrich the prompt of an MLLM component. Let us start by importing the components.

from agents.components import Vision, MLLM

Setting up the Object Detection Component#

For object detection and tracking, ROS Agents provides a unified Vision component. This component takes as input an image topic published by a camera device onboard our robot. The output of this component can be a detections topic in case of object detection or a trackings topic in case of object tracking. In this example we will use a detections topic.

from agents.ros import Topic

# Define the image input topic
image0 = Topic(name="image_raw", msg_type="Image")
# Create a detection topic
detections_topic = Topic(name="detections", msg_type="Detections")

Additionally the component requiers a model client with an object detection model. We will use the RESP client for RoboML and use the VisionModel a convenient model class made available in ROS Agents, for initializing all vision models available in the opensource mmdetection library. We will specify the model we want to use by specifying the checkpoint attribute.

Note

Learn about setting up RoboML with vision here.

Setting up the MLLM Component#

For the MLLM component, we will provide an additional text input topic, which will listen to our queries. The output of the component will be another text topic. We will use the RoboML HTTP client with the multimodal LLM Idefics2 by the good folks at HuggingFace for this example.

from agents.models import Idefics2

# Define MLLM input and output text topics
text_query = Topic(name="text0", msg_type="String")
text_answer = Topic(name="text1", msg_type="String")

# Define a model client (working with roboml in this case)
idefics = Idefics2(name="idefics_model")
idefics_client = HTTPModelClient(idefics)

# Define an MLLM component
# We can pass in the detections topic which we defined previously directy as an optional input
# to the MLLM component in addition to its other required inputs
mllm = MLLM(
    inputs=[text_query, image0, detections_topic],
    outputs=[text_answer],
    model_client=idefics_client,
    trigger=text_query,
    component_name="mllm_component"
)

Next we will setup a component level prompt to ensure that our text query and the output of the detections topic are sent to the model as we intend. We will do this by passing a jinja2 template to the set_component_prompt function.

mllm.set_component_prompt(
    template="""Imagine you are a robot.
    This image has following items: {{ detections }}.
    Answer the following about this image: {{ text0 }}"""
)

Caution

The names of the topics used in the jinja2 template are the same as the name parameters set when creation the Topic objects.

Launching the Components#

Finally we will launch our components as we did in the previous example.

from agents.ros import Launcher

# Launch the components
launcher = Launcher()
launcher.add_pkg(
    components=[vision, mllm]
    )
launcher.bringup()

And there we have it. Complete code of this example is provided below.

Prompt Engineering with Object Detection#

from agents.components import Vision, MLLM
from agents.models import VisionModel, Idefics2
from agents.clients.roboml import RESPModelClient, HTTPModelClient
from agents.config import VisionConfig
from agents.ros import Topic, Launcher

image0 = Topic(name="image_raw", msg_type="Image")
detections_topic = Topic(name="detections", msg_type="Detections")

object_detection = VisionModel(name="object_detection",
                               checkpoint="dino-4scale_r50_8xb2-12e_coco")
roboml_detection = RESPModelClient(object_detection)

detection_config = VisionConfig(threshold=0.5)
vision = Vision(
    inputs=[image0],
    outputs=[detections_topic],
    trigger=image0,
    config=detection_config,
    model_client=roboml_detection,
    component_name="detection_component",
)

text_query = Topic(name="text0", msg_type="String")
text_answer = Topic(name="text1", msg_type="String")

idefics = Idefics2(name="idefics_model")
idefics_client = HTTPModelClient(idefics)

mllm = MLLM(
    inputs=[text_query, image0, detections_topic],
    outputs=[text_answer],
    model_client=idefics_client,
    trigger=text_query,
    component_name="mllm_component"
)

mllm.set_component_prompt(
    template="""Imagine you are a robot.
    This image has following items: {{ detections }}.
    Answer the following about this image: {{ text0 }}"""
)
launcher = Launcher()
launcher.add_pkg(
    components=[vision, mllm]
    )
launcher.bringup()

Prompt engineering for LLMs/MLLMs using vision models

Contents

Prompt engineering for LLMs/MLLMs using vision models#

Setting up the Object Detection Component#

Setting up the MLLM Component#

Launching the Components#