Prompt engineering for LLMs/MLLMs using vision models

Prompt engineering for LLMs/MLLMs using vision models#

In this example we will use the output of an object detection component to enrich the prompt of an MLLM component. Let us start by importing the components.

from agents.components import Vision, MLLM

Setting up the Object Detection Component#

For object detection and tracking, ROS Agents provides a unified Vision component. This component takes as input an image topic published by a camera device onboard our robot. The output of this component can be a detections topic in case of object detection or a trackings topic in case of object tracking. In this example we will use a detections topic.

from agents.ros import Topic

# Define the image input topic
image0 = Topic(name="image_raw", msg_type="Image")
# Create a detection topic
detections_topic = Topic(name="detections", msg_type="Detections")

Additionally the component requiers a model client with an object detection model. We will use the RESP client for RoboML and use the VisionModel a convenient model class made available in ROS Agents, for initializing all vision models available in the opensource mmdetection library. We will specify the model we want to use by specifying the checkpoint attribute.

Note

Learn about setting up RoboML with vision here.

See also

Checkout all available mmdetection models and their benchmarking results in the mmdetection model zoo.

from agents.models import VisionModel
from agents.clients.roboml import RESPModelClient, HTTPModelClient
from agents.config import VisionConfig

# Add an object detection model
object_detection = VisionModel(name="object_detection",
                               checkpoint="dino-4scale_r50_8xb2-12e_coco")
roboml_detection = RESPModelClient(object_detection)

# Initialize the Vision component
detection_config = VisionConfig(threshold=0.5)
vision = Vision(
    inputs=[image0],
    outputs=[detections_topic],
    trigger=image0,
    config=detection_config,
    model_client=roboml_detection,
    component_name="detection_component",
)

Tip

Notice that we passed in an option config to the component. Component configs can be used to setup various parameters in the component. If the component calls an ML than inference parameters for the model can be set in the component config.

Setting up the MLLM Component#

For the MLLM component, we will provide an additional text input topic, which will listen to our queries. The output of the component will be another text topic. We will use the RoboML HTTP client with the multimodal LLM Idefics2 by the good folks at HuggingFace for this example.

from agents.models import Idefics2

# Define MLLM input and output text topics
text_query = Topic(name="text0", msg_type="String")
text_answer = Topic(name="text1", msg_type="String")

# Define a model client (working with roboml in this case)
idefics = Idefics2(name="idefics_model")
idefics_client = HTTPModelClient(idefics)

# Define an MLLM component
# We can pass in the detections topic which we defined previously directy as an optional input
# to the MLLM component in addition to its other required inputs
mllm = MLLM(
    inputs=[text_query, image0, detections_topic],
    outputs=[text_answer],
    model_client=idefics_client,
    trigger=text_query,
)

Next we will setup a component level prompt to ensure that our text query and the output of the detections topic are sent to the model as we intend. We will do this by passing a jinja2 template to the set_component_prompt function.

mllm.set_component_prompt(
    template="""Imagine you are a robot.
    This image has following items: {{ detections }}.
    Answer the following about this image: {{ text0 }}"""
)

Caution

The names of the topics used in the jinja2 template are the same as the name parameters set when creation the Topic objects.

Launching the Components#

Finally we will launch our components as we did in the previous example.

from agents.ros import Launcher

# Launch the components
launcher = Launcher()
launcher.add_pkg(
    components=[vision, mllm]
    )
launcher.bringup()

And there we have it. Complete code of this example is provided below.

Prompt Engineering with Object Detection#
 1from agents.components import Vision, MLLM
 2from agents.models import VisionModel, Idefics2
 3from agents.clients.roboml import RESPModelClient, HTTPModelClient
 4from agents.config import VisionConfig
 5from agents.ros import Topic, Launcher
 6
 7image0 = Topic(name="image_raw", msg_type="Image")
 8detections_topic = Topic(name="detections", msg_type="Detections")
 9
10object_detection = VisionModel(name="object_detection",
11                               checkpoint="dino-4scale_r50_8xb2-12e_coco")
12roboml_detection = RESPModelClient(object_detection)
13
14detection_config = VisionConfig(threshold=0.5)
15vision = Vision(
16    inputs=[image0],
17    outputs=[detections_topic],
18    trigger=image0,
19    config=detection_config,
20    model_client=roboml_detection,
21    component_name="detection_component",
22)
23
24text_query = Topic(name="text0", msg_type="String")
25text_answer = Topic(name="text1", msg_type="String")
26
27idefics = Idefics2(name="idefics_model")
28idefics_client = HTTPModelClient(idefics)
29
30mllm = MLLM(
31    inputs=[text_query, image0, detections_topic],
32    outputs=[text_answer],
33    model_client=idefics_client,
34    trigger=text_query,
35)
36
37mllm.set_component_prompt(
38    template="""Imagine you are a robot.
39    This image has following items: {{ detections }}.
40    Answer the following about this image: {{ text0 }}"""
41)
42launcher = Launcher()
43launcher.add_pkg(
44    components=[vision, mllm]
45    )
46launcher.bringup()