agents.models#
The following model specification classes are meant to define a comman interface for initialization parameters for ML models across supported model serving platforms.
Module Contents#
Classes#
A generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs. |
|
A generic Multimodal LLM configuration for OpenAI-compatible APIs. |
|
A generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs. |
|
A generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs. |
|
An LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client. |
|
An MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client. |
|
An Ollama model that needs to be initialized with an ollama tag as checkpoint. |
|
Whisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details |
|
A model for text-to-speech synthesis developed by Microsoft. Details |
|
A model for text-to-speech synthesis developed by SunoAI. Details |
|
A model for text-to-speech synthesis developed by MyShell AI using the MeloTTS engine. |
|
Object Detection Model with Optional Tracking. |
API#
- class agents.models.GenericLLM#
Bases:
agents.models.ModelA generic LLM configuration for OpenAI-compatible /v1/chat/completions APIs.
This class supports any model served via an OpenAI-compatible endpoint (e.g., vLLM, LMDeploy, DeepSeek, Groq, or OpenAI itself).
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier on the remote server (e.g., “gpt-4o”, “meta-llama/Llama-3-70b”). For OpenAI models, consult: https://platform.openai.com/docs/models
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
options (dict, optional) –
Optional dictionary to configure default inference behavior. Options that conflict with component config options such as (max_tokens and temperature) will be overridden if set in component config. Supported keys match standard OpenAI API parameters:
temperature (float): Sampling temperature (0-2).
top_p (float): Nucleus sampling probability.
max_tokens (int): Max tokens to generate.
presence_penalty (float): Penalty for new tokens (-2.0 to 2.0).
frequency_penalty (float): Penalty for frequent tokens (-2.0 to 2.0).
stop (str or list): Stop sequences.
seed (int): Random seed for deterministic sampling.
Example usage:
gpt4 = GenericLLM( name='gpt4', checkpoint="gpt-4o", options={"temperature": 0.7, "max_tokens": 500} )
- get_init_params() Dict#
Get init params from models
- class agents.models.GenericMLLM#
Bases:
agents.models.GenericLLMA generic Multimodal LLM configuration for OpenAI-compatible APIs.
Use this for models that accept image/audio inputs alongside text (e.g., GPT-4o, Claude 3.5 Sonnet via wrapper, Gemini via OpenAI adapter).
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier. Consult provider documentation.
options (dict, optional) – Optional dictionary for default inference parameters (see GenericLLM).
Example usage:
gpt4_vision = GenericMLLM(name='gpt4v', checkpoint="gpt-4o")
- get_init_params() Dict#
Get init params from models
- class agents.models.GenericTTS#
Bases:
agents.models.ModelA generic Text-to-Speech model for OpenAI-compatible /v1/audio/speech APIs.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier (e.g., “tts-1”, “tts-1-hd”). For details: https://platform.openai.com/docs/models/tts
voice (str) – The voice ID to use. OpenAI standard voices: ‘alloy’, ‘echo’, ‘fable’, ‘onyx’, ‘nova’, ‘shimmer’. Other providers may have different IDs.
speed (float) – The speed of the generated audio. Select a value from 0.25 to 4.0. Default is 1.0.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
tts = GenericTTS( name='openai_tts', checkpoint="tts-1-hd", voice="nova", speed=1.2 )
- get_init_params() Dict#
Get init params from models
- class agents.models.GenericSTT#
Bases:
agents.models.ModelA generic Speech-to-Text model for OpenAI-compatible /v1/audio/transcriptions APIs.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The model identifier (e.g., “whisper-1”). For details: https://platform.openai.com/docs/models/whisper
language (str, optional) – The language of the input audio (ISO-639-1 format, e.g., ‘en’, ‘fr’). Improves accuracy if known. Default is None (auto-detect).
temperature (float) – The sampling temperature (0-1). Lower values are more deterministic. Default is 0.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
stt = GenericSTT( name='openai_stt', checkpoint="whisper-1", language="en", temperature=0.2 )
- get_init_params() Dict#
Get init params from models
- class agents.models.TransformersLLM#
Bases:
agents.models.LLMAn LLM model that needs to be initialized with any LLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “microsoft/Phi-3-mini-4k-instruct”. For available checkpoints consult HuggingFace LLM Models
quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
llm = TransformersLLM(name='llm', checkpoint="meta-llama/Meta-Llama-3.1-8B-Instruct")
- get_init_params() Dict#
Get init params from models
- class agents.models.TransformersMLLM#
Bases:
agents.models.TransformersLLMAn MLLM model that needs to be initialized with any MLLM checkpoint available on HuggingFace transformers. This model can be used with a roboml client.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “HuggingFaceM4/idefics2-8b”. For available checkpoints consult HuggingFace Image-Text to Text Models
quantization (str or None) – The quantization scheme used by the model. Can be one of “4bit”, “8bit” or None (default is “4bit”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
mllm = TransformersMLLM(name='mllm', checkpoint="gemma2:latest")
- get_init_params() Dict#
Get init params from models
- class agents.models.OllamaModel#
Bases:
agents.models.LLMAn Ollama model that needs to be initialized with an ollama tag as checkpoint.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. For available checkpoints consult Ollama Models
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
options –
Optional dictionary to configure generation behavior. Options that conflict with component config options such as (num_predict and temperature) will be overridden if set in component config. Only the following keys with their specified value types are allowed. For details check Ollama api documentation:
num_keep: int
seed: int
num_predict: int
top_k: int
top_p: float
min_p: float
typical_p: float
repeat_last_n: int
temperature: float
repeat_penalty: float
presence_penalty: float
frequency_penalty: float
penalize_newline: bool
stop: list of strings
numa: bool
num_ctx: int
num_batch: int
num_gpu: int
main_gpu: int
use_mmap: bool
num_thread: int
llm = OllamaModel( name='ollama1', checkpoint="gemma2:latest", options={"temperature": 0.7, "num_predict": 50} )
- get_init_params() Dict#
Get init params from models
- class agents.models.Whisper#
Bases:
agents.models.ModelWhisper is an automatic speech recognition (ASR) system by OpenAI trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Details
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – Size of the model to use (tiny, tiny.en, base, base.en, small, small.en, distil-small.en, medium, medium.en, distil-medium.en, large-v1, large-v2, large-v3, large, distil-large-v2, distil-large-v3, large-v3-turbo, or turbo). For more information check here
compute_type (str or None) – The compute type used by the model. Can be one of “int8”, “fp16”, “fp32”, None (default is “int8”).
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
whisper = Whisper(name='s2t', checkpoint="small") # Initialize with a different checkpoint
- get_init_params() Dict#
Get init params from models
- class agents.models.SpeechT5#
Bases:
agents.models.ModelA model for text-to-speech synthesis developed by Microsoft. Details
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Default is “microsoft/speecht5_tts”.
voice – The voice to use for synthesis. Can be one of “awb”, “bdl”, “clb”, “jmk”, “ksp”, “rms”, or “slt”. Default is “clb”.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
speecht5 = SpeechT5(name='t2s1', voice="bdl") # Initialize with a different voice
- get_init_params() Dict#
Get init params from models
- class agents.models.Bark#
Bases:
agents.models.ModelA model for text-to-speech synthesis developed by SunoAI. Details
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. Bark checkpoints on HuggingFace. Default is “suno/bark-small”.
attn_implementation – The attention implementation to use for the model. Default is “flash_attention_2”.
voice – The voice to use for synthesis. More choices are available here. Default is “v2/en_speaker_6”.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
bark = Bark(name='t2s2', voice="v2/en_speaker_1") # Initialize with a different voice
- get_init_params() Dict#
Get init params from models
- class agents.models.MeloTTS#
Bases:
agents.models.ModelA model for text-to-speech synthesis developed by MyShell AI using the MeloTTS engine.
- Parameters:
name (str) – An arbitrary name given to the model.
language (str) – The language for speech synthesis. Supported values: [“EN”, “ES”, “FR”, “ZH”, “JP”, “KR”]. Default is “EN”.
speaker_id (str) – The speaker ID for the chosen language. Default is “EN-US”. For details check here
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
melotts = MeloTTS(name='melo1', language='JP', speaker_id='JP-1')
- get_init_params() Dict#
Get init params from models
- class agents.models.VisionModel#
Bases:
agents.models.ModelObject Detection Model with Optional Tracking.
This vision model provides a flexible framework for object detection and tracking using the mmdet framework. It can be used as a standalone detector or as a tracker to follow detected objects over time. It can be initizaled with any checkpoint available in the mmdet framework.
- Parameters:
name (str) – An arbitrary name given to the model.
checkpoint (str) – The name of the pre-trained model’s checkpoint. All available checkpoints in the mmdet framework. Default is “dino-4scale_r50_8xb2-12e_coco”.
cache_dir (str) – The directory where downloaded models are cached. Default is ‘mmdet’.
setup_trackers (bool) – Whether to set up trackers using norfair or not. Default is False.
tracking_distance_function (str) – The function used to calculate the distance between detected objects. This can be any distance metric string available in scipy.spatial.distance.cdist Default is “euclidean”.
tracking_distance_threshold (int) – The threshold for determining whether two object in consecutive frames are considered close enough to be considered the same object. Default is 30, with a minimum value of 1.
deploy_tensorrt (bool) – Deploy the vision model using NVIDIA TensorRT. To utilize this feature with roboml, checkout the instructions here. Default is False.
_num_trackers (int) – The number of trackers to use. This number depends on the number of inputs image streams being given to the component. It is set automatically if setup_trackers is True.
init_timeout (int, optional) – The timeout in seconds for the initialization process. Defaults to None.
Example usage:
model = DetectionModel(name='detection1', setup_trackers=True, num_trackers=1, tracking_distance_threshold=20) # Initialize the model for tracking one object
- get_init_params() Dict#
Get init params from models