Types & Helpers
image_block
infermesh.image_block
image_block(
source: str | Path | bytes,
*,
detail: Literal["auto", "low", "high"] | None = None,
mime_type: str | None = None,
) -> dict[str, Any]
Build an image content block for a multimodal chat message.
For URL-based images no helper is needed — pass the dict directly::
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
Use this function when the image is a local file or raw bytes that must be base64-encoded before sending to the provider. The LLM servers cannot read the caller's filesystem; both require either a publicly reachable URL or a base64 data URL embedded in the request body.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str or Path or bytes
|
The image source:
|
required |
detail
|
('auto', 'low', 'high')
|
OpenAI vision detail level controlling how many image tokens are
consumed. |
"auto"
|
mime_type
|
str or None
|
MIME type string (e.g. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
FileNotFoundError
|
If |
Examples:
URL (plain string is fine for URLs):
Local file — pass a Path, not a plain string:
>>> msg = {
... "role": "user",
... "content": [
... {"type": "text", "text": "What's in this image?"},
... image_block(Path("photo.jpg")),
... image_block(Path("diagram.png"), detail="high"),
... ],
... }
>>> result = client.generate([msg])
Raw bytes:
Input Types
infermesh.types.ChatMessage
module-attribute
A single chat message dict.
Must contain at least a "role" key and a "content" key, e.g.
{"role": "user", "content": "Hello!"}.
For multimodal (VLM) inputs the "content" value may be a list of content
blocks instead of a plain string. Text blocks have the form
{"type": "text", "text": "..."}; image blocks use
{"type": "image_url", "image_url": {"url": "https://..."}}.
Use image_block to build image blocks from local files or raw bytes.
infermesh.types.ChatInput
module-attribute
A full chat conversation: an ordered list of ChatMessage dicts.
Supports both plain-text and multimodal (VLM) messages; see ChatMessage and
image_block.
infermesh.types.ResponsesInput
module-attribute
Input for the "responses" endpoint.
Contains an "input" key (required) and an optional "instructions"
key for a system prompt.
infermesh.types.GenerateInput
module-attribute
Union of the three accepted generation input formats.
str: plain text; converted to a single user message internally.ChatInput: a pre-built list of role/content dicts. Supports multimodal messages; seeChatMessageand image_block.ResponsesInput: a dict suitable for theresponsesendpoint.
infermesh.types.EmbeddingInput
module-attribute
Accepted embedding input: a single string or a list of strings.
infermesh.types.TranscriptionInput
module-attribute
Accepted transcription input.
str/pathlib.Path: path to an audio file on disk; opened and read automatically.bytes: raw audio bytes.BinaryIO: any file-like object with a.read()method.
infermesh.types.EndpointType
module-attribute
The three supported generation endpoint identifiers.
"chat_completion"(default): standard chat API (/v1/chat/completions)."text_completion": legacy completions API (/v1/completions). Input must be a plain string; LiteLLM'satext_completionis called."responses": OpenAI Responses API (/v1/responses).
Result Types
infermesh.GenerationResult
dataclass
GenerationResult(
model_id: str,
output_text: str,
output_parsed: Any | None = None,
reasoning: str | None = None,
token_usage: TokenUsage | None = None,
finish_reason: str | None = None,
tool_calls: list[ToolCall] | None = None,
raw_response: Any | None = None,
request_id: str | None = None,
cost: float | None = None,
metrics: RequestMetrics | None = None,
)
The typed result of a text-generation request.
Returned by generate,
agenerate, and contained in
BatchResult for *_batch methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
The provider-reported model identifier (e.g. |
required |
output_text
|
str
|
The generated text. For |
required |
output_parsed
|
object or None
|
The structured result when |
None
|
reasoning
|
str or None
|
Extended chain-of-thought reasoning text, when disclosed by the provider (e.g. certain Anthropic or OpenAI reasoning models). |
None
|
token_usage
|
TokenUsage or None
|
Token-count breakdown. |
None
|
finish_reason
|
str or None
|
The stop condition reported by the provider. Common values are
|
None
|
tool_calls
|
list[ToolCall] or None
|
Structured tool calls emitted by the model. |
None
|
raw_response
|
object or None
|
The unmodified provider response object. Useful for accessing provider-specific fields not surfaced by this dataclass. |
None
|
request_id
|
str or None
|
The provider-assigned request identifier (e.g. the |
None
|
cost
|
float or None
|
Estimated cost in USD, when reported by LiteLLM's cost tracking. |
None
|
metrics
|
RequestMetrics or None
|
Queue-wait and service-time metadata for this request. |
None
|
Notes
str(result) returns output_text, so a
GenerationResult can be used directly wherever a string is
expected.
Examples:
Basic generation:
>>> result = client.generate("Summarize backpropagation in one sentence.")
>>> print(result.output_text)
>>> print(f"Cost: ${result.cost:.6f}" if result.cost else "no cost info")
Structured output with a Pydantic model:
>>> from pydantic import BaseModel
>>> class Summary(BaseModel):
... headline: str
... body: str
>>> result = client.generate(
... "Summarize the French Revolution.",
... response_format=Summary,
... )
>>> summary: Summary = result.output_parsed # type: ignore[assignment]
>>> print(summary.headline)
infermesh.EmbeddingResult
dataclass
EmbeddingResult(
model_id: str,
embedding: list[float],
token_usage: TokenUsage | None = None,
raw_response: Any | None = None,
request_id: str | None = None,
metrics: RequestMetrics | None = None,
)
The typed result of an embedding request.
Returned by embed for single-string input and contained in BatchResult for embed_batch calls.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
The provider-reported model identifier. |
required |
embedding
|
list[float]
|
The dense embedding vector. Its length equals the model's output
dimension (e.g. 1536 for |
required |
token_usage
|
TokenUsage or None
|
Token-count breakdown. |
None
|
raw_response
|
object or None
|
The unmodified provider response for advanced use cases. |
None
|
request_id
|
str or None
|
The provider-assigned request identifier. |
None
|
metrics
|
RequestMetrics or None
|
Queue-wait and service-time metadata for this request. |
None
|
Examples:
>>> import numpy as np
>>> result = client.embed("The quick brown fox jumps over the lazy dog.")
>>> vec = np.array(result.embedding)
>>> print(f"Dim: {vec.shape[0]}, Norm: {np.linalg.norm(vec):.4f}")
infermesh.TranscriptionResult
dataclass
TranscriptionResult(
model_id: str,
text: str,
duration_s: float | None = None,
language: str | None = None,
raw_response: Any | None = None,
request_id: str | None = None,
metrics: RequestMetrics | None = None,
)
The typed result of an audio-transcription request.
Returned by transcribe and atranscribe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
The provider-reported model identifier (e.g. |
required |
text
|
str
|
The transcribed text. |
required |
duration_s
|
float or None
|
Duration of the audio clip in seconds, when reported by the provider. |
None
|
language
|
str or None
|
Detected or explicitly requested language code (e.g. |
None
|
raw_response
|
object or None
|
The unmodified provider response for advanced use cases. |
None
|
request_id
|
str or None
|
The provider-assigned request identifier. |
None
|
metrics
|
RequestMetrics or None
|
Queue-wait and service-time metadata for this request. |
None
|
Examples:
>>> result = client.transcribe("interview.mp3")
>>> print(result.text)
>>> if result.language:
... print(f"Detected language: {result.language}")
infermesh.BatchResult
dataclass
Bases: Generic[T]
A typed container for the results of a batch request.
Returned by generate_batch, agenerate_batch, embed_batch, aembed_batch, transcribe_batch, and atranscribe_batch.
When return_exceptions=True (the default), a failed item does not
raise and discard the whole batch. Instead, results contains
None at that position and errors holds the exception. Both
lists are always the same length as the input, enabling index-based
correlation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[T or None]
|
One entry per input item. Successful items have type |
required |
errors
|
list[BaseException or None] or None
|
One entry per input item when |
None
|
Notes
len(batch)always equals the number of input items.- Iterating over
batchyields fromresults(may includeNonevalues on failure). - Index access (
batch[i]) returnsresults[i]. -
To split successes from failures::
successes = [r for r, e in zip(batch.results, batch.errors or []) if e is None] failures = [(i, e) for i, e in enumerate(batch.errors or []) if e is not None]
Examples:
Process a batch tolerating partial failures (default behaviour):
>>> prompts = ["Translate 'cat' to French", "bad-prompt", "What is 42?"]
>>> batch = client.generate_batch(prompts)
>>> for i, (result, error) in enumerate(zip(batch.results, batch.errors or [])):
... if error:
... print(f"[{i}] ERROR: {error}")
... else:
... print(f"[{i}] {result.output_text}")
Opt in to raise-on-first-failure (legacy behaviour):
__iter__
Iterate over batch items in input order.
Yields:
| Type | Description |
|---|---|
T or None
|
Each item from |
__getitem__
Return the result at index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Zero-based position in the batch. |
required |
Returns:
| Type | Description |
|---|---|
T or None
|
|
infermesh.TokenUsage
dataclass
TokenUsage(
prompt_tokens: int,
completion_tokens: int,
total_tokens: int,
reasoning_tokens: int | None = None,
)
Token-count information returned by a provider for a single request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_tokens
|
int
|
Number of tokens in the input (prompt / context window content). |
required |
completion_tokens
|
int
|
Number of tokens in the generated output. |
required |
total_tokens
|
int
|
Combined token count as reported by the provider. May differ from
|
required |
reasoning_tokens
|
int or None
|
Tokens consumed by chain-of-thought reasoning, when disclosed by the
provider (e.g. OpenAI |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
output_tokens |
int
|
Provider-neutral alias for |
Notes
Use output_tokens (alias for completion_tokens) when writing
code that should work with multiple providers, as some SDKs use the term
"output tokens" rather than "completion tokens".
Examples:
infermesh.RequestMetrics
dataclass
RequestMetrics(
queue_wait_s: float,
service_time_s: float,
end_to_end_s: float,
deployment: str | None = None,
retries: int = 0,
)
Per-request timing and routing metadata.
Attached to every GenerationResult, EmbeddingResult, and TranscriptionResult produced by LMClient.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queue_wait_s
|
float
|
Seconds spent waiting in the concurrency semaphore and/or rate-limiter
queue before the request was dispatched to the provider. A
persistently high value indicates the client is regularly hitting its
configured RPM / TPM limits or its |
required |
service_time_s
|
float
|
Seconds from request dispatch to response receipt — essentially network round-trip time plus provider inference latency. |
required |
end_to_end_s
|
float
|
Total wall-clock seconds from when the call entered the client to when
the response was received. Always equal to
|
required |
deployment
|
str or None
|
The deployment label selected for this request in router mode
(e.g. |
None
|
retries
|
int
|
Number of retry attempts made before this response was received.
|
0
|
Examples:
>>> result = client.generate("Hello")
>>> m = result.metrics
>>> if m:
... print(
... f"Queue wait: {m.queue_wait_s:.3f}s, "
... f"Service: {m.service_time_s:.3f}s, "
... f"Deployment: {m.deployment}, "
... f"Retries: {m.retries}"
... )
infermesh.ToolCall
dataclass
A tool call emitted by a model during a generation request.
Appears in tool_calls when the model decides to
invoke a function. Use id to correlate the tool result back to
the original call when continuing a multi-turn conversation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id
|
str
|
Unique identifier assigned by the provider for this specific tool call. |
required |
name
|
str
|
The function name the model wants to invoke. |
required |
arguments
|
str or None
|
JSON-encoded string containing the arguments the model supplied.
Parse with |
None
|
Examples:
>>> import json
>>> result = client.generate("What is the weather in Paris?", ...)
>>> if result.tool_calls:
... for tc in result.tool_calls:
... args = json.loads(tc.arguments or "{}")
... print(f"Call {tc.id}: {tc.name}({args})")
infermesh.DeploymentConfig
dataclass
DeploymentConfig(
model: str,
api_base: str,
api_key: str | None = None,
extra_kwargs: dict[str, Any] | None = None,
)
Configuration for a single deployment replica used in router mode.
In router mode LMClient accepts a mapping of free-form
labels (for example "gpu-0" or "us-east-1") to
DeploymentConfig instances. The client builds
a LiteLLM Router from these configs and load-balances requests across the
replicas.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Full LiteLLM model identifier understood by the provider, e.g.
|
required |
api_base
|
str
|
Base URL of the server, e.g. |
required |
api_key
|
str or None
|
API key for this replica. Pass |
None
|
extra_kwargs
|
dict or None
|
Additional LiteLLM keyword arguments applied only to this deployment. Useful for provider-specific settings such as custom request timeouts or Azure deployment names. |
None
|
Examples:
Create a deployment for a local vLLM replica:
>>> from infermesh import DeploymentConfig
>>> cfg = DeploymentConfig(
... model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
... api_base="http://gpu0:8000/v1",
... )
Create a deployment with an environment-sourced API key and custom timeout:
>>> import os
>>> cfg = DeploymentConfig(
... model="openai/gpt-4o",
... api_base="https://api.openai.com/v1",
... api_key=os.environ["OPENAI_API_KEY"],
... extra_kwargs={"timeout": 30},
... )
Batch Aliases
infermesh.types.GenerationBatchResult
module-attribute
Type alias for a batch of generation results.