Types & Helpers

image_block

infermesh.image_block

image_block(
    source: str | Path | bytes,
    *,
    detail: Literal["auto", "low", "high"] | None = None,
    mime_type: str | None = None,
) -> dict[str, Any]

Build an image content block for a multimodal chat message.

For URL-based images no helper is needed — pass the dict directly::

{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}

Use this function when the image is a local file or raw bytes that must be base64-encoded before sending to the provider. The LLM servers cannot read the caller's filesystem; both require either a publicly reachable URL or a base64 data URL embedded in the request body.

Parameters:

Name	Type	Description	Default
`source`	`str or Path or bytes`	The image source: A URL string (`"https://..."` or `"http://..."`) — returned as-is inside an `image_url` block. A `pathlib.Path` — the file is read and base64-encoded automatically. MIME type is inferred from the file extension via `mimetypes`; supply `mime_type` to override. Raises `ValueError` when the MIME type cannot be inferred and `mime_type` is not provided. `bytes` — raw image bytes, base64-encoded. `mime_type` is required.	required
`detail`	`('auto', 'low', 'high')`	OpenAI vision detail level controlling how many image tokens are consumed. `None` (default) omits the field and lets the provider choose.	`"auto"`
`mime_type`	`str or None`	MIME type string (e.g. `"image/png"`). Required when `source` is `bytes`; optional override when `source` is a `pathlib.Path`.	`None`

Returns:

Type	Description
`dict`	`{"type": "image_url", "image_url": {"url": ...}}` ready for use as an element in a `ChatMessage` `"content"` list.

Raises:

Type	Description
`ValueError`	If `source` is `bytes` and `mime_type` is not provided, or if `source` is a `pathlib.Path` and the MIME type cannot be inferred from the file extension and `mime_type` is not provided. Also raised when a plain string is passed that is not an `http://` or `https://` URL — use a `pathlib.Path` for local files.
`FileNotFoundError`	If `source` is a `pathlib.Path` that does not exist.

Examples:

URL (plain string is fine for URLs):

>>> block = image_block("https://example.com/cat.jpg")

Local file — pass a Path, not a plain string:

>>> msg = {
...     "role": "user",
...     "content": [
...         {"type": "text", "text": "What's in this image?"},
...         image_block(Path("photo.jpg")),
...         image_block(Path("diagram.png"), detail="high"),
...     ],
... }
>>> result = client.generate([msg])

Raw bytes:

>>> with open("photo.jpg", "rb") as f:
...     data = f.read()
>>> block = image_block(data, mime_type="image/jpeg")

Input Types

infermesh.types.ChatMessage `module-attribute`

ChatMessage: TypeAlias = dict[str, Any]

A single chat message dict.

Must contain at least a "role" key and a "content" key, e.g. {"role": "user", "content": "Hello!"}.

For multimodal (VLM) inputs the "content" value may be a list of content blocks instead of a plain string. Text blocks have the form {"type": "text", "text": "..."}; image blocks use {"type": "image_url", "image_url": {"url": "https://..."}}.

Use image_block to build image blocks from local files or raw bytes.

infermesh.types.ChatInput `module-attribute`

ChatInput: TypeAlias = list[ChatMessage]

A full chat conversation: an ordered list of ChatMessage dicts.

Supports both plain-text and multimodal (VLM) messages; see ChatMessage and image_block.

infermesh.types.ResponsesInput `module-attribute`

ResponsesInput: TypeAlias = dict[str, Any]

Input for the "responses" endpoint.

Contains an "input" key (required) and an optional "instructions" key for a system prompt.

infermesh.types.GenerateInput `module-attribute`

GenerateInput: TypeAlias = str | ChatInput | ResponsesInput

Union of the three accepted generation input formats.

str: plain text; converted to a single user message internally.
ChatInput: a pre-built list of role/content dicts. Supports multimodal messages; see ChatMessage and image_block.
ResponsesInput: a dict suitable for the responses endpoint.

infermesh.types.EmbeddingInput `module-attribute`

EmbeddingInput: TypeAlias = str | list[str]

Accepted embedding input: a single string or a list of strings.

infermesh.types.TranscriptionInput `module-attribute`

TranscriptionInput: TypeAlias = (
    str | Path | bytes | BinaryIO
)

Accepted transcription input.

str / pathlib.Path: path to an audio file on disk; opened and read automatically.
bytes: raw audio bytes.
BinaryIO: any file-like object with a .read() method.

infermesh.types.EndpointType `module-attribute`

EndpointType: TypeAlias = Literal[
    "text_completion", "chat_completion", "responses"
]

The three supported generation endpoint identifiers.

"chat_completion" (default): standard chat API (/v1/chat/completions).
"text_completion": legacy completions API (/v1/completions). Input must be a plain string; LiteLLM's atext_completion is called.
"responses": OpenAI Responses API (/v1/responses).

Result Types

infermesh.GenerationResult `dataclass`

GenerationResult(
    model_id: str,
    output_text: str,
    output_parsed: Any | None = None,
    reasoning: str | None = None,
    token_usage: TokenUsage | None = None,
    finish_reason: str | None = None,
    tool_calls: list[ToolCall] | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    cost: float | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of a text-generation request.

Returned by generate, agenerate, and contained in BatchResult for *_batch methods.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	The provider-reported model identifier (e.g. `"gpt-4o-mini"`).	required
`output_text`	`str`	The generated text. For `responses`-endpoint calls this is the concatenation of all `output_text` content blocks in the response.	required
`output_parsed`	`object or None`	The structured result when `response_format` was supplied, or when `parse_output=True` was used with a Pydantic model or JSON-schema dict. The type matches the supplied `response_format`. When `response_format` is a Pydantic model class the output is validated via `model_validate_json`; when it is a `dict` the parsed JSON is validated against the provided JSON Schema before being returned — a schema violation is treated as a parse failure. `None` when parsing was not requested or failed (a warning is logged on parse failure).	`None`
`reasoning`	`str or None`	Extended chain-of-thought reasoning text, when disclosed by the provider (e.g. certain Anthropic or OpenAI reasoning models).	`None`
`token_usage`	`TokenUsage or None`	Token-count breakdown. `None` if the provider did not include usage information in the response.	`None`
`finish_reason`	`str or None`	The stop condition reported by the provider. Common values are `"stop"` (normal completion), `"length"` (hit `max_tokens`), and `"tool_calls"` (model requested a tool).	`None`
`tool_calls`	`list[ToolCall] or None`	Structured tool calls emitted by the model. `None` when the model completed without requesting any tool invocation.	`None`
`raw_response`	`object or None`	The unmodified provider response object. Useful for accessing provider-specific fields not surfaced by this dataclass.	`None`
`request_id`	`str or None`	The provider-assigned request identifier (e.g. the `id` field from an OpenAI response).	`None`
`cost`	`float or None`	Estimated cost in USD, when reported by LiteLLM's cost tracking.	`None`
`metrics`	`RequestMetrics or None`	Queue-wait and service-time metadata for this request.	`None`

Notes

str(result) returns output_text, so a GenerationResult can be used directly wherever a string is expected.

Examples:

Basic generation:

>>> result = client.generate("Summarize backpropagation in one sentence.")
>>> print(result.output_text)
>>> print(f"Cost: ${result.cost:.6f}" if result.cost else "no cost info")

Structured output with a Pydantic model:

>>> from pydantic import BaseModel
>>> class Summary(BaseModel):
...     headline: str
...     body: str
>>> result = client.generate(
...     "Summarize the French Revolution.",
...     response_format=Summary,
... )
>>> summary: Summary = result.output_parsed  # type: ignore[assignment]
>>> print(summary.headline)

str

__str__() -> str

Return the generated text.

Returns:

Type	Description
`str`	The value of `output_text`.

infermesh.EmbeddingResult `dataclass`

EmbeddingResult(
    model_id: str,
    embedding: list[float],
    token_usage: TokenUsage | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of an embedding request.

Returned by embed for single-string input and contained in BatchResult for embed_batch calls.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	The provider-reported model identifier.	required
`embedding`	`list[float]`	The dense embedding vector. Its length equals the model's output dimension (e.g. 1536 for `text-embedding-3-small`).	required
`token_usage`	`TokenUsage or None`	Token-count breakdown. `None` if the provider did not report usage.	`None`
`raw_response`	`object or None`	The unmodified provider response for advanced use cases.	`None`
`request_id`	`str or None`	The provider-assigned request identifier.	`None`
`metrics`	`RequestMetrics or None`	Queue-wait and service-time metadata for this request.	`None`

Examples:

>>> import numpy as np
>>> result = client.embed("The quick brown fox jumps over the lazy dog.")
>>> vec = np.array(result.embedding)
>>> print(f"Dim: {vec.shape[0]}, Norm: {np.linalg.norm(vec):.4f}")

infermesh.TranscriptionResult `dataclass`

TranscriptionResult(
    model_id: str,
    text: str,
    duration_s: float | None = None,
    language: str | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of an audio-transcription request.

Returned by transcribe and atranscribe.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	The provider-reported model identifier (e.g. `"whisper-1"`).	required
`text`	`str`	The transcribed text.	required
`duration_s`	`float or None`	Duration of the audio clip in seconds, when reported by the provider.	`None`
`language`	`str or None`	Detected or explicitly requested language code (e.g. `"en"`), when reported by the provider.	`None`
`raw_response`	`object or None`	The unmodified provider response for advanced use cases.	`None`
`request_id`	`str or None`	The provider-assigned request identifier.	`None`
`metrics`	`RequestMetrics or None`	Queue-wait and service-time metadata for this request.	`None`

Examples:

>>> result = client.transcribe("interview.mp3")
>>> print(result.text)
>>> if result.language:
...     print(f"Detected language: {result.language}")

infermesh.BatchResult `dataclass`

BatchResult(
    results: list[T | None],
    errors: list[BaseException | None] | None = None,
)

Bases: Generic[T]

A typed container for the results of a batch request.

Returned by generate_batch, agenerate_batch, embed_batch, aembed_batch, transcribe_batch, and atranscribe_batch.

When return_exceptions=True (the default), a failed item does not raise and discard the whole batch. Instead, results contains None at that position and errors holds the exception. Both lists are always the same length as the input, enabling index-based correlation.

Parameters:

Name	Type	Description	Default
`results`	`list[T or None]`	One entry per input item. Successful items have type `T`; items where an exception occurred are `None` (only when `return_exceptions=True`).	required
`errors`	`list[BaseException or None] or None`	One entry per input item when `return_exceptions=True` was used. `None` at positions where the request succeeded; the exception at positions where it failed. This attribute is `None` itself when `return_exceptions=False`.	`None`

Notes

len(batch) always equals the number of input items.
Iterating over batch yields from results (may include None values on failure).
Index access (batch[i]) returns results[i].
To split successes from failures::

successes = [r for r, e in zip(batch.results, batch.errors or []) if e is None] failures = [(i, e) for i, e in enumerate(batch.errors or []) if e is not None]

Examples:

Process a batch tolerating partial failures (default behaviour):

>>> prompts = ["Translate 'cat' to French", "bad-prompt", "What is 42?"]
>>> batch = client.generate_batch(prompts)
>>> for i, (result, error) in enumerate(zip(batch.results, batch.errors or [])):
...     if error:
...         print(f"[{i}] ERROR: {error}")
...     else:
...         print(f"[{i}] {result.output_text}")

Opt in to raise-on-first-failure (legacy behaviour):

>>> batch = client.generate_batch(prompts, return_exceptions=False)

iter

__iter__() -> Iterator[T | None]

Iterate over batch items in input order.

Yields:

Type	Description
`T or None`	Each item from `results`. `None` at positions where the corresponding request failed (when `return_exceptions=True`).

getitem

__getitem__(index: int) -> T | None

Return the result at index.

Parameters:

Name	Type	Description	Default
`index`	`int`	Zero-based position in the batch.	required

Returns:

Type	Description
`T or None`	`results[index]`. `None` if the request at that position failed (when `return_exceptions=True`).

len

__len__() -> int

Return the number of items in the batch.

Returns:

Type	Description
`int`	Always equal to the number of input items, regardless of how many requests succeeded or failed.

infermesh.TokenUsage `dataclass`

TokenUsage(
    prompt_tokens: int,
    completion_tokens: int,
    total_tokens: int,
    reasoning_tokens: int | None = None,
)

Token-count information returned by a provider for a single request.

Parameters:

Name	Type	Description	Default
`prompt_tokens`	`int`	Number of tokens in the input (prompt / context window content).	required
`completion_tokens`	`int`	Number of tokens in the generated output.	required
`total_tokens`	`int`	Combined token count as reported by the provider. May differ from `prompt_tokens + completion_tokens` for providers that count internal reasoning tokens separately.	required
`reasoning_tokens`	`int or None`	Tokens consumed by chain-of-thought reasoning, when disclosed by the provider (e.g. OpenAI `o1` / `o3` families). `None` when not reported.	`None`

Attributes:

Name	Type	Description
`output_tokens`	`int`	Provider-neutral alias for `completion_tokens`.

Notes

Use output_tokens (alias for completion_tokens) when writing code that should work with multiple providers, as some SDKs use the term "output tokens" rather than "completion tokens".

Examples:

>>> result = client.generate("Explain backpropagation briefly.")
>>> if result.token_usage:
...     u = result.token_usage
...     print(
...         f"Prompt: {u.prompt_tokens}, Output: {u.output_tokens}, "
...         f"Total: {u.total_tokens}"
...     )

output_tokens `property`

output_tokens: int

Return completion tokens under a provider-neutral alias.

Returns:

Type	Description
`int`	The value of `completion_tokens`.

infermesh.RequestMetrics `dataclass`

RequestMetrics(
    queue_wait_s: float,
    service_time_s: float,
    end_to_end_s: float,
    deployment: str | None = None,
    retries: int = 0,
)

Per-request timing and routing metadata.

Attached to every GenerationResult, EmbeddingResult, and TranscriptionResult produced by LMClient.

Parameters:

Name	Type	Description	Default
`queue_wait_s`	`float`	Seconds spent waiting in the concurrency semaphore and/or rate-limiter queue before the request was dispatched to the provider. A persistently high value indicates the client is regularly hitting its configured RPM / TPM limits or its `max_parallel_requests` cap.	required
`service_time_s`	`float`	Seconds from request dispatch to response receipt — essentially network round-trip time plus provider inference latency.	required
`end_to_end_s`	`float`	Total wall-clock seconds from when the call entered the client to when the response was received. Always equal to `queue_wait_s + service_time_s`.	required
`deployment`	`str or None`	The deployment label selected for this request in router mode (e.g. `"replica-1"`), extracted from LiteLLM's `_hidden_params` or `x-litellm-deployment` header. `None` in single-endpoint mode.	`None`
`retries`	`int`	Number of retry attempts made before this response was received. `0` means the first attempt succeeded.	`0`

Examples:

>>> result = client.generate("Hello")
>>> m = result.metrics
>>> if m:
...     print(
...         f"Queue wait: {m.queue_wait_s:.3f}s, "
...         f"Service: {m.service_time_s:.3f}s, "
...         f"Deployment: {m.deployment}, "
...         f"Retries: {m.retries}"
...     )

infermesh.ToolCall `dataclass`

ToolCall(id: str, name: str, arguments: str | None = None)

A tool call emitted by a model during a generation request.

Appears in tool_calls when the model decides to invoke a function. Use id to correlate the tool result back to the original call when continuing a multi-turn conversation.

Parameters:

Name	Type	Description	Default
`id`	`str`	Unique identifier assigned by the provider for this specific tool call.	required
`name`	`str`	The function name the model wants to invoke.	required
`arguments`	`str or None`	JSON-encoded string containing the arguments the model supplied. Parse with `json.loads(tool_call.arguments)` to obtain a `dict`. `None` if the model emitted a tool call with no arguments.	`None`

Examples:

>>> import json
>>> result = client.generate("What is the weather in Paris?", ...)
>>> if result.tool_calls:
...     for tc in result.tool_calls:
...         args = json.loads(tc.arguments or "{}")
...         print(f"Call {tc.id}: {tc.name}({args})")

infermesh.DeploymentConfig `dataclass`

DeploymentConfig(
    model: str,
    api_base: str,
    api_key: str | None = None,
    extra_kwargs: dict[str, Any] | None = None,
)

Configuration for a single deployment replica used in router mode.

In router mode LMClient accepts a mapping of free-form labels (for example "gpu-0" or "us-east-1") to DeploymentConfig instances. The client builds a LiteLLM Router from these configs and load-balances requests across the replicas.

Parameters:

Name	Type	Description	Default
`model`	`str`	Full LiteLLM model identifier understood by the provider, e.g. `"hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct"` for a vLLM server or `"anthropic/claude-3-5-sonnet-20241022"` for Anthropic.	required
`api_base`	`str`	Base URL of the server, e.g. `"http://gpu0:8000/v1"`.	required
`api_key`	`str or None`	API key for this replica. Pass `None` (default) when the server does not require authentication, which is typical for local vLLM deployments.	`None`
`extra_kwargs`	`dict or None`	Additional LiteLLM keyword arguments applied only to this deployment. Useful for provider-specific settings such as custom request timeouts or Azure deployment names.	`None`

Examples:

Create a deployment for a local vLLM replica:

>>> from infermesh import DeploymentConfig
>>> cfg = DeploymentConfig(
...     model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
...     api_base="http://gpu0:8000/v1",
... )

Create a deployment with an environment-sourced API key and custom timeout:

>>> import os
>>> cfg = DeploymentConfig(
...     model="openai/gpt-4o",
...     api_base="https://api.openai.com/v1",
...     api_key=os.environ["OPENAI_API_KEY"],
...     extra_kwargs={"timeout": 30},
... )

Batch Aliases

infermesh.types.GenerationBatchResult `module-attribute`

GenerationBatchResult: TypeAlias = BatchResult[
    GenerationResult
]

Type alias for a batch of generation results.

infermesh.types.EmbeddingBatchResult `module-attribute`

EmbeddingBatchResult: TypeAlias = BatchResult[
    EmbeddingResult
]

Type alias for a batch of embedding results.

Types & Helpers

image_block

infermesh.image_block

Input Types

infermesh.types.ChatMessage module-attribute

infermesh.types.ChatInput module-attribute

infermesh.types.ResponsesInput module-attribute

infermesh.types.GenerateInput module-attribute

infermesh.types.EmbeddingInput module-attribute

infermesh.types.TranscriptionInput module-attribute

infermesh.types.EndpointType module-attribute

Result Types

infermesh.GenerationResult dataclass

__str__

infermesh.EmbeddingResult dataclass

infermesh.TranscriptionResult dataclass

infermesh.BatchResult dataclass

__iter__

__getitem__

__len__

infermesh.TokenUsage dataclass

output_tokens property

infermesh.RequestMetrics dataclass

infermesh.ToolCall dataclass

infermesh.DeploymentConfig dataclass

Batch Aliases

infermesh.types.GenerationBatchResult module-attribute

infermesh.types.EmbeddingBatchResult module-attribute

infermesh.types.ChatMessage `module-attribute`

infermesh.types.ChatInput `module-attribute`

infermesh.types.ResponsesInput `module-attribute`

infermesh.types.GenerateInput `module-attribute`

infermesh.types.EmbeddingInput `module-attribute`

infermesh.types.TranscriptionInput `module-attribute`

infermesh.types.EndpointType `module-attribute`

infermesh.GenerationResult `dataclass`

str

infermesh.EmbeddingResult `dataclass`

infermesh.TranscriptionResult `dataclass`

infermesh.BatchResult `dataclass`

iter

getitem

len

infermesh.TokenUsage `dataclass`

output_tokens `property`

infermesh.RequestMetrics `dataclass`

infermesh.ToolCall `dataclass`

infermesh.DeploymentConfig `dataclass`

infermesh.types.GenerationBatchResult `module-attribute`

infermesh.types.EmbeddingBatchResult `module-attribute`