Skip to content

Types & Helpers

image_block

infermesh.image_block

image_block(
    source: str | Path | bytes,
    *,
    detail: Literal["auto", "low", "high"] | None = None,
    mime_type: str | None = None,
) -> dict[str, Any]

Build an image content block for a multimodal chat message.

For URL-based images no helper is needed — pass the dict directly::

{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}

Use this function when the image is a local file or raw bytes that must be base64-encoded before sending to the provider. The LLM servers cannot read the caller's filesystem; both require either a publicly reachable URL or a base64 data URL embedded in the request body.

Parameters:

Name Type Description Default
source str or Path or bytes

The image source:

  • A URL string ("https://..." or "http://...") — returned as-is inside an image_url block.
  • A pathlib.Path — the file is read and base64-encoded automatically. MIME type is inferred from the file extension via mimetypes; supply mime_type to override. Raises ValueError when the MIME type cannot be inferred and mime_type is not provided.
  • bytes — raw image bytes, base64-encoded. mime_type is required.
required
detail ('auto', 'low', 'high')

OpenAI vision detail level controlling how many image tokens are consumed. None (default) omits the field and lets the provider choose.

"auto"
mime_type str or None

MIME type string (e.g. "image/png"). Required when source is bytes; optional override when source is a pathlib.Path.

None

Returns:

Type Description
dict

{"type": "image_url", "image_url": {"url": ...}} ready for use as an element in a ChatMessage "content" list.

Raises:

Type Description
ValueError

If source is bytes and mime_type is not provided, or if source is a pathlib.Path and the MIME type cannot be inferred from the file extension and mime_type is not provided. Also raised when a plain string is passed that is not an http:// or https:// URL — use a pathlib.Path for local files.

FileNotFoundError

If source is a pathlib.Path that does not exist.

Examples:

URL (plain string is fine for URLs):

>>> block = image_block("https://example.com/cat.jpg")

Local file — pass a Path, not a plain string:

>>> msg = {
...     "role": "user",
...     "content": [
...         {"type": "text", "text": "What's in this image?"},
...         image_block(Path("photo.jpg")),
...         image_block(Path("diagram.png"), detail="high"),
...     ],
... }
>>> result = client.generate([msg])

Raw bytes:

>>> with open("photo.jpg", "rb") as f:
...     data = f.read()
>>> block = image_block(data, mime_type="image/jpeg")

Input Types

infermesh.types.ChatMessage module-attribute

ChatMessage: TypeAlias = dict[str, Any]

A single chat message dict.

Must contain at least a "role" key and a "content" key, e.g. {"role": "user", "content": "Hello!"}.

For multimodal (VLM) inputs the "content" value may be a list of content blocks instead of a plain string. Text blocks have the form {"type": "text", "text": "..."}; image blocks use {"type": "image_url", "image_url": {"url": "https://..."}}.

Use image_block to build image blocks from local files or raw bytes.

infermesh.types.ChatInput module-attribute

ChatInput: TypeAlias = list[ChatMessage]

A full chat conversation: an ordered list of ChatMessage dicts.

Supports both plain-text and multimodal (VLM) messages; see ChatMessage and image_block.

infermesh.types.ResponsesInput module-attribute

ResponsesInput: TypeAlias = dict[str, Any]

Input for the "responses" endpoint.

Contains an "input" key (required) and an optional "instructions" key for a system prompt.

infermesh.types.GenerateInput module-attribute

GenerateInput: TypeAlias = str | ChatInput | ResponsesInput

Union of the three accepted generation input formats.

  • str: plain text; converted to a single user message internally.
  • ChatInput: a pre-built list of role/content dicts. Supports multimodal messages; see ChatMessage and image_block.
  • ResponsesInput: a dict suitable for the responses endpoint.

infermesh.types.EmbeddingInput module-attribute

EmbeddingInput: TypeAlias = str | list[str]

Accepted embedding input: a single string or a list of strings.

infermesh.types.TranscriptionInput module-attribute

TranscriptionInput: TypeAlias = (
    str | Path | bytes | BinaryIO
)

Accepted transcription input.

  • str / pathlib.Path: path to an audio file on disk; opened and read automatically.
  • bytes: raw audio bytes.
  • BinaryIO: any file-like object with a .read() method.

infermesh.types.EndpointType module-attribute

EndpointType: TypeAlias = Literal[
    "text_completion", "chat_completion", "responses"
]

The three supported generation endpoint identifiers.

  • "chat_completion" (default): standard chat API (/v1/chat/completions).
  • "text_completion": legacy completions API (/v1/completions). Input must be a plain string; LiteLLM's atext_completion is called.
  • "responses": OpenAI Responses API (/v1/responses).

Result Types

infermesh.GenerationResult dataclass

GenerationResult(
    model_id: str,
    output_text: str,
    output_parsed: Any | None = None,
    reasoning: str | None = None,
    token_usage: TokenUsage | None = None,
    finish_reason: str | None = None,
    tool_calls: list[ToolCall] | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    cost: float | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of a text-generation request.

Returned by generate, agenerate, and contained in BatchResult for *_batch methods.

Parameters:

Name Type Description Default
model_id str

The provider-reported model identifier (e.g. "gpt-4o-mini").

required
output_text str

The generated text. For responses-endpoint calls this is the concatenation of all output_text content blocks in the response.

required
output_parsed object or None

The structured result when response_format was supplied, or when parse_output=True was used with a Pydantic model or JSON-schema dict. The type matches the supplied response_format. When response_format is a Pydantic model class the output is validated via model_validate_json; when it is a dict the parsed JSON is validated against the provided JSON Schema before being returned — a schema violation is treated as a parse failure. None when parsing was not requested or failed (a warning is logged on parse failure).

None
reasoning str or None

Extended chain-of-thought reasoning text, when disclosed by the provider (e.g. certain Anthropic or OpenAI reasoning models).

None
token_usage TokenUsage or None

Token-count breakdown. None if the provider did not include usage information in the response.

None
finish_reason str or None

The stop condition reported by the provider. Common values are "stop" (normal completion), "length" (hit max_tokens), and "tool_calls" (model requested a tool).

None
tool_calls list[ToolCall] or None

Structured tool calls emitted by the model. None when the model completed without requesting any tool invocation.

None
raw_response object or None

The unmodified provider response object. Useful for accessing provider-specific fields not surfaced by this dataclass.

None
request_id str or None

The provider-assigned request identifier (e.g. the id field from an OpenAI response).

None
cost float or None

Estimated cost in USD, when reported by LiteLLM's cost tracking.

None
metrics RequestMetrics or None

Queue-wait and service-time metadata for this request.

None
Notes

str(result) returns output_text, so a GenerationResult can be used directly wherever a string is expected.

Examples:

Basic generation:

>>> result = client.generate("Summarize backpropagation in one sentence.")
>>> print(result.output_text)
>>> print(f"Cost: ${result.cost:.6f}" if result.cost else "no cost info")

Structured output with a Pydantic model:

>>> from pydantic import BaseModel
>>> class Summary(BaseModel):
...     headline: str
...     body: str
>>> result = client.generate(
...     "Summarize the French Revolution.",
...     response_format=Summary,
... )
>>> summary: Summary = result.output_parsed  # type: ignore[assignment]
>>> print(summary.headline)

__str__

__str__() -> str

Return the generated text.

Returns:

Type Description
str

The value of output_text.

infermesh.EmbeddingResult dataclass

EmbeddingResult(
    model_id: str,
    embedding: list[float],
    token_usage: TokenUsage | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of an embedding request.

Returned by embed for single-string input and contained in BatchResult for embed_batch calls.

Parameters:

Name Type Description Default
model_id str

The provider-reported model identifier.

required
embedding list[float]

The dense embedding vector. Its length equals the model's output dimension (e.g. 1536 for text-embedding-3-small).

required
token_usage TokenUsage or None

Token-count breakdown. None if the provider did not report usage.

None
raw_response object or None

The unmodified provider response for advanced use cases.

None
request_id str or None

The provider-assigned request identifier.

None
metrics RequestMetrics or None

Queue-wait and service-time metadata for this request.

None

Examples:

>>> import numpy as np
>>> result = client.embed("The quick brown fox jumps over the lazy dog.")
>>> vec = np.array(result.embedding)
>>> print(f"Dim: {vec.shape[0]}, Norm: {np.linalg.norm(vec):.4f}")

infermesh.TranscriptionResult dataclass

TranscriptionResult(
    model_id: str,
    text: str,
    duration_s: float | None = None,
    language: str | None = None,
    raw_response: Any | None = None,
    request_id: str | None = None,
    metrics: RequestMetrics | None = None,
)

The typed result of an audio-transcription request.

Returned by transcribe and atranscribe.

Parameters:

Name Type Description Default
model_id str

The provider-reported model identifier (e.g. "whisper-1").

required
text str

The transcribed text.

required
duration_s float or None

Duration of the audio clip in seconds, when reported by the provider.

None
language str or None

Detected or explicitly requested language code (e.g. "en"), when reported by the provider.

None
raw_response object or None

The unmodified provider response for advanced use cases.

None
request_id str or None

The provider-assigned request identifier.

None
metrics RequestMetrics or None

Queue-wait and service-time metadata for this request.

None

Examples:

>>> result = client.transcribe("interview.mp3")
>>> print(result.text)
>>> if result.language:
...     print(f"Detected language: {result.language}")

infermesh.BatchResult dataclass

BatchResult(
    results: list[T | None],
    errors: list[BaseException | None] | None = None,
)

Bases: Generic[T]

A typed container for the results of a batch request.

Returned by generate_batch, agenerate_batch, embed_batch, aembed_batch, transcribe_batch, and atranscribe_batch.

When return_exceptions=True (the default), a failed item does not raise and discard the whole batch. Instead, results contains None at that position and errors holds the exception. Both lists are always the same length as the input, enabling index-based correlation.

Parameters:

Name Type Description Default
results list[T or None]

One entry per input item. Successful items have type T; items where an exception occurred are None (only when return_exceptions=True).

required
errors list[BaseException or None] or None

One entry per input item when return_exceptions=True was used. None at positions where the request succeeded; the exception at positions where it failed. This attribute is None itself when return_exceptions=False.

None
Notes
  • len(batch) always equals the number of input items.
  • Iterating over batch yields from results (may include None values on failure).
  • Index access (batch[i]) returns results[i].
  • To split successes from failures::

    successes = [r for r, e in zip(batch.results, batch.errors or []) if e is None] failures = [(i, e) for i, e in enumerate(batch.errors or []) if e is not None]

Examples:

Process a batch tolerating partial failures (default behaviour):

>>> prompts = ["Translate 'cat' to French", "bad-prompt", "What is 42?"]
>>> batch = client.generate_batch(prompts)
>>> for i, (result, error) in enumerate(zip(batch.results, batch.errors or [])):
...     if error:
...         print(f"[{i}] ERROR: {error}")
...     else:
...         print(f"[{i}] {result.output_text}")

Opt in to raise-on-first-failure (legacy behaviour):

>>> batch = client.generate_batch(prompts, return_exceptions=False)

__iter__

__iter__() -> Iterator[T | None]

Iterate over batch items in input order.

Yields:

Type Description
T or None

Each item from results. None at positions where the corresponding request failed (when return_exceptions=True).

__getitem__

__getitem__(index: int) -> T | None

Return the result at index.

Parameters:

Name Type Description Default
index int

Zero-based position in the batch.

required

Returns:

Type Description
T or None

results[index]. None if the request at that position failed (when return_exceptions=True).

__len__

__len__() -> int

Return the number of items in the batch.

Returns:

Type Description
int

Always equal to the number of input items, regardless of how many requests succeeded or failed.

infermesh.TokenUsage dataclass

TokenUsage(
    prompt_tokens: int,
    completion_tokens: int,
    total_tokens: int,
    reasoning_tokens: int | None = None,
)

Token-count information returned by a provider for a single request.

Parameters:

Name Type Description Default
prompt_tokens int

Number of tokens in the input (prompt / context window content).

required
completion_tokens int

Number of tokens in the generated output.

required
total_tokens int

Combined token count as reported by the provider. May differ from prompt_tokens + completion_tokens for providers that count internal reasoning tokens separately.

required
reasoning_tokens int or None

Tokens consumed by chain-of-thought reasoning, when disclosed by the provider (e.g. OpenAI o1 / o3 families). None when not reported.

None

Attributes:

Name Type Description
output_tokens int

Provider-neutral alias for completion_tokens.

Notes

Use output_tokens (alias for completion_tokens) when writing code that should work with multiple providers, as some SDKs use the term "output tokens" rather than "completion tokens".

Examples:

>>> result = client.generate("Explain backpropagation briefly.")
>>> if result.token_usage:
...     u = result.token_usage
...     print(
...         f"Prompt: {u.prompt_tokens}, Output: {u.output_tokens}, "
...         f"Total: {u.total_tokens}"
...     )

output_tokens property

output_tokens: int

Return completion tokens under a provider-neutral alias.

Returns:

Type Description
int

The value of completion_tokens.

infermesh.RequestMetrics dataclass

RequestMetrics(
    queue_wait_s: float,
    service_time_s: float,
    end_to_end_s: float,
    deployment: str | None = None,
    retries: int = 0,
)

Per-request timing and routing metadata.

Attached to every GenerationResult, EmbeddingResult, and TranscriptionResult produced by LMClient.

Parameters:

Name Type Description Default
queue_wait_s float

Seconds spent waiting in the concurrency semaphore and/or rate-limiter queue before the request was dispatched to the provider. A persistently high value indicates the client is regularly hitting its configured RPM / TPM limits or its max_parallel_requests cap.

required
service_time_s float

Seconds from request dispatch to response receipt — essentially network round-trip time plus provider inference latency.

required
end_to_end_s float

Total wall-clock seconds from when the call entered the client to when the response was received. Always equal to queue_wait_s + service_time_s.

required
deployment str or None

The deployment label selected for this request in router mode (e.g. "replica-1"), extracted from LiteLLM's _hidden_params or x-litellm-deployment header. None in single-endpoint mode.

None
retries int

Number of retry attempts made before this response was received. 0 means the first attempt succeeded.

0

Examples:

>>> result = client.generate("Hello")
>>> m = result.metrics
>>> if m:
...     print(
...         f"Queue wait: {m.queue_wait_s:.3f}s, "
...         f"Service: {m.service_time_s:.3f}s, "
...         f"Deployment: {m.deployment}, "
...         f"Retries: {m.retries}"
...     )

infermesh.ToolCall dataclass

ToolCall(id: str, name: str, arguments: str | None = None)

A tool call emitted by a model during a generation request.

Appears in tool_calls when the model decides to invoke a function. Use id to correlate the tool result back to the original call when continuing a multi-turn conversation.

Parameters:

Name Type Description Default
id str

Unique identifier assigned by the provider for this specific tool call.

required
name str

The function name the model wants to invoke.

required
arguments str or None

JSON-encoded string containing the arguments the model supplied. Parse with json.loads(tool_call.arguments) to obtain a dict. None if the model emitted a tool call with no arguments.

None

Examples:

>>> import json
>>> result = client.generate("What is the weather in Paris?", ...)
>>> if result.tool_calls:
...     for tc in result.tool_calls:
...         args = json.loads(tc.arguments or "{}")
...         print(f"Call {tc.id}: {tc.name}({args})")

infermesh.DeploymentConfig dataclass

DeploymentConfig(
    model: str,
    api_base: str,
    api_key: str | None = None,
    extra_kwargs: dict[str, Any] | None = None,
)

Configuration for a single deployment replica used in router mode.

In router mode LMClient accepts a mapping of free-form labels (for example "gpu-0" or "us-east-1") to DeploymentConfig instances. The client builds a LiteLLM Router from these configs and load-balances requests across the replicas.

Parameters:

Name Type Description Default
model str

Full LiteLLM model identifier understood by the provider, e.g. "hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct" for a vLLM server or "anthropic/claude-3-5-sonnet-20241022" for Anthropic.

required
api_base str

Base URL of the server, e.g. "http://gpu0:8000/v1".

required
api_key str or None

API key for this replica. Pass None (default) when the server does not require authentication, which is typical for local vLLM deployments.

None
extra_kwargs dict or None

Additional LiteLLM keyword arguments applied only to this deployment. Useful for provider-specific settings such as custom request timeouts or Azure deployment names.

None

Examples:

Create a deployment for a local vLLM replica:

>>> from infermesh import DeploymentConfig
>>> cfg = DeploymentConfig(
...     model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
...     api_base="http://gpu0:8000/v1",
... )

Create a deployment with an environment-sourced API key and custom timeout:

>>> import os
>>> cfg = DeploymentConfig(
...     model="openai/gpt-4o",
...     api_base="https://api.openai.com/v1",
...     api_key=os.environ["OPENAI_API_KEY"],
...     extra_kwargs={"timeout": 30},
... )

Batch Aliases

infermesh.types.GenerationBatchResult module-attribute

GenerationBatchResult: TypeAlias = BatchResult[
    GenerationResult
]

Type alias for a batch of generation results.

infermesh.types.EmbeddingBatchResult module-attribute

EmbeddingBatchResult: TypeAlias = BatchResult[
    EmbeddingResult
]

Type alias for a batch of embedding results.