LMClient

infermesh.LMClient

LMClient(
    *,
    model: str | None = None,
    api_base: str | None = None,
    api_key: str | None = None,
    deployments: dict[
        str, DeploymentConfig | dict[str, Any]
    ]
    | None = None,
    endpoint: EndpointType = "chat_completion",
    max_parallel_requests: int | None = None,
    rpm: int | None = None,
    tpm: int | None = None,
    rpd: int | None = None,
    tpd: int | None = None,
    max_request_burst: int | None = None,
    max_token_burst: int | None = None,
    header_bucket_scope: Literal[
        "minute", "day", "auto"
    ] = "auto",
    default_output_tokens: int = 0,
    timeout: float | None = None,
    max_retries: int = 3,
    default_request_kwargs: dict[str, Any] | None = None,
    routing_strategy: str = "simple-shuffle",
    router_kwargs: dict[str, Any] | None = None,
)

Bases: _ClientRuntimeMixin

Batch-friendly language-model interface built on LiteLLM.

LMClient supports two operating modes selected at construction time:

Single-endpoint mode — one model, one server. Provide model and api_base:

client = LMClient(
    model="openai/gpt-4.1-mini",
    api_base="https://api.openai.com/v1",
)
result = client.generate("What is the capital of France?")
print(result.output_text)  # "Paris"
client.close()

Router mode — multiple replicas, load-balanced by a LiteLLM Router. Provide model (the logical model name) and deployments (a dict of free-form label → DeploymentConfig):

client = LMClient(
    model="llama-3",
    deployments={
        "gpu-0": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://gpu0:8000/v1",
        ),
        "gpu-1": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://gpu1:8000/v1",
        ),
    },
)

If you only need a few single requests, plain LiteLLM or the provider SDK is usually simpler. LMClient becomes useful when you need concurrent batches, per-item failure handling, client-side rate limiting, or routing across several replicas of the same logical model.

The client can be used as a context manager (sync or async) to ensure close is always called:

with LMClient(
    model="openai/gpt-4o", api_base="https://api.openai.com/v1"
) as client:
    batch = client.generate_batch(prompts)

The sync methods delegate to their async counterparts through a managed background loop, which keeps notebook and REPL usage simple while preserving retries, throttling, and batching behavior.

Notes

Always call close (or use the context-manager form) when the client is no longer needed. close stops the background SyncRunner thread; failing to call it leaves a daemon thread running until process exit.

A single RateLimiter instance is shared between the caller's event loop and the SyncRunner background loop, so sync and async calls are accounted together and do not double the effective rate.

close

close() -> None

Release background resources used by the synchronous API.

generate, embed, and transcribe run on a managed background event loop. Call close when you are finished with the client, or prefer with / async with so cleanup happens automatically.

generate

generate(
    input_data: GenerateInput,
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    **kwargs: Any,
) -> GenerationResult

Generate one response on the notebook-safe background loop.

Parameters:

Name	Type	Description	Default
`input_data`	`GenerateInput`	Prompt text, a chat-style message list, or a `responses` payload. The accepted shape depends on the selected endpoint.	required
`endpoint`	`EndpointType \| None`	Per-call override for the generation endpoint. Defaults to the client-wide `endpoint` set at construction time.	`None`
`response_format`	`type[BaseModel] \| dict[str, Any] \| None`	Structured output target. Pass a Pydantic model class or provider schema mapping to parse the response into `output_parsed`.	`None`
`parse_output`	`bool`	When `True`, attempt to parse structured output even without an explicit `response_format`.	`False`
`**kwargs`	`Any`	Additional LiteLLM request kwargs such as temperature or max tokens.	`{}`

Returns:

Type	Description
`GenerationResult`	The generated text, optional parsed output, token usage, request id, finish reason, and timing metrics.

Raises:

Type	Description
`ValueError`	If `input_data` does not match the selected endpoint contract.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.generate("Summarize the French Revolution.")
>>> result.output_text

generate_batch

generate_batch(
    input_batch: Sequence[GenerateInput],
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnGenerationResult = None,
    **kwargs: Any,
) -> GenerationBatchResult

Generate a batch of responses synchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[GenerateInput]`	Ordered inputs to run. Each item may be a prompt string, chat message list, or `responses` payload.	required
`endpoint`	`EndpointType \| None`	Behave the same as in `generate` and are applied to every item in the batch.	`None`
`response_format`	`EndpointType \| None`	Behave the same as in `generate` and are applied to every item in the batch.	`None`
`parse_output`	`EndpointType \| None`	Behave the same as in `generate` and are applied to every item in the batch.	`None`
`**kwargs`	`EndpointType \| None`	Behave the same as in `generate` and are applied to every item in the batch.	`None`
`return_exceptions`	`bool`	When `True`, item failures are captured in `errors` and the rest of the batch keeps running. When `False`, the first failure cancels remaining work and is raised.	`True`
`on_progress`	`callable \| None`	Callback invoked as `on_progress(completed, total)` each time an item finishes.	`None`
`on_result`	`callable \| None`	Callback invoked as `on_result(index, result, error)` for each completed item.	`None`

Returns:

Type	Description
`GenerationBatchResult`	A batch result with one slot per input item. Successful items appear in `results` and failures, when captured, appear in `errors`.

Notes

For large or memory-sensitive Python batch runs, set max_parallel_requests on the client. When it is unset, generate_batch may start work for the full batch up front.

Raises:

Type	Description
`ValueError`	If any batch item has an invalid input shape.
`Exception`	The first provider error when `return_exceptions` is `False`.

Examples:

>>> batch = client.generate_batch(
...     ["ELI5: Artificial Intelligence", "ELI5: Quantum Computing"]
... )
>>> [item.output_text if item else None for item in batch.results]

agenerate `async`

agenerate(
    input_data: GenerateInput,
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    **kwargs: Any,
) -> GenerationResult

Generate one response asynchronously.

Parameters:

Name	Type	Description	Default
`input_data`	`GenerateInput`	Follow the same contract as `generate`.	required
`endpoint`	`GenerateInput`	Follow the same contract as `generate`.	required
`response_format`	`GenerateInput`	Follow the same contract as `generate`.	required
`parse_output`	`GenerateInput`	Follow the same contract as `generate`.	required
`**kwargs`	`GenerateInput`	Follow the same contract as `generate`.	required

Returns:

Type	Description
`GenerationResult`	The generated text, structured output if requested, and request metadata for a single input.

Raises:

Type	Description
`ValueError`	If `input_data` does not match the selected endpoint contract.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.agenerate("Summarize the French Revolution.")
>>> result.output_text

agenerate_batch `async`

agenerate_batch(
    input_batch: Sequence[GenerateInput],
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnGenerationResult = None,
    **kwargs: Any,
) -> GenerationBatchResult

Generate a batch of responses asynchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[GenerateInput]`	Follow the same contract as `generate_batch`.	required
`endpoint`	`Sequence[GenerateInput]`	Follow the same contract as `generate_batch`.	required
`response_format`	`Sequence[GenerateInput]`	Follow the same contract as `generate_batch`.	required
`parse_output`	`Sequence[GenerateInput]`	Follow the same contract as `generate_batch`.	required
`**kwargs`	`Sequence[GenerateInput]`	Follow the same contract as `generate_batch`.	required
`return_exceptions`	`bool`	Capture per-item failures in `errors` when `True`. When `False`, the first failure cancels the remaining tasks and is raised.	`True`
`on_progress`	`callable \| None`	Optional callbacks invoked as items finish.	`None`
`on_result`	`callable \| None`	Optional callbacks invoked as items finish.	`None`

Returns:

Type	Description
`GenerationBatchResult`	Batch-sized `results` and `errors` collections aligned to the original input order.

Notes

For large or memory-sensitive Python batch runs, set max_parallel_requests on the client. When it is unset, agenerate_batch may start work for the full batch up front.

Raises:

Type	Description
`ValueError`	If any input item is invalid for the selected endpoint.
`Exception`	The first provider error when `return_exceptions` is `False`.

Examples:

>>> batch = await client.agenerate_batch(
...     ["ELI5: Artificial Intelligence", "ELI5: Quantum Computing"]
... )
>>> len(batch.results)

embed

embed(input_data: str, **kwargs: Any) -> EmbeddingResult

Create an embedding for a single text string synchronously.

Parameters:

Name	Type	Description	Default
`input_data`	`str`	Text to embed.	required
`**kwargs`	`Any`	Additional LiteLLM embedding kwargs such as dimension hints.	`{}`

Returns:

Type	Description
`EmbeddingResult`	One embedding vector plus request metadata and token-usage details.

Raises:

Type	Description
`ValueError`	If the provider response does not contain an embedding vector.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.embed("The quick brown fox")
>>> len(result.embedding)

embed_batch

embed_batch(
    input_batch: Sequence[str],
    *,
    micro_batch_size: int = 32,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnEmbeddingResult = None,
    **kwargs: Any,
) -> EmbeddingBatchResult

Create embeddings for a batch of text strings synchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[str]`	Text values to embed together.	required
`micro_batch_size`	`int`	Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order.	`32`
`return_exceptions`	`bool`	When `True`, failures are isolated per input item and captured in `errors`. When `False`, the first terminal failure cancels the remaining work and is raised.	`True`
`on_progress`	`callable \| None`	Callback invoked as `on_progress(completed, total)` each time an item settles.	`None`
`on_result`	`callable \| None`	Callback invoked as `on_result(index, result, error)` for each settled item.	`None`
`**kwargs`	`Any`	Additional LiteLLM embedding kwargs applied to the batch request.	`{}`

Returns:

Type	Description
`EmbeddingBatchResult`	A result slot for each input string, aligned to the original order.

Raises:

Type	Description
`ValueError`	If `micro_batch_size` is not a positive integer.
`Exception`	The provider error when `return_exceptions` is `False`.

Examples:

>>> batch = client.embed_batch(["Queen", "King", "Card", "Ace"])
>>> [len(item.embedding) if item else None for item in batch.results]

aembed `async`

aembed(input_data: str, **kwargs: Any) -> EmbeddingResult

Create an embedding for a single text string asynchronously.

Parameters:

Name	Type	Description	Default
`input_data`	`str`	Follow the same contract as `embed`.	required
`**kwargs`	`str`	Follow the same contract as `embed`.	required

Returns:

Type	Description
`EmbeddingResult`	The embedding vector and request metadata for one text value.

Raises:

Type	Description
`ValueError`	If the provider response does not contain an embedding vector.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.aembed("The quick brown fox")
>>> len(result.embedding)

aembed_batch `async`

aembed_batch(
    input_batch: Sequence[str],
    *,
    micro_batch_size: int = 32,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnEmbeddingResult = None,
    **kwargs: Any,
) -> EmbeddingBatchResult

Create embeddings for a batch of text strings asynchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[str]`	Text values to embed together.	required
`micro_batch_size`	`int`	Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order.	`32`
`return_exceptions`	`bool`	When `True`, failures are isolated per input item and captured in `errors`. When `False`, the first terminal failure cancels the remaining work and is raised.	`True`
`on_progress`	`callable \| None`	Callback invoked as `on_progress(completed, total)` each time an item settles.	`None`
`on_result`	`callable \| None`	Callback invoked as `on_result(index, result, error)` for each settled item.	`None`
`**kwargs`	`Any`	Additional LiteLLM embedding kwargs applied to the batch request.	`{}`

Returns:

Type	Description
`EmbeddingBatchResult`	One result slot per input string, aligned to the original order.

Raises:

Type	Description
`ValueError`	If `micro_batch_size` is not a positive integer.
`Exception`	The provider error when `return_exceptions` is `False`.

Examples:

>>> batch = await client.aembed_batch(["Queen", "King", "Card", "Ace"])
>>> len(batch.results)

transcribe

transcribe(
    input_data: TranscriptionInput,
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    **kwargs: Any,
) -> TranscriptionResult

Transcribe one audio input synchronously.

Parameters:

Name	Type	Description	Default
`input_data`	`TranscriptionInput`	Local path, raw bytes, or a binary file-like object containing audio data.	required
`max_transcription_bytes`	`int \| None`	Defensive size limit applied before the request is sent. Pass `None` to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because the client may read and send very large audio files in full.	`DEFAULT_MAX_TRANSCRIPTION_BYTES`
`**kwargs`	`Any`	Additional LiteLLM transcription kwargs such as language hints.	`{}`

Returns:

Type	Description
`TranscriptionResult`	The transcript text plus request metadata, language, and duration when the provider returns them.

Raises:

Type	Description
`ValueError`	If the input cannot be normalized or exceeds `max_transcription_bytes`.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.transcribe("sample.wav")
>>> result.text

transcribe_batch

transcribe_batch(
    input_batch: Sequence[TranscriptionInput],
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnTranscriptionResult = None,
    **kwargs: Any,
) -> TranscriptionBatchResult

Transcribe a batch of audio inputs synchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[TranscriptionInput]`	Ordered audio inputs to transcribe.	required
`max_transcription_bytes`	`int \| None`	Defensive size limit applied before each request is sent. Pass `None` to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because each admitted item may be read and sent in full.	`DEFAULT_MAX_TRANSCRIPTION_BYTES`
`return_exceptions`	`bool`	When `True`, per-item failures are captured in `errors` and successful siblings still complete. When `False`, the first failure cancels the rest and is raised.	`True`
`on_progress`	`callable \| None`	Callback invoked as `on_progress(completed, total)` each time an item settles.	`None`
`on_result`	`callable \| None`	Callback invoked as `on_result(index, result, error)` for each settled item.	`None`
`**kwargs`	`Any`	Additional LiteLLM transcription kwargs such as language hints.	`{}`

Returns:

Type	Description
`TranscriptionBatchResult`	A result slot for each input item, aligned to the original order.

Raises:

Type	Description
`ValueError`	If any input cannot be normalized or exceeds `max_transcription_bytes`.
`Exception`	The first terminal provider or normalization error when `return_exceptions` is `False`.

Examples:

>>> batch = client.transcribe_batch(["a.wav", "b.wav"])
>>> [item.text if item else None for item in batch.results]

atranscribe `async`

atranscribe(
    input_data: TranscriptionInput,
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    **kwargs: Any,
) -> TranscriptionResult

Transcribe one audio input asynchronously.

Parameters:

Name	Type	Description	Default
`input_data`	`TranscriptionInput`	Local path, raw bytes, or a binary file-like object containing audio data.	required
`max_transcription_bytes`	`int \| None`	Defensive size limit applied before the request is sent. Pass `None` to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because the client may read and send very large audio files in full.	`DEFAULT_MAX_TRANSCRIPTION_BYTES`
`**kwargs`	`Any`	Additional LiteLLM transcription kwargs such as language hints.	`{}`

Returns:

Type	Description
`TranscriptionResult`	The transcript text and request metadata for one audio input.

Raises:

Type	Description
`ValueError`	If the input cannot be normalized or exceeds the configured size limit.
`Exception`	Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.atranscribe("sample.wav")
>>> result.text

atranscribe_batch `async`

atranscribe_batch(
    input_batch: Sequence[TranscriptionInput],
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnTranscriptionResult = None,
    **kwargs: Any,
) -> TranscriptionBatchResult

Transcribe a batch of audio inputs asynchronously.

Parameters:

Name	Type	Description	Default
`input_batch`	`Sequence[TranscriptionInput]`	Ordered audio inputs to transcribe.	required
`max_transcription_bytes`	`int \| None`	Defensive size limit applied before each request is sent. Pass `None` to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because each admitted item may be read and sent in full.	`DEFAULT_MAX_TRANSCRIPTION_BYTES`
`return_exceptions`	`bool`	When `True`, per-item failures are captured in `errors` and successful siblings still complete. When `False`, the first failure cancels the rest and is raised.	`True`
`on_progress`	`callable \| None`	Callback invoked as `on_progress(completed, total)` each time an item settles.	`None`
`on_result`	`callable \| None`	Callback invoked as `on_result(index, result, error)` for each settled item.	`None`
`**kwargs`	`Any`	Additional LiteLLM transcription kwargs such as language hints.	`{}`

Returns:

Type	Description
`TranscriptionBatchResult`	One result slot per input item, aligned to the original order.

Raises:

Type	Description
`ValueError`	If any input cannot be normalized or exceeds the configured size limit.
`Exception`	The first terminal provider or normalization error when `return_exceptions` is `False`.

Examples:

>>> batch = await client.atranscribe_batch(["a.wav", "b.wav"])
>>> len(batch.results)

enter

__enter__() -> LMClient

Enter a synchronous context-manager scope.

exit

__exit__(exc_type: Any, exc: Any, traceback: Any) -> None

Exit a synchronous context-manager scope.

aenter `async`

__aenter__() -> LMClient

Enter an asynchronous context-manager scope.

aexit `async`

__aexit__(exc_type: Any, exc: Any, traceback: Any) -> None

Exit an asynchronous context-manager scope.

del

__del__() -> None

Best-effort cleanup for interpreter shutdown.

Name	Type	Description	Default
`model`	`str`	Provider model name used for all requests. This is required in both single-endpoint and deployment-router modes.	`None`
`api_base`	`str \| None`	Connection details for direct, single-endpoint usage. Leave these unset when routing through `deployments`.	`None`
`api_key`	`str \| None`	Connection details for direct, single-endpoint usage. Leave these unset when routing through `deployments`.	`None`
`deployments`	`dict[str, DeploymentConfig \| dict[str, Any]] \| None`	Named deployment definitions used for router mode. Each deployment can override model, API base, API key, and provider-specific kwargs.	`None`
`endpoint`	`EndpointType`	Default generation endpoint used by `generate` and `generate_batch` unless a per-call override is supplied.	`"chat_completion"`
`max_parallel_requests`	`int \| None`	Per-event-loop cap on concurrent in-flight requests. When set, `generate_batch` and `agenerate_batch` also admit generation work through a bounded in-flight window instead of creating one task per item up front. Must be `None` or a positive integer.	`None`
`rpm`	`int \| None`	Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.	`None`
`tpm`	`int \| None`	Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.	`None`
`rpd`	`int \| None`	Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.	`None`
`tpd`	`int \| None`	Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.	`None`
`max_request_burst`	`int \| None`	Burst allowances applied by the client-side rate limiter.	`None`
`max_token_burst`	`int \| None`	Burst allowances applied by the client-side rate limiter.	`None`
`header_bucket_scope`	`('minute', 'day', 'auto')`	How provider rate-limit headers should be interpreted when updating limiter state after a request.	`"minute"`
`default_output_tokens`	`int`	Default completion-token budget used when estimating token reservations for rate limiting.	`0`
`timeout`	`float \| None`	Default request timeout forwarded to LiteLLM unless a per-call timeout is supplied in `kwargs`.	`None`
`max_retries`	`int`	Number of retry attempts for transient provider failures such as rate limits, timeouts, and internal server errors.	`3`
`default_request_kwargs`	`dict[str, Any] \| None`	Request kwargs merged into every provider call.	`None`
`routing_strategy`	`str`	Router selection strategy used when `deployments` are configured.	`"simple-shuffle"`
`router_kwargs`	`dict[str, Any] \| None`	Extra kwargs forwarded when building the LiteLLM router.	`None`

LMClient

infermesh.LMClient

close

generate

generate_batch

agenerate async

agenerate_batch async

embed

embed_batch

aembed async

aembed_batch async

transcribe

transcribe_batch

atranscribe async

atranscribe_batch async

__enter__

__exit__

__aenter__ async

__aexit__ async

__del__

agenerate `async`

agenerate_batch `async`

aembed `async`

aembed_batch `async`

atranscribe `async`

atranscribe_batch `async`

enter

exit

aenter `async`

aexit `async`

del