Skip to content

LMClient

infermesh.LMClient

LMClient(
    *,
    model: str | None = None,
    api_base: str | None = None,
    api_key: str | None = None,
    deployments: dict[
        str, DeploymentConfig | dict[str, Any]
    ]
    | None = None,
    endpoint: EndpointType = "chat_completion",
    max_parallel_requests: int | None = None,
    rpm: int | None = None,
    tpm: int | None = None,
    rpd: int | None = None,
    tpd: int | None = None,
    max_request_burst: int | None = None,
    max_token_burst: int | None = None,
    header_bucket_scope: Literal[
        "minute", "day", "auto"
    ] = "auto",
    default_output_tokens: int = 0,
    timeout: float | None = None,
    max_retries: int = 3,
    default_request_kwargs: dict[str, Any] | None = None,
    routing_strategy: str = "simple-shuffle",
    router_kwargs: dict[str, Any] | None = None,
)

Bases: _ClientRuntimeMixin

Batch-friendly language-model interface built on LiteLLM.

LMClient supports two operating modes selected at construction time:

Single-endpoint mode — one model, one server. Provide model and api_base:

client = LMClient(
    model="openai/gpt-4.1-mini",
    api_base="https://api.openai.com/v1",
)
result = client.generate("What is the capital of France?")
print(result.output_text)  # "Paris"
client.close()

Router mode — multiple replicas, load-balanced by a LiteLLM Router. Provide model (the logical model name) and deployments (a dict of free-form label → DeploymentConfig):

client = LMClient(
    model="llama-3",
    deployments={
        "gpu-0": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://gpu0:8000/v1",
        ),
        "gpu-1": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://gpu1:8000/v1",
        ),
    },
)

If you only need a few single requests, plain LiteLLM or the provider SDK is usually simpler. LMClient becomes useful when you need concurrent batches, per-item failure handling, client-side rate limiting, or routing across several replicas of the same logical model.

The client can be used as a context manager (sync or async) to ensure close is always called:

with LMClient(
    model="openai/gpt-4o", api_base="https://api.openai.com/v1"
) as client:
    batch = client.generate_batch(prompts)

The sync methods delegate to their async counterparts through a managed background loop, which keeps notebook and REPL usage simple while preserving retries, throttling, and batching behavior.

Notes

Always call close (or use the context-manager form) when the client is no longer needed. close stops the background SyncRunner thread; failing to call it leaves a daemon thread running until process exit.

A single RateLimiter instance is shared between the caller's event loop and the SyncRunner background loop, so sync and async calls are accounted together and do not double the effective rate.

See Also

DeploymentConfig : Per-replica configuration for router mode.

RateLimiter : The rate-limiter used internally; can also be used standalone.

Create an LMClient instance.

Parameters:

Name Type Description Default
model str

Provider model name used for all requests. This is required in both single-endpoint and deployment-router modes.

None
api_base str | None

Connection details for direct, single-endpoint usage. Leave these unset when routing through deployments.

None
api_key str | None

Connection details for direct, single-endpoint usage. Leave these unset when routing through deployments.

None
deployments dict[str, DeploymentConfig | dict[str, Any]] | None

Named deployment definitions used for router mode. Each deployment can override model, API base, API key, and provider-specific kwargs.

None
endpoint EndpointType

Default generation endpoint used by generate and generate_batch unless a per-call override is supplied.

"chat_completion"
max_parallel_requests int | None

Per-event-loop cap on concurrent in-flight requests. When set, generate_batch and agenerate_batch also admit generation work through a bounded in-flight window instead of creating one task per item up front. Must be None or a positive integer.

None
rpm int | None

Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.

None
tpm int | None

Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.

None
rpd int | None

Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.

None
tpd int | None

Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods.

None
max_request_burst int | None

Burst allowances applied by the client-side rate limiter.

None
max_token_burst int | None

Burst allowances applied by the client-side rate limiter.

None
header_bucket_scope ('minute', 'day', 'auto')

How provider rate-limit headers should be interpreted when updating limiter state after a request.

"minute"
default_output_tokens int

Default completion-token budget used when estimating token reservations for rate limiting.

0
timeout float | None

Default request timeout forwarded to LiteLLM unless a per-call timeout is supplied in kwargs.

None
max_retries int

Number of retry attempts for transient provider failures such as rate limits, timeouts, and internal server errors.

3
default_request_kwargs dict[str, Any] | None

Request kwargs merged into every provider call.

None
routing_strategy str

Router selection strategy used when deployments are configured.

"simple-shuffle"
router_kwargs dict[str, Any] | None

Extra kwargs forwarded when building the LiteLLM router.

None

Warns:

Type Description
UserWarning

A logger warning is emitted when api_base uses insecure HTTP for a non-local host.

Raises:

Type Description
ValueError

If model is missing, if endpoint is invalid, or if deployment mode is mixed with direct api_base/api_key settings, or if max_parallel_requests is not positive.

Examples:

>>> client = LMClient(
...     model="openai/gpt-4o-mini",
...     api_base="https://api.openai.com/v1",
...     timeout=30,
... )
>>> client.close()

close

close() -> None

Release background resources used by the synchronous API.

generate, embed, and transcribe run on a managed background event loop. Call close when you are finished with the client, or prefer with / async with so cleanup happens automatically.

generate

generate(
    input_data: GenerateInput,
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    **kwargs: Any,
) -> GenerationResult

Generate one response on the notebook-safe background loop.

Parameters:

Name Type Description Default
input_data GenerateInput

Prompt text, a chat-style message list, or a responses payload. The accepted shape depends on the selected endpoint.

required
endpoint EndpointType | None

Per-call override for the generation endpoint. Defaults to the client-wide endpoint set at construction time.

None
response_format type[BaseModel] | dict[str, Any] | None

Structured output target. Pass a Pydantic model class or provider schema mapping to parse the response into output_parsed.

None
parse_output bool

When True, attempt to parse structured output even without an explicit response_format.

False
**kwargs Any

Additional LiteLLM request kwargs such as temperature or max tokens.

{}

Returns:

Type Description
GenerationResult

The generated text, optional parsed output, token usage, request id, finish reason, and timing metrics.

Raises:

Type Description
ValueError

If input_data does not match the selected endpoint contract.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.generate("Summarize the French Revolution.")
>>> result.output_text

generate_batch

generate_batch(
    input_batch: Sequence[GenerateInput],
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnGenerationResult = None,
    **kwargs: Any,
) -> GenerationBatchResult

Generate a batch of responses synchronously.

Parameters:

Name Type Description Default
input_batch Sequence[GenerateInput]

Ordered inputs to run. Each item may be a prompt string, chat message list, or responses payload.

required
endpoint EndpointType | None

Behave the same as in generate and are applied to every item in the batch.

None
response_format EndpointType | None

Behave the same as in generate and are applied to every item in the batch.

None
parse_output EndpointType | None

Behave the same as in generate and are applied to every item in the batch.

None
**kwargs EndpointType | None

Behave the same as in generate and are applied to every item in the batch.

None
return_exceptions bool

When True, item failures are captured in errors and the rest of the batch keeps running. When False, the first failure cancels remaining work and is raised.

True
on_progress callable | None

Callback invoked as on_progress(completed, total) each time an item finishes.

None
on_result callable | None

Callback invoked as on_result(index, result, error) for each completed item.

None

Returns:

Type Description
GenerationBatchResult

A batch result with one slot per input item. Successful items appear in results and failures, when captured, appear in errors.

Notes

For large or memory-sensitive Python batch runs, set max_parallel_requests on the client. When it is unset, generate_batch may start work for the full batch up front.

Raises:

Type Description
ValueError

If any batch item has an invalid input shape.

Exception

The first provider error when return_exceptions is False.

Examples:

>>> batch = client.generate_batch(
...     ["ELI5: Artificial Intelligence", "ELI5: Quantum Computing"]
... )
>>> [item.output_text if item else None for item in batch.results]

agenerate async

agenerate(
    input_data: GenerateInput,
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    **kwargs: Any,
) -> GenerationResult

Generate one response asynchronously.

Parameters:

Name Type Description Default
input_data GenerateInput

Follow the same contract as generate.

required
endpoint GenerateInput

Follow the same contract as generate.

required
response_format GenerateInput

Follow the same contract as generate.

required
parse_output GenerateInput

Follow the same contract as generate.

required
**kwargs GenerateInput

Follow the same contract as generate.

required

Returns:

Type Description
GenerationResult

The generated text, structured output if requested, and request metadata for a single input.

Raises:

Type Description
ValueError

If input_data does not match the selected endpoint contract.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.agenerate("Summarize the French Revolution.")
>>> result.output_text

agenerate_batch async

agenerate_batch(
    input_batch: Sequence[GenerateInput],
    *,
    endpoint: EndpointType | None = None,
    response_format: type[BaseModel]
    | dict[str, Any]
    | None = None,
    parse_output: bool = False,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnGenerationResult = None,
    **kwargs: Any,
) -> GenerationBatchResult

Generate a batch of responses asynchronously.

Parameters:

Name Type Description Default
input_batch Sequence[GenerateInput]

Follow the same contract as generate_batch.

required
endpoint Sequence[GenerateInput]

Follow the same contract as generate_batch.

required
response_format Sequence[GenerateInput]

Follow the same contract as generate_batch.

required
parse_output Sequence[GenerateInput]

Follow the same contract as generate_batch.

required
**kwargs Sequence[GenerateInput]

Follow the same contract as generate_batch.

required
return_exceptions bool

Capture per-item failures in errors when True. When False, the first failure cancels the remaining tasks and is raised.

True
on_progress callable | None

Optional callbacks invoked as items finish.

None
on_result callable | None

Optional callbacks invoked as items finish.

None

Returns:

Type Description
GenerationBatchResult

Batch-sized results and errors collections aligned to the original input order.

Notes

For large or memory-sensitive Python batch runs, set max_parallel_requests on the client. When it is unset, agenerate_batch may start work for the full batch up front.

Raises:

Type Description
ValueError

If any input item is invalid for the selected endpoint.

Exception

The first provider error when return_exceptions is False.

Examples:

>>> batch = await client.agenerate_batch(
...     ["ELI5: Artificial Intelligence", "ELI5: Quantum Computing"]
... )
>>> len(batch.results)

embed

embed(input_data: str, **kwargs: Any) -> EmbeddingResult

Create an embedding for a single text string synchronously.

Parameters:

Name Type Description Default
input_data str

Text to embed.

required
**kwargs Any

Additional LiteLLM embedding kwargs such as dimension hints.

{}

Returns:

Type Description
EmbeddingResult

One embedding vector plus request metadata and token-usage details.

Raises:

Type Description
ValueError

If the provider response does not contain an embedding vector.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.embed("The quick brown fox")
>>> len(result.embedding)

embed_batch

embed_batch(
    input_batch: Sequence[str],
    *,
    micro_batch_size: int = 32,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnEmbeddingResult = None,
    **kwargs: Any,
) -> EmbeddingBatchResult

Create embeddings for a batch of text strings synchronously.

Parameters:

Name Type Description Default
input_batch Sequence[str]

Text values to embed together.

required
micro_batch_size int

Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order.

32
return_exceptions bool

When True, failures are isolated per input item and captured in errors. When False, the first terminal failure cancels the remaining work and is raised.

True
on_progress callable | None

Callback invoked as on_progress(completed, total) each time an item settles.

None
on_result callable | None

Callback invoked as on_result(index, result, error) for each settled item.

None
**kwargs Any

Additional LiteLLM embedding kwargs applied to the batch request.

{}

Returns:

Type Description
EmbeddingBatchResult

A result slot for each input string, aligned to the original order.

Raises:

Type Description
ValueError

If micro_batch_size is not a positive integer.

Exception

The provider error when return_exceptions is False.

Examples:

>>> batch = client.embed_batch(["Queen", "King", "Card", "Ace"])
>>> [len(item.embedding) if item else None for item in batch.results]

aembed async

aembed(input_data: str, **kwargs: Any) -> EmbeddingResult

Create an embedding for a single text string asynchronously.

Parameters:

Name Type Description Default
input_data str

Follow the same contract as embed.

required
**kwargs str

Follow the same contract as embed.

required

Returns:

Type Description
EmbeddingResult

The embedding vector and request metadata for one text value.

Raises:

Type Description
ValueError

If the provider response does not contain an embedding vector.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.aembed("The quick brown fox")
>>> len(result.embedding)

aembed_batch async

aembed_batch(
    input_batch: Sequence[str],
    *,
    micro_batch_size: int = 32,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnEmbeddingResult = None,
    **kwargs: Any,
) -> EmbeddingBatchResult

Create embeddings for a batch of text strings asynchronously.

Parameters:

Name Type Description Default
input_batch Sequence[str]

Text values to embed together.

required
micro_batch_size int

Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order.

32
return_exceptions bool

When True, failures are isolated per input item and captured in errors. When False, the first terminal failure cancels the remaining work and is raised.

True
on_progress callable | None

Callback invoked as on_progress(completed, total) each time an item settles.

None
on_result callable | None

Callback invoked as on_result(index, result, error) for each settled item.

None
**kwargs Any

Additional LiteLLM embedding kwargs applied to the batch request.

{}

Returns:

Type Description
EmbeddingBatchResult

One result slot per input string, aligned to the original order.

Raises:

Type Description
ValueError

If micro_batch_size is not a positive integer.

Exception

The provider error when return_exceptions is False.

Examples:

>>> batch = await client.aembed_batch(["Queen", "King", "Card", "Ace"])
>>> len(batch.results)

transcribe

transcribe(
    input_data: TranscriptionInput,
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    **kwargs: Any,
) -> TranscriptionResult

Transcribe one audio input synchronously.

Parameters:

Name Type Description Default
input_data TranscriptionInput

Local path, raw bytes, or a binary file-like object containing audio data.

required
max_transcription_bytes int | None

Defensive size limit applied before the request is sent. Pass None to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because the client may read and send very large audio files in full.

DEFAULT_MAX_TRANSCRIPTION_BYTES
**kwargs Any

Additional LiteLLM transcription kwargs such as language hints.

{}

Returns:

Type Description
TranscriptionResult

The transcript text plus request metadata, language, and duration when the provider returns them.

Raises:

Type Description
ValueError

If the input cannot be normalized or exceeds max_transcription_bytes.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = client.transcribe("sample.wav")
>>> result.text

transcribe_batch

transcribe_batch(
    input_batch: Sequence[TranscriptionInput],
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnTranscriptionResult = None,
    **kwargs: Any,
) -> TranscriptionBatchResult

Transcribe a batch of audio inputs synchronously.

Parameters:

Name Type Description Default
input_batch Sequence[TranscriptionInput]

Ordered audio inputs to transcribe.

required
max_transcription_bytes int | None

Defensive size limit applied before each request is sent. Pass None to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because each admitted item may be read and sent in full.

DEFAULT_MAX_TRANSCRIPTION_BYTES
return_exceptions bool

When True, per-item failures are captured in errors and successful siblings still complete. When False, the first failure cancels the rest and is raised.

True
on_progress callable | None

Callback invoked as on_progress(completed, total) each time an item settles.

None
on_result callable | None

Callback invoked as on_result(index, result, error) for each settled item.

None
**kwargs Any

Additional LiteLLM transcription kwargs such as language hints.

{}

Returns:

Type Description
TranscriptionBatchResult

A result slot for each input item, aligned to the original order.

Raises:

Type Description
ValueError

If any input cannot be normalized or exceeds max_transcription_bytes.

Exception

The first terminal provider or normalization error when return_exceptions is False.

Examples:

>>> batch = client.transcribe_batch(["a.wav", "b.wav"])
>>> [item.text if item else None for item in batch.results]

atranscribe async

atranscribe(
    input_data: TranscriptionInput,
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    **kwargs: Any,
) -> TranscriptionResult

Transcribe one audio input asynchronously.

Parameters:

Name Type Description Default
input_data TranscriptionInput

Local path, raw bytes, or a binary file-like object containing audio data.

required
max_transcription_bytes int | None

Defensive size limit applied before the request is sent. Pass None to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because the client may read and send very large audio files in full.

DEFAULT_MAX_TRANSCRIPTION_BYTES
**kwargs Any

Additional LiteLLM transcription kwargs such as language hints.

{}

Returns:

Type Description
TranscriptionResult

The transcript text and request metadata for one audio input.

Raises:

Type Description
ValueError

If the input cannot be normalized or exceeds the configured size limit.

Exception

Re-raises provider errors after retries are exhausted.

Examples:

>>> result = await client.atranscribe("sample.wav")
>>> result.text

atranscribe_batch async

atranscribe_batch(
    input_batch: Sequence[TranscriptionInput],
    *,
    max_transcription_bytes: int
    | None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
    return_exceptions: bool = True,
    on_progress: Callable[[int, int], None] | None = None,
    on_result: OnTranscriptionResult = None,
    **kwargs: Any,
) -> TranscriptionBatchResult

Transcribe a batch of audio inputs asynchronously.

Parameters:

Name Type Description Default
input_batch Sequence[TranscriptionInput]

Ordered audio inputs to transcribe.

required
max_transcription_bytes int | None

Defensive size limit applied before each request is sent. Pass None to disable the check. Use that override only in trusted environments where the server is expected to accept larger uploads, because each admitted item may be read and sent in full.

DEFAULT_MAX_TRANSCRIPTION_BYTES
return_exceptions bool

When True, per-item failures are captured in errors and successful siblings still complete. When False, the first failure cancels the rest and is raised.

True
on_progress callable | None

Callback invoked as on_progress(completed, total) each time an item settles.

None
on_result callable | None

Callback invoked as on_result(index, result, error) for each settled item.

None
**kwargs Any

Additional LiteLLM transcription kwargs such as language hints.

{}

Returns:

Type Description
TranscriptionBatchResult

One result slot per input item, aligned to the original order.

Raises:

Type Description
ValueError

If any input cannot be normalized or exceeds the configured size limit.

Exception

The first terminal provider or normalization error when return_exceptions is False.

Examples:

>>> batch = await client.atranscribe_batch(["a.wav", "b.wav"])
>>> len(batch.results)

__enter__

__enter__() -> LMClient

Enter a synchronous context-manager scope.

__exit__

__exit__(exc_type: Any, exc: Any, traceback: Any) -> None

Exit a synchronous context-manager scope.

__aenter__ async

__aenter__() -> LMClient

Enter an asynchronous context-manager scope.

__aexit__ async

__aexit__(exc_type: Any, exc: Any, traceback: Any) -> None

Exit an asynchronous context-manager scope.

__del__

__del__() -> None

Best-effort cleanup for interpreter shutdown.