LMClient
infermesh.LMClient
LMClient(
*,
model: str | None = None,
api_base: str | None = None,
api_key: str | None = None,
deployments: dict[
str, DeploymentConfig | dict[str, Any]
]
| None = None,
endpoint: EndpointType = "chat_completion",
max_parallel_requests: int | None = None,
rpm: int | None = None,
tpm: int | None = None,
rpd: int | None = None,
tpd: int | None = None,
max_request_burst: int | None = None,
max_token_burst: int | None = None,
header_bucket_scope: Literal[
"minute", "day", "auto"
] = "auto",
default_output_tokens: int = 0,
timeout: float | None = None,
max_retries: int = 3,
default_request_kwargs: dict[str, Any] | None = None,
routing_strategy: str = "simple-shuffle",
router_kwargs: dict[str, Any] | None = None,
)
Bases: _ClientRuntimeMixin
Batch-friendly language-model interface built on LiteLLM.
LMClient supports two operating modes selected at construction time:
Single-endpoint mode — one model, one server. Provide model and
api_base:
client = LMClient(
model="openai/gpt-4.1-mini",
api_base="https://api.openai.com/v1",
)
result = client.generate("What is the capital of France?")
print(result.output_text) # "Paris"
client.close()
Router mode — multiple replicas, load-balanced by a LiteLLM Router.
Provide model (the logical model name) and deployments (a dict of
free-form label → DeploymentConfig):
client = LMClient(
model="llama-3",
deployments={
"gpu-0": DeploymentConfig(
model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
api_base="http://gpu0:8000/v1",
),
"gpu-1": DeploymentConfig(
model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
api_base="http://gpu1:8000/v1",
),
},
)
If you only need a few single requests, plain LiteLLM or the provider SDK is usually simpler. LMClient becomes useful when you need concurrent batches, per-item failure handling, client-side rate limiting, or routing across several replicas of the same logical model.
The client can be used as a context manager (sync or async) to ensure close is always called:
with LMClient(
model="openai/gpt-4o", api_base="https://api.openai.com/v1"
) as client:
batch = client.generate_batch(prompts)
The sync methods delegate to their async counterparts through a managed background loop, which keeps notebook and REPL usage simple while preserving retries, throttling, and batching behavior.
Notes
Always call close (or use the context-manager form)
when the client is no longer needed. close stops
the background SyncRunner thread; failing to call it leaves a daemon thread
running until process exit.
A single RateLimiter instance is shared between
the caller's event loop and the SyncRunner background loop, so sync and
async calls are accounted together and do not double the effective rate.
See Also
DeploymentConfig : Per-replica configuration for router mode.
RateLimiter : The rate-limiter used internally; can also be used standalone.
Create an LMClient instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Provider model name used for all requests. This is required in both single-endpoint and deployment-router modes. |
None
|
api_base
|
str | None
|
Connection details for direct, single-endpoint usage. Leave these
unset when routing through |
None
|
api_key
|
str | None
|
Connection details for direct, single-endpoint usage. Leave these
unset when routing through |
None
|
deployments
|
dict[str, DeploymentConfig | dict[str, Any]] | None
|
Named deployment definitions used for router mode. Each deployment can override model, API base, API key, and provider-specific kwargs. |
None
|
endpoint
|
EndpointType
|
Default generation endpoint used by |
"chat_completion"
|
max_parallel_requests
|
int | None
|
Per-event-loop cap on concurrent in-flight requests. When set,
|
None
|
rpm
|
int | None
|
Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods. |
None
|
tpm
|
int | None
|
Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods. |
None
|
rpd
|
int | None
|
Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods. |
None
|
tpd
|
int | None
|
Client-side rate-limit settings. When any limit is set, the client creates a shared limiter used by sync and async methods. |
None
|
max_request_burst
|
int | None
|
Burst allowances applied by the client-side rate limiter. |
None
|
max_token_burst
|
int | None
|
Burst allowances applied by the client-side rate limiter. |
None
|
header_bucket_scope
|
('minute', 'day', 'auto')
|
How provider rate-limit headers should be interpreted when updating limiter state after a request. |
"minute"
|
default_output_tokens
|
int
|
Default completion-token budget used when estimating token reservations for rate limiting. |
0
|
timeout
|
float | None
|
Default request timeout forwarded to LiteLLM unless a per-call
timeout is supplied in |
None
|
max_retries
|
int
|
Number of retry attempts for transient provider failures such as rate limits, timeouts, and internal server errors. |
3
|
default_request_kwargs
|
dict[str, Any] | None
|
Request kwargs merged into every provider call. |
None
|
routing_strategy
|
str
|
Router selection strategy used when |
"simple-shuffle"
|
router_kwargs
|
dict[str, Any] | None
|
Extra kwargs forwarded when building the LiteLLM router. |
None
|
Warns:
| Type | Description |
|---|---|
UserWarning
|
A logger warning is emitted when |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> client = LMClient(
... model="openai/gpt-4o-mini",
... api_base="https://api.openai.com/v1",
... timeout=30,
... )
>>> client.close()
close
Release background resources used by the synchronous API.
generate, embed, and transcribe run on a managed background
event loop. Call close when you are finished
with the client, or prefer with / async with so cleanup happens
automatically.
generate
generate(
input_data: GenerateInput,
*,
endpoint: EndpointType | None = None,
response_format: type[BaseModel]
| dict[str, Any]
| None = None,
parse_output: bool = False,
**kwargs: Any,
) -> GenerationResult
Generate one response on the notebook-safe background loop.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
GenerateInput
|
Prompt text, a chat-style message list, or a |
required |
endpoint
|
EndpointType | None
|
Per-call override for the generation endpoint. Defaults to the
client-wide |
None
|
response_format
|
type[BaseModel] | dict[str, Any] | None
|
Structured output target. Pass a Pydantic model class or provider
schema mapping to parse the response into |
None
|
parse_output
|
bool
|
When |
False
|
**kwargs
|
Any
|
Additional LiteLLM request kwargs such as temperature or max tokens. |
{}
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
The generated text, optional parsed output, token usage, request id, finish reason, and timing metrics. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
generate_batch
generate_batch(
input_batch: Sequence[GenerateInput],
*,
endpoint: EndpointType | None = None,
response_format: type[BaseModel]
| dict[str, Any]
| None = None,
parse_output: bool = False,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnGenerationResult = None,
**kwargs: Any,
) -> GenerationBatchResult
Generate a batch of responses synchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[GenerateInput]
|
Ordered inputs to run. Each item may be a prompt string, chat
message list, or |
required |
endpoint
|
EndpointType | None
|
Behave the same as in |
None
|
response_format
|
EndpointType | None
|
Behave the same as in |
None
|
parse_output
|
EndpointType | None
|
Behave the same as in |
None
|
**kwargs
|
EndpointType | None
|
Behave the same as in |
None
|
return_exceptions
|
bool
|
When |
True
|
on_progress
|
callable | None
|
Callback invoked as |
None
|
on_result
|
callable | None
|
Callback invoked as |
None
|
Returns:
| Type | Description |
|---|---|
GenerationBatchResult
|
A batch result with one slot per input item. Successful items appear
in |
Notes
For large or memory-sensitive Python batch runs, set
max_parallel_requests on the client. When it is unset,
generate_batch may start work for the full batch up front.
Raises:
| Type | Description |
|---|---|
ValueError
|
If any batch item has an invalid input shape. |
Exception
|
The first provider error when |
Examples:
agenerate
async
agenerate(
input_data: GenerateInput,
*,
endpoint: EndpointType | None = None,
response_format: type[BaseModel]
| dict[str, Any]
| None = None,
parse_output: bool = False,
**kwargs: Any,
) -> GenerationResult
Generate one response asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
GenerateInput
|
Follow the same contract as |
required |
endpoint
|
GenerateInput
|
Follow the same contract as |
required |
response_format
|
GenerateInput
|
Follow the same contract as |
required |
parse_output
|
GenerateInput
|
Follow the same contract as |
required |
**kwargs
|
GenerateInput
|
Follow the same contract as |
required |
Returns:
| Type | Description |
|---|---|
GenerationResult
|
The generated text, structured output if requested, and request metadata for a single input. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
agenerate_batch
async
agenerate_batch(
input_batch: Sequence[GenerateInput],
*,
endpoint: EndpointType | None = None,
response_format: type[BaseModel]
| dict[str, Any]
| None = None,
parse_output: bool = False,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnGenerationResult = None,
**kwargs: Any,
) -> GenerationBatchResult
Generate a batch of responses asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[GenerateInput]
|
Follow the same contract as |
required |
endpoint
|
Sequence[GenerateInput]
|
Follow the same contract as |
required |
response_format
|
Sequence[GenerateInput]
|
Follow the same contract as |
required |
parse_output
|
Sequence[GenerateInput]
|
Follow the same contract as |
required |
**kwargs
|
Sequence[GenerateInput]
|
Follow the same contract as |
required |
return_exceptions
|
bool
|
Capture per-item failures in |
True
|
on_progress
|
callable | None
|
Optional callbacks invoked as items finish. |
None
|
on_result
|
callable | None
|
Optional callbacks invoked as items finish. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationBatchResult
|
Batch-sized |
Notes
For large or memory-sensitive Python batch runs, set
max_parallel_requests on the client. When it is unset,
agenerate_batch may start work for the full batch up front.
Raises:
| Type | Description |
|---|---|
ValueError
|
If any input item is invalid for the selected endpoint. |
Exception
|
The first provider error when |
Examples:
embed
Create an embedding for a single text string synchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
str
|
Text to embed. |
required |
**kwargs
|
Any
|
Additional LiteLLM embedding kwargs such as dimension hints. |
{}
|
Returns:
| Type | Description |
|---|---|
EmbeddingResult
|
One embedding vector plus request metadata and token-usage details. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider response does not contain an embedding vector. |
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
embed_batch
embed_batch(
input_batch: Sequence[str],
*,
micro_batch_size: int = 32,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnEmbeddingResult = None,
**kwargs: Any,
) -> EmbeddingBatchResult
Create embeddings for a batch of text strings synchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[str]
|
Text values to embed together. |
required |
micro_batch_size
|
int
|
Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order. |
32
|
return_exceptions
|
bool
|
When |
True
|
on_progress
|
callable | None
|
Callback invoked as |
None
|
on_result
|
callable | None
|
Callback invoked as |
None
|
**kwargs
|
Any
|
Additional LiteLLM embedding kwargs applied to the batch request. |
{}
|
Returns:
| Type | Description |
|---|---|
EmbeddingBatchResult
|
A result slot for each input string, aligned to the original order. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Exception
|
The provider error when |
Examples:
aembed
async
Create an embedding for a single text string asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
str
|
Follow the same contract as |
required |
**kwargs
|
str
|
Follow the same contract as |
required |
Returns:
| Type | Description |
|---|---|
EmbeddingResult
|
The embedding vector and request metadata for one text value. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider response does not contain an embedding vector. |
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
aembed_batch
async
aembed_batch(
input_batch: Sequence[str],
*,
micro_batch_size: int = 32,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnEmbeddingResult = None,
**kwargs: Any,
) -> EmbeddingBatchResult
Create embeddings for a batch of text strings asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[str]
|
Text values to embed together. |
required |
micro_batch_size
|
int
|
Maximum number of texts to send in a single provider embedding request. Larger logical batches are split into contiguous micro-batches and stitched back together in input order. |
32
|
return_exceptions
|
bool
|
When |
True
|
on_progress
|
callable | None
|
Callback invoked as |
None
|
on_result
|
callable | None
|
Callback invoked as |
None
|
**kwargs
|
Any
|
Additional LiteLLM embedding kwargs applied to the batch request. |
{}
|
Returns:
| Type | Description |
|---|---|
EmbeddingBatchResult
|
One result slot per input string, aligned to the original order. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Exception
|
The provider error when |
Examples:
transcribe
transcribe(
input_data: TranscriptionInput,
*,
max_transcription_bytes: int
| None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
**kwargs: Any,
) -> TranscriptionResult
Transcribe one audio input synchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
TranscriptionInput
|
Local path, raw bytes, or a binary file-like object containing audio data. |
required |
max_transcription_bytes
|
int | None
|
Defensive size limit applied before the request is sent. Pass
|
DEFAULT_MAX_TRANSCRIPTION_BYTES
|
**kwargs
|
Any
|
Additional LiteLLM transcription kwargs such as language hints. |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
The transcript text plus request metadata, language, and duration when the provider returns them. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input cannot be normalized or exceeds
|
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
transcribe_batch
transcribe_batch(
input_batch: Sequence[TranscriptionInput],
*,
max_transcription_bytes: int
| None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnTranscriptionResult = None,
**kwargs: Any,
) -> TranscriptionBatchResult
Transcribe a batch of audio inputs synchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[TranscriptionInput]
|
Ordered audio inputs to transcribe. |
required |
max_transcription_bytes
|
int | None
|
Defensive size limit applied before each request is sent. Pass
|
DEFAULT_MAX_TRANSCRIPTION_BYTES
|
return_exceptions
|
bool
|
When |
True
|
on_progress
|
callable | None
|
Callback invoked as |
None
|
on_result
|
callable | None
|
Callback invoked as |
None
|
**kwargs
|
Any
|
Additional LiteLLM transcription kwargs such as language hints. |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionBatchResult
|
A result slot for each input item, aligned to the original order. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any input cannot be normalized or exceeds
|
Exception
|
The first terminal provider or normalization error when
|
Examples:
atranscribe
async
atranscribe(
input_data: TranscriptionInput,
*,
max_transcription_bytes: int
| None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
**kwargs: Any,
) -> TranscriptionResult
Transcribe one audio input asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_data
|
TranscriptionInput
|
Local path, raw bytes, or a binary file-like object containing audio data. |
required |
max_transcription_bytes
|
int | None
|
Defensive size limit applied before the request is sent. Pass
|
DEFAULT_MAX_TRANSCRIPTION_BYTES
|
**kwargs
|
Any
|
Additional LiteLLM transcription kwargs such as language hints. |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
The transcript text and request metadata for one audio input. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input cannot be normalized or exceeds the configured size limit. |
Exception
|
Re-raises provider errors after retries are exhausted. |
Examples:
atranscribe_batch
async
atranscribe_batch(
input_batch: Sequence[TranscriptionInput],
*,
max_transcription_bytes: int
| None = DEFAULT_MAX_TRANSCRIPTION_BYTES,
return_exceptions: bool = True,
on_progress: Callable[[int, int], None] | None = None,
on_result: OnTranscriptionResult = None,
**kwargs: Any,
) -> TranscriptionBatchResult
Transcribe a batch of audio inputs asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_batch
|
Sequence[TranscriptionInput]
|
Ordered audio inputs to transcribe. |
required |
max_transcription_bytes
|
int | None
|
Defensive size limit applied before each request is sent. Pass
|
DEFAULT_MAX_TRANSCRIPTION_BYTES
|
return_exceptions
|
bool
|
When |
True
|
on_progress
|
callable | None
|
Callback invoked as |
None
|
on_result
|
callable | None
|
Callback invoked as |
None
|
**kwargs
|
Any
|
Additional LiteLLM transcription kwargs such as language hints. |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionBatchResult
|
One result slot per input item, aligned to the original order. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any input cannot be normalized or exceeds the configured size limit. |
Exception
|
The first terminal provider or normalization error when
|
Examples:
__exit__
Exit a synchronous context-manager scope.
__aexit__
async
Exit an asynchronous context-manager scope.