Rate Limiter

infermesh.RateLimiter

RateLimiter(
    requests_per_minute: int,
    tokens_per_minute: int | None = None,
    requests_per_day: int | None = None,
    tokens_per_day: int | None = None,
    max_request_burst: int | None = None,
    max_token_burst: int | None = None,
    header_bucket_scope: Literal[
        "minute", "day", "auto"
    ] = "auto",
)

Thread-safe async rate limiter backed by token buckets.

Enforces up to four independent limits simultaneously:

RPM (requests per minute) — always active; the only required limit.
TPM (tokens per minute) — optional; activated by passing tokens_per_minute.
RPD (requests per day) — optional; activated by passing requests_per_day.
TPD (tokens per day) — optional; activated by passing tokens_per_day.

A request acquires capacity from all active buckets before it proceeds. Waiters are queued in a min-heap ordered by token cost (smallest-first), which prevents large requests from being starved by a constant stream of small ones.

After a provider response arrives, adjust refines the token accounting with the actual usage and can also sync the bucket state from the x-ratelimit-* headers returned by OpenAI-compatible APIs.

Parameters:

Name	Type	Description	Default
`requests_per_minute`	`int`	Maximum number of requests allowed per 60-second window. This is the only required parameter.	required
`tokens_per_minute`	`int or None`	Maximum number of tokens allowed per 60-second window. When supplied, a TPM bucket is created and all requests must also fit within this limit.	`None`
`requests_per_day`	`int or None`	Maximum requests per 24-hour window. Creates an RPD bucket in addition to the RPM bucket.	`None`
`tokens_per_day`	`int or None`	Maximum tokens per 24-hour window. Creates a TPD bucket.	`None`
`max_request_burst`	`int or None`	Burst capacity for the RPM bucket. When `None` (default), the burst equals `requests_per_minute` (no burst above the base rate). Set higher to allow short spikes above the steady-state RPM.	`None`
`max_token_burst`	`int or None`	Burst capacity for the TPM bucket. Analogous to `max_request_burst`.	`None`
`header_bucket_scope`	`('auto', 'minute', 'day')`	Controls which bucket receives updates from `x-ratelimit-*` response headers. `"auto"` (default): resets arriving within 120 s → per-minute buckets (RPM/TPM); later resets → per-day buckets (RPD/TPD). `"minute"`: always route header updates to RPM/TPM. `"day"`: always route header updates to RPD/TPD. Override `"auto"` when your provider uses non-standard header conventions.	`"auto"`

Raises:

Type	Description
`ValueError`	If `header_bucket_scope` is not one of the accepted string literals.

Examples:

A simple 100 RPM limiter:

>>> from infermesh import RateLimiter
>>> limiter = RateLimiter(requests_per_minute=100)

A combined 500 RPM / 100 000 TPM limiter appropriate for OpenAI Tier-2:

>>> limiter = RateLimiter(
...     requests_per_minute=500,
...     tokens_per_minute=100_000,
... )

Using the limiter outside of LMClient (advanced):

>>> import asyncio
>>> async def limited_call(limiter: RateLimiter, tokens: int) -> None:
...     handle = await limiter.acquire(tokens)
...     try:
...         response = await some_api_call()
...         actual = response.usage.total_tokens
...     except Exception:
...         await limiter.adjust(handle, actual_tokens=0)
...         raise
...     await limiter.adjust(handle, actual_tokens=actual)

acquire `async`

acquire(
    estimated_tokens: int,
) -> RateLimiterAcquisitionHandle | None

Reserve capacity for one request and return an acquisition handle.

Blocks asynchronously until all active buckets (RPM, TPM, RPD, TPD) have enough capacity for the request. Once capacity is available, the tokens are atomically deducted and a RateLimiterAcquisitionHandle is returned.

Call adjust with the returned handle after the request completes to reconcile actual token usage.

Parameters:

Name	Type	Description	Default
`estimated_tokens`	`int`	Pre-dispatch estimate of the total tokens this request will consume (prompt + expected output). Must be non-negative. A value of `0` means "only reserve one request slot".	required

Returns:

Type	Description
`RateLimiterAcquisitionHandle or None`	A handle encapsulating the reservation. Returns `None` if the wait future was cancelled (i.e. the calling task was cancelled while waiting in the queue); the request was not dispatched in that case.

Raises:

Type	Description
`ValueError`	If `estimated_tokens` exceeds the capacity of any token bucket. This prevents a request that can never fit from blocking the queue indefinitely. Either reduce the estimate or increase `max_token_burst`.

Notes

Waiters are ordered by token cost (smallest first) in a min-heap. This avoids starvation of small requests by large ones but means that a very large request may wait until the bucket refills enough to serve it.

Examples:

>>> handle = await limiter.acquire(estimated_tokens=512)
>>> if handle is None:
...     return  # task was cancelled; do not dispatch
>>> try:
...     response = await api_call()
... except Exception:
...     await limiter.adjust(handle, actual_tokens=0)
...     raise
>>> await limiter.adjust(handle, actual_tokens=response.usage.total_tokens)

adjust `async`

adjust(
    handle: RateLimiterAcquisitionHandle,
    actual_tokens: int,
    response_headers: dict[str, Any] | None = None,
) -> None

Reconcile a reservation with the actual outcome of the request.

This method must be called after every acquire call, regardless of whether the request succeeded or failed. It performs three tasks:

Failure refund — when actual_tokens == 0 (signalling a failed request), the one request slot that was deducted by acquire is returned to the RPM bucket so the failed request does not count against the rate.
Token correction — the difference between the pre-dispatch estimate (handle.estimated_tokens) and the actual usage is added back to all token buckets (a negative delta removes tokens; a positive delta returns over-reserved tokens).
Header sync — when response_headers is provided, any x-ratelimit-* headers are parsed and used to authoritative- ly synchronise the bucket state, overriding the local estimate.

Parameters:

Name	Type	Description	Default
`handle`	`RateLimiterAcquisitionHandle`	The handle returned by the preceding acquire call.	required
`actual_tokens`	`int`	Total tokens actually consumed, as reported in the response's `usage.total_tokens` field. Pass `0` to indicate a failed request (no tokens billed) and trigger the failure refund.	required
`response_headers`	`dict or None`	The provider's response headers. When present, any `x-ratelimit-limit-`, `x-ratelimit-remaining-`, and `x-ratelimit-reset-*` headers are parsed and used to sync the corresponding buckets. Pass `None` (default) to skip header syncing (appropriate for local vLLM servers that do not send rate-limit headers).	`None`

Raises:

Type	Description
`ValueError`	If `actual_tokens` or `handle.estimated_tokens` is negative.

Notes

After the adjustments are applied, any queued waiters that can now proceed are notified.

Examples:

Successful request:

>>> handle = await limiter.acquire(256)
>>> response = await api_call()
>>> await limiter.adjust(
...     handle,
...     actual_tokens=response.usage.total_tokens,
...     response_headers=dict(response.headers),
... )

Failed request (refund the slot):

>>> handle = await limiter.acquire(256)
>>> try:
...     await api_call()
... except Exception:
...     await limiter.adjust(handle, actual_tokens=0)
...     raise

infermesh.RateLimiterAcquisitionHandle `dataclass`

RateLimiterAcquisitionHandle(estimated_tokens: int)

Opaque handle returned by acquire.

Pass this handle to adjust after the request completes so the limiter can reconcile the actual token usage against the pre-dispatch estimate and release or reclaim the difference.

Parameters:

Name	Type	Description	Default
`estimated_tokens`	`int`	The number of tokens that were reserved when the handle was created. This field is used internally by adjust; external callers generally do not need to inspect it.	required

Notes

Do not create RateLimiterAcquisitionHandle instances directly. They are produced exclusively by acquire and should be treated as opaque tokens.

Rate Limiter

infermesh.RateLimiter

acquire async

adjust async

infermesh.RateLimiterAcquisitionHandle dataclass

acquire `async`

adjust `async`

infermesh.RateLimiterAcquisitionHandle `dataclass`