Rate Limiter
infermesh.RateLimiter
RateLimiter(
requests_per_minute: int,
tokens_per_minute: int | None = None,
requests_per_day: int | None = None,
tokens_per_day: int | None = None,
max_request_burst: int | None = None,
max_token_burst: int | None = None,
header_bucket_scope: Literal[
"minute", "day", "auto"
] = "auto",
)
Thread-safe async rate limiter backed by token buckets.
Enforces up to four independent limits simultaneously:
- RPM (requests per minute) — always active; the only required limit.
- TPM (tokens per minute) — optional; activated by passing
tokens_per_minute. - RPD (requests per day) — optional; activated by passing
requests_per_day. - TPD (tokens per day) — optional; activated by passing
tokens_per_day.
A request acquires capacity from all active buckets before it proceeds. Waiters are queued in a min-heap ordered by token cost (smallest-first), which prevents large requests from being starved by a constant stream of small ones.
After a provider response arrives, adjust
refines the token accounting with the actual usage and can also sync the
bucket state from the x-ratelimit-* headers returned by OpenAI-compatible
APIs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
requests_per_minute
|
int
|
Maximum number of requests allowed per 60-second window. This is the only required parameter. |
required |
tokens_per_minute
|
int or None
|
Maximum number of tokens allowed per 60-second window. When supplied, a TPM bucket is created and all requests must also fit within this limit. |
None
|
requests_per_day
|
int or None
|
Maximum requests per 24-hour window. Creates an RPD bucket in addition to the RPM bucket. |
None
|
tokens_per_day
|
int or None
|
Maximum tokens per 24-hour window. Creates a TPD bucket. |
None
|
max_request_burst
|
int or None
|
Burst capacity for the RPM bucket. When |
None
|
max_token_burst
|
int or None
|
Burst capacity for the TPM bucket. Analogous to |
None
|
header_bucket_scope
|
('auto', 'minute', 'day')
|
Controls which bucket receives updates from
Override |
"auto"
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
A simple 100 RPM limiter:
A combined 500 RPM / 100 000 TPM limiter appropriate for OpenAI Tier-2:
Using the limiter outside of LMClient (advanced):
>>> import asyncio
>>> async def limited_call(limiter: RateLimiter, tokens: int) -> None:
... handle = await limiter.acquire(tokens)
... try:
... response = await some_api_call()
... actual = response.usage.total_tokens
... except Exception:
... await limiter.adjust(handle, actual_tokens=0)
... raise
... await limiter.adjust(handle, actual_tokens=actual)
acquire
async
Reserve capacity for one request and return an acquisition handle.
Blocks asynchronously until all active buckets (RPM, TPM, RPD, TPD) have enough capacity for the request. Once capacity is available, the tokens are atomically deducted and a RateLimiterAcquisitionHandle is returned.
Call adjust with the returned handle after the request completes to reconcile actual token usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimated_tokens
|
int
|
Pre-dispatch estimate of the total tokens this request will
consume (prompt + expected output). Must be non-negative. A
value of |
required |
Returns:
| Type | Description |
|---|---|
RateLimiterAcquisitionHandle or None
|
A handle encapsulating the reservation. Returns |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Notes
Waiters are ordered by token cost (smallest first) in a min-heap. This avoids starvation of small requests by large ones but means that a very large request may wait until the bucket refills enough to serve it.
Examples:
>>> handle = await limiter.acquire(estimated_tokens=512)
>>> if handle is None:
... return # task was cancelled; do not dispatch
>>> try:
... response = await api_call()
... except Exception:
... await limiter.adjust(handle, actual_tokens=0)
... raise
>>> await limiter.adjust(handle, actual_tokens=response.usage.total_tokens)
adjust
async
adjust(
handle: RateLimiterAcquisitionHandle,
actual_tokens: int,
response_headers: dict[str, Any] | None = None,
) -> None
Reconcile a reservation with the actual outcome of the request.
This method must be called after every acquire call, regardless of whether the request succeeded or failed. It performs three tasks:
- Failure refund — when
actual_tokens == 0(signalling a failed request), the one request slot that was deducted by acquire is returned to the RPM bucket so the failed request does not count against the rate. - Token correction — the difference between the pre-dispatch
estimate (
handle.estimated_tokens) and the actual usage is added back to all token buckets (a negative delta removes tokens; a positive delta returns over-reserved tokens). - Header sync — when
response_headersis provided, anyx-ratelimit-*headers are parsed and used to authoritative- ly synchronise the bucket state, overriding the local estimate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
handle
|
RateLimiterAcquisitionHandle
|
The handle returned by the preceding acquire call. |
required |
actual_tokens
|
int
|
Total tokens actually consumed, as reported in the response's
|
required |
response_headers
|
dict or None
|
The provider's response headers. When present, any
|
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Notes
After the adjustments are applied, any queued waiters that can now proceed are notified.
Examples:
Successful request:
>>> handle = await limiter.acquire(256)
>>> response = await api_call()
>>> await limiter.adjust(
... handle,
... actual_tokens=response.usage.total_tokens,
... response_headers=dict(response.headers),
... )
Failed request (refund the slot):
infermesh.RateLimiterAcquisitionHandle
dataclass
Opaque handle returned by acquire.
Pass this handle to adjust after the request completes so the limiter can reconcile the actual token usage against the pre-dispatch estimate and release or reclaim the difference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimated_tokens
|
int
|
The number of tokens that were reserved when the handle was created. This field is used internally by adjust; external callers generally do not need to inspect it. |
required |
Notes
Do not create RateLimiterAcquisitionHandle instances directly. They are produced exclusively by acquire and should be treated as opaque tokens.