Skip to content

Rate Limiter

infermesh.RateLimiter

RateLimiter(
    requests_per_minute: int,
    tokens_per_minute: int | None = None,
    requests_per_day: int | None = None,
    tokens_per_day: int | None = None,
    max_request_burst: int | None = None,
    max_token_burst: int | None = None,
    header_bucket_scope: Literal[
        "minute", "day", "auto"
    ] = "auto",
)

Thread-safe async rate limiter backed by token buckets.

Enforces up to four independent limits simultaneously:

  • RPM (requests per minute) — always active; the only required limit.
  • TPM (tokens per minute) — optional; activated by passing tokens_per_minute.
  • RPD (requests per day) — optional; activated by passing requests_per_day.
  • TPD (tokens per day) — optional; activated by passing tokens_per_day.

A request acquires capacity from all active buckets before it proceeds. Waiters are queued in a min-heap ordered by token cost (smallest-first), which prevents large requests from being starved by a constant stream of small ones.

After a provider response arrives, adjust refines the token accounting with the actual usage and can also sync the bucket state from the x-ratelimit-* headers returned by OpenAI-compatible APIs.

Parameters:

Name Type Description Default
requests_per_minute int

Maximum number of requests allowed per 60-second window. This is the only required parameter.

required
tokens_per_minute int or None

Maximum number of tokens allowed per 60-second window. When supplied, a TPM bucket is created and all requests must also fit within this limit.

None
requests_per_day int or None

Maximum requests per 24-hour window. Creates an RPD bucket in addition to the RPM bucket.

None
tokens_per_day int or None

Maximum tokens per 24-hour window. Creates a TPD bucket.

None
max_request_burst int or None

Burst capacity for the RPM bucket. When None (default), the burst equals requests_per_minute (no burst above the base rate). Set higher to allow short spikes above the steady-state RPM.

None
max_token_burst int or None

Burst capacity for the TPM bucket. Analogous to max_request_burst.

None
header_bucket_scope ('auto', 'minute', 'day')

Controls which bucket receives updates from x-ratelimit-* response headers.

  • "auto" (default): resets arriving within 120 s → per-minute buckets (RPM/TPM); later resets → per-day buckets (RPD/TPD).
  • "minute": always route header updates to RPM/TPM.
  • "day": always route header updates to RPD/TPD.

Override "auto" when your provider uses non-standard header conventions.

"auto"

Raises:

Type Description
ValueError

If header_bucket_scope is not one of the accepted string literals.

Examples:

A simple 100 RPM limiter:

>>> from infermesh import RateLimiter
>>> limiter = RateLimiter(requests_per_minute=100)

A combined 500 RPM / 100 000 TPM limiter appropriate for OpenAI Tier-2:

>>> limiter = RateLimiter(
...     requests_per_minute=500,
...     tokens_per_minute=100_000,
... )

Using the limiter outside of LMClient (advanced):

>>> import asyncio
>>> async def limited_call(limiter: RateLimiter, tokens: int) -> None:
...     handle = await limiter.acquire(tokens)
...     try:
...         response = await some_api_call()
...         actual = response.usage.total_tokens
...     except Exception:
...         await limiter.adjust(handle, actual_tokens=0)
...         raise
...     await limiter.adjust(handle, actual_tokens=actual)

acquire async

acquire(
    estimated_tokens: int,
) -> RateLimiterAcquisitionHandle | None

Reserve capacity for one request and return an acquisition handle.

Blocks asynchronously until all active buckets (RPM, TPM, RPD, TPD) have enough capacity for the request. Once capacity is available, the tokens are atomically deducted and a RateLimiterAcquisitionHandle is returned.

Call adjust with the returned handle after the request completes to reconcile actual token usage.

Parameters:

Name Type Description Default
estimated_tokens int

Pre-dispatch estimate of the total tokens this request will consume (prompt + expected output). Must be non-negative. A value of 0 means "only reserve one request slot".

required

Returns:

Type Description
RateLimiterAcquisitionHandle or None

A handle encapsulating the reservation. Returns None if the wait future was cancelled (i.e. the calling task was cancelled while waiting in the queue); the request was not dispatched in that case.

Raises:

Type Description
ValueError

If estimated_tokens exceeds the capacity of any token bucket. This prevents a request that can never fit from blocking the queue indefinitely. Either reduce the estimate or increase max_token_burst.

Notes

Waiters are ordered by token cost (smallest first) in a min-heap. This avoids starvation of small requests by large ones but means that a very large request may wait until the bucket refills enough to serve it.

Examples:

>>> handle = await limiter.acquire(estimated_tokens=512)
>>> if handle is None:
...     return  # task was cancelled; do not dispatch
>>> try:
...     response = await api_call()
... except Exception:
...     await limiter.adjust(handle, actual_tokens=0)
...     raise
>>> await limiter.adjust(handle, actual_tokens=response.usage.total_tokens)

adjust async

adjust(
    handle: RateLimiterAcquisitionHandle,
    actual_tokens: int,
    response_headers: dict[str, Any] | None = None,
) -> None

Reconcile a reservation with the actual outcome of the request.

This method must be called after every acquire call, regardless of whether the request succeeded or failed. It performs three tasks:

  1. Failure refund — when actual_tokens == 0 (signalling a failed request), the one request slot that was deducted by acquire is returned to the RPM bucket so the failed request does not count against the rate.
  2. Token correction — the difference between the pre-dispatch estimate (handle.estimated_tokens) and the actual usage is added back to all token buckets (a negative delta removes tokens; a positive delta returns over-reserved tokens).
  3. Header sync — when response_headers is provided, any x-ratelimit-* headers are parsed and used to authoritative- ly synchronise the bucket state, overriding the local estimate.

Parameters:

Name Type Description Default
handle RateLimiterAcquisitionHandle

The handle returned by the preceding acquire call.

required
actual_tokens int

Total tokens actually consumed, as reported in the response's usage.total_tokens field. Pass 0 to indicate a failed request (no tokens billed) and trigger the failure refund.

required
response_headers dict or None

The provider's response headers. When present, any x-ratelimit-limit-*, x-ratelimit-remaining-*, and x-ratelimit-reset-* headers are parsed and used to sync the corresponding buckets. Pass None (default) to skip header syncing (appropriate for local vLLM servers that do not send rate-limit headers).

None

Raises:

Type Description
ValueError

If actual_tokens or handle.estimated_tokens is negative.

Notes

After the adjustments are applied, any queued waiters that can now proceed are notified.

Examples:

Successful request:

>>> handle = await limiter.acquire(256)
>>> response = await api_call()
>>> await limiter.adjust(
...     handle,
...     actual_tokens=response.usage.total_tokens,
...     response_headers=dict(response.headers),
... )

Failed request (refund the slot):

>>> handle = await limiter.acquire(256)
>>> try:
...     await api_call()
... except Exception:
...     await limiter.adjust(handle, actual_tokens=0)
...     raise

infermesh.RateLimiterAcquisitionHandle dataclass

RateLimiterAcquisitionHandle(estimated_tokens: int)

Opaque handle returned by acquire.

Pass this handle to adjust after the request completes so the limiter can reconcile the actual token usage against the pre-dispatch estimate and release or reclaim the difference.

Parameters:

Name Type Description Default
estimated_tokens int

The number of tokens that were reserved when the handle was created. This field is used internally by adjust; external callers generally do not need to inspect it.

required
Notes

Do not create RateLimiterAcquisitionHandle instances directly. They are produced exclusively by acquire and should be treated as opaque tokens.