1. What are Rate Limits?
Rate Limits refers to the API that limits the number of times a user can access the server or the number of tokens consumed within a specific period of time. API rate limiting on current platforms takes several forms:- RPM (Requests Per Minute, the maximum number of requests initiated in one minute)
- RPH (Requests Per Hour, the maximum number of requests allowed per hour)
- RPD (Requests Per Day, the maximum number of requests allowed per day)
- TPM (Tokens Per Minute) is the total number of tokens consumed per minute, including input and output.
- TPD (Tokens Per Day, the maximum number of tokens allowed per day)
- IPM (Images Per Minute, the maximum number of images generated in one minute)
- IPD (Images Per Day, the maximum number of images generated in one day)
2. Why are Rate Limits needed?
Enforcing rate limits is a common mechanism in API services for purposes including:- Prevent the interface from being abused maliciously, such as preventing a large number of invalid accesses in a short period of time from causing performance degradation or even service unavailability.
- Ensure that the rights and interests of all users are not infringed upon by a few high-frequency users, so that the API can maintain fair distribution of access resources and avoid excessive consumption by some users that affects the experience of others.
- Help this site provide a consistent and efficient service experience for all users.
3. Specific parameters of API rate limit
Implement a unified rate policy for account orchestration based on the model used, interface category, and account type. For example: If the account has 120 RPM, this account can send up to 120 requests per minute. If 30 times have been used, 90 times remain available for other requests. Restrictions for each interface are detailed as follows: There is currently no limit4. Response instructions when the rate exceeds the limit
When the number of requests or token usage reaches the upper limit in a short period of time, the API will return a rate-limited error message. At this time, subsequent requests will be temporarily rejected, and you can continue to access the interface after the cooling time has expired.5. Rate limit and tokens_to_generate, max_tokens
Since the total number of input and output tokens cannot be accurately known during the request, the system will use the interface parameters tomax_tokensEstimate the token used for the request and perform TPM current limiting accordingly. After actual generation, it will be corrected with the actual number of tokens.
Recommended settingsmax_tokensThe value should be as close to the actual needs as possible to minimize errors caused by exceeding the limit.