Block the Blast: How Query Rejection Protects Your Cortex Cluster
Categories:
Introduction
Although Cortex includes various safeguards to protect against overload, they can’t prevent every failure scenario. In some environments, a small set of seemingly harmless-looking dashboard queries have repeatedly slipped just under the limits yet still OOM-killed the querier pods. Built-in protections weren’t enough, and the only available option was to throttle all incoming traffic. These queries often came from a specific dashboard or followed a predictable pattern. There was no way to block just those without affecting everything else. This inspired the introduction of query rejection, a last-resort safety net for operators running multi-tenant Cortex clusters.
Why Limits Aren’t Enough
Cortex already includes resource limiting, throttling and other resource safeguards, but these protections can’t cover every edge case. Some limits are enforced too late in the query lifecycle to matter; others are broad and can’t target specific bad actors. As a result, a single heavy query can bypass all service limits and still:
- Cause OOM kills that disrupt other tenants.
- Degrade availability or spike latency for everyone.
- Require manual operator intervention to restore normal service.
We needed a more precise tool—one that lets operators proactively block specific query patterns without harming legitimate traffic.
What Is Query Rejection?
Think of query rejection as an “emergency stop” in a factory. It sits in front of the query engine and checks each request against a set of operator-defined rules. If a request matches criteria, it’s rejected immediately. This allows you to target the handful of queries that cause trouble without imposing a blanket slowdown on everyone else.
Key features:
- Per-tenant control: It’s defined in the tenant limit configuration, which only targets queries from specific tenant.
- Precise matching: You can specify different query attributes to narrow down to specific queries. All fields within a rejection rule must match (AND logic). If needed, you can define multiple independent rejection rules to target different types of queries.
- Pre-processing enforcement: Query rejection is applied before the query is executed, allowing known-bad patterns to be blocked before consuming any resources.
Matching Criteria
Heavy queries often share identifiable traits. Query rejection lets you match on a variety of attributes and reject only those requests. You can combine as many of these as needed:
- API type:
query
,query_range
,series
, etc. - Query string (regex): Match by pattern, e.g., any query containing “ALERT”.
- Time range: Match queries whose range falls between a configured min and max.
- Time window: Match queries based on how far their time window is from now by specifying relative min and max boundaries. This is often used to distinguish queries that hit hot storage versus cold storage.
- Step value (resolution): Block extremely fine resolutions (e.g., steps under 5s).
- Headers: Match by User-Agent, Grafana dashboard UID (
X-Dashboard-Uid
) or panel ID (X-Panel-Id
).
By combining these fields, you can zero in on the exact query patterns causing problems without over-blocking.
Configuring Query Rejection
You define query rejection rules per tenant in a runtime config file. Each rejection rule specifies a set of attributes that must all match for the query to be rejected. The configuration supports multiple such rules.
Here’s an example configuration:
# runtime_config.yaml
overrides:
<tenant_id>:
query_rejection:
enabled: true
query_attributes:
- api_type: query_range
regex: .*ALERT.*
query_step_limit:
min: 6s
max: 20s
dashboard_uid: "dash123"
What this does:
enabled
This allows you to temporarily turn off query rejection without removing the configuration. It can help verify whether previously blocked queries are still causing issues.query_attributes
is a list of rejection rules. A query will only be rejected if it matches all attributes within one rejection rule.
In the example above, the single rejection rule requires the following:
- API type must be
query_range
. - The query string must contain the word
ALERT
. - The step must be between 6 and 20 seconds.
- The request must come from a dashboard with UID
dash123
.
If all of these conditions match, the request is rejected with a 422
response. If even one condition doesn’t match, the query is allowed to run. You can define additional rejection rules in the list to target different patterns.
Practical Example
Imagine a dashboard panel that repeatedly hits your cluster with a query like this:
curl \
'http://localhost:8005/prometheus/api/v1/query?query=customALERTquery&start=1718383304&end=1718386904&step=7s' \
-H "User-Agent: other" \
-H "X-Dashboard-Uid: dash123"
Because this request matches all the configured attributes, it will be blocked. But if even one field is different—such as a longer step duration or a different dashboard UID—the query will go through.
Best Practices and Cautions
Start with narrow rules. Use the most specific fields first—like panel ID or dashboard UID—to reduce risk of over-blocking.
Monitor rejections. Use the
cortex_query_frontend_rejected_queries_total
metric to track rejected queries, and check logs for detailed query information.Communicate with tenants. Let affected tenants know if their queries are being blocked, and help them adjust their dashboards accordingly.
Ruler Queries
Query rejection only applies to API queries and does not apply to ruler queries. However, Ruler queries are typically instant and lightweight, so a complex query‑rejection mechanism isn’t required for them. In situations where a rule group contains heavy queries and no other mitigations are effective, operators can disable the entire rule group. This is especially helpful to prevent Rulers from getting OOMKilled due to resource-intensive rules.
Rule group disabling is configured per tenant, similar to query rejection. When you disable a rule group, Cortex stops evaluating the rules within that group, removing the problematic queries altogether. For example:
# runtime_config.yaml
overrides:
<tenant_id>:
disabled_rule_groups:
- namespace: "keep_firing_for_test"
name: "smallsteps"
This makes it easy to mitigate issues from the ruler without introducing query rejection logic for those queries.
Conclusion
When traditional safeguards fall short, query rejection gives operators precise control to block only what’s harmful; without slowing down everything else.
If you operate a shared Cortex environment, consider learning how to use query rejection effectively. It might just save you from the next incident; by preventing OOM kills, degraded performance, or disruption to other tenants.