How ServiceMonitor can quietly overload your kube-apiserver
You add a ServiceMonitor. Prometheus picks up your metrics. Done, right?
At small scale, yes. At thousands of nodes and tens of thousands of pods, that single CRD can be one of the largest drivers of control-plane load in your cluster — without any single component looking obviously wrong.
This post walks through the chain reaction ServiceMonitor triggers, the
symptoms you’ll see when it starts to hurt, and a pattern that decouples metrics
discovery from data-plane traffic so the two stop fighting each other.
The chain reaction nobody draws
The prometheus-operator’s ServiceMonitor resource doesn’t scrape anything by
itself. It tells Prometheus where to look. To do that, it needs a
Kubernetes Service to discover targets through.
That’s the part everyone knows. Here’s what happens next:
ServiceMonitor
└─ selects a Service (label match)
└─ Service has Endpoints / EndpointSlices
└─ EndpointSlices are watched by:
├─ Prometheus (for scrape targets) ← intended
├─ kube-proxy on every node ← every node!
├─ Istio control plane (Istiod) ← if you have a mesh
└─ Anything else watching that Service ← e.g. CoreDNS
A single Service backing 100 pods doesn’t just produce 100 endpoint entries. It produces a stream of EndpointSlice updates every time one of those pods rolls, restarts, or fails a readiness probe. Those updates fan out to every watcher on every node.
Now multiply that by thousands of services across the cluster. The
ServiceMonitor is no longer the consumer — the Service it requires has
become a hot resource on the kube-apiserver.
Where the cost actually lands
There are three places this hurts, in roughly increasing order of pain:
1. kube-apiserver watch fanout
Every Service and EndpointSlice change is written to etcd through
kube-apiserver, then pushed out to every active watch stream. On a healthy
small cluster you might see a few thousand active watches. At scale, with
every node running kube-proxy + an Istio sidecar + various agents,
six-figure concurrent watch counts become normal. Each watch holds a
goroutine, an HTTP/2 stream, and a buffer.
This is the kind of thing that looks fine until it doesn’t, then suddenly
your apiserver p99 latency for WATCH and LIST calls falls off a cliff.
2. kube-proxy iptables/IPVS sync
Every node’s kube-proxy watches EndpointSlices and rewrites its local
iptables or IPVS rules whenever they change. The sync time grows with the
total number of endpoints in the cluster — not just the ones routed on that
node.
In one cluster I worked on, the first full IPVS sync on a fresh node took five to ten minutes once endpoint counts climbed into the hundreds of thousands. That’s five to ten minutes during which the node’s pods can’t reach in-cluster services reliably.
3. Istio XDS publication
This is the one that tends to surprise people. Istiod watches all
Service resources in the cluster. By default, it then publishes every
service it sees to every sidecar’s configuration via XDS.
The sidecars don’t need most of that information. A pod in namespace A
probably never calls the -metrics service of a pod in namespace Z. But
Istio doesn’t know that, so it ships the whole catalog.
Result: bloated sidecar memory, slow first-sync on startup (which blocks
Envoy’s LDS/CDS/EDS readiness, which blocks the app from accepting
traffic), and big XDS push storms whenever endpoints churn.
Symptoms
If ServiceMonitor-driven load is your bottleneck, you’ll typically see
some combination of:
- High
apiserver_longrunning_requestsspecifically onWATCHverbs againstendpoints/endpointslicesresources. - p99 pod startup time creeping up — especially the time between “pod scheduled” and “all containers ready,” dominated by the sidecar waiting on its first XDS push.
- kube-proxy logs showing very long sync durations:
proxier.go: numEndpoints=512000
proxier.go: syncProxyRules took 8m23s
- Sidecar memory growing linearly with cluster size, not with the individual pod’s traffic profile.
- Istiod CPU spikes correlated with pod rollouts in completely unrelated namespaces.
The fix: separate discovery from traffic
The root cause is conceptual: we’re treating Service as one thing when
it’s really two:
- A traffic primitive — a stable virtual IP for clients to dial.
- A discovery primitive — a set of labels Prometheus can match to find pods to scrape.
These have very different requirements. Traffic services should be in Istio’s catalog, exposed to relevant namespaces, and routable. Discovery services need none of that — Prometheus just needs to enumerate the backing pods.
The fix is to give each app two Services: one for traffic, one for metrics. The metrics one is headless (no ClusterIP) and excluded from Istio’s XDS publication.
A: keep the traffic Service as-is
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
ports:
- name: http
port: 80
targetPort: 8080
Nothing special. This is what clients dial. It stays in Istio’s catalog because real traffic flows through it.
B: add a headless metrics Service, hidden from Istio
apiVersion: v1
kind: Service
metadata:
name: my-app-metrics
annotations:
# Istio reads this Service but never publishes it to sidecars.
# See note below on choice of value.
networking.istio.io/exportTo: __non_existent_namespace__
spec:
clusterIP: None # headless — no virtual IP, no kube-proxy entry
selector:
app: my-app
ports:
- name: metrics
port: 9090
targetPort: 9090
Two things doing the heavy lifting here:
clusterIP: None— kube-proxy doesn’t install iptables/IPVS rules for this Service. No data-plane cost, even though every node still gets the endpoint updates via the watch.networking.istio.io/exportTo— controls which namespaces Istiod publishes the Service to. The documented values are*(all, the default),.(current namespace only),~(hidden, exported to no namespaces), or a comma-separated namespace list. See the Istio configuration scoping docs.
A note on the annotation value. The documented way to hide a Service
is networking.istio.io/exportTo: "~". The example above uses a non-existent
namespace name instead, which is not documented but produces the same
visibility outcome. The reason it shows up in production setups is
istio/istio#46950: using ~
still causes Pilot to trigger a full push on every change to the
annotated Service, which defeats the point of hiding it at scale. Routing
through a non-existent namespace avoids that code path. If you’re not
running at a scale where the full-push behavior is a problem, prefer the
documented ~ value.
C: point ServiceMonitor at the metrics Service only
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
spec:
selector:
matchLabels:
app: my-app-metrics # only the metrics Service
endpoints:
- port: metrics
interval: 30s
Prometheus now discovers targets through a Service that doesn’t participate in Istio’s catalog, doesn’t burn iptables rules, and exists for one purpose.
What about a port that’s used for both?
If your app serves both traffic and metrics on the same port (which is
common — /metrics on the HTTP port), you have two options:
- Duplicate the port in both Services. This is fine — Services are metadata; duplicating a port between a ClusterIP and a headless Service adds zero data-plane cost.
- Lose the metrics scrape for that pod. Not great, but sometimes the right call if the service is small and the operational simplicity is worth more than its metrics.
If your scrape configuration framework (Helm chart, operator, etc.) doesn’t support discovery via a separate Service, this might be the first thing to fix — it’s an enabler for everything above.
The takeaway
ServiceMonitor is a discovery primitive that happens to consume a traffic
primitive. At small scale that conflation is invisible; at large scale it
becomes the single biggest source of control-plane load you have, because
everyone watches Service and EndpointSlice resources — not just
Prometheus.
The pattern of splitting discovery and traffic Services, plus excluding the discovery one from Istio’s XDS publication, is a small structural change with an outsized payoff:
- Sidecars stop carrying state they’ll never use.
- kube-proxy stops syncing rules for Services that don’t route traffic.
- Istiod stops doing pushes triggered by metrics-only churn.
- The watch fanout on kube-apiserver shrinks back to something reasonable.
If you’re running Prometheus + Istio at scale and you’ve never thought about this, your kube-apiserver is probably doing more work than it needs to. Check your sidecar memory first — the line of best fit against cluster size will tell you most of what you need to know.