pnj

How ServiceMonitor can quietly overload your kube-apiserver

· 7 min read · kubernetes · prometheus · observability · istio · scaling

You add a ServiceMonitor. Prometheus picks up your metrics. Done, right?

At small scale, yes. At thousands of nodes and tens of thousands of pods, that single CRD can be one of the largest drivers of control-plane load in your cluster — without any single component looking obviously wrong.

This post walks through the chain reaction ServiceMonitor triggers, the symptoms you’ll see when it starts to hurt, and a pattern that decouples metrics discovery from data-plane traffic so the two stop fighting each other.

The chain reaction nobody draws

The prometheus-operator’s ServiceMonitor resource doesn’t scrape anything by itself. It tells Prometheus where to look. To do that, it needs a Kubernetes Service to discover targets through.

That’s the part everyone knows. Here’s what happens next:

ServiceMonitor
   └─ selects a Service (label match)
        └─ Service has Endpoints / EndpointSlices
             └─ EndpointSlices are watched by:
                  ├─ Prometheus (for scrape targets)        ← intended
                  ├─ kube-proxy on every node               ← every node!
                  ├─ Istio control plane (Istiod)           ← if you have a mesh
                  └─ Anything else watching that Service    ← e.g. CoreDNS

A single Service backing 100 pods doesn’t just produce 100 endpoint entries. It produces a stream of EndpointSlice updates every time one of those pods rolls, restarts, or fails a readiness probe. Those updates fan out to every watcher on every node.

Now multiply that by thousands of services across the cluster. The ServiceMonitor is no longer the consumer — the Service it requires has become a hot resource on the kube-apiserver.

Where the cost actually lands

There are three places this hurts, in roughly increasing order of pain:

1. kube-apiserver watch fanout

Every Service and EndpointSlice change is written to etcd through kube-apiserver, then pushed out to every active watch stream. On a healthy small cluster you might see a few thousand active watches. At scale, with every node running kube-proxy + an Istio sidecar + various agents, six-figure concurrent watch counts become normal. Each watch holds a goroutine, an HTTP/2 stream, and a buffer.

This is the kind of thing that looks fine until it doesn’t, then suddenly your apiserver p99 latency for WATCH and LIST calls falls off a cliff.

2. kube-proxy iptables/IPVS sync

Every node’s kube-proxy watches EndpointSlices and rewrites its local iptables or IPVS rules whenever they change. The sync time grows with the total number of endpoints in the cluster — not just the ones routed on that node.

In one cluster I worked on, the first full IPVS sync on a fresh node took five to ten minutes once endpoint counts climbed into the hundreds of thousands. That’s five to ten minutes during which the node’s pods can’t reach in-cluster services reliably.

3. Istio XDS publication

This is the one that tends to surprise people. Istiod watches all Service resources in the cluster. By default, it then publishes every service it sees to every sidecar’s configuration via XDS.

The sidecars don’t need most of that information. A pod in namespace A probably never calls the -metrics service of a pod in namespace Z. But Istio doesn’t know that, so it ships the whole catalog.

Result: bloated sidecar memory, slow first-sync on startup (which blocks Envoy’s LDS/CDS/EDS readiness, which blocks the app from accepting traffic), and big XDS push storms whenever endpoints churn.

Symptoms

If ServiceMonitor-driven load is your bottleneck, you’ll typically see some combination of:

  • High apiserver_longrunning_requests specifically on WATCH verbs against endpoints / endpointslices resources.
  • p99 pod startup time creeping up — especially the time between “pod scheduled” and “all containers ready,” dominated by the sidecar waiting on its first XDS push.
  • kube-proxy logs showing very long sync durations:
proxier.go: numEndpoints=512000
proxier.go: syncProxyRules took 8m23s
  • Sidecar memory growing linearly with cluster size, not with the individual pod’s traffic profile.
  • Istiod CPU spikes correlated with pod rollouts in completely unrelated namespaces.

The fix: separate discovery from traffic

The root cause is conceptual: we’re treating Service as one thing when it’s really two:

  • A traffic primitive — a stable virtual IP for clients to dial.
  • A discovery primitive — a set of labels Prometheus can match to find pods to scrape.

These have very different requirements. Traffic services should be in Istio’s catalog, exposed to relevant namespaces, and routable. Discovery services need none of that — Prometheus just needs to enumerate the backing pods.

The fix is to give each app two Services: one for traffic, one for metrics. The metrics one is headless (no ClusterIP) and excluded from Istio’s XDS publication.

A: keep the traffic Service as-is

apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
  ports:
    - name: http
      port: 80
      targetPort: 8080

Nothing special. This is what clients dial. It stays in Istio’s catalog because real traffic flows through it.

B: add a headless metrics Service, hidden from Istio

apiVersion: v1
kind: Service
metadata:
  name: my-app-metrics
  annotations:
    # Istio reads this Service but never publishes it to sidecars.
    # See note below on choice of value.
    networking.istio.io/exportTo: __non_existent_namespace__
spec:
  clusterIP: None      # headless — no virtual IP, no kube-proxy entry
  selector:
    app: my-app
  ports:
    - name: metrics
      port: 9090
      targetPort: 9090

Two things doing the heavy lifting here:

  1. clusterIP: None — kube-proxy doesn’t install iptables/IPVS rules for this Service. No data-plane cost, even though every node still gets the endpoint updates via the watch.
  2. networking.istio.io/exportTo — controls which namespaces Istiod publishes the Service to. The documented values are * (all, the default), . (current namespace only), ~ (hidden, exported to no namespaces), or a comma-separated namespace list. See the Istio configuration scoping docs.

A note on the annotation value. The documented way to hide a Service is networking.istio.io/exportTo: "~". The example above uses a non-existent namespace name instead, which is not documented but produces the same visibility outcome. The reason it shows up in production setups is istio/istio#46950: using ~ still causes Pilot to trigger a full push on every change to the annotated Service, which defeats the point of hiding it at scale. Routing through a non-existent namespace avoids that code path. If you’re not running at a scale where the full-push behavior is a problem, prefer the documented ~ value.

C: point ServiceMonitor at the metrics Service only

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
spec:
  selector:
    matchLabels:
      app: my-app-metrics      # only the metrics Service
  endpoints:
    - port: metrics
      interval: 30s

Prometheus now discovers targets through a Service that doesn’t participate in Istio’s catalog, doesn’t burn iptables rules, and exists for one purpose.

What about a port that’s used for both?

If your app serves both traffic and metrics on the same port (which is common — /metrics on the HTTP port), you have two options:

  1. Duplicate the port in both Services. This is fine — Services are metadata; duplicating a port between a ClusterIP and a headless Service adds zero data-plane cost.
  2. Lose the metrics scrape for that pod. Not great, but sometimes the right call if the service is small and the operational simplicity is worth more than its metrics.

If your scrape configuration framework (Helm chart, operator, etc.) doesn’t support discovery via a separate Service, this might be the first thing to fix — it’s an enabler for everything above.

The takeaway

ServiceMonitor is a discovery primitive that happens to consume a traffic primitive. At small scale that conflation is invisible; at large scale it becomes the single biggest source of control-plane load you have, because everyone watches Service and EndpointSlice resources — not just Prometheus.

The pattern of splitting discovery and traffic Services, plus excluding the discovery one from Istio’s XDS publication, is a small structural change with an outsized payoff:

  • Sidecars stop carrying state they’ll never use.
  • kube-proxy stops syncing rules for Services that don’t route traffic.
  • Istiod stops doing pushes triggered by metrics-only churn.
  • The watch fanout on kube-apiserver shrinks back to something reasonable.

If you’re running Prometheus + Istio at scale and you’ve never thought about this, your kube-apiserver is probably doing more work than it needs to. Check your sidecar memory first — the line of best fit against cluster size will tell you most of what you need to know.