Cross-account Prometheus replication with end-to-end mTLS via Envoy

There are easier ways to ship metrics across AWS accounts than the one I’m about to describe. You could push to Amazon Managed Prometheus. You could peer the VPCs. You could use PrivateLink. We didn’t — we wanted both ends to be self-hosted Prometheus, separated by an account boundary, and we wanted the link to be the kind of thing a compliance reviewer can audit end-to-end. So the wire ended up going across the public internet with mutual TLS as the only thing standing between any random host on the internet and our metrics ingest endpoint.

This post walks through the design: how the two Prometheus servers talk to each other, what Envoy does on each side, how the certificates fit together, and the networking and security considerations that ended up shaping the final shape of the system.

The setup at a glance

Account A runs Prometheus that scrapes everything it cares about and uses remote_write to ship samples elsewhere. The elsewhere is a local Envoy on the same host (or sidecar), which terminates outbound TLS, adds its client certificate, and dials the public endpoint in Account B.
The traffic leaves Account A through a NAT gateway with a pinned Elastic IP, so the egress IP is stable.
It lands in Account B on a public Network Load Balancer that fronts another Envoy. Account B’s security group allows port 443 only from Account A’s NAT EIP. So even a perfectly forged TLS handshake from a random host on the internet hits a closed socket.
Envoy in Account B terminates TLS, validates the client certificate against a trusted intermediate CA, and forwards the request to a receiver-mode Prometheus running locally.

Three independent layers of access control, in order of how early they reject a bad request:

Network ACL (Account B’s SG) — only Account A’s egress IP can even open a connection.
mTLS server validation — Envoy rejects clients that don’t present a certificate signed by the intermediate CA we trust.
mTLS client validation — Account A’s Envoy refuses to send to a server whose cert isn’t signed by the same intermediate.

You need all three to fail for an exfiltration or injection to land.

Prometheus side

The sender’s prometheus.yml is boring on purpose — all of the security logic is delegated to the local Envoy.

# Account A: prometheus.yml (excerpt)
remote_write:
  - url: http://127.0.0.1:9091/api/v1/write
    queue_config:
      capacity: 10000
      max_samples_per_send: 2000
      max_shards: 50
      min_backoff: 100ms
      max_backoff: 30s
    metadata_config:
      send: true
      send_interval: 1m

A few things worth noting here:

The URL is http:// (plain) and 127.0.0.1 (loopback). The crypto happens one hop later, in the Envoy sidecar.
queue_config is tuned higher than the defaults because cross-account links have higher latency and more variance than intra-cluster remote_write. Bumping max_shards lets Prometheus parallelize when the link is healthy; the WAL acts as the backpressure buffer when it isn’t.
metadata_config.send: true ships sample metadata so the receiver side can render correct help text in /api/v1/metadata. It’s free bytes and worth it.

The receiver is also boring:

# Account B: prometheus startup flags
prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=127.0.0.1:9090 \
  --web.enable-remote-write-receiver \
  --enable-feature=memory-snapshot-on-shutdown

--web.enable-remote-write-receiver is the magic flag — it turns Prometheus into a receiver for the remote-write protocol on /api/v1/write. Without it, the endpoint 404s. --web.listen-address binds to loopback only; nothing on the network can reach Prometheus directly. Everything has to come through Envoy.

Envoy on the receiver (Account B)

This is the more interesting half of the configuration. Envoy is the public face of the metrics ingest endpoint — it terminates TLS, validates the client certificate, and acts as a reverse proxy to Prometheus.

# Account B: envoy.yaml (excerpt)
static_resources:
  listeners:
    - name: metrics_ingest
      address:
        socket_address: { address: 0.0.0.0, port_value: 8443 }
      filter_chains:
        - transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              require_client_certificate: true
              common_tls_context:
                tls_certificates:
                  - certificate_chain: { filename: /etc/envoy/certs/server.crt }
                    private_key:       { filename: /etc/envoy/certs/server.key }
                validation_context:
                  # Trust the *foreign* root CA bundle — Account A's
                  # root, copied over out of band. NOT the OS root store.
                  trusted_ca: { filename: /etc/envoy/certs/account-a-root.crt }
                  match_typed_subject_alt_names:
                    - san_type: DNS
                      matcher:
                        exact: "metrics-sender.account-a.internal"
          filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  virtual_hosts:
                    - name: prom_ingest
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/api/v1/write" }
                          route: { cluster: prometheus_local }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                      log_format:
                        text_format_source:
                          inline_string: |
                            %START_TIME% %DOWNSTREAM_PEER_SUBJECT% %REQ(:METHOD)% %REQ(:PATH%)% %RESPONSE_CODE% %BYTES_RECEIVED%

  clusters:
    - name: prometheus_local
      connect_timeout: 1s
      type: STATIC
      load_assignment:
        cluster_name: prometheus_local
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: 127.0.0.1, port_value: 9090 }

The bits that actually matter, called out:

require_client_certificate: true — the handshake fails if the client doesn’t present a cert. No “optional” mTLS, no falling back to one-way TLS for clients that “haven’t migrated yet.”
trusted_ca points at the intermediate CA, not the system root store. This is critical. If you point it at the OS truststore you’ve effectively said “anything Let’s Encrypt or DigiCert signs is fine,” which is the opposite of what mTLS is supposed to give you. The trust anchor must be the small internal CA you control.
match_typed_subject_alt_names pins the expected SAN on the client cert. Without this, any cert signed by your intermediate would be accepted, including ones for other services that happen to be in the same trust domain. Pinning the SAN means a stolen-but-different cert from inside the trust domain still gets rejected.
access_log includes DOWNSTREAM_PEER_SUBJECT — the client cert’s subject DN is logged on every request. This is the audit trail that says “this metric came from a system holding cert X.” When the compliance reviewer asks “who can write to this Prometheus?” the answer is in the logs.
The route is restricted to /api/v1/write. The receiver Prometheus has plenty of other endpoints (/api/v1/query, /metrics, the UI), and none of them are reachable through Envoy. The sender can write, and only write.

Envoy on the sender (Account A)

The sender’s Envoy is the mirror image: it terminates the upstream TLS, presents the client cert, and validates the server side.

# Account A: envoy.yaml (excerpt)
static_resources:
  listeners:
    - name: prom_remote_write_egress
      address:
        socket_address: { address: 127.0.0.1, port_value: 9091 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: egress_http
                route_config:
                  virtual_hosts:
                    - name: remote_write
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/api/v1/write" }
                          route: { cluster: remote_prometheus }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: remote_prometheus
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: remote_prometheus
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: metrics-ingest.example.com
                      port_value: 443
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          sni: metrics-ingest.example.com
          common_tls_context:
            tls_certificates:
              - certificate_chain: { filename: /etc/envoy/certs/client.crt }
                private_key:       { filename: /etc/envoy/certs/client.key }
            validation_context:
              trusted_ca: { filename: /etc/envoy/certs/account-b-root.crt }
              match_typed_subject_alt_names:
                - san_type: DNS
                  matcher:
                    exact: "metrics-ingest.account-b.internal"

Same pinning discipline — but a different trust anchor than you might expect. The sender’s validation_context.trusted_ca points at the receiver’s root CA bundle, not the sender’s own root and not a shared intermediate. The handshake works because the server presents its full chain (leaf → its intermediate → its root) and Envoy will accept any chain that terminates at a root it trusts. We’ll get to why this matters in the next section.

The networking layer

The transport choices were the trickiest part to justify on paper.

Why public internet, not VPC peering or PrivateLink?

VPC peering would have worked, but it requires both sides’ networking teams to coordinate on CIDR blocks, route tables, and security group references — and that coordination needs to be redone any time either side re-IPs. PrivateLink is cleaner but is a single-vendor (AWS) story: the moment one side moves out of AWS or runs from a different cloud for DR, you’re rebuilding the link. The public internet path is portable; the mTLS + IP-allow-list combination provides the security envelope without binding to a specific AWS networking primitive.

The trade-off is throughput and latency: this path crosses internet hops, so a regional outage in either direction is a hard cut. Prometheus’s local WAL absorbs that — samples accumulate and replay when the link comes back, up to the WAL’s retention window (typically 2 hours).

Why a NAT with a pinned EIP?

The Account B security group needs a stable source IP to allow-list. The default behavior of an AWS NAT Gateway with no explicit EIP can survive fine, but if it ever gets re-created (Terraform destroy/apply, AZ failover, region migration), the IP changes and Account B starts silently rejecting traffic. Pinning the NAT to a pre-allocated EIP makes the egress IP part of state you explicitly version.

In Terraform that’s a four-line pattern:

resource "aws_eip" "metrics_egress" {
  domain = "vpc"
  tags   = { Name = "metrics-egress" }
}

resource "aws_nat_gateway" "metrics_egress" {
  allocation_id = aws_eip.metrics_egress.id
  subnet_id     = aws_subnet.public_a.id
}

The receiver side allow-lists aws_eip.metrics_egress.public_ip and is done. EIPs are sticky and survive NAT recreation.

Why an NLB, not an ALB?

An ALB terminates TLS itself and would either bypass the Envoy behind it or require a second TLS hop. The point of Envoy is to be the single TLS termination point; an NLB lets the TLS connection pass through untouched at L4. The NLB exists only for DNS stability and public IP attachment.

The certificate chain

This is the part that surprised me the most when I went looking at how “normal” cross-org mTLS write-ups described their PKI. They assume one shared trust anchor. We had two completely separate PKIs and cross-trust was established by exchanging root CA bundles, not by sharing an intermediate.

       Account A's PKI                              Account B's PKI

   Root CA A (offline)                       Root CA B (offline)
        │                                          │
        │ signs                                    │ signs
        ▼                                          ▼
   Intermediate A (long-lived)               Intermediate B (long-lived)
        │                                          │
        │ signs                                    │ signs
        ▼                                          ▼
   Client cert                                Server cert
   (metrics-sender.…)                         (metrics-ingest.…)
   Account A Envoy                            Account B Envoy

           ╲                                ╱
            ╲   cross-trust at the root    ╱
             ╲     (manual bundle copy)   ╱
              ▼                          ▼

   Account A Envoy `trusted_ca`     =   Root CA B bundle
   Account B Envoy `trusted_ca`     =   Root CA A bundle

Each account ran its own PKI top to bottom. The two trees never merged and never shared an intermediate. Cross-trust was established once, out of band, by copying each side’s root CA bundle to the other side and pinning it as the foreign trust anchor. After that, every handshake verified the other side’s full chain (leaf → that account’s intermediate → that account’s root) against the copied root bundle.

Why this layout instead of a shared intermediate:

The account boundary stays clean. A shared intermediate means both accounts hold material that can mint certs valid for either side. That defeats the point of putting them in separate accounts in the first place. With separate PKIs, neither account can issue a cert the other will trust as one of its own.
Rotation inside an account is invisible to the other side. Account A can rotate its leaf certs (and even its intermediate, if it ever needed to) on its own schedule, and nothing in Account B has to change as long as the new chain still terminates at the same root. No coordination is needed because the cross-trust is anchored above the level that rotates.
The audit story is cleaner. “Here’s the bundle of foreign roots I trust” is a tiny, version-controlled artifact distinct from your own CA hierarchy. A reviewer can answer “who is allowed to write to me?” by looking at one file.

What’s actually long-lived vs. routine:

Roots — almost never rotate. Their job is to be the trust anchor.
Intermediates — long-lived (years). Rotation within an account is a planned, infrequent event.
Leaf certs — short-lived (weeks to months). Rotation is routine, automated, and entirely internal to each account.

Because intermediates rarely move and roots almost never do, the cross-account root bundle copy was a one-time activity. Once each side has the other’s root, the system runs without any further cross-account trust ceremony — leaf cert rotations on either side happen independently, the chain still terminates at the same root, and the handshake keeps working.

The one-time-ness is the design feature. A naive read of mTLS-across- trust-boundaries imagines a permanent dance of cert distribution between organizations. In practice, if you anchor cross-trust at the root and let each side own its own intermediate-and-below, you do the ceremony once and the rest takes care of itself for the lifetime of the roots — which is measured in years.

The price you pay is that root rotation, if it ever happens, does require coordination with the other side. That’s the reason root CA key material gets so much care: rotating it is a multi-step, multi-organization event. Worth it for everything you don’t have to coordinate in between.

What we learned to watch

A few things broke in instructive ways:

prometheus_remote_storage_failed_samples_total — this is the alert you actually want. It catches everything from “cert expired” to “receiver out of disk” to “we tripped the rate limit on the receiver.” Without it, the failure mode is silent: Prometheus keeps trying, keeps failing, WAL grows, eventually the sender’s disk fills up and scrape failures start happening too.
prometheus_wal_segment_current — if this is climbing without bound on the sender, the link is congested or down. The WAL is doing its job, but you have a finite amount of time before it stops doing its job.
Envoy’s ssl.fail_verify_cert counter on the receiver — non-zero values mean someone (probably your own infrastructure mid-rotation, occasionally something actually wrong) is trying to connect with a cert that doesn’t pass validation. Alert on it.
The NAT EIP — sounds dumb, but: alert if the EIP allocation disappears from state. We had a single incident where a Terraform refactor moved the EIP into a different module without re-using the same allocation ID, the new EIP wasn’t on the allow list, and Account B started rejecting writes within seconds.
The deployed root bundle vs. the canonical one — because the cross-account trust anchor is a static file copied in once, the failure mode if it ever drifts (someone edits it on a host without going through CI, a bad config push overwrites it, a file mode changes and Envoy can’t read it) is “all handshakes fail.” A periodic check that hashes the on-disk trusted_ca and compares it to the value in source control catches this before someone notices the metrics gap.

The summary

The shape of the system, distilled:

Prometheus on both sides, both self-hosted, separated by an AWS account boundary.
Envoy sidecars on each side handle all the cryptography and certificate lifecycle. Prometheus stays cleartext, local, simple.
mTLS with two independent PKIs cross-trusted at the root via a one-time root-bundle exchange, plus SAN pinning on both sides, is the inner fence. Each side rotates leaves (and even intermediates) on its own schedule with no cross-account coordination.
A pinned-EIP NAT and a single-IP allow-list on the receiver SG is the outer fence.
An NLB sits between the internet and the receiver Envoy so the public DNS name and IP are stable independent of how Envoy is deployed.
The WAL on the sender absorbs link outages; alerts on failed_samples_total and wal_segment_current catch real problems.

None of this is novel — every component is documented somewhere — but the combination is what gave us a metrics replication link we’d be happy to put in front of an auditor and confident we’d hear about it before it failed silently.