Service mesh networking

Put a small proxy beside every service so the network — retries, mTLS, routing, metrics — is handled outside your app code.

The idea

In a microservice fleet, every service needs the same network plumbing: encrypt traffic, retry a flaky peer, balance across replicas, emit metrics. Writing that into each app — in each language — is repetitive and drifts out of sync.

A service mesh moves it out of the app. A sidecar proxy (typically Envoy) runs next to each service instance. Calls never go app→app directly; they go app→local sidecar→remote sidecar→app. The sidecars (the data plane) do mTLS, load balancing, and retries; a control plane tells them how.

Service A (orders) wants to call payments. Press play to watch the request travel app → sidecar → mTLS → sidecar → app, with a retry along the way.

The app code only ever made one plain call to payments. Everything green — encryption, the upstream choice, the retry — happened in the sidecars, invisible to both apps.

How it works

Each pod runs the app and a sidecar side by side. Inbound and outbound traffic is transparently redirected (via iptables or eBPF) into the local sidecar, so the app makes an ordinary http://payments/charge call and never knows a proxy exists. The outbound sidecar resolves payments to a set of healthy endpoints, opens an mTLS connection to the chosen peer’s sidecar — both sides present and verify certificates — load-balances, and retries on failure. The remote sidecar terminates mTLS and forwards over loopback to its app.

The control plane (e.g. Istiod) pushes config to every sidecar. A route/retry policy looks like:

# control plane → every sidecar (Envoy) for the "payments" service
route:
  destination: payments          # logical name, not an IP
  load_balancer: round_robin     # spread across healthy endpoints

retry_policy:
  retry_on: connect-failure,5xx,reset
  num_retries: 2                 # try up to 2 other endpoints
  per_try_timeout: 250ms

tls:
  mode: ISTIO_MUTUAL             # mTLS: both proxies present a cert
  # certs are issued + auto-rotated by the control plane (SPIFFE identity)

outlier_detection:               # eject an endpoint that keeps failing
  consecutive_5xx: 5
  base_ejection_time: 30s

None of this lives in the application. Change the retry count or turn on mTLS and the apps redeploy nothing — the control plane reconfigures the sidecars in place.

Cost / trade-offs

Concern	With a mesh	Without (in-app)
Latency	Extra hop in + out of each sidecar (often sub-millisecond, but real)	Direct app→app, no proxy hop
Resource cost	One sidecar per pod: extra CPU + memory, fleet-wide	No per-pod proxy overhead
mTLS	Uniform, auto-rotated certs, on by default	Hand-rolled per service / language; easy to skip
Retries & LB	Declarative policy, consistent everywhere	Re-implemented in every client library
Observability	Golden metrics + tracing headers without app work	Each app instruments itself
Operational load	A whole control plane + sidecars to run and upgrade	Fewer moving parts to operate

The bargain: you accept an extra hop and per-pod overhead to get uniform mTLS, retries, load balancing, and metrics without touching app code or libraries. That pays off most when you have many services in many languages.

Watch out for

Doubled latency and CPU. Every call now passes through two proxies (out of A, into B). It is usually small, but at high RPS the per-pod sidecar CPU and the added hops show up in p99 and in your bill. Measure before assuming “negligible.”
Sidecar startup race. If the app container starts and fires requests before its sidecar is ready, early calls fail or bypass the mesh. Order startup (hold the app until the proxy is healthy) and drain the proxy last on shutdown.
Certificate rotation and expiry. mTLS identities are short-lived and auto-rotated. A stalled control plane, clock skew, or a botched upgrade can let certs expire and silently break all traffic at once. Alert on cert age and control-plane health.
“Is it the mesh?” debugging. A failure can live in the app, the local sidecar, the network, or the remote sidecar. Without proxy access logs and tracing, every incident becomes a finger-pointing hunt across four hops.
A mesh is not a security boundary by itself. mTLS proves identity and encrypts in transit, but you still need authorization policy, network isolation, and app-level checks. “We have a mesh” is not the same as “we are secure.”

Worked example

Orders calls payments to charge a card. Payments runs two replicas, B1 and B2. During a deploy, B1 is briefly flapping and resets connections.

orders app:   POST http://payments/charge        # one plain call, no TLS code

sidecar A:    resolve "payments" → [B1, B2]
              pick B1 (round robin)
              mTLS handshake with sidecar B1 … RESET     # B1 is flapping
              retry_policy fires: num_retries=2
              pick B2 (next healthy endpoint)
              mTLS handshake with sidecar B2 … OK         # certs verified
              send encrypted request → sidecar B2

sidecar B2:   terminate mTLS, forward over loopback → payments app
payments app: charge card → 200 OK
              response retraces the path back to orders

orders app:   got 200 OK in ~30ms                        # never saw the reset, retry, or TLS

Outlier detection then ejects B1 for 30s so later calls skip it entirely. The orders team changed nothing and shipped no client-retry code; the mesh absorbed the flap.

Check yourself

Where does the mutual-TLS encryption actually happen in a sidecar mesh?

The upstream endpoint B1 resets the connection. With a retry policy in the mesh, what does the calling app observe?