Service mesh networking

Put a small proxy beside every service so the network — retries, mTLS, routing, metrics — is handled outside your app code.

The idea

In a microservice fleet, every service needs the same network plumbing: encrypt traffic, retry a flaky peer, balance across replicas, emit metrics. Writing that into each app — in each language — is repetitive and drifts out of sync.

A service mesh moves it out of the app. A sidecar proxy (typically Envoy) runs next to each service instance. Calls never go app→app directly; they go app→local sidecarremote sidecar→app. The sidecars (the data plane) do mTLS, load balancing, and retries; a control plane tells them how.

Service A (orders) wants to call payments. Press play to watch the request travel app → sidecar → mTLS → sidecar → app, with a retry along the way.

The app code only ever made one plain call to payments. Everything green — encryption, the upstream choice, the retry — happened in the sidecars, invisible to both apps.

How it works

Each pod runs the app and a sidecar side by side. Inbound and outbound traffic is transparently redirected (via iptables or eBPF) into the local sidecar, so the app makes an ordinary http://payments/charge call and never knows a proxy exists. The outbound sidecar resolves payments to a set of healthy endpoints, opens an mTLS connection to the chosen peer’s sidecar — both sides present and verify certificates — load-balances, and retries on failure. The remote sidecar terminates mTLS and forwards over loopback to its app.

The control plane (e.g. Istiod) pushes config to every sidecar. A route/retry policy looks like:

# control plane → every sidecar (Envoy) for the "payments" service
route:
  destination: payments          # logical name, not an IP
  load_balancer: round_robin     # spread across healthy endpoints

retry_policy:
  retry_on: connect-failure,5xx,reset
  num_retries: 2                 # try up to 2 other endpoints
  per_try_timeout: 250ms

tls:
  mode: ISTIO_MUTUAL             # mTLS: both proxies present a cert
  # certs are issued + auto-rotated by the control plane (SPIFFE identity)

outlier_detection:               # eject an endpoint that keeps failing
  consecutive_5xx: 5
  base_ejection_time: 30s

None of this lives in the application. Change the retry count or turn on mTLS and the apps redeploy nothing — the control plane reconfigures the sidecars in place.

Cost / trade-offs

ConcernWith a meshWithout (in-app)
LatencyExtra hop in + out of each sidecar (often sub-millisecond, but real)Direct app→app, no proxy hop
Resource costOne sidecar per pod: extra CPU + memory, fleet-wideNo per-pod proxy overhead
mTLSUniform, auto-rotated certs, on by defaultHand-rolled per service / language; easy to skip
Retries & LBDeclarative policy, consistent everywhereRe-implemented in every client library
ObservabilityGolden metrics + tracing headers without app workEach app instruments itself
Operational loadA whole control plane + sidecars to run and upgradeFewer moving parts to operate

The bargain: you accept an extra hop and per-pod overhead to get uniform mTLS, retries, load balancing, and metrics without touching app code or libraries. That pays off most when you have many services in many languages.

Watch out for

Worked example

Orders calls payments to charge a card. Payments runs two replicas, B1 and B2. During a deploy, B1 is briefly flapping and resets connections.

orders app:   POST http://payments/charge        # one plain call, no TLS code

sidecar A:    resolve "payments" → [B1, B2]
              pick B1 (round robin)
              mTLS handshake with sidecar B1 … RESET     # B1 is flapping
              retry_policy fires: num_retries=2
              pick B2 (next healthy endpoint)
              mTLS handshake with sidecar B2 … OK         # certs verified
              send encrypted request → sidecar B2

sidecar B2:   terminate mTLS, forward over loopback → payments app
payments app: charge card → 200 OK
              response retraces the path back to orders

orders app:   got 200 OK in ~30ms                        # never saw the reset, retry, or TLS

Outlier detection then ejects B1 for 30s so later calls skip it entirely. The orders team changed nothing and shipped no client-retry code; the mesh absorbed the flap.

Check yourself

Where does the mutual-TLS encryption actually happen in a sidecar mesh?

The upstream endpoint B1 resets the connection. With a retry policy in the mesh, what does the calling app observe?