Skip to main content

Command Palette

Search for a command to run...

Building a Production-Ready Kubernetes Performance Testing Framework

Updated
6 min read
Building a Production-Ready Kubernetes Performance Testing Framework

Building a Kubernetes cluster is easy; proving it's production-ready is hard. How do you know if your control plane can scale? Is your storage actually delivering the IOPS promised by the vendor?

To answer these questions and ensure the cluster is ready for production, I researched and gathered information on setting it up. Then, I built a performance test framework to do it.

In this post, I'll walk you through the Objective, the Test Flow, and the Requirements needed to set up this framework.

TL;DR 👉 github.com/nh4ttruong/k8s-perf-tests

🎯 The Objective

The goal is to ensure the Kubernetes cluster meets specific performance requirements across all layers:

  • Control Plane: API Server handles expected request volume.

  • etcd: Database meets throughput and latency requirements.

  • Network: Pod-to-pod communication achieves expected bandwidth and latency.

  • DNS: CoreDNS handles query rate from services.

  • Storage: Persistent Volume meets IOPS and throughput requirements.

  • Ingress: Load balancer handles external traffic.

🔄 The Test Flow

We don't test randomly. We execute in a specific order, moving from the infrastructure layer up to the application layer. If the foundation (etcd) is shaky, testing the Ingress is pointless.

  1. etcd → Database performance (foundation for everything)

  2. API Server → Control plane capacity

  3. Network (CNI) → Pod networking performance

  4. CoreDNS → Service discovery latency

  5. Storage → Persistent volume I/O

  6. Ingress → External traffic handling

🛠 Components & Tools

Here is the stack we use to validate each component. Click on the component name to view the source code and detailed documentation:

ComponentToolTest Focus
API Server & Kubeletkube-burnerObject CRUD, pod scheduling
etcdetcdctl, benchmarkRead/write latency, throughput
CoreDNSdnsperfQuery throughput, latency
Networkk8s-netperfPod-to-pod, service latency
Storagefio, kbenchIOPS, throughput, latency
IngresswrkHTTP RPS, response time
MonitoringGrafanaReal-time metrics

Test types

For each component in the following posts, we will look at three types of tests:

  1. Smoke: Validate configuration (1-5 min).

  2. Load: Measure performance at expected load (15-60 min, 70-100% capacity).

  3. Stress: Find the breaking point (10-30 min, 150-200% capacity).

Quick Start

# Clone repository
git clone https://github.com/nh4ttruong/k8s-perf-test.git
cd k8s-perf-test

# 1. etcd smoke test (if you have access)
./etcd/scripts/etcdctl/smoke.sh

# 2. API Server smoke test
kube-burner init -c kube-burner/api-server/smoke.yaml

# 3. Network smoke test
k8s-netperf --config network/smoke_1.yaml --local

# 4. DNS smoke test
kubectl apply -f coredns/
kubectl logs -f -l app=dnsperf -n coredns-perf-test

# 5. Storage smoke test
kubectl create namespace storage-perf-test
kubectl apply -f storage/smoke.yaml -n storage-perf-test

Evaluation Criteria

Kubernetes SLIs/SLOs

The Kubernetes project defines Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for a properly functioning cluster. These criteria are used for performance evaluation.

Reference: Kubernetes Scalability SLIs/SLOs

API Server SLOs

SLISLODescription
Mutating API latency (P99)≤ 1sTime to process CREATE, UPDATE, DELETE
Non-mutating API latency (P99)≤ 1s (single object)Time to process GET single resource
Non-mutating API latency (P99)≤ 30s (list objects)Time to process LIST resources

Pod Startup SLOs

SLISLOCondition
Pod startup latency (P99)≤ 5sStateless pods, image already present
Pod startup latency (P99)≤ 20sStateless pods, image pull required

Target Metrics by Component

Specific evaluation criteria for each component:

Control Plane

ComponentMetricTargetCriticalReference
API ServerMutating P99< 500ms< 1sK8s SLOs
API ServerNon-mutating P99< 200ms< 1sK8s SLOs
API ServerQPS sustained\> 1000\> 500Depends on cluster size
API ServerError rate< 0.1%< 1%
etcdWrite latency P99< 25ms< 50msetcd tuning
etcdRead latency P99< 10ms< 25msetcd tuning
etcdfsync duration P99< 10ms< 25msetcd hardware

Data Plane

ComponentMetricTargetCriticalReference
Pod-to-podThroughput\> 5 Gbps\> 1 GbpsDepends on physical network
Pod-to-podLatency< 1ms< 5msSame zone
Pod-to-serviceLatency< 2ms< 10msVia kube-proxy
CoreDNSQuery rate\> 10k QPS\> 5k QPSCoreDNS plugins
CoreDNSP99 latency< 10ms< 50msWith cache
CoreDNSCache hit ratio\> 90%\> 80%

Storage

Workload TypeMetricSSD TargetNVMe TargetReference
Database (OLTP)Random 4K read IOPS\> 10k\> 50kfio profiles
Database (OLTP)Random 4K write IOPS\> 5k\> 20k
Database (OLTP)P99 latency< 5ms< 1ms
Logging/StreamingSequential write MB/s\> 200\> 1000
Analytics (OLAP)Sequential read MB/s\> 300\> 2000

Ingress

MetricSmall clusterLarge clusterReference
Requests/sec\> 10k\> 50kNGINX tuning
P99 latency< 100ms< 20ms
Error rate (5xx)< 0.1%< 0.01%
Connection rate\> 5k/s\> 20k/s

Result Evaluation

Pass/Fail Criteria

ResultCondition
PASSAll metrics meet Target
CONDITIONAL PASSAll metrics within Critical range, some not meeting Target
FAILAny metric exceeds Critical threshold

Pre-Production Checklist

  • API Server P99 latency < 500ms under expected load

  • etcd write latency P99 < 25ms

  • No etcd leader election during 24h test

  • Pod startup time P99 < 5s (image cached)

  • DNS query latency P99 < 10ms

  • Storage IOPS meets workload requirements

  • Network throughput meets inter-zone requirements

  • Ingress handles expected peak traffic

References


In the next post, we start with the heart of the cluster: etcd component → Kubernetes etcd Performance Benchmarks