Building a Production-Ready Kubernetes Performance Testing Framework

Building a Kubernetes cluster is easy; proving it's production-ready is hard. How do you know if your control plane can scale? Is your storage actually delivering the IOPS promised by the vendor?
To answer these questions and ensure the cluster is ready for production, I researched and gathered information on setting it up. Then, I built a performance test framework to do it.
In this post, I'll walk you through the Objective, the Test Flow, and the Requirements needed to set up this framework.
🎯 The Objective
The goal is to ensure the Kubernetes cluster meets specific performance requirements across all layers:
Control Plane: API Server handles expected request volume.
etcd: Database meets throughput and latency requirements.
Network: Pod-to-pod communication achieves expected bandwidth and latency.
DNS: CoreDNS handles query rate from services.
Storage: Persistent Volume meets IOPS and throughput requirements.
Ingress: Load balancer handles external traffic.
🔄 The Test Flow
We don't test randomly. We execute in a specific order, moving from the infrastructure layer up to the application layer. If the foundation (etcd) is shaky, testing the Ingress is pointless.
etcd → Database performance (foundation for everything)
API Server → Control plane capacity
Network (CNI) → Pod networking performance
CoreDNS → Service discovery latency
Storage → Persistent volume I/O
Ingress → External traffic handling
🛠 Components & Tools
Here is the stack we use to validate each component. Click on the component name to view the source code and detailed documentation:
| Component | Tool | Test Focus |
| API Server & Kubelet | kube-burner | Object CRUD, pod scheduling |
| etcd | etcdctl, benchmark | Read/write latency, throughput |
| CoreDNS | dnsperf | Query throughput, latency |
| Network | k8s-netperf | Pod-to-pod, service latency |
| Storage | fio, kbench | IOPS, throughput, latency |
| Ingress | wrk | HTTP RPS, response time |
| Monitoring | Grafana | Real-time metrics |
Test types
For each component in the following posts, we will look at three types of tests:
Smoke: Validate configuration (1-5 min).
Load: Measure performance at expected load (15-60 min, 70-100% capacity).
Stress: Find the breaking point (10-30 min, 150-200% capacity).
Quick Start
# Clone repository
git clone https://github.com/nh4ttruong/k8s-perf-test.git
cd k8s-perf-test
# 1. etcd smoke test (if you have access)
./etcd/scripts/etcdctl/smoke.sh
# 2. API Server smoke test
kube-burner init -c kube-burner/api-server/smoke.yaml
# 3. Network smoke test
k8s-netperf --config network/smoke_1.yaml --local
# 4. DNS smoke test
kubectl apply -f coredns/
kubectl logs -f -l app=dnsperf -n coredns-perf-test
# 5. Storage smoke test
kubectl create namespace storage-perf-test
kubectl apply -f storage/smoke.yaml -n storage-perf-test
Evaluation Criteria
Kubernetes SLIs/SLOs
The Kubernetes project defines Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for a properly functioning cluster. These criteria are used for performance evaluation.
Reference: Kubernetes Scalability SLIs/SLOs
API Server SLOs
| SLI | SLO | Description |
| Mutating API latency (P99) | ≤ 1s | Time to process CREATE, UPDATE, DELETE |
| Non-mutating API latency (P99) | ≤ 1s (single object) | Time to process GET single resource |
| Non-mutating API latency (P99) | ≤ 30s (list objects) | Time to process LIST resources |
Pod Startup SLOs
| SLI | SLO | Condition |
| Pod startup latency (P99) | ≤ 5s | Stateless pods, image already present |
| Pod startup latency (P99) | ≤ 20s | Stateless pods, image pull required |
Target Metrics by Component
Specific evaluation criteria for each component:
Control Plane
| Component | Metric | Target | Critical | Reference |
| API Server | Mutating P99 | < 500ms | < 1s | K8s SLOs |
| API Server | Non-mutating P99 | < 200ms | < 1s | K8s SLOs |
| API Server | QPS sustained | \> 1000 | \> 500 | Depends on cluster size |
| API Server | Error rate | < 0.1% | < 1% | |
| etcd | Write latency P99 | < 25ms | < 50ms | etcd tuning |
| etcd | Read latency P99 | < 10ms | < 25ms | etcd tuning |
| etcd | fsync duration P99 | < 10ms | < 25ms | etcd hardware |
Data Plane
| Component | Metric | Target | Critical | Reference |
| Pod-to-pod | Throughput | \> 5 Gbps | \> 1 Gbps | Depends on physical network |
| Pod-to-pod | Latency | < 1ms | < 5ms | Same zone |
| Pod-to-service | Latency | < 2ms | < 10ms | Via kube-proxy |
| CoreDNS | Query rate | \> 10k QPS | \> 5k QPS | CoreDNS plugins |
| CoreDNS | P99 latency | < 10ms | < 50ms | With cache |
| CoreDNS | Cache hit ratio | \> 90% | \> 80% |
Storage
| Workload Type | Metric | SSD Target | NVMe Target | Reference |
| Database (OLTP) | Random 4K read IOPS | \> 10k | \> 50k | fio profiles |
| Database (OLTP) | Random 4K write IOPS | \> 5k | \> 20k | |
| Database (OLTP) | P99 latency | < 5ms | < 1ms | |
| Logging/Streaming | Sequential write MB/s | \> 200 | \> 1000 | |
| Analytics (OLAP) | Sequential read MB/s | \> 300 | \> 2000 |
Ingress
| Metric | Small cluster | Large cluster | Reference |
| Requests/sec | \> 10k | \> 50k | NGINX tuning |
| P99 latency | < 100ms | < 20ms | |
| Error rate (5xx) | < 0.1% | < 0.01% | |
| Connection rate | \> 5k/s | \> 20k/s |
Result Evaluation
Pass/Fail Criteria
| Result | Condition |
| PASS | All metrics meet Target |
| CONDITIONAL PASS | All metrics within Critical range, some not meeting Target |
| FAIL | Any metric exceeds Critical threshold |
Pre-Production Checklist
API Server P99 latency < 500ms under expected load
etcd write latency P99 < 25ms
No etcd leader election during 24h test
Pod startup time P99 < 5s (image cached)
DNS query latency P99 < 10ms
Storage IOPS meets workload requirements
Network throughput meets inter-zone requirements
Ingress handles expected peak traffic
References
Kubernetes Scalability SLIs/SLOs - Official SLO definitions
Kubernetes Scalability Thresholds - Cluster size limits
SIG Scalability - Scalability working group
In the next post, we start with the heart of the cluster: etcd component → Kubernetes etcd Performance Benchmarks





