Kubernetes Performance Testing Framework Guide

Building a Kubernetes cluster is easy; proving it's production-ready is hard. How do you know if your control plane can scale? Is your storage actually delivering the IOPS promised by the vendor?

To answer these questions and ensure the cluster is ready for production, I researched and gathered information on setting it up. Then, I built a performance test framework to do it.

In this post, I'll walk you through the Objective, the Test Flow, and the Requirements needed to set up this framework.

TL;DR 👉 github.com/nh4ttruong/k8s-perf-tests

🎯 The Objective

The goal is to ensure the Kubernetes cluster meets specific performance requirements across all layers:

Control Plane: API Server handles expected request volume.
etcd: Database meets throughput and latency requirements.
Network: Pod-to-pod communication achieves expected bandwidth and latency.
DNS: CoreDNS handles query rate from services.
Storage: Persistent Volume meets IOPS and throughput requirements.
Ingress: Load balancer handles external traffic.

🔄 The Test Flow

We don't test randomly. We execute in a specific order, moving from the infrastructure layer up to the application layer. If the foundation (etcd) is shaky, testing the Ingress is pointless.

etcd → Database performance (foundation for everything)
API Server → Control plane capacity
Network (CNI) → Pod networking performance
CoreDNS → Service discovery latency
Storage → Persistent volume I/O
Ingress → External traffic handling

🛠 Components & Tools

Here is the stack we use to validate each component. Click on the component name to view the source code and detailed documentation:

Component	Tool	Test Focus
API Server & Kubelet	`kube-burner`	Object CRUD, pod scheduling
etcd	`etcdctl`, `benchmark`	Read/write latency, throughput
CoreDNS	`dnsperf`	Query throughput, latency
Network	`k8s-netperf`	Pod-to-pod, service latency
Storage	`fio`, `kbench`	IOPS, throughput, latency
Ingress	`wrk`	HTTP RPS, response time
Monitoring	`Grafana`	Real-time metrics

Test types

For each component in the following posts, we will look at three types of tests:

Smoke: Validate configuration (1-5 min).
Load: Measure performance at expected load (15-60 min, 70-100% capacity).
Stress: Find the breaking point (10-30 min, 150-200% capacity).

Quick Start

# Clone repository
git clone https://github.com/nh4ttruong/k8s-perf-test.git
cd k8s-perf-test

# 1. etcd smoke test (if you have access)
./etcd/scripts/etcdctl/smoke.sh

# 2. API Server smoke test
kube-burner init -c kube-burner/api-server/smoke.yaml

# 3. Network smoke test
k8s-netperf --config network/smoke_1.yaml --local

# 4. DNS smoke test
kubectl apply -f coredns/
kubectl logs -f -l app=dnsperf -n coredns-perf-test

# 5. Storage smoke test
kubectl create namespace storage-perf-test
kubectl apply -f storage/smoke.yaml -n storage-perf-test

Evaluation Criteria

Kubernetes SLIs/SLOs

The Kubernetes project defines Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for a properly functioning cluster. These criteria are used for performance evaluation.

Reference: Kubernetes Scalability SLIs/SLOs

API Server SLOs

SLI	SLO	Description
Mutating API latency (P99)	≤ 1s	Time to process CREATE, UPDATE, DELETE
Non-mutating API latency (P99)	≤ 1s (single object)	Time to process GET single resource
Non-mutating API latency (P99)	≤ 30s (list objects)	Time to process LIST resources

Pod Startup SLOs

SLI	SLO	Condition
Pod startup latency (P99)	≤ 5s	Stateless pods, image already present
Pod startup latency (P99)	≤ 20s	Stateless pods, image pull required

Target Metrics by Component

Specific evaluation criteria for each component:

Control Plane

Component	Metric	Target	Critical	Reference
API Server	Mutating P99	< 500ms	< 1s	K8s SLOs
API Server	Non-mutating P99	< 200ms	< 1s	K8s SLOs
API Server	QPS sustained	\> 1000	\> 500	Depends on cluster size
API Server	Error rate	< 0.1%	< 1%
etcd	Write latency P99	< 25ms	< 50ms	etcd tuning
etcd	Read latency P99	< 10ms	< 25ms	etcd tuning
etcd	fsync duration P99	< 10ms	< 25ms	etcd hardware

Data Plane

Component	Metric	Target	Critical	Reference
Pod-to-pod	Throughput	\> 5 Gbps	\> 1 Gbps	Depends on physical network
Pod-to-pod	Latency	< 1ms	< 5ms	Same zone
Pod-to-service	Latency	< 2ms	< 10ms	Via kube-proxy
CoreDNS	Query rate	\> 10k QPS	\> 5k QPS	CoreDNS plugins
CoreDNS	P99 latency	< 10ms	< 50ms	With cache
CoreDNS	Cache hit ratio	\> 90%	\> 80%

Storage

Workload Type	Metric	SSD Target	NVMe Target	Reference
Database (OLTP)	Random 4K read IOPS	\> 10k	\> 50k	fio profiles
Database (OLTP)	Random 4K write IOPS	\> 5k	\> 20k
Database (OLTP)	P99 latency	< 5ms	< 1ms
Logging/Streaming	Sequential write MB/s	\> 200	\> 1000
Analytics (OLAP)	Sequential read MB/s	\> 300	\> 2000

Ingress

Metric	Small cluster	Large cluster	Reference
Requests/sec	\> 10k	\> 50k	NGINX tuning
P99 latency	< 100ms	< 20ms
Error rate (5xx)	< 0.1%	< 0.01%
Connection rate	\> 5k/s	\> 20k/s

Result Evaluation

Pass/Fail Criteria

Result	Condition
PASS	All metrics meet Target
CONDITIONAL PASS	All metrics within Critical range, some not meeting Target
FAIL	Any metric exceeds Critical threshold

Pre-Production Checklist

API Server P99 latency < 500ms under expected load
etcd write latency P99 < 25ms
No etcd leader election during 24h test
Pod startup time P99 < 5s (image cached)
DNS query latency P99 < 10ms
Storage IOPS meets workload requirements
Network throughput meets inter-zone requirements
Ingress handles expected peak traffic

References

Kubernetes Scalability SLIs/SLOs - Official SLO definitions
Kubernetes Scalability Thresholds - Cluster size limits
SIG Scalability - Scalability working group
kube-burner Documentation
etcd Performance
etcd Tuning
k8s-netperf
fio Documentation
CoreDNS
NGINX Ingress Controller

In the next post, we start with the heart of the cluster: etcd component → Kubernetes etcd Performance Benchmarks

Building a Production-Ready Kubernetes Performance Testing Framework

🎯 The Objective

🔄 The Test Flow

🛠 Components & Tools

Test types

Quick Start