HPA allows you to automatically scale the number of pod replicas (container instances) based on the workload. When demand increases, HPA creates more replicas to handle it, and when demand decreases, it reduces the number of replicas to optimize resource usage.
Prerequisites
Before configuring HPA, ensure the following:
- A functioning Kubernetes cluster is set up and accessible.
- The cluster has the Metrics Server installed and verified as operational. The Metrics Server aggregates resource usage data (CPU, memory), which the HPA queries for its decision-making process.
Example: To verify Metrics Server status:
1 kubectl get deployment metrics-server -n kube-system
Implementing the Horizontal Pod Autoscaler
The HPA in Kubernetes is designed to adjust the number of running pods for deployments, replica sets, or stateful sets, responding automatically to shifting workloads.
Basic Configuration via Command Line
A typical use case involved scaling a web service that should maintain ~50% CPU utilization across pods. The following command created the HPA object:
1
kubectl autoscale deployment webapp --cpu-percent=50 --min=2 --max=8
This auto-generates an HPA resource targeting the webapp
deployment, keeping CPU usage close to the specified threshold.
As load increases, HPA increases the pod count up to the defined maximum. As load falls, pod count decreases, but never below the minimum.
Advanced Customization Using YAML
While the command-line approach covers many scenarios, more granular control is often required. Here is an example YAML definition to demonstrate flexibility:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Apply the configuration with:
1
kubectl apply -f webapp-hpa.yaml
HPA monitors the CPU utilization specifically and modifies replication count accordingly.
Monitoring and Continuous Adjustment
It’s critical to observe the autoscaler’s actions and tweak parameters as use cases evolve. Monitoring can be done via:
1
kubectl get hpa webapp-hpa
Review results regularly and adjust CPU percentage or replica limits to optimize cost and performance.
Example Output:
1 2 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE webapp-hpa Deployment/webapp 40%/50% 2 8 3 5m
Integrating Custom Metrics
In more complex scenarios, CPU and memory metrics may not reflect application performance needs (e.g., queue length, requests/sec). Kubernetes supports custom metrics integration through APIs such as custom.metrics.k8s.io
, and external systems like Prometheus.
For HTTP requests per second scaling, configure and connect a Prometheus Adapter and set up HPA metrics
field for the external metric.
Applications become responsive not just to resource usage but to actual external demand patterns.
Benefits
Through HPA deployment and tuning, several benefits became clear:
- High Availability: Applications maintain service levels even during traffic spikes.
- Resource Optimization: Pod count matches demand, avoiding over-provisioning.
- Minimal Manual Intervention: Autoscaling reduces the need for on-call intervention during load changes.
- Flexibility: Support for custom metrics unlocks more intelligent scaling tied to business logic.
Conclusion
Deploying and customizing the Horizontal Pod Autoscaler in Kubernetes has proven effective for adapting workloads with precision. With the right integrations—such as the Metrics Server and optionally custom metric adapters—teams are equipped to automate scalability, contain costs, and improve user experience with minimal manual effort.
For future projects, further exploration into advanced custom metrics and real-time scaling policies can unlock even greater operational benefits.