Configuring Kubernetes Horizontal Pod Autoscaler

HPA allows you to automatically scale the number of pod replicas (container instances) based on the workload. When demand increases, HPA creates more replicas to handle it, and when demand decreases, it reduces the number of replicas to optimize resource usage.

Prerequisites

Before configuring HPA, ensure the following:

A functioning Kubernetes cluster is set up and accessible.
The cluster has the Metrics Server installed and verified as operational. The Metrics Server aggregates resource usage data (CPU, memory), which the HPA queries for its decision-making process.

Example: To verify Metrics Server status:

kubectl get deployment metrics-server -n kube-system

Implementing the Horizontal Pod Autoscaler

The HPA in Kubernetes is designed to adjust the number of running pods for deployments, replica sets, or stateful sets, responding automatically to shifting workloads.

Basic Configuration via Command Line

A typical use case involved scaling a web service that should maintain ~50% CPU utilization across pods. The following command created the HPA object:

kubectl autoscale deployment webapp --cpu-percent=50 --min=2 --max=8

This auto-generates an HPA resource targeting the webapp deployment, keeping CPU usage close to the specified threshold. As load increases, HPA increases the pod count up to the defined maximum. As load falls, pod count decreases, but never below the minimum.

Advanced Customization Using YAML

While the command-line approach covers many scenarios, more granular control is often required. Here is an example YAML definition to demonstrate flexibility:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

Apply the configuration with:

kubectl apply -f webapp-hpa.yaml

HPA monitors the CPU utilization specifically and modifies replication count accordingly.

Monitoring and Continuous Adjustment

It’s critical to observe the autoscaler’s actions and tweak parameters as use cases evolve. Monitoring can be done via:

kubectl get hpa webapp-hpa

Review results regularly and adjust CPU percentage or replica limits to optimize cost and performance.

Example Output:

NAME         REFERENCE         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
webapp-hpa   Deployment/webapp 40%/50%         2         8         3          5m

Integrating Custom Metrics

In more complex scenarios, CPU and memory metrics may not reflect application performance needs (e.g., queue length, requests/sec). Kubernetes supports custom metrics integration through APIs such as custom.metrics.k8s.io, and external systems like Prometheus.

For HTTP requests per second scaling, configure and connect a Prometheus Adapter and set up HPA metrics field for the external metric. Applications become responsive not just to resource usage but to actual external demand patterns.

Benefits

Through HPA deployment and tuning, several benefits became clear:

High Availability: Applications maintain service levels even during traffic spikes.
Resource Optimization: Pod count matches demand, avoiding over-provisioning.
Minimal Manual Intervention: Autoscaling reduces the need for on-call intervention during load changes.
Flexibility: Support for custom metrics unlocks more intelligent scaling tied to business logic.

Conclusion

Deploying and customizing the Horizontal Pod Autoscaler in Kubernetes has proven effective for adapting workloads with precision. With the right integrations—such as the Metrics Server and optionally custom metric adapters—teams are equipped to automate scalability, contain costs, and improve user experience with minimal manual effort.

For future projects, further exploration into advanced custom metrics and real-time scaling policies can unlock even greater operational benefits.