Symptoms & Diagnosis
Kubernetes scheduler performance issues typically manifest as “Pending” pods despite available resources. You may notice a significant lag between pod creation and the node assignment phase.
The first step in diagnosis is checking the kube-scheduler metrics. Specifically, look for high values in scheduling_attempt_duration_seconds and e2e_scheduling_duration_seconds.
High CPU utilization on the control plane nodes often accompanies these symptoms. This suggests the scheduler is working too hard to calculate scores for every node in a large cluster.
| Metric | Indication of Issue |
|---|---|
| pod_scheduling_duration_seconds | High latency in the end-to-end scheduling cycle. |
| scheduler_pending_pods | A growing queue of pods waiting to be processed. |
| scheduler_schedule_attempts_total | High error rates or frequent retries for the same pods. |

Troubleshooting Guide
Adjusting percentageOfNodesToScore
In large clusters, the scheduler evaluates every node to find the best fit. This is computationally expensive. You can limit this by adjusting the percentageOfNodesToScore parameter.
By setting this value lower (e.g., 10-30%), the scheduler stops looking for nodes once it finds enough feasible candidates. This drastically reduces CPU cycles.
# Example: Checking kube-scheduler configuration
kubectl get cm kube-scheduler-config -n kube-system -o yaml
Optimizing Affinity and Anti-Affinity Rules
Complex PodAffinity and PodAntiAffinity rules are the primary causes of scheduler slowdowns. Each rule requires the scheduler to iterate through all existing pods on all nodes.
If possible, replace hard requiredDuringSchedulingIgnoredDuringExecution rules with preferredDuringSchedulingIgnoredDuringExecution. This allows the scheduler to make “best effort” decisions faster.
Debugging with pprof
If the scheduler is consuming excessive memory or CPU, use the pprof tool to identify the specific function causing the bottleneck. This requires access to the control plane’s profiling endpoint.
# Capture a 30-second CPU profile
curl http://localhost:10251/debug/pprof/profile?seconds=30 -o cpu.prof
# Analyze the profile
go tool pprof cpu.prof
Prevention
Preventing scheduler performance degradation starts with cluster design. Divide large clusters into smaller logical groups if the node count exceeds several thousand.
Ensure that Pod templates have accurate requests and limits. Accurate resource requests help the scheduler filter out nodes quickly, reducing the scoring overhead.
Regularly update Kubernetes to the latest stable version. The community frequently introduces optimizations to the scheduler’s internal algorithms and cache management.
Finally, implement horizontal pod autoscaling (HPA) carefully. Rapid bursts of pod creation can overwhelm the scheduler queue if the underlying infrastructure is not responsive.