# Cluster Autoscaler Alerts

This document describes the alerts generated by the Cluster Autoscaler Operator,
their possible causes, and suggested resolutions.


## ClusterAutoscalerUnschedulablePods
The cluster autoscaler is unable to scale up and is alerting that there are
unschedulable pods because of this condition.

### Query
```
# for: 20m
cluster_autoscaler_unschedulable_pods_count{service="cluster-autoscaler-default"} > 0
```

### Possible Causes
* The autoscaler is unable to create new machines due to replica limits on the
  MachineAutoscalers.
* The autoscaler is unable to create new machines due to maximum node, CPU, or
  RAM limits on the ClusterAutoscaler.
* Kubernetes is waiting for new nodes to become ready before scheduling pods to
  them.

### Resolution
In many cases this alert is normal and expected depending on the configuration
of the autoscaler. You should check the replica limits in the MachineAutoscaler
resources to ensure they are large enough. You should also check the maximum
totals nodes, CPU, and RAM limits in the ClusterAutoscaler resource to ensure
they are valid.

In rare cases it is possible that the cloud provider is taking longer than 20
minutes to create new nodes. This should be investigated with the cloud provider
and their specific process for node creation.

## ClusterAutoscalerNotSafeToScale
The cluster autoscaler has detected that the number of unready nodes is too high
and it is not safe to continute scaling operations. It makes this determination
by checking that the number of ready nodes is greater than the minimum ready count
(default of 3) and the ratio of unready to ready nodes is less than the maximum
unready node percentage (default of 45%). If either of those conditions are not
true then the cluster autoscaler will enter an unsafe to scale state until the
conditions change.

### Query
```
# for: 15m
cluster_autoscaler_cluster_safe_to_autoscale{service="cluster-autoscaler-default"} != 1
```

### Possible Causes
* The cluster has too many nodes in an unready state.
* A large number of new nodes have been created and are taking longer than 15 minutes to join the
  cluster.

### Resolution
This alert is indicating an issue with nodes not reaching a ready state. You
should investigate the logs associated with your cloud provider controllers and
the Machine API resources to discover the root cause. For more information on
why nodes, or machines, might not become ready please see the
[Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/master/FAQ.md).

## ClusterAutoscalerUnableToScaleCPULimitReached
The number of total cores in the cluster has exceeded the maximum number set on the
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
in the cluster and comparing that number against the maximum cores value set for the
cluster autoscaler (default 320000 cores).

### Query
```
# for: 15m
cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum CPU limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum core count in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the number of cores in the cluster is fewer than the maximum.

## ClusterAutoscalerUnableToScaleMemoryLimitReached
The number of total bytes of RAM in the cluster has exceeded the maximum number set on
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
in the cluster and comparing that number against the maximum memory bytes value set
for the cluster autoscaler (default 6400000 gigabytes).

### Query
```
# for: 15m
cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum memory limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum memory bytes in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the amount of bytes of RAM in the cluster is fewer than the maximum.
