The Misunderstood Load Average in Linux Hosts

Jun 7, 2020

16 mins read

Linux Performance

Ever wondered when someone runs the command uptime in a linux host what the values in the load average: section mean? well I have wondered about it many times in my career. And this should be a simple question to ask a seasoned linux administrator or developer, right? well it’s not entirely true, as the load average value in a linux hosts probably is the most misunderstood term and often associated with the wrong concepts. In this post I will explain and tell a bit more about my experience about load average value in linux and how it helps me everyday to spot issues in infrastructure.

Then, what is it?

To actually explain what is it I will start talking about what’s not it.

Is it CPU?

Well, it’s a big no. The most common mistake is to associate the load average of a system with cpu usage, while high cpu usage is one of the causes that might drive this value up, it’s not only CPU that might end in a high load average.
I could be speaking a long time about what it’s not, but I wanted to emphasize CPU because is one of the most common mistakes.

Simple explanation

How load average is calculated sometimes could be complex as there are many moving parts and situations in a linux system that might generate a high load situation. I will not go into the details of it, I will share my perspective and experience through the years from a practical point of view. If you need to understand more deeply how this value is calculated there is a great article named Linux Load Averages: Solving the Mystery. While this post is very educational, it can be tedious to read. So I will take a simple approach and talk about my personal experience.
I have my own definition for the term, which can be found in many articles on the internet, load average in a linux system is the average amount of processes waiting for cpu time. As can be seen, in my own simple definition involves CPU in it, but wait, didn’t I say load average is not CPU, yes I did, but even if I said so, a high load average might not be related to CPU issues, it can be many other things, and I think that is why people get confused by the term. In the end, a process will always be waiting for it’s turn to use CPU, and there are other situations that a process might be waiting for the CPU and the cpu will be idle anyway. Confusing, isn’t it. Well yes. I will try to explain this situations from a practical standpoint that will make sense why I said it’s not only CPU.

What does the values mean?

First lets look at the output of uptime command, load average can also be found in top command at the top right corner.

$ uptime
 23:07:56 up 21 days, 23:59,  1 user,  load average: 0.51, 0.25, 0.22

As a quick explanation. load average: 0.51, 0.25, 0.22 the first value is the average last minute, the second value is the average of the last 5 minutes and the third is the average of the last 15 minutes. In this example it means that in the last minute there were 0.51 processes in average waiting for CPU time and the last 5 minutes there as an average of 0.25 processes waiting for cpu time, same for the 15 minute value has an average of 0.22 processes waiting for cpu time.
Still this explanation doesn’t say much. And this is one of the most controversial topics about this values. First we need to ask as ourselves, what is what we want to know? This values can catch situations where performance is starting to be an issue. And defining when this values mean our system has performance issues is the key. And there is no magic formula or answer for all systems. And any senior engineer would respond, it depends.

How to calculate thresholds

This is the very tricky part, how to know which is the right value to know if a system is having performance problems. In the past I have heard if the load average value is above 1.00 then the system is having performance issues. Well as I said before, it depends. I have heard many people saying this is the rule for load average, and it was true, when we all had single core machines. Today this is is not true anymore as even our mobile phones have multiple cores, then the previous statement is not true.
Going back to my definition, we can say that we might be having issues when a system load average is above the number of cores. Since the load average value is the amount of processes waiting for cpu time, then if we have more processes waiting for CPU than the amount of cores we have in the system we might be having trouble, as our processes might be waiting too long for resources, and the impact depends really in what is running on the server but as a general rule we can use the load average / amount of cores to calculate the “Load percentage”. Of course this percentage can go above 100% and in my experience after hitting above 100% a system should be ‘investigated’ to see is there are performance issues. As I mentioned this depends on the system and the performance requirements of the applications running but in general a system that goes above 100% is having a some kind of bottleneck that needs investigation.

Monitoring Load Average

Now that I talked a little about some concepts, which are important to understand to properly monitor systems.
Load average is a good metric to catch when a system might be having performance issues and needs further investigation. But there is a catch about this, having high load average won’t tell exactly what is wrong with your system and gives a hint that there might be a bottleneck somewhere, but it doesn’t tell which one is the bottleneck. So there are situations that another complimentary metric from the system will be needed in order to find where the bottleneck is.
I find extremely useful to add to my generic set of metrics I monitor in all systems, as it is general enough to catch multiple issues or situations without having a lot of other complex metrics to track when monitoring a system. In my experience if I have to set a generic threshold that will fit most systems I would set to alert after load average percentage goes above 120%-130%, I found this to be generic enough to catch issues while not generating too many false positives.

Alerting

When alerting using load average, we need to be careful with the thresholds, as if we set to a low value it will page the on call engineer too much, while if we set it to a value that is too high we might miss an issue with the application. As mentioned before, I usually set a generic value for most systems, but some systems will need tuning, and it will depend on the applications tolerance to performance issues running on a given system. Being that said, I experienced some systems running applications that had low tolerance to latency, and since high load average could mean higher latency, in such systems I used 100% or even lower threshold. On the other hand, I also had to tune when alerting on systems that could tolerate a bit more load, an example would be big data and analytics systems, to big databases running big queries that don’t require to run with low latency, in such systems we could afford a bigger latency, in those cases I have set values of 200% or more, but it depends on the case and the tolerance to latency the application has.
One thing to have in mind when defining alerts or tuning an existing one, is to have enough metrics data to look at the history of load average in a system to see the baseline usage and maximum usages. It will always be better to have history of the load average to understand how it behaves and also how it correlated to other metrics in the system.

How do I find the bottlenecks?

Whenever load average exceeds the defined thresholds, it means we could have found a bottleneck or a simple resource starvation of some sort. In this section I will mention the most common ones I have come across and what too look for in each case.
Common use cases of high load average

High CPU usage
Heavy read or write to disks
Exhausting the memory
High peripherals activity

There are many more cases or combination of use cases that might happen, however they could be more infrequent. This is not an exhaustive list of them, just the ones I had to deal with more frequently.
I will mention some command line tools to catch some of the issues but I recommend if possible to use a graphic tool to see the load average and other system metrics history, things are much more clear when one is looking at a graph rather than looking at values in the console. For this purposes I generally use prometheus to store metrics in a time series database and Grafana to visualize the data history for system metrics. Also I use another too called Netdata to watch metrics in realtime.

High CPU usage

Off course when having high load average one of the most common problems would be high CPU usage. This is one of the use cases easier to spot as there are tools that will show right in your face processes using a lot of CPU. top is the most common tool to watch processes and CPU usage, among other metrics. When spotting processes using a lot of cpu we need to find process names and pids and investigate each one of this processes why they are using CPU, this task is out of the scope of this post.
The impact that high cpu usage in a system will have, first thing it will impact is latency, if it’s a database server queries will take longer to be served, or if it’s a server running an HTTP application we will see latency going higher to serve request and in some cases requests will timeout, but it all depends on what the system is running.
Usually solving high cpu will mean the we need more cores in the system or we need to work with the application owner to optimize the application to use less cpu, or scale horizontally the application.

Heavy read or write to disks

A very common case of high load is when applications are doing either heavy read, write or both to disk. This increases the I/O (Input/Output) happening in the system and the system will be found many times waiting for this I/O to happen, when this happens load average spikes as processes are kept waiting more time for resources. This one is not as easy as CPU to spot because I/O are values is something we are not used to look at, to catch this in the command line we can use iostat or iotop to see metrics around the I/O. Another good place to look would be to iowait value of the cpu, which tell us what is the percentage of time the CPU was locked waiting for I/O to happen, as usually I/O operations are costly and much slower than the CPU,we found many times that the cpu is waiting for this operations to complete and this iowait can be found in the top command as wa in the CPU metrics shown. When we see an increase in iowait most likely it will be the one driving up the load of the system. To check which processes are the ones doing I/O we can use the sar command, I won’t go into details how to use sar as explaining it might require a blog post on it’s own.
To solve high I/O in systems it is also tricky, if we can’t optimize the applications, then we might need to increase resources. If running on bare-metal servers, then we need to see what can be optimized on the disk to gain more speed, also we could change to faster disks but that is not always an option. Another good option is to add more disks and configure them in a RAID array in mode RAID0 or RAID10, when using this modes it would be using a technique called striping which it’s writes/reads will be balanced between the disks in parallel and if we combine this with multicore systems, this will increase the throughput considerably. If we are running in cloud VMs we can do some other optimizations, depending on the cloud vendor we are using we could increase I/O throughput by choosing different disks types or different vm types. Most major cloud providers offer similar approach to IOPS (Input Output Operations) where depending on the size and type of disk you choose it provides different IOPS, also vm types will offer different max IOPS capacity per vm so I recommend to read the cloud provider docs in order to find the best approach, also the RAID approach mentioned before can also be used in vms in cloud providers to increase throughput.

Exhausting memory

Having memory pressure on a system can also cause high load average. When applications are requesting more memory that the system has available to allocate one of the symptoms would be that the load average will be higher, if the system has swap memory it will start swapping out memory pages to disk driving the load high because swap to disk is much slower than RAM, then the memory problem is not the memory itself that drives the load but the high I/O generated when swapping to disk. So memory issues turn into the issue mentioned in the previous section generating a high I/O situation.
Exhausting the memory has more serious effects than CPU or high I/O, as running out of memory and swap can cause the applications to crash, or to get killed by the out of memory killer or having strange behavior in the application because it can’t allocate memory. Normally in the case of running out of memory it will end up leaving the system unstable or even unaccessible and might need a reboot to get the system back.

High peripherals activity

One use case that is not as common as the others and also hard to troubleshoot or catch. There are some times that hardware is the source of high load, and the reason would be that some peripherals generate a lot of interrupts (IRQ) that might end up locking the CPU until finished, this is hard to catch since tools to watch it are less used, on way to look at interrupts would be to do watch -n1 "cat /proc/interrupts". Also using node exporter for prometheus export IRQ usage and context switches as well as netdata, I recommend using this tool to graphically watch this metrics.
In order to see what is happening we need to watch the load average metric and also watch the IRQ metrics and correlate them, once confirmed that the load average is driven by IRQ then we could check which are the IRQ that are spiking at the time and investigate how to solve it. I haven’t come across this issue too many times in the past, and there are some ways to ease the pressure, for example doing CPU pinning in multicore systems which will use only 1 or 2 defined cpus for handling a particular piece of hardware, but solving this issue will require more research each time we hit this issue.

Conclusion

In this post I shared some of my experience with load average in linux and tried to explain what to look for when having high load on a system. What I can say is that I find this metric very useful as it is a generic way to get how a system is performing and easy to configure and set thresholds. It allows to catch most performance issues in a system with one single metric. It is helpful when managing a large fleet of systems and need to deploy/configure monitoring and alerts.
The downside of load average, as mentioned before, it will tell that something is wrong but won’t tell what is wrong, then more investigation is needed to catch what resource is driving the load up, but helps in standardization of monitoring. Also load average is not a bulletproof metric to watch, there are some cases when still having performance issues but they will not show in the load average, specially in latency sensitive applications. In this cases we should watch other metrics on the applications or the system to catch this errors. One example is that some network applications performance can’t be caught by load average but still have issues, in that case we need to watch more closely network parameters and look for errors there. For example I experienced performance problems with applications using udp traffic where load average was very low but application started seeing dropped packets, so that needed tuning on other side but was not caught by the load average.
Bottom line, load average is useful but you will need to monitor other metrics and applications in order to fully catch performance problems.