4/28/20

MySQL Performance Tuning Best Practices

by Jesper Wisborg Krogh

Whatever software you work with, chances are that you sooner or later encounter that the performance is not as good as you would like it to be. MySQL is no different. If you are not familiar with MySQL, then it is a general-purpose relational database management system (RDBMS) (with NoSQL support through the MySQL Document Store). The very nature that it is general-purpose and thus allow arbitrary complex queries (within the syntax rules and limitations) and that it is often used as a data store for business critical and customer facing software, meaning that performance tuning is both important and not straightforward.

This blog introduces MySQL performance tuning by going through some best practices. If some of the recommendations sound familiar from other types of performance tuning, then it is because they are – MySQL performance tuning shares a lot of things with other types of software. Given how broad a subject MySQL performance tuning is, I will just focus on the high-level best practices without going into specific recommendations. My top six best practices are:

Be wary of best practices
Monitor
Work methodically
Consider the full stack
Make small, incremental changes
Understand the change

Let’s go through them in more detail.

Be Wary of Best Practices

It may sound odd to start a best practices list by warning to be wary of best practices. However, that is very important, particularly the more specific the advice becomes. (And yes, that also applies to the best practices I give myself including in the newly released book MySQL 8 Query Performance Tuning.)

What I mean is that you always need to take your specific situation into account. As mentioned, MySQL is a general purpose database, so your usage will be different from how other people uses it, and so is the optimal configuration, environment, and query optimization strategies – or in other words “no two systems are the same”. Furthermore, MySQL evolves and a recommendation that was great in an earlier release may not be so great any longer.

Let me give an example. MySQL has configuration options to specify the size of various buffers; two of these are sort_buffer_size and join_buffer_size. A common best practice is to keep these no larger than 256 KiB except for specific queries that have been proved to benefit from larger values. However, in MySQL 8.0.12 and later, MySQL uses a different algorithm to allocate the sort buffer meaning it is not as problematic to increase it in general, and in MySQL 8.0.17 and later there is support for hash joins that uses the join buffer, so you may need to increase it to take full effect of the new feature. (That said, don’t go ahead and start increasing the buffers just yet as there are still downsides of too large buffers.)

This “what was once true, may no longer apply” concept is not only limited to features and algorithms, but also applies to the hardware. You will still find best practices originating in 2005 or earlier where a server with 8 GiB of memory and four CPUs was quite powerful, but now I have more in my laptop that I use to write this blog. Then later as hardware became more powerful, virtual machines became common, so now you may suddenly see virtual machines that are no more powerful than the 2005 bare metal servers. In short, you need to take these things into account when considering a best practice.

Monitor

Knowledge is the key to success in any kind of performance tuning. This applies at two levels: know the product you work with and have metrics about the work done. The latter is obtained through monitoring. While some metrics are well suited for ad hoc collection, like the time it takes to execute a specific query reported slow or the query plan for it, in general you need a dedicated monitoring solution. Some of my recommendations for monitoring are:

Choose a monitoring solution that also collects query information
Take the time to know your monitoring solution well
Work hard to remove false positive alerts
Configure the alerts to an appropriate severity and alerting mechanism

One of the most useful sources of information when executing a performance problem that is more general than “this query is slow” is a query analyzing feature where you can see which queries executed at the time of the issue. Two monitoring solutions that I know of that have such a feature are MySQL Enterprise Monitor (also known as MEM) which is part of the MySQL Enterprise Edition subscription and SolarWinds Database Performance Monitor (until recently known as VividCortex). Both are commercial products, but they are well worth it.

New Content Item

_{Example of the MySQL Enterprise Monitor Query Analyzer}

No matter how good your monitoring solution is, it is only worth as much as you know how to use it. Spent time before a crisis occur to familiarize yourself with the solution you have chosen, so you can immediately navigate to the data that can help you solve the issues you encounter, and ensure you have it configured to capture the information you need.

Another useful feature that is associated with monitoring is to have alerts sent to you either through email, text messages, to chat channels, or similar. A pet peeve of mine is that every time you get a false positive or that you get woken up at 2:30am by an alert that easily can wait until the morning, the higher risk that you will end up ignoring an important alert. So, it is important that you work hard to avoid false positives and ensure you receive alerts in an appropriate fashion, so you are not interrupted more than necessary.

Good monitoring is essential when you need to solve an issue as will be discussed next.

Work Methodically

When you encounter a performance problem and you have pressure from management and end users, it may feel tempting to follow your got reaction and make the first change that comes to mind in an effort to quickly solve the problem. However, too often that is counter productive and even when it appears to work, you do not have any way to be sure it really was the change that made the difference. Instead, you need to work methodically by going through a series of phases:

Verify the problem.
Determine the cause.
Determine the solution.
Implement the chosen solution.

Each phase consists of one or more steps. By documenting each phase, you can also use this as help in the future to quickly identify and solve similar issues. You can illustrate the phases as a circle with each step following the previous.

New Content Item

_{The performance tuning life cycle}

The phases are shown here as a circle as performance is not a binary state, so you will in general keep working to improve the performance with a short or long delay between cycles. A given problem may also require several cycles before it is fully solved. Let’s go through each of these phases in more detail.

Verify the Problem

Verifying what the issue is being as specific about the problem as possible and find proof that it is so. For example, an end user may state that “the query is slow”, but that does not really explain what the issue is. What query is it? Is it the application, network, MySQL, or something else that is slow (more in the next advice about considering the whole stack)? What does it mean that the query is slow?

It is your job to make sure you have the answer to these kind of questions as well as proof of the problem. The proof may for example be to collect a baseline in your monitoring solution that shows the issue or to execute the query manually and verify the execution time is as slow as claimed. This may seem like distrust, but it is not meant that way. Rather, systems are complex and the one reporting the problem will in general not have the knowledge required to know exactly what the underlying issue is. So, the verification is there to avoid that you end up on a wild goose chase. The baseline is also essential to prove whether the solution worked when you implement the solution.

A part of verifying the problem, you also need to determine what is “good enough”. As mentioned, performance is not a binary state, so if you do not define what “good enough” performance is, you will never be done.

Once you have verified the problem, you can move on to determining the cause.

Determine the Cause

The hardest part of investigating a performance issue is often to determine the cause. Make sure you are open-minded and consider the whole stack as even though the problem clearly shows up in MySQL, the cause may be in a different part of the stack. It also pays off being creative and not just consider the standard solutions. Once you believe you have determined the cause, remember to add justification that you have found the cause.

Determine the Solution

The next step is to determine a solution. Here again, you need to keep your options open and not just consider possible solutions in the specific area where the cause was found. Sometimes (an example will follow) you can find a solution in a different part of the stack or using a different feature or similar. As for the cause, note why you think each potential solution will work. Once done, choose the solution you think is the best to implement. Note that you may choose one solution as a stop gap measure and another for a longer-term solution.

Implement the Solution

The final phase is to implement the solution. You do this by creating an action plan and test the solution on your test and staging systems. Make sure you keep updating the action plan as you go through the test cycles. Optimally, the solution will be implemented in a manner that allow you to apply it automatically, so you are sure the same solution is applied to the production system as has been tested while developing the action plan.

An important part of implementing the solution is also to verify the effect of the change. Otherwise you have no idea if you really solved the problem. This is where you need the baseline from the verification as that allows you to compare how the system performs before and after the change.

I have mentioned the “whole stack” a few times. Let’s discuss that.

Consider the Whole Stack

When you use MySQL, you cannot just focus on MySQL itself to resolve the performance problems that you encounter. MySQL is part of a larger stack starting with the end user and going all the way down to the lowest levels of the hardware. A simplified stack can be seen in the following figure.

New Content Item

_{Simplified stack with MySQL sitting in the middle}

Notice that the arrows go in both directions to illustrate how the interaction goes both ways. In this example, the application sends a query over the network to MySQL that executes it. This execution involves using the operating system (such as the I/O cache and file system) which in turn uses the hardware in the host. The host can be either bare metal or a virtual machine. In case of a virtual machine the stack continues down through the hypervisor to the underlying hardware.

Once the host has performed its actions, the result is sent back up to the operating system and MySQL. This allows MySQL to complete the query and the result can be sent back over the network to the application. To illustrate how these layers in the stack interacts and may mean the cause and solutions are not where you would first expect them to be, it is worth considering two real world examples.

In the first case, the database administrator noticed that MySQL randomly stalled. This was in MySQL NDB Cluster which has separate data nodes where the data is stored. When a stall lasted more than 18 seconds, the data node would be shut down to prevent holding up the rest of the cluster. Initial testing with Linux commands like ^top, ^mpstat, and ^iostat showed that the stalls could last more than a minute. Already that meant that the focus moved outside MySQL as the stalls lasted well after the MySQL data node had shut down. Using the ^perf tool revealed that the root cause was that Linux was doing memory compaction. This was triggered by MySQL trying to send TCP traffic over the network, and as TCP requires a buffer consisting of contiguous memory (^kmalloc()) and the memory was extremely fragmented, a defragmentation was required. So why did that happen? As it turned out the server had much more memory than MySQL used, so Linux had used the rest of the memory for a huge I/O cache and this cache caused the memory fragmentation. The solution was to enable direct I/O in MySQL which greatly reduced the use of the operating system I/O cache and this reduced the memory fragmentation.

The second case is for MySQL Server (what is usually considered “MySQL”). An application was migrated from using Microsoft SQL Server to using MySQL. After the migration, it was reported that queries were very slow and that from time to time, MySQL would block the application from connecting. The cause was a bug in a framework used by the application that meant a ^WHERE clause was missing. This made one specific query scan huge amounts of data meaning it took so long that the application resubmitted it. Eventually up to 50 copies of the same query would execute in parallel on a relatively small virtual machine. As the first of the queries read data into memory, the subsequent queries would become faster and faster, so they would all complete around the same time. Now, 50 queries were all sending huge result sets back to the application overloading the network and making the application run out of memory and thus crash. The overload of the network in turn meant that network packets were dropped causing handshake errors which finally made MySQL blacklist the application host.

These examples are at the extreme end of issues, but they illustrate well how symptoms in one place can be caused by actions or bugs in a different part of the stack. And for the first example, how the solution may not be in the same part of the stack as caused the problem.

Make Small Incremental Changes

When you encounter a performance issue and you have several changes that you think will help improve the performance, it is tempting to make them all at the same time. While this can seem like a shortcut, in reality it may cause frustration and let you believe that the solutions do not work even when some of them do. The problem is that you may end up having two changes cancel out each other, so it looks like neither works. This is the reason, you should aim at making just one change at a time, so it is clear whether each change works.

Similarly, for each change, make small incremental changes. For example, with configuration options, there may likely be a sweet spot where the performance is optimal. If you make large changes, you may bypass the sweet spot and it looks like the change does not work. By making several small changes, you can search for the sweet spot.

Understand the Change

My final best practice is to ensure that you understand the change you are making. Why I raise this point is best illustrated by an example. MySQL has a log called the binary log which records all changes made, so you can apply them again either on a replica or during a point-in-time recovery. Assume you verify that writing the binary log is causing performance issues, and you search for solutions. One piece of advice may be to set the option ^sync_binlog to 0. You test this and sure enough the performance improves. Case closed? No, not so fast.

What happens when you set ^sync_binlog to 0? The option specifies how often the binary log must be flushed to disk. A value of 1 means for every transaction commit, a value of 100 for every one hundred commits, and a value of 0 means “never”.

Flushing writes to disk is relatively expensive which explains why disabling a forced flush improves the write performance. However, it has two side effects which you need to consider. First of all, a write is only persisted once it is flushed – until then it only exists in memory. So, by setting ^sync_binlog to 0, you risk losing writes if MySQL crashes. This also means that you will have to recreate all replicas if the replication source crashes, and you may not be able to recover all data in case of a point-in-time recovery. Are you prepared to pay that price? If not, the solution will not work for you.

A more subtle side effect of setting ^sync_binlog to 0 is that eventually all writes much end up on disk. A binary log file can at most be 1 GiB (plus the size of the last transaction), and a binary log file is always flushed to disk when it is rotated. With server’s having tens to hundreds of gigabytes of memory, usually with ^{sync_binlog = 0}, the whole file ends up being cached in-memory and a gigabyte of transaction events will be flushed at the same time. While the flush is ongoing, no commits can be made, so you may see stalls of several seconds each time the binary log is rotated. In other words, ^{sync_binlog = 0} will improve the throughput, but ^{sync_binlog = 1} gives the most consistent commit latency.

There is no shortcut to understanding these interactions. It requires gaining your own experience by working with MySQL and by studying existing sources of information. One source that can help you with MySQL performance tuning (specifically related to queries) is the new book MySQL 8 Query Performance Tuning written by yours sincerely. One of the aims I had writing the book is to help the reader understand how MySQL works as well as giving more specific tips to improving the performance of your queries. Happy reading.

About the Author

Jesper Wisborg Krogh has worked with MySQL databases since 2006 both as an SQL developer, a database administrator, and for more than eight years as part of the Oracle MySQL Support team. He has spoken at MySQL Connect and Oracle OpenWorld on several occasions, and addition to his books, he regularly blogs on MySQL topics and has authored around 800 documents in the Oracle Knowledge Base. He has contributed to the sys schema and four Oracle Certified Professional (OCP) exams for MySQL 5.6 to 8.0.

He earned a PhD in computational chemistry before changing to work with MySQL and other software development in 2006. Jesper lives in Sydney, Australia, and enjoys spending time outdoors walking, traveling, and reading. His areas of expertise include MySQL Cluster, MySQL Enterprise Backup, performance tuning, and the Performance and sys schemas.

This article was contributed by Jesper Wisborg Krogh, author of MySQL 8 Query Performance Tuning.