Tencent Gains Up to 85% Performance Boost for MySQL Using Intel® oneAPI Tools

Using the Intel® oneAPI DPC++/C++ Compiler, Tencent achieves significant performance improvements for MySQL

Get the Latest on All Things CODE

author-image

作者

           

"

TencentDB for MySQL is designed to provide customers with secure, reliable, high-performance, and easy-to-maintain enterprise-level cloud database services. And Intel® oneAPI DPC++/C++ Compiler can bring Tencent another avenue to get better MySQL performance.

Ben Cheng, TencentDB

About TencentDB for MySQL

Distributed data storage serves a critical role across a myriad of industries and use cases—internet, finance, and e-commerce. TencentDB for MySQL | Tencent Cloud is a reliable, scalable database hosting service through Tencent, a popular cloud service provider. The application is based on the open-source database MySQL and helps users setup, operate, and expand relational databases in the cloud.

Increasing Performance of MySQL

TencentDB for MySQL can be used in different applications and scenarios for a variety of purposes. However, no matter which scenarios it is applied to, a common challenge is that with increasing of storage and data processing requirements, performance requirements for MySQL were also increasing. Therefore, optimizing for higher MySQL performance is critical to meet ongoing business and customer needs.

Tencent collaborated with Intel to build a high-performance MySQL on Intel® Xeon® Processors and optimized by the Intel® oneAPI DPC++/C++ Compiler, an advanced compiler in the Intel® oneAPI Base Toolkit.

Deep Dive into Compiler Optimization

The Intel oneAPI DPC++/C++ compiler is used for parallel programming applications and delivers productivity and performance across CPUs and accelerators. The following are details on how using the compiler combining link-time optimization (LTO) and profile-guided (PGO) methods helped Tencent build a high-performance MySQL.

Link Time Optimization (LTO)

In LTO implemented by GCC or LLVM, the compiler can dump its intermediate representation (GIMPLE bytecode or LLVM bit code) to disk to compose a single executable, making it possible for all the different compilation units to be optimized as a single module. This extends the scope of intraprocedural optimization (IPO) to cover the entire program (everything visible when linking). With LTO, the compiler can apply various forms of IPO to the entire program, allowing for deeper analysis, more optimization, and ultimately better program performance. LTO can use digest-based global program analysis optimization, cross-module optimization, and other technologies to better guide optimizations such as function in lining and useless code deletion, among others. LTO can also speed up secondary compilation through Link cache and reduce code size to gain performance advantage.

Profile-guided Optimization

 PGO improves application performance by shrinking code size, reducing branch mispredictions, and reorganizing code layout to reduce instruction-cache problems. PGO provides information to the compiler about areas of an application that are most frequently executed. By knowing these areas, the compiler can be more selective and specific in optimizing the application. 

PGO consists of three phases or steps. 

  1. Instrument the program. The compiler creates and links an instrumented program from your source code and special code from the compiler. 
  2. Run the instrumented executable. Each time you execute the instrumented code, the instrumented program generates a dynamic information file, which is used in the final compilation. 
  3. Final compilation. When you compile a second time, the dynamic information files are merged into a summary file. Using the summary of the profile information in this file, the compiler attempts to optimize the execution of the most heavily traveled paths in the program. 

Finally, combining IPO and PGO typically produces better results than using IPO alone. PGO produces dynamic profiling information that can usually provide better optimization opportunities than the static profiling information used in IPO. When you use PGO with IPO, the compiler uses the following guidelines for applying heuristics, where the default heuristic: 

  • Focuses on the most frequently executed call sites, based on the profile information gathered for the program.  
  • Always inlines very small functions that meet the minimum inline criteria. 

Performance Evaluation

Performance of olt_read_write.lua in Sysbench with different threads was tested and compared with the performances of MySQL built with Intel® oneAPI DPC++/C++ Compiler vs. the GCC compiler. At the same time, we optimized MySQL with the Intel compiler's LTO and PGO methods. Two key indicators are typically used to measure the performance of a database, transactions per second (TPS) and queries per second (QPS).For TPS, a transaction may consist of one or more queries. Therefore, QPS is usually several times higher than TPS. But if comparing the performance improvements between different versions of MySQL, the percentage increases of TPS and QPS are the same. In this paper, the QPS performance indicator was used to evaluate the performance of MySQL.

MySQL performance is affected by many factors, one of which is its configuration. Different configurations may have a different impact on performance. Tencent used a configuration file, and built MySQL, both with LTO and PGO, and evaluated the performance with Sysbench. The QPS performance of oltp_read_write.lua is shown in Figure 1.

Figure 1. QPS comparison of the optimized MySQL and GCC; Note: Data source from Tencent Internal Evaluation.

Results

The data shown in the above figure is the QPS data of MySQL. Among them, GCC-10.2.0 refers to MySQL compiled with GCC without LTO and PGO enabled. ICX-LTO and ICX-LTO&PGO refer to MySQL compiled with LTO and LTO&PGO respectively using the Intel compiler. From Figure 3, we can see that in the Sysbench test using different threads, the QPS was improved respectively: 

         1) ICX-LTO increased up to 51%. 

         2) ICX-LTO&PGO increased up to 85%. 

By using LTO and PGO optimization techniques, Tencent’s MySQL can achieve a significant performance improvement. Although the PGO optimization technology requires two compilations, the usage method is not complicated and can be used quickly

Performance Analysis Using Intel® VTune™ Profiler

Intel® VTune™ Profiler (VTune) is an advanced performance analysis tool to optimize application performance, system performance, and system configuration for HPC, cloud, IoT, media, storage, and more on Intel CPUs, GPUs and FPGAs. It is part of the Intel oneAPI Base Toolkit, and has many important analysis types, such as hot spots analysis, and microarchitecture exploration.

To understand how LTO and PGO can help MySQL optimization, we used VTune to collect the performance of MySQL based on the default configuration. Hot spot analysis can help users understand an application flow and identify sections of code that get a lot of execution time. Hot spot analysis does not by default capture the function call stacks, so to analyze the stacks, we select the “Collect stacks” option explicitly

Intel® VTune Profiler
Hotspots view for PTO

 Figure 2. Hot spots for the MySQL without optimization shown in Intel VTune Profiler.

Figure 2. Hot spots for the MySQL without optimization shown in Intel VTune Profiler. 

For example, Figure 2 displays the hot spots of unoptimized MySQL, the left part shows the hot spot functions, and the right part shows the call stacks of function log_wait_for_flush (log_t const&, unsigned long, bool*)::$_5::operator() which is a lambda function in function log_wait_for_flush, the source code is shown in Figure 3, line 861. Then the lambda function is called by function os_event_wait_for, as shown in line 890.

Figure 3. Source code of function log_wait_for_flush

Figure 3. Source code of function log_wait_for_flush.

However, in the collected performance results for optimized MySQL, as shown in Figure 4, we can clearly find that the os_event_wait_for function and log_wait_for_flush function have disappeared in the call stacks of log_wait_for_flush(log_t const&, unsigned long, bool*)::$_5::operator(). And this means that the functions os_event_wait_for and log_wait_for_flush are inlined after LTO and PGO optimization. This example shows that using PGO and LTO can help the Intel compiler make better decisions about function inlining.

Figure 4. Hotspots for the MySQL with LTO and PGO optimization

Figure 4. Hotspots for the MySQL with LTO and PGO optimization.

Conclusion

Maximizing performance is essential for TencentDB MySQL applications. There are many approaches to improve MySQL performance, such as using higher performance hardware, high-speed storage, optimizing MySQL's source code, etc. This article optimizes the performance from the perspective of the compiler, adopting the Intel® oneAPI DPC++/C++ Compiler, and combining the LTO and PGO methods to build a high-performance MySQL.

See Related Content

Articles

  • WeBank* Cryptography Case Study
    Download
  • Tutorial: Program Data Parallel C++ (SYCL)
    Read
  • SIMD vectorization in LLVM and GCC for Intel® CPUs and GPUs
    Read
  • Take Data Science Applications to the Next Level with OmniSci
    Read

On-Demand Webinars

  • Get Started with the Latest Intel LLVM-based Compilers
    Watch
  • DPC++ Part 2: Programming Best Practices
    Watch
  • Prepare for the Next Generation of Memory
    Watch

Podcasts

  • DPC++ is the New, Open C++ Programming Model
    Listen

Get the Software

Intel® oneAPI Base Toolkit
Get started with this core set of tools and libraries for developing high-performance, data-centric applications across diverse architectures.
 
Get It Now

See All Tools