Watt is Software?: Transparent Huge Pages: When being more efficent profiles slower

I was recently tasked with trying to figure out why our in house Yocto image was performing noticeably slower on an AArch64 SoC evaluation board compared to the vendor image on the perf memcpy benchmark.

The vendor image produced:

$ perf bench mem memcpy -s 1GB # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1GB bytes ... 5.870508 GB/sec

Whereas our in house image was quite a bit slower:

# perf bench mem memcpy -s 1GB # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1GB bytes ... 2.786695 GB/sec

My first inclination is that there was something different about the way our in-house image was compiled to the vendor image (e.g. different CFLAGS, etc.).

This discrepancy is quite weird for a number of reasons, the main one being that perf is using memcpy() to do the copy. After digging around in the glibc source code, I discovered that the memcpy() implementation is written in optimized assembly, so it is very unlikely that its implementation or optimization is the cause of the performance difference. However, I wanted to eliminate that as a possibility, so I mounted the vendor SoC SD card image and used chroot to run their version of perf from their root filesystem. Using chroot is very useful here because it means any shared libraries will be pulled from their image, not ours. Unfortunately, this didn't yield any better results:

$ chroot /run/media/sda2/ /usr/bin/perf bench mem memcpy -s 1GB # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1GB bytes ... 2.804884 GB/sec

This told me that it wasn't a problem with the way the code was compiled, so next I moved onto the kernel. Replacing our kernel with the vendor kernel did not improve performance, which was not surprising since our kernel was derived heavily from the vendor kernel.

Next, I moved on to trying the bootloader. In some AArch64 SoCs, the SPL configures the memory timings before u-boot even gets a chance to run, so our theory was that there was something wrong in our SPL causing it to incorrectly configure the RAM. However, after copying the vendor SPL and u-boot and convincing them to boot our image, I got the same slow results.

At this point, we were booting the vendor SPL and u-boot, the vendor kernel, and I could use chroot to run the vendor perf with the vendor glibc but it was still slow. This seemed to indicate that there was something different about the way their root filesystem was initing that was causing the difference in performance. As a test, I switched back to using our SPL, u-boot, and kernel, but made them boot the vendor root filesystem. Finally, this yielded the result we were looking for:

$ perf bench mem memcpy -s 1GB # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1GB bytes ... 5.875889 GB/sec

From here, I dug through the init process of the vendor image, and found this gem hidden in their /etc/rc.local file:

# Set THP as madvise mode
if [ -e /sys/kernel/mm/transparent_hugepage/enabled ]
then
    echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
fi

Running this command in our in-house Yocto image suddenly made our perf benchmark match the vendor image. Success!

But why? What are transparent huge pages, and why does changing it significantly impact the memcpy performance?

Transparent Huge Pages are intended to be a performance optimization the kernel can perform where it will transparently allocate a huge (2MB) page in place of 512 4K pages when an application requests memory. This is designed to reduce the size of the page tables that the kernel has to track for a process, as well as increase the efficiency of the TLB lookups, since a single TLB entry maps a larger virtual to physical address translation (e.g. 2MB instead of 4K). Importantly, if transparent huge pages are enabled, the default behavior for the kernel (and hence what AArch64 does by default) is to always use a huge page where possible. This means that when perf does a 1GB malloc() in our benchmark, the kernel will try to use as many 2MB huge pages as possible, in this case up to 512 of them. We can test this out by reworking the test to compare the speed difference between copying 1GB one time, and copying 1MB 1024 times:

$ echo always > /sys/kernel/mm/transparent_hugepage/enabled $ perf bench mem memcpy -s 1GB # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1GB bytes ... 2.783825 GB/sec $ perf bench mem memcpy -s 1MB -l 1024 # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 8.419351 GB/sec

The second number is artificially high, due to either due to time measurement error, or because 1MB buffer is too small and fits in the CPU cache, meaning it's not actually measuring the RAM speed. The following table shows the perf benchmark results for each power of two size from 1MB to 1GB, and the results with transparent huge pages enable and disabled:

Block size	# of loops	Speed with huge pages (GB/sec)	Speed without huge pages (GB/sec)
1MB	1024	8.746687	8.713989
2MB	512	7.214226	7.234791
4MB	256	3.734827	6.506520
8MB	128	3.207688	6.201166
16MB	64	2.968178	6.035185
32MB	32	2.867392	5.945515
64MB	16	2.842613	5.904721
128MB	8	2.827423	5.883253
256MB	4	2.811563	5.876304
512MB	2	2.767668	5.872991
1GB	1	2.771941	5.875165

Note that 4MB is the first size that is guaranteed to have the option of at least one huge page. The 2MB would only be able to be a huge page if the allocation also happened to be aligned to a 2MB memory address, which is unlikely. Correspondingly, as soon as the allocation size hits 4MB the copy speed drops significantly.

To narrow down more specifically what is going on, I used the perf stat command to profile a run of the memcpy benchmark both with transparent huge pages enabled and with it disabled. The command I used was:

$ perf stat -e $METRICS perf bench mem memcpy -s 32MB -l 1024

In order to get a more accurate analysis of the memory speed, the memcpy benchmark does an initial memcpy() from the source to the destination buffer to force the kernel to populate the destination buffer pages in memory before performing the copy loops (crucially, the source buffer is untouched which means it is initialized to zero by the kernel, but more on that in a bit). This initial memcpy() will cause many page faults before the benchmark begins, which is why I chose to run a smaller memcpy more times when profiling. This causes the initial fills of the buffers (which perf stat ends up measuring) to be less significant part of the final measurements compared to profiling a single pass of a large 1GB buffer. I chose a 32MB buffer as the sweet spot where it is still large enough to have a significant number of 2MB huge pages compared to the total buffer size, but small enough to minimize initial page faults.

A number of metrics were profiled using this method, which are shown in the following table:

Metric	With Huge Pages	Without Huge Pages	% Change
Copy Speed (GB/s)	2.905628	5.960294	105%
Elapsed time (s)	11.0365	5.4210	-51%
CPU cycles	17,538,311,364	8,634,255,108	-51%
Instructions	5,406,625,559	5,427,592,614	0.3%
CPI	3.24	1.59	-51%
Cache misses	17232385	773783	-96%
TLB misses	1,525,677	17,201,195	1027%
Context Switches	1120	550	-51%
Bus Cycles	8870763558	4317758724	-51%

The data clearly shows that when transparent huge pages are enabled, over twice as many cache misses are encountered as compared to when they are disabled, which is very likely the cause of the performance degradation. This was not at all what I expected, since huge pages are designed to reduce the number of TLB misses (which you can see from the data they do), and really shouldn't have anything to do with the cache. After much consternation and research, I came across an LWN article that made everything click: The reason for the slow down is because of the huge zero page that is being mapped in for the source buffer.

When a user space process allocates a new page, the kernel zero-initializes it in order to prevent leaking data from another process. However, as an optimization the Linux kernel doesn't actually allocate a new page and memset() it to zero when an application asks for one. Instead, there is a special "zero page" that the kernel tracks and when a process is allocated a new page the kernel simply adds the zero page to the process page table at the correct virtual address. The page is marked copy on write, so as long as an application only reads from this page, it will continue to efficiently get zeros, but as soon as it does a write a new page is copied from the zero page and replaces it in the page table.

After careful reading of the perf source code, I realized that it never initializes the source buffer, meaning that the source buffer is effectively a repeated copy of the zero page. When transparent huge pages are disabled, this is the 4KB zero page repeated through the buffer. Since the cache on most(?) modern processors uses physical addresses, this means that a copy from the source buffer is reading from the same 4KB chunk of physical memory over and over again. This 4KB easily fits in the L1 cache of the processor, and thus is fast. However, when pages are coalesced into a huge page transparently, things are different. There are zero pages for huges pages, but they are necessarily the size of the huge page (i.e. 2MB) so that they can be used in place of an actual huge pages in the page tables. This means that the source buffer in this case is the huge zero page repeated. Unlike the 4KB zero page, 2MB is much too large to fit in the L1 cache of our SoC, and thus there are a lot of cache misses reading from the source buffer. Even though it is reading the same physical memory over and over again, it is having to read the entire source from main memory and write the entire destination to main memory, meaning that the performance is just about exactly halved.

So, what does this mean? I have a few observations:

perf bench mem memcpy is probably not the best tool for determining actual main memory speed. If that is your goal, perf bench mem memset is going to serve you much better, since it doesn't involve any source buffers, and thus is immune to the problems I found with the memcpy benchmark.
I wonder what the actual goal of the memcpy benchmark is. If the goal is to try and get an accurate speed for how quickly memory can be copied from main memory to main memory, it is currently failing. In the "best" case is will repeatedly copy the 4KB zero page from cache to main memory, only giving you the speed at which main memory can be written. This isn't very useful, since the memset benchmark can tell you the same thing without being unpredictable because of transparent huge pages. It is possible that the memcpy isn't actually intended to measure main memory speed, but rather the speed of various copy implementations (especially given the multiple implementations for x86), but this also doesn't quite make sense because if you really wanted to benchmark various implementations, you would want them to read and write to the L1 cache, otherwise the transfer time to main memory will likely dominate the execution time and the comparison will not be very useful.
It may be worth considering if "always" is a good default for transparent huge pages. In this specific case transparent huge pages causes a significant reduction is performance. Granted, this is a very synthetic case, as I can't think of many applications that are going to be relying on the performance of copying large amounts of zero pages.

Watt is Software?

Monday, May 4, 2020

Transparent Huge Pages: When being more efficent profiles slower

No comments:

Post a Comment