Sunday, October 25, 2020

Pi Board Automated Testing - Part 2 Power Control

One of the early challenges I encountered with my Pi board testing was how to power up to 6 of the test boards safely and efficiently. I decided that I effectively had two options:
  1. Set up a shared 5 volt power supply that each device can be connected to with it's appropriate power connector (Micro USB, USB-C, or Barrel plug)
  2. Set up 120 V wall outlets that each device can plug it's appropriate power converter into
Initial I was leaning toward a shared 5 volt power supply, but I soured on this for a few reasons. The first reason was that I acquired a RockPi4 board, which recommends a power supply that supports USB quick charging (meaning it can negotiate up to a higher voltage to get more power). The second was that I had trouble sourcing an acceptable power supply, switching, and distribution mechanism that could handle the power requirements I had; namely that I wanted to be able to deliver at least 3A@5V to each board (although, not all at the same time). In the end, I decided to use 8 household outlets, and control each from an 8 channel USB relay. Each board then plugs into an outlet with its normal power converted and I am able to turn them on and off independently using the USB relay. This has a number of distinct advantages:
  1. Everything is off the shelf, cheap, and available at the hardware store. I was even able to easily add a "master switch" to turn off the entire device for less than $1 by using a simple light switch.
  2. I don't really have to worry too much about the power ratings. The wires and relay I'm using are sufficiently rated such that I'm not worried about transiently drawing 15A through them, and I was careful to select higher rated components for the main power feed and terminal blocks which have to handle powering all devices.
  3. The wiring is simple. Household wiring is designed to be connected with screw terminals, no soldering required.

Here is a picture of the bottom of the device, where most of the interesting things are happening. If you are an electrician or at all aware of wiring codes, I apologize for what you are about to see:

The black cable going toward the bottom of the image is a "dead-mans plug"; the other end terminates in a standard 3 prong plug and is how power comes into the device. The hot (black) wire from this cable goes up through the light switch (acting as the "master" switch), and then up to the upper terminal block. The upper terminal block has a (red) bus bar that connects all the terminals together, and from there it splits out 8 ways and goes to the 8 relays on the other side of the board through two holes. You will note that I did a terrible job color coding my wiring, as the four wires on the left of the terminal block are obviously hot, but I used white wire instead of black. This was because I ran out of black wire, and was too lazy to get more. I don't recommend doing this, as it caused a bit of a problem for me later. From the relays, the wires go back through the holes and then to the hot side lugs of the wall outlets. I turned all my outlets into split outlets by removing the tab that would normally attach the top and bottom hot lugs so that they can operated independently.

The neutral wire (white) and ground wire (bare copper) from the power feed cable both go to the lower terminal block. It's hard to see from the picture, but there are actually 2 (black) bus bars attached to the lower terminal block, each one connects alternating terminals together. This makes the lower terminal block alternate between neutral and ground terminals (note, I wouldn't recommend doing this; use two smaller bus bars). From the lower terminal block, the neutral and ground wires split out to each outlet. Only four neutral wires are needed because I left tab that connects the top and bottom lugs on the neutral side of each outlet in place.

The cable that runs off to the right wraps around to the top of the board to power 12V power converted, used to power the USB HUB and Ethernet switch. I "cheated" and taped its hot wire off of the screw lug on the light switch which is unused because the upper terminal block is connected using the friction tap. The neutral and ground are attached to the lower terminal block.

The entire bottom of the device is covered with two pieces of polycarbonate (Lexan); despite my atrocities against the household wiring code, I do have a healthy fear of 120V wiring, and I want to make sure there is no way an errant finger or bit of metal debris can find its way to the 120V wiring.

Here is the view from the top:

Note the wires running up from the two holes to either side of the relay. Remember that the wires come up from the terminal blocks through the holes, through the relays, and black down to the back of the outlets. The 12V power brick is right below the frame, on the left. Also note that I wrote a number on each outlet, indicating which relay controls that outlet. I wasn't really trying to have an ordering that makes sense and succeeded; it doesn't matter to much because I refer to each relay by a name with software. Also note that the relay is covered with a sheet of polycarbonate to prevent any inadvertent touching.

When I'm not actively using one of the outlets, I use a simple plastic child outlet cover to keep things from falling in the receptacle holes. I originally intended to cover the outlets with wall plates to protect the lugs from the top, but I poorly planned the spacing of the outlets, and they don't fit. I might still cut a piece of polycarbonate to fit over the outlets.

Overall, I'm quite pleased with the results, and it has worked really well. I highly recommend using the terminal blocks to distribute the electricity; they are easy to use and pretty safe since this is what they are designed to do. If I were going to make another, I would probably use 2 separate smaller terminal blocks for the neutral and ground. If necessary, I think you could also use a smaller terminal block for the hot wire, since you might be able to double up and put one wire on each side of the screw lug to get two per terminal, in which case a 5 terminal block would cover the 8 outputs (this way, you could use the same size terminal block everywhere, which is nice size I've only found them for sale in lots of 8 to 10).

I would also do a much better job of color coding my wire. As noted, I ran out of black wire, and switched to using white wire for the hot lines. This actually caused me to damage the first relay I was using. The white wires confused me and I mis-wired one of the outlets such that one of the relay switches would short the hot and neutral wires when activated. This subsequently caused quite a spark and you can clearly see where the copper trace on the relay board evaporated:

In the end I wasn't too upset because I didn't like this relay for reasons I might cover in another post. Were I to make another one, I would make sure to only use black wire for hot, white wire for ground, and probably red between the relay and outlets.

Monday, May 4, 2020

Transparent Huge Pages: When being more efficent profiles slower

I was recently tasked with trying to figure out why our in house Yocto image was performing noticeably slower on an AArch64 SoC evaluation board compared to the vendor image on the perf memcpy benchmark.

The vendor image produced:

$ perf bench mem memcpy -s 1GB
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1GB bytes ...

       5.870508 GB/sec

Whereas our in house image was quite a bit slower:

# perf bench mem memcpy -s 1GB
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1GB bytes ...

       2.786695 GB/sec

My first inclination is that there was something different about the way our in-house image was compiled to the vendor image (e.g. different CFLAGS, etc.).

This discrepancy is quite weird for a number of reasons, the main one being that perf is using memcpy() to do the copy. After digging around in the glibc source code, I discovered that the memcpy() implementation is written in optimized assembly, so it is very unlikely that its implementation or optimization is the cause of the performance difference. However, I wanted to eliminate that as a possibility, so I mounted the vendor SoC SD card image and used chroot to run their version of perf from their root filesystem. Using chroot is very useful here because it means any shared libraries will be pulled from their image, not ours. Unfortunately, this didn't yield any better results:

$ chroot /run/media/sda2/ /usr/bin/perf bench mem memcpy -s 1GB
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1GB bytes ...

       2.804884 GB/sec

This told me that it wasn't a problem with the way the code was compiled, so next I moved onto the kernel. Replacing our kernel with the vendor kernel did not improve performance, which was not surprising since our kernel was derived heavily from the vendor kernel.

Next, I moved on to trying the bootloader. In some AArch64 SoCs, the SPL configures the memory timings before u-boot even gets a chance to run, so our theory was that there was something wrong in our SPL causing it to incorrectly configure the RAM. However, after copying the vendor SPL and u-boot and convincing them to boot our image, I got the same slow results.

At this point, we were booting the vendor SPL and u-boot, the vendor kernel, and I could use chroot to run the vendor perf with the vendor glibc but it was still slow. This seemed to indicate that there was something different about the way their root filesystem was initing that was causing the difference in performance. As a test, I switched back to using our SPL, u-boot, and kernel, but made them boot the vendor root filesystem. Finally, this yielded the result we were looking for:

$ perf bench mem memcpy -s 1GB
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1GB bytes ...

       5.875889 GB/sec

From here, I dug through the init process of the vendor image, and found this gem hidden in their /etc/rc.local file:

# Set THP as madvise mode
if [ -e /sys/kernel/mm/transparent_hugepage/enabled ]
then
    echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
fi

Running this command in our in-house Yocto image suddenly made our perf benchmark match the vendor image. Success!

But why? What are transparent huge pages, and why does changing it significantly impact the memcpy performance?

Transparent Huge Pages are intended to be a performance optimization the kernel can perform where it will transparently allocate a huge (2MB) page in place of 512 4K pages when an application requests memory. This is designed to reduce the size of the page tables that the kernel has to track for a process, as well as increase the efficiency of the TLB lookups, since a single TLB entry maps a larger virtual to physical address translation (e.g. 2MB instead of 4K). Importantly, if transparent huge pages are enabled, the default behavior for the kernel (and hence what AArch64 does by default) is to always use a huge page where possible. This means that when perf does a 1GB malloc() in our benchmark, the kernel will try to use as many 2MB huge pages as possible, in this case up to 512 of them. We can test this out by reworking the test to compare the speed difference between copying 1GB one time, and copying 1MB 1024 times:

$ echo always > /sys/kernel/mm/transparent_hugepage/enabled
$ perf bench mem memcpy -s 1GB
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1GB bytes ...

       2.783825 GB/sec
$ perf bench mem memcpy -s 1MB -l 1024
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...

       8.419351 GB/sec

The second number is artificially high, due to either due to time measurement error, or because 1MB buffer is too small and fits in the CPU cache, meaning it's not actually measuring the RAM speed. The following table shows the perf benchmark results for each power of two size from 1MB to 1GB, and the results with transparent huge pages enable and disabled:

Block size# of loopsSpeed with huge pages (GB/sec)Speed without huge pages (GB/sec)
1MB 1024 8.746687 8.713989
2MB 512 7.214226 7.234791
4MB 256 3.734827 6.506520
8MB 128 3.207688 6.201166
16MB 64 2.968178 6.035185
32MB 32 2.867392 5.945515
64MB 16 2.842613 5.904721
128MB 8 2.827423 5.883253
256MB 4 2.811563 5.876304
512MB 2 2.767668 5.872991
1GB 1 2.771941 5.875165

Note that 4MB is the first size that is guaranteed to have the option of at least one huge page. The 2MB would only be able to be a huge page if the allocation also happened to be aligned to a 2MB memory address, which is unlikely. Correspondingly, as soon as the allocation size hits 4MB the copy speed drops significantly.

To narrow down more specifically what is going on, I used the perf stat command to profile a run of the memcpy benchmark both with transparent huge pages enabled and with it disabled. The command I used was:

$ perf stat -e $METRICS perf bench mem memcpy -s 32MB -l 1024

In order to get a more accurate analysis of the memory speed, the memcpy benchmark does an initial memcpy() from the source to the destination buffer to force the kernel to populate the destination buffer pages in memory before performing the copy loops (crucially, the source buffer is untouched which means it is initialized to zero by the kernel, but more on that in a bit). This initial memcpy() will cause many page faults before the benchmark begins, which is why I chose to run a smaller memcpy more times when profiling. This causes the initial fills of the buffers (which perf stat ends up measuring) to be less significant part of the final measurements compared to profiling a single pass of a large 1GB buffer. I chose a 32MB buffer as the sweet spot where it is still large enough to have a significant number of 2MB huge pages compared to the total buffer size, but small enough to minimize initial page faults.

A number of metrics were profiled using this method, which are shown in the following table:

Metric With Huge Pages Without Huge Pages % Change
Copy Speed (GB/s) 2.905628 5.960294 105%
Elapsed time (s) 11.0365 5.4210 -51%
CPU cycles 17,538,311,364 8,634,255,108 -51%
Instructions 5,406,625,559 5,427,592,614 0.3%
CPI 3.24 1.59 -51%
Cache misses 17232385 773783 -96%
TLB misses 1,525,677 17,201,195 1027%
Context Switches 1120 550 -51%
Bus Cycles 8870763558 4317758724 -51%

The data clearly shows that when transparent huge pages are enabled, over twice as many cache misses are encountered as compared to when they are disabled, which is very likely the cause of the performance degradation. This was not at all what I expected, since huge pages are designed to reduce the number of TLB misses (which you can see from the data they do), and really shouldn't have anything to do with the cache. After much consternation and research, I came across an LWN article that made everything click: The reason for the slow down is because of the huge zero page that is being mapped in for the source buffer.

When a user space process allocates a new page, the kernel zero-initializes it in order to prevent leaking data from another process. However, as an optimization the Linux kernel doesn't actually allocate a new page and memset() it to zero when an application asks for one. Instead, there is a special "zero page" that the kernel tracks and when a process is allocated a new page the kernel simply adds the zero page to the process page table at the correct virtual address. The page is marked copy on write, so as long as an application only reads from this page, it will continue to efficiently get zeros, but as soon as it does a write a new page is copied from the zero page and replaces it in the page table.

After careful reading of the perf source code, I realized that it never initializes the source buffer, meaning that the source buffer is effectively a repeated copy of the zero page. When transparent huge pages are disabled, this is the 4KB zero page repeated through the buffer. Since the cache on most(?) modern processors uses physical addresses, this means that a copy from the source buffer is reading from the same 4KB chunk of physical memory over and over again. This 4KB easily fits in the L1 cache of the processor, and thus is fast. However, when pages are coalesced into a huge page transparently, things are different. There are zero pages for huges pages, but they are necessarily the size of the huge page (i.e. 2MB) so that they can be used in place of an actual huge pages in the page tables. This means that the source buffer in this case is the huge zero page repeated. Unlike the 4KB zero page, 2MB is much too large to fit in the L1 cache of our SoC, and thus there are a lot of cache misses reading from the source buffer. Even though it is reading the same physical memory over and over again, it is having to read the entire source from main memory and write the entire destination to main memory, meaning that the performance is just about exactly halved.

So, what does this mean? I have a few observations:

  1. perf bench mem memcpy is probably not the best tool for determining actual main memory speed. If that is your goal, perf bench mem memset is going to serve you much better, since it doesn't involve any source buffers, and thus is immune to the problems I found with the memcpy benchmark.
  2. I wonder what the actual goal of the memcpy benchmark is. If the goal is to try and get an accurate speed for how quickly memory can be copied from main memory to main memory, it is currently failing. In the "best" case is will repeatedly copy the 4KB zero page from cache to main memory, only giving you the speed at which main memory can be written. This isn't very useful, since the memset benchmark can tell you the same thing without being unpredictable because of transparent huge pages. It is possible that the memcpy isn't actually intended to measure main memory speed, but rather the speed of various copy implementations (especially given the multiple implementations for x86), but this also doesn't quite make sense because if you really wanted to benchmark various implementations, you would want them to read and write to the L1 cache, otherwise the transfer time to main memory will likely dominate the execution time and the comparison will not be very useful.
  3. It may be worth considering if "always" is a good default for transparent huge pages. In this specific case transparent huge pages causes a significant reduction is performance. Granted, this is a very synthetic case, as I can't think of many applications that are going to be relying on the performance of copying large amounts of zero pages.

Sunday, February 16, 2020

Pi Board Automated Testing - Part 1 Introduction

I have a weakness for Raspberry Pi clones. I like how you can pack such a leave large amount of variety and power in a small and consistent form factor. As part of my day job, I often have to evaluate new SoCs or SoC IP cores (e.g. GPUs), and I almost always tend to gravitate toward finding a SBC in Raspberry Pi form factor for these evaluations if at all possible. This also involves using OpenEmbedded and the Yocto Project to run Linux on the SBCs, since I'm an active community member of both projects and also use them extensively at my day job.

After acquiring my 5th such SBC, managing them started to become ungainly. Each board requires at least 3 cables for the tests I was running (power, serial terminal, ethernet, and occasionally mouse, keyboard and HDMI), and shuffling them around was beginning to be a pain, particularly when I wanted to try the same build across different devices at the same time for testing, evaluation, and comparision purposes. Further complicating this setup was that each device needed a unique power supply, and the serial terminal is usually exposed as bare pins on the board (e.g. on the 40-pin pi header) meaning that constantly connecting and disconnecting it was annoying and error prone.

So, I set about trying to fix this. My initial goal was to create some sort of fixture that I could mount multiple SBCs on that would only require 3 connections for my core use cases: 1 ethernet cable that I could attach to my PC to network all of them (via a gigabit switch), 1 USB cable to access all of the serial consoles (via a USB hub and USB serial converters), and 1 power cable to power all the boards, ethernet switch, and USB hub.

I had a few basic requirements for this project:
  1. No complex fabrication. I tried not to use materials or construction methods that require anything more than basic tools; I don't own a 3D printer, or a CNC machine, so if I can't hammer, saw, drill, or glue it, it's not getting built. In fact, most of the structural components were scraps I had lying in my garage.
  2. Use as much off the shelf components as possible, and try to keep them cheap. The boards I'm testing max out around $100 each and most are less than that, so it's hard to justify spending a ton of (my own) money to test them.
  3. the fixture should support at least 6 boards at a time. I feel like this is a reasonable lower limit for a test fixture.
  4. The fixture should be transportable. I often have to move it around, so it shouldn't be too heavy, ungainly, or require a lot of setup.
  5. Reasonable access to the I/O of each board; namely the HDMI, USB, SD card, and Ethernet ports. These are the peripherals I use the most in my evaluations so I want to be able to access them easily on each board. Access to the 40-pin headers is nice, but secondary.
After putting some thought into this, I realize that it might be possible to use this setup to do automated board testing with a few minor tweaks. The first extra thing required was a way to remotely power cycle the devices, which I designed into the fixture. The second thing required was a way to get code running on the remote devices; there are a number of ways to accomplish this, but I really wanted some mechanism where by the test controller could write a complete image to the device under test's SD card while it was powered off, then power on the device and have it boot from that card. There are 2 reasons for this:
  1. Often one of the hardest parts of bringing up support for these boards is the bootloader configuration. Being able to flash an SD card fresh each time ensures that this is adequately tested
  2. This also ensures that the test can be run regardless of the state of the device before test, which should help keep the tests stable.
Fortunately, the internet delivered and I discovered SDWire which should perfectly fit my bill. The only real requirement it added to my fixture was that each board now required a extra USB connection to the USB hub so that the SDwire adapter could be accessed. I was unable to find anywhere where these could be reasonably purchased pre-made, so I'm have to build them from scratch and they are one of the few exceptions to my simple construction techniques.

As of this writing, the main fixture is completed, but I do not yet have the SDWire adapter built for a complete automated testing solution. Here is a picture of the completed fixture:


I will go into more detail about the different components of the fixture, how I built it, why I made some of the choices I did, and an overview of the 5 SBCs I'm currently using in subsequent blog posts.