::Reviews.Archive
Loop Unrolling with Duff’s Device

Loop unrolling, or unwinding, is simply reducing the number of overhead instructions that the CPU has to execute in a loop, thus improving the cache hit rate and the loop’s run time.

As a simple demonstration, the following example code copies 100 values from the memory pointed by from, to device, that points to some hardware device we want to send data to.

for (int i = 0; i < 100; ++i)
{
    *device = *from++;
}

Note: We assume device points to a hardware device, so its pointer isn’t incremented in the example.

Without looking at the assembley code , I boldly assume almost any compiler would require at least one condition check, one index increment and one jump instruction to execute each loop cycle in the example code above. We can optimize the code dramatically by simply executing several copy instructions in every loop cycle:

for (int i = 0; i < 20; i++)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

The improvement is easily noticeable - only fifth of the condition checks, fifth of the jump instructions and fifth of the index increments would be executed. This optimization method is called loop unrolling or loop unwinding.

What happens when the number of elements to copy varies? If loop unrolling is to be used as an optimization, it should be used very carefully. In the last example we counted on the fact that the amount of loop cycles can be divided by 5, but what if the total number of elements doesn’t divide by 5? A naive implementation would probably split the logic into two separate parts:

// Handle copy as much as we can with unrolling
int loopCycles = (count + 4) / 5;
for (int i = 0; i < loopCycles; ++i)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

// Take care of the remainder memory to copy
for (int j=0; j < count % 5; ++j)
{
    *device = *from++;
}

The implementation above works, but is seems awkward to have two loops for just copying one buffer - by means of code size, and by means of sheer lack of style.

Duff’s Device

A special C syntax hack can spare us the filth of those two consecutive loops in the naive code above. Tom Duff (link below) was the first to discover it in 1983 while working in Lucas Films, and proudly named it “Duff’s Device”. Duff’s device implements loop unrolling by interlacing the structures of a switch and a loop. It is perhaps the most dramatic use of case label fall-through ever seen in C:

int n = (count + 7) / 8;      /* count > 0 assumed */

switch (count % 8)
{
    case 0:   do {    *device = *from++;
    case 7:           *device = *from++;
    case 6:           *device = *from++;
    case 5:           *device = *from++;
    case 4:           *device = *from++;
    case 3:           *device = *from++;
    case 2:           *device = *from++;
    case 1:           *device = *from++;
              } while (--n > 0);
}

At first the code might seem weird, but it is perfectly valid ANSI C. To clear things out, here’s a brief explanation of the code:

When we enter the switch, we first copy count’s remainder with 8. Note that there is no break instruction in the switch, so we will ‘land’ on count % 8 and fall through the rest of the switch. Execution will reach the do’s while condition, and jump back to the case 0: instruction. From this point onward, we are simply running a normal loop unrolling calculated in n, and the switch is ignored.

For a simple memory copy you can use a fast assembly instruction like rep movsb. However, in this review the subject is copying memory to a hardware device. Besides hardware specific instructions, Duff’s device is still considered to be the fastest code to copy memory to a device for over 22 years.

References

Tom Duff’s original discovery can be found at http://www.lysator.liu.se/c/duffs-device.html

Posted by Tomer Margolin | | No comments
::Reviews.Archive
Journey into CPUID

Some programs rely on specific CPU instructions, with features like MMX, 3DNow!, Hyper Threading, etc. Some of them even rely on the specific CPU version. How would they know for sure that the CPU has what it takes to operate? The documentation for the Win32 API on this matter is scarce, if any. For example, GetSystemInfo returns no more than the CPU vendor, and of course we want to know more.

When Intel released CPU’s with MMX instruction sets, they had to add a special instruction for programmers to know whether this is an MMX machine or not - CPUID. Today, CPUID has become a standard instruction. You can use it on any Intel CPU starting from Intel 486, on any AMD CPUs, and even on some of Cyrix’s CPUs.
The CPUID instruction

This is far from being a complete guide to the CPUID instruction. However, a few pointers will be given as an introduction.

General Usage

CPUID handles EAX as a parameter for the required information type, and returns the data in the EAX, EBX, ECX, EDX registers. Always call CPUID with EAX first set to 0, and get the highest EAX parameter you can pass to CPUID in EAX, and the vendor string (Yes, string) in EBX, ECX, EDX with total of 12 characters. According to the highest EAX, you will know whether to proceed to the next CPUID call with a different EAX value. For example, calling CPUID with EAX = 0 on an AMD machine would get the string “AuthenticAMD” on EBX, ECX and EDX (4 chars each). If you wanted to call CPUID with EAX = 1, you would make sure you got EAX >= 1 on the first call. As mentioned, beware that not all CPUs support this instruction. As any other unsupported CPU instruction, this will raise a structured exception which can easily be caught, as shown in the following MS specific code:

__try
{
  __asm
  {
    xor eax, eax
    cpuid
  }
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
  // The cpu does not support CPUID instruction
  NotifyTheUserThatHisCpuIsProbablyNotWhatHePayedFor();
}

References

A complete reference for the Intel’s retrievable features list can be found in Intel’s CPU reference for Pentium II at ftp://download.intel.com/design/PentiumII/manuals/24319102.PDF starting at page 3-116.

A good guide for using the CPUID instruction can be found at
http://www.paradicesoftware.com/specs/cpuid/

Thoroughly detecting Hyper Threading on intel CPUs:
http://www.intel.com/cd/ids/developer/asmo-na/eng/199701.htm?page=3

Irrelevant anecdote

In 1999, Intel added a serial number feature which can be retrieved by CPUID on Pentium III CPUs. The existence of a retrievable unique serial number for every intel computer world wide has raised the public interest as an invasion of privacy. You can read about it in here:
http://www.cdt.org/privacy/issues/pentium3/990226intelcomplaint.shtml
http://www.cdt.org/privacy/issues/pentium3/990226intelcomplaintsupp.shtml

and about the resolution in here:
http://www.cyber-rights.org/reports/intel-rep.htm

Posted by Tomer Margolin | | No comments
::Sidebar
::Links
  • xkcd
  •  

    Powered by WordPress