Loop unrolling, or unwinding, is simply reducing the number of overhead instructions that the CPU has to execute in a loop, thus improving the cache hit rate and the loop’s run time.

As a simple demonstration, the following example code copies 100 values from the memory pointed by from, to device, that points to some hardware device we want to send data to.

for (int i = 0; i < 100; ++i)
{
    *device = *from++;
}

Note: We assume device points to a hardware device, so its pointer isn't incremented in the example.

Without looking at the assembley code , I boldly assume almost any compiler would require at least one condition check, one index increment and one jump instruction to execute each loop cycle in the example code above. We can optimize the code dramatically by simply executing several copy instructions in every loop cycle:

for (int i = 0; i < 20; i++)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

The improvement is easily noticeable - only fifth of the condition checks, fifth of the jump instructions and fifth of the index increments would be executed. This optimization method is called loop unrolling or loop unwinding.

What happens when the number of elements to copy varies? If loop unrolling is to be used as an optimization, it should be used very carefully. In the last example we counted on the fact that the amount of loop cycles can be divided by 5, but what if the total number of elements doesn't divide by 5? A naive implementation would probably split the logic into two separate parts:

// Handle copy as much as we can with unrolling
int loopCycles = (count + 4) / 5;
for (int i = 0; i < loopCycles; ++i)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

// Take care of the remainder memory to copy
for (int j=0; j < count % 5; ++j)
{
    *device = *from++;
}

The implementation above works, but is seems awkward to have two loops for just copying one buffer - by means of code size, and by means of sheer lack of style.

Duff's Device

A special C syntax hack can spare us the filth of those two consecutive loops in the naive code above. Tom Duff (link below) was the first to discover it in 1983 while working in Lucas Films, and proudly named it "Duff's Device". Duff's device implements loop unrolling by interlacing the structures of a switch and a loop. It is perhaps the most dramatic use of case label fall-through ever seen in C:

int n = (count + 7) / 8;      /* count > 0 assumed */
  
switch (count % 8)
{
    case 0:   do {    *device = *from++;
    case 7:           *device = *from++;
    case 6:           *device = *from++;
    case 5:           *device = *from++;
    case 4:           *device = *from++;
    case 3:           *device = *from++;
    case 2:           *device = *from++;
    case 1:           *device = *from++;
              } while (--n > 0);
}

At first the code might seem weird, but it is perfectly valid ANSI C. To clear things out, here's a brief explanation of the code:

When we enter the switch, we first copy count's remainder with 8. Note that there is no break instruction in the switch, so we will 'land' on count % 8 and fall through the rest of the switch. Execution will reach the do's while condition, and jump back to the case 0: instruction. From this point onward, we are simply running a normal loop unrolling calculated in n, and the switch is ignored.

For a simple memory copy you can use a fast assembly instruction like rep movsb. However, in this review the subject is copying memory to a hardware device. Besides hardware specific instructions, Duff's device is still considered to be the fastest code to copy memory to a device for over 22 years.

References

Tom Duff's original discovery can be found at http://www.lysator.liu.se/c/duffs-device.html

Posted by Tomer Margolin | | No comments

No Comments »

RSS feed for comments on this post. TrackBack URI

Leave a comment