Any 3D engine draws it’s power and speed from the mathematical models and implementations within, and trust John Carmack of ID software for using really good hacks. As it turns out, a very interesting hack is used in Quake III to calculate an inverse square root.
ID software has recently released the source code of Quake III engine with a GPL license. In this article we’ll see Carmack work his black magic to calculate the square root of a floating point number blazingly fast.
A fast glance at the file game/code/q_math.c reveals many interesting performance hacks. The first thing that pops out is the use of the number 0×5f3759df in the function Q_rsqrt, which calculates the inverse square root of a float and why does this function actually work?
Observe the original function from q_math.c:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
#ifndef Q3_VM
#ifdef __linux__
assert( !isnan(y) ); // bk010122 - FPE?
#endif
#endif
return y;
}
Not only does it work, on some CPU’s Carmack’s Q_rsqrt runs up to 4 times faster than (float)(1.0/sqrt(x), eventhough sqrt() is usually implemented using the FSQRT assembley instruction!
In another file, code/common/cm_trace.c , a neater implementation of the same hack can be found. This time it is used for calculating the square root of a float - sqrt(x). Note that the only real difference is in the return value - instead of returning y, return number*y as the square root:
/*
================
SquareRootFloat
================
*/
float SquareRootFloat(float number) {
long i;
float x, y;
const float f = 1.5F;
x = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( f - ( x * y * y ) );
y = y * ( f - ( x * y * y ) );
return number * y;
}
The code above implements the well known Newton Approximation of roots [3]. As in most other iterative approximation calculations, The Newton approximation is supposed to be ran in iterations; each iteration enhances the accuracy until enough iterations have been made for reaching the desired accuracy.
The general idea behind Newton’s approximation is that whenever we have a guess y for the value of the square root of a number x, we can perform a simple manipulation to get a better guess (one closer to the actual square root) by averaging y with x/y. For example we can compute the square root of 2 as follows; Suppose the initial guess is 1:
2/1 = 2 ; (2 + 1) / 2 = 1.5 2/1.5 = 1.3333; ( 1.5 + 1.3333 ) / 2 = 1.4167 2/1.4167 = 1.4117; ( 1.4167 + 1.4117 ) / 2 = 1.4142 And so on...
As mentioned before, Newton’s approximation is a well known method of fast root calculations. However, Carmack picked an unusual first guess for the first iteration that enhances the approximation drastically and lowers the number of required iterations in Quake III’s calculations to only one iteration!
The really interesting aspect of this function is the magic constant 0×5f3759df, used to calculate the initial guess, in :
i = 0x5f3759df - ( i >> 1 );
Hence, dividing the input by 2 and subtracting it from the magic constant. This constant works almost perfectly - Only one Newton approximation iteration is required for a low relative error of 10^-3. . As the code illustrates in the commented out second iteration, this approximation is good enough for the Quake III engine.
It turns out that the magic constant 0×5f3759df is a real mystery. In the article “Fast Inverse Square Root” [2] , mathematician Chris Lomont of Purdue University researches the meaning of this magic constant. Using several elaborate techniques, Lomont tries to mathematically calculate this constant himself. The results are very surprising - Lomont’s well calculated mathematically optimal constant turns out to be slighltly different (0×5f37642f) , and in spite of being theoretically better, it yields worse results then the original constant used in the source code!! Indeed, John Carmack must have used genuine black magic to find this number.
Only by numerically searching for another better constant, Lomont found one, which is slightly better that the original. However, in practice the two constants are likely to yield similar results. Lomont proposes this function with the better constant:
float InvSqrt(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f375a86- (i>>1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy
return x;
}
Atomic counters are the foundation of any thread safe reference counter. They are also common in daily use for thread counting and many other uses. Regularly, implementing thread safe atomic counters can be done in several ways, but always with a performance cost. In this review we will examine a method to implement a reference counter with a single assembly instruction.
Mutexes and semaphores exist in all modern POSIX and Microsoft Windows operating systems. The OS kernel supplies user-space applications a way to enforce memory locking. The shortcoming of this method, is that it bears heavy performance penalty, mostly due to the change to kernel space and back.
As a faster alternative, Critical Sections are implemented in all modern operating systems (either via pthread mutex, or WIN32 API’s as in EnterCriticalSection). A critical section implements locking purely in user-space, therefore it is much more efficient. So how exactly critical sections succeed in ensuring us thread safety in user-space? The answer is the LOCK prefix.
All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs. The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments.
The LOCK prefix works only with the following instructions:
BT, BTS, BTR, BTC (mem, reg/imm) XCHG, XADD (reg, mem / mem, reg) ADD, OR, ADC, SBB (mem, reg/imm) AND, SUB, XOR (mem, reg/imm) NOT, NEG, INC, DEC (mem)
Note: XCHG and XADD (and all the ‘X’ family of instructions) are planned to be multi-processor safe, and always asserts LOCK# regardless of the presence of the LOCK prefix.
All threads and processes accessing the locked memory must ‘play fair’. This means that any one accessing the memory without trying to lock the address will not be protected by other processes locking the same address. For example:
Thread 1: lock inc ptr [edx]
Thread 2: inc ptr [edx]
If the two threads run simultaneously, thread 2 does not assert LOCK#. This means that it doesn’t check if the memory pointed by edx is locked. This means the lock is useless in this matter.
In order to use a critical section, the user must create a critical section object, and lock / unlock the relevant piece of code. This creates a lot of overhead in our simple case, where we only want to change the value of a variable safely.
Below is a simple implementation of a ‘thread safe increment’. The function takes one volatile parameter, a pointer to an integer value, and increments it:
void __fastcall atomic_inc (volatile int* pNum)
{
__asm
{
lock inc dword ptr [ECX]
ret
}
}
The __fastcall keyword makes sure the parameter is passed as a register, in order to perform the increment in one atomic instruction.
Note that under Microsoft Windows compiler it’s possible to add the __declspec(naked), since we do not need traditional function entry and exit code. This will make the code run with almost no overhead.
Performance-wise, there is almost no overhead to the action, and nothing prevents it from being cached by the CPU upon frequent usage. The same concept can be used for other data types, or implement any of the actions supporting the LOCK prefix.
Win32 API has a set of builtin functions that implement the functionality described above, these are called the Interlocked functions. These functions gets LONG parameters and run in the same way demonstrated above. For example:
LONG InterlockedExchange(IN OUT PLONG plTarget, IN LONG lValue);
Performs two calls for CMPXCHG with the lock prefix.
It is worth mentioning that in the Repertoire Project, John M. Dlugosz’s implemented a very handy C++ template class that enforces thread safe counting. For example, their safe add implementation is:
__declspec(naked) int __fastcall Xadd (volatile int* pNum, int val)
{
__asm
{
lock xadd dword ptr [ECX], EDX
mov EAX, EDX
ret
}
}
If you plan to implement a safe counter yourself, it is very much advised to encapsulate it in it’s own class. The LOCK mechanism will work only if all the threads / processes are modifying the data with the LOCK prefix. This will help make sure that everyone change the data in a synchronous way.
References
Do you know how much memory your structs and classes require? Unless you plan carefully, structs and classes may require as much as twice memory than actually needed, due to the compiler’s default packing alignment. The pack directive can be used in order to define the best packing alignment for your code.
When compiling code with any modern compiler, the compiler determines the default packing alignment of structures, unions and classes. This alignment is set for an entire compilation unit, and can usually be controlled with compiler switches like /Zp (in VC’s cl.exe).
The default packing alignment varies when working on different compilers. However, it defaults to 4 or 8 on most compilers.
As an example, observe the following struct:
struct myPack
{
char m_char;
int m_int;
char m_boolean:1;
};
void main()
{
printf("%d bytes required for myStruct, n", sizeof(myStruct));
}
myPack consists of 42 ( = 8 + 32 + 1) useful bits of information, which may be rounded to 6 bytes of actual memory space needed. Surprisingly, the print out of the program above would yield “12 bytes”. This is beacause under the default packing alignment of 4, myPack would actually require 12 bytes, as illustrated in the following chart:
[m_char , _, _, _, m_int, m_int, m_int, m_int, m_boolean, _, _, _]
(every position in the chart represents one byte)
This inefficient allocation gets worse when declaring an array of myStruct:
struct myStructArray
{
myStruct structArr[256][256];
};
The size of myStructArray is 256 * 256 * sizeof(myStruct) = 786432 bytes, whereas only 42 bits * 256 * 256 = 344064 bytes may be actually used!
In most cases this would not budge anyone as memory is not always among your program’s limitation. However, it is very easy to think of problems that can be occur by this waste of memory space when using large memory structs like myStructArray. Starting from performance issues like the fact that more page faults would occur when addressing that memory, hence your program could be slower than possible. Ending with hardware problems like wasting that memory when serializing large structs buffers into files or hardware devices that could be influenced by this waste.
Note: It is important to mention here that the default packing alignment is set to 4 or 8 mainly in order to fit the CPU’s registers. If there is no special reason like those mentioned above, your program would probably run faster with the default alignment.
Not very surprisingly, a simple C++ pragma instruction can set the compiler’s packing alignment to the required size.
When using #pragma pack(n), the compiler uses n bytes as a packing alignment from that directive until the end of the compilation unit. Under Microsoft’s compiler and GCC , N could be 1,2,4,8 or 16.
In order to preserve the default packing alignment for declarations other than your required declaration, the following syntax can be used to control the internal compiler’s stack:
#pragma pack(push, 1)
struct myStruct
{
char m_char;
int m_int;
char m_boolean:1;
};
#pragma pack(pop)
myStruct would now be compiled with a 1 byte alignment. You can relax as sizeof(myStruct) = 6 bytes only, instead of 12, and of course myStructArray is about 393KB instead of 786KB.
Note that after the pop directive, the default packing alignment influences the rest of the code in the sense that anyone who uses myStruct would get only size of 6 bytes, including myStructArray, even if myStruct is used inside a region of another packing alignment. However, any other struct or class before the push or after the pop would be compiled with the default packing alignment, unless mentioned otherwise.
The pack syntax even lets your declare identifiers and use them to verify that the compiler’s stack is in the state you wanted it to be. For example:
#pragma(push, beforeInclude) // an evil include file that uses #pragma push itself witout popping #include "IDontTrustThis.h" #pragma(pop, beforeInclude)
When popping, the compiler would scream that the stack’s top is not beforeInclude, and this would alert you to check that include file for a missing pop.
The pack pragma is not supported in all architectures (powerPC for example), a less powerful, but platform independent alternative is supported in GCC; The __attribute((packed)) directive. The less powerful equivalent to the above code with the GCC alternative would be:
struct myStruct
{
char m_char;
int m_int;
char m_boolean:1;
} __attribute((packed));
In this case, the compiler will not align the struct, and pack it into 6 bytes.
Using the pack directive does have some dark sides:
10 Lines of code are more than enough. This teaser challenges you to perform the daunting task of hex dumping under 10 lines of code.
The code should read input from STDIN, and print hex values of each input octet to the STDOUT. The code should also prints all printable characters, and print all non-printable characters as ‘.’ (dot). The length of each output line should be modifiable.
Note: As in all other Top 10 Lines challenges, thou shell not exceed 10 lines of code.
9 lines of Perl are more than enough:
$len = 0xC; #12 bytes in each time
for( $offset = 0; read STDIN, $chars, $len; $offset += $len)
{
$printable = $chars;
$printable =~ s/[^x21-x7e]/./g; #Replace non printable chars with dot
#Pad the chars to $len size, and ord them
@hex = unpack("C*", $chars. " " x ($len - length $chars));
printf("%.8X: "."%.2x " x $len ."| %sn", $offset, @hex, $printable);
}
Your Kung-Fu is better than the Maestro’s? Submit your code
Code Challenge - Any language acceptable, no technique too dirty!
Powered by WordPress