Atomic counters are the foundation of any thread safe reference counter. They are also common in daily use for thread counting and many other uses. Regularly, implementing thread safe atomic counters can be done in several ways, but always with a performance cost. In this review we will examine a method to implement a reference counter with a single assembly instruction.

The Alternatives

Mutexes and semaphores exist in all modern POSIX and Microsoft Windows operating systems. The OS kernel supplies user-space applications a way to enforce memory locking. The shortcoming of this method, is that it bears heavy performance penalty, mostly due to the change to kernel space and back.

As a faster alternative, Critical Sections are implemented in all modern operating systems (either via pthread mutex, or WIN32 API’s as in EnterCriticalSection). A critical section implements locking purely in user-space, therefore it is much more efficient. So how exactly critical sections succeed in ensuring us thread safety in user-space? The answer is the LOCK prefix.

The LOCK# Signal and Prefix

All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs. The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments.

The LOCK prefix works only with the following instructions:

BT, BTS, BTR, BTC   (mem, reg/imm)
XCHG, XADD  (reg, mem / mem, reg)
ADD, OR, ADC, SBB   (mem, reg/imm)
AND, SUB, XOR   (mem, reg/imm)
NOT, NEG, INC, DEC  (mem)

Note: XCHG and XADD (and all the ‘X’ family of instructions) are planned to be multi-processor safe, and always asserts LOCK# regardless of the presence of the LOCK prefix.

All threads and processes accessing the locked memory must ‘play fair’. This means that any one accessing the memory without trying to lock the address will not be protected by other processes locking the same address. For example:
Thread 1: lock inc ptr [edx]
Thread 2: inc ptr [edx]
If the two threads run simultaneously, thread 2 does not assert LOCK#. This means that it doesn’t check if the memory pointed by edx is locked. This means the lock is useless in this matter.

The Overhead of Critical Sections Usage

In order to use a critical section, the user must create a critical section object, and lock / unlock the relevant piece of code. This creates a lot of overhead in our simple case, where we only want to change the value of a variable safely.

The atomic_inc function

Below is a simple implementation of a ‘thread safe increment’. The function takes one volatile parameter, a pointer to an integer value, and increments it:

void __fastcall atomic_inc (volatile int* pNum)
{
    __asm 
    {
        lock inc dword ptr [ECX]
        ret
    }
}

The __fastcall keyword makes sure the parameter is passed as a register, in order to perform the increment in one atomic instruction.

Note that under Microsoft Windows compiler it’s possible to add the __declspec(naked), since we do not need traditional function entry and exit code. This will make the code run with almost no overhead.

Performance-wise, there is almost no overhead to the action, and nothing prevents it from being cached by the CPU upon frequent usage. The same concept can be used for other data types, or implement any of the actions supporting the LOCK prefix.

Win32 API has a set of builtin functions that implement the functionality described above, these are called the Interlocked functions. These functions gets LONG parameters and run in the same way demonstrated above. For example:

    LONG InterlockedExchange(IN OUT PLONG plTarget, IN LONG lValue); 

Performs two calls for CMPXCHG with the lock prefix.

It is worth mentioning that in the Repertoire Project, John M. Dlugosz’s implemented a very handy C++ template class that enforces thread safe counting. For example, their safe add implementation is:

__declspec(naked) int __fastcall Xadd (volatile int* pNum, int val)
{
    __asm 
    {
        lock xadd dword ptr [ECX], EDX
        mov EAX, EDX
        ret
    }
}

Final Note

If you plan to implement a safe counter yourself, it is very much advised to encapsulate it in it’s own class. The LOCK mechanism will work only if all the threads / processes are modifying the data with the LOCK prefix. This will help make sure that everyone change the data in a synchronous way.
References