Liran Chen's Blog: Multithreading

Showing posts with label Multithreading. Show all posts

Monday, October 11, 2010

Hot/Cold Data Separation Considered Harmful in Multithreaded Environments

When optimizing your code for performance, a quite important step in the way is to check that your application correctly utilizes its processor's cache. Of course, the magnitude of importance here is solely depended on your own personal scenario, but in general, it's always a good idea to keep your cache usage patterns in mind when attempting to resolve a performance bottleneck or perhaps for when you're just looking for possible ways to enhance you're application's performance.

Due to the relative high cost of accessing the main memory, modern processors make use of a data and an instruction cache in an attempt to lower the latency involved in accessing the main memory.
A processor may choose to keep in its cache the values of frequently accessed memory addresses, or perhaps prefetch the values of adjacent memory addresses for possible future usage.
When the processor attempts to access a memory address, it first check to see whether it's already exist in one of its caches (L1, L2, L3 etc.). If the address isn't found, then we have a "cache miss" and the processor we'll have to perform a round-trip to the main memory in order to obtain the required data. Wasting valuable CPU cycles on the way.

Not too long ago, one of the most popular guidelines in designing cache-aware data structures was to split them into "hot" and "cold" parts. This kind of separation makes sense since it could lead to a more efficient utilization of the cache. Frequently used fields (the hot part) are grouped together in the same cache lines, instead of being mixed with infrequently used fields (the cold part). This way, the cache is able to contain more hot data, and cache lines aren't wasted to hold cold data.
For those of you who are interested, the subject is covered in greater detail in the article Splitting Data Objects to Increase Cache Utilization by Michael Franz and Thomas Kistler.

Red: Hot data, Green: Cold data

While following this guideline could lead to a more efficient cache utilization, it won't necessarily lead to a performance improvement in a multithreaded environment. In fact, it's more likely that you'll face a performance degradation than gaining any performance improvement at all.
The problem in hand is that highly utilized caches could easily send your valuable CPU cycles down the drain, due to the effects of false sharing. Once a single cache line holds several frequently written-to fields, the chance it will get invalidated by another processor gets greater. Sharing a single cache line with multiple frequently modified fields could easily have a negative effect on your cache's locality.
In multithreaded environments, one could benefit from sparsely allocating frequently used fields in the cache. There's an obvious trade-off between cache utilization (memory usage) and locality (performance). While the cache will contain less hot data (which could result in more round-trips to the main memory), it would benefit from better locality, hence better scalability across multiple processors that might attempt to modify the same object simultaneously.

When designing a cache-aware data structure, you don't necessarily have to order your fields in such way that a hot field will always get followed by a cold field. Instead, "artificial" padding could be used to fill the excessive space left in the cache line which holds the structure. In .Net, decorating the type with an StructLayout(LayoutKind.Explicit) attribute and assigning each field with its appropriate offset would do the job.

Friday, September 24, 2010

Writing a Semi-Local Object Pool

Using object pooling in managed environments can usually benefit us in two ways:

Reducing the amount of time required to create "heavyweight" objects (that might involve executing some time consuming tasks).
Reducing the amount and rate of dynamic memory allocations. Thus, reducing the GC latency in future collections.

Nevertheless, it's important to remember that under certain scenarios, using object pools might actually have a negative impact on your application's performance. Since in managed environments (e.g, CLR, JVM), dynamic memory allocations is considerably fast, using object pools in favor of "lightwight" objects could cause somewhat of an unnecessary overhead during the object's allocation process. Brian Goetz summarized this issue:

Allocation in JVMs was not always so fast -- early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice -- because it was, along with everything else in early JVMs -- and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.)

A common, simple pattern for implementing an object pool is to create a single pool instance that is shared across all of the application. To achieve thread-safety, you would usually find a single, global lock around the Allocate and Free methods.
It's obvious that this type of design could introduce major concurrency bottlenecks. The more objects we'll attempt to pool, the greater the chance that we'll have threads attempting to acquire the pool's lock. And since we only maintain a single, global pool, contentions around that lock are bound to appear. Effectively ruining our application's scalability.
To demonstrate the issue, I've written a small benchmark that uses all of the available processors to allocate and free a constant number of objects (each thread gets an equal amount of objects to pool). Logically speaking, the more processors we use, the faster we should be able to finish allocating and freeing the constant number of objects. However, as the results show, we're actually experiencing a slowdown that gets worse as soon as we add more and more processors.
The results aren't surprising since they can be easily explained due to the massive amount of contentions we're experiencing around our single lock. (The time axis in the chart is expressed in milliseconds).

The first implementation of the pool that was being used in the test:
(Mind you that the code samples in this post are purely meant to demonstrate the conceptual differences between the pools).

    // holds a dictionary that makes a pool-per-type corelation
    public class SimpleMainPool
    {
        private Dictionary<Type, ISubPool> m_main;

        // to make things simpler, the dictionary isn't modified
        // after the first initialization
        public SimpleMainPool(Type[] pooledTypes)
        {
            m_main = new Dictionary<Type, ISubPool>();

            foreach (Type curType in pooledTypes)
                m_main.Add(curType, new SemiLocalPool(curType));
        }

        public object Allocate(Type type)
        {
            ISubPool sub = m_main[type];

            object pooledObj = sub.Allocate();
            return pooledObj;
        }

        public void Free(object obj)
        {
            ISubPool sub = m_main[obj.GetType()];
            sub.Free(obj);
        }
    }

    // our simple thread-safe pool
    class SimplePool : ISubPool
    {
        private const int PRIME = 50;

        private Type m_type;

        private Stack<object> m_sharedPool;

        public SimplePool(Type type)
        {
            m_sharedPool = new Stack<object>(PRIME);
            m_type = type;

            for (int i = 0; i < PRIME; i++)
            {
                object sharedObj = Activator.CreateInstance(m_type);
                m_sharedPool.Push(sharedObj);
            }
        }

        public object Allocate()
        {
            lock (m_sharedPool)
            {
                if (m_sharedPool.Count == 0)
                {
                    for (int i = 0; i < PRIME; i++)
                    {
                        object newAlloc = Activator.CreateInstance(m_type);
                        m_sharedPool.Push(newAlloc);
                    }
                }

                object fromLocal = m_sharedPool.Pop();
                return fromLocal;
            }
        }

        public void Free(object obj)
        {
            lock (m_sharedPool)
            {
                m_sharedPool.Push(obj);
            }
        }
    }

    interface ISubPool
    {
        object Allocate();
        void Free(object obj);
    }

As in all things related to concurrency, if you don't have locality, then you've got sharing, and once you have sharing, you will probably end up with contentions that are bound to harm your application's performance, wasting valuable CPU cycles.
So if we'd like to improve our scalability, then our goal is clear: reducing the amount of shared data. For example, if pools wouldn't be shared across different threads, then we wouldn't had to worry about synchronizing them and we could avoid the involved contentions altogether. A simple way to achieve this, is to use the TLS to allocate an independent pool for every thread. This way, on the one hand we'll achieve perfect scalability due to the avoidance of state sharing, but on the other hand, this kind of implementation could lead to an excessive memory usage. For instance, if a single instance of our pool (including all of its pre-allocated objects) weights about 10Mb, then on a machine with 16 processors, we could find ourselves dedicating no less then 160Mb in favor of our thread-local pools, even though its not likely that every single thread needs to use all the types of objects that we're allocated in its local pool.
For example, if we're parallelizing some algorithm using 3 threads, where thread 1 needs to use objects of type A and thread 2 needs to use objects of type B and thread 3 needs to use objects of type C, then it makes no sense that every one of those threads will hold a pool that will contain objects of all three types.

A possible solution for this problem is to use a pool hierarchy, where every time a thread attempts to create an object, it will direct itself to its "closest" pool instance. If that pool doesn't contain available instances of the requested object, then it will continue to navigate up the hierarchy until it reaches a pool that holds available instances of the object. Once the thread finishes using the object, it will return it to a pool that is located "closer" to that thread, this way we are able to maintain a level of locality between a thread and its used objects.
Instead of getting confused with unclear and too complex hierarchies, I'll demonstrate the concept using a flat hierarchy that offers a single "global" pool that is shared across all threads, and another local pool for every thread.

Basically, the idea is that the only place synchronization is involved is in the shared pool. So in the optimal scenario, each local pool will eventually hold only the amount of objects required to keep the thread from accessing the shared pool.
Every time a thread needs to create an object, it will first check its local pool. Since this pool only serves the requesting thread, we won't have to deal with synchronization here. Only in case where we've ran out of objects, we'll move on to the shared pool and transfer N more instances of the requested object to the local pool. It could be wise to transfer more objects than the thread initially asked for in order to avoid future accesses to the shared pool. Also, in order to cap the amount of memory we'd like to dedicate for each thread, we could decide that each local pool can hold a maximum of X objects. Once we've exceeded that number, every time a thread will want to free an object, it will return it to the shared pool instead of its local pool (of course, this could cause some contentions, depending on the implementation detail [e.g. the pool may buffer object returns]. But its entirely up to the developer to perform this kind of fine-tuning [memory usage vs. scalability]).

To demonstrate to concept, I've came up with this simplistic pool implementation:

    class SemiLocalPool : ISubPool
    {
        private const int SHARED_PRIME = 50;

        private const int LOCAL_PRIME = 20;
        private const int LOCAL_MAX = 1000;

        [ThreadStatic]
        private static Stack<object> t_localPool;

        private Type m_type;
        private Stack<object> m_sharedPool;

        public SemiLocalPool(Type type)
        {
            m_sharedPool = new Stack<object>(SHARED_PRIME);

            m_type = type;

            for (int i = 0; i < SHARED_PRIME; i++)
            {
                object sharedObj = Activator.CreateInstance(m_type);
                m_sharedPool.Push(sharedObj);
            }
        }

        public static void Init()
        {
            t_localPool = new Stack<object>(LOCAL_PRIME);
        }

        public object Allocate()
        {
            // first, try to allocate from the local pool
            if (t_localPool.Count > 0)
            {
                object localObj = t_localPool.Pop();
                return localObj;
            }

            int allocated = 0;

            lock (m_sharedPool)
            {
                // pass objects from shared to local pool
                for (; m_sharedPool.Count > 0 && allocated < LOCAL_PRIME - 1; allocated++)
                {
                    object sharedObj = m_sharedPool.Pop();
                    t_localPool.Push(sharedObj);
                }

                // prime share pool
                if (m_sharedPool.Count == 0)
                {
                    for (int i = 0; i < SHARED_PRIME; i++)
                    {
                        // bad practice: holding the lock while executing external code
                        object sharedObj = Activator.CreateInstance(m_type);

                        m_sharedPool.Push(sharedObj);
                    }
                }
            }

            // if the shared pool didn't contain enough elements, prime the remaining items
            for (; allocated < LOCAL_PRIME - 1; allocated++)
            {
                object newAlloc = Activator.CreateInstance(m_type);
                t_localPool.Push(newAlloc);
            }

            object fromLocal = Activator.CreateInstance(m_type);
            return fromLocal;
        }

        public void Free(object obj)
        {
            // first return to local pool
            if (t_localPool.Count < LOCAL_MAX)
            {
                t_localPool.Push(obj);
                return;
            }

            // only after reaching LOCAL_MAX push back to the shared pool
            lock (m_sharedPool)
            {
                m_sharedPool.Push(obj);
            }
        }
    }

The scalability difference between the two implementations is closely related to the thread's pool usage pattern and to the values given to LOCAL_MAX, LOCAL_PRIME etc. If we reach a situation where there's always enough objects in the local pool, then we'll should enjoy perfect scalability.
For the purpose of the demonstration, here are the results of the previous benchmark, now using the new pool implementation (beside exceeding the predefined values at the beginning of the benchmark, the benchmark's behavior exhibits optimal usage pattern [accessing only the local pool after a while]).

One problematic characteristic of this type of design is its reliance on thread affinity. While in some scenarios it could actually benefit us, in others it could make the Semi-Local Pool irrelevant.
If every thread in our application is affinitized to certain section of the code (that allocates a constant set of objects), then using this design could be optimal since we dedicate each local pool to a managed thread. We actually assume that the thread will always attempt to allocate objects from a specific, constant set of objects.
However, if the threads doesn't comply with this assumption, then its only a matter of time until each local pool will hold the entire set of pooled objects in the applications (which will of course lead to high memory usage).
In order to improve our way of handling with such scenarios, we could decide to add a kind of additional hierarchy level, that will separate the shared pools according to different sections in the code. Meaning, threads that are currently executing code from a network module for example will access Pool X, while threads that are currently executing some algorithm will access Pool Y. This way we could achieve object locality not by relaying on thread affinity, but on "category affinity" (each section of the code uses a certain set of objects, relevant to it). When a thread will want to allocate an object, it will tell the pool which area in the code its currently executing, so it would receive the appropriate "category pool. It's likely that this pool already contains the same type of objects that will be requested by the current thread since they we're already allocated by other threads that previously executed the same code section.

And some code to illustrate the concept:

    public class CategorizedMainPool
    {
        private Dictionary<string, SimpleMainPool> m_main;

        public CategorizedMainPool(Tuple<string, Type[]>[] pooledCategories)
        {
            m_main = new Dictionary<string, SimpleMainPool>();

            foreach (Tuple<string, Type[]> curCategory in pooledCategories)
            {
                SimpleMainPool curSub = new SimpleMainPool(curCategory.Item2);
                m_main.Add(curCategory.Item1, curSub);
            }
        }

        public object Allocate(string category, Type type)
        {
            SimpleMainPool sub = m_main[category];

            object pooledObj = sub.Allocate(type);
            return pooledObj;
        }

        public void Free(string category, object obj)
        {
            SimpleMainPool sub = m_main[catagory];
            sub.Free(obj);
        }
    }

Tuesday, August 24, 2010

Reducing AutoResetEvent's Synchronization Overhead

One of the most common design patterns in multithreaded programming is the producer-consumer. In the usual scenario, a consumer thread is spawned into an infinite loop, where it waits on some handle until new data arrives from the producer threads. Then it wakes up, and process the new data.
The overall idea can be expressed via this code:

AutoResetEvent m_event = new AutoResetEvent(false);

private void WorkCycles()

{

    while(true)

    {

        // wait for a signal

        m_event.WaitOne();

        // do work..

    }

}

Using the AutoResetEvent class seems very suitable for this kind of scenario since it supplies us the exact behavior that we're looking for: causing a thread to to wait on a handle until a signal arrives. However, calls to WaitOne/Set come with a price, and it ain't cheap. The AutoResetEvent class is simply a managed-code wrapper for Win32's WaitForSingleObject, and it doesn't keep track of the internal event's state (whether it's currently signaled state or not), thus, every call to AutoResetEvent will result in diving all the way into kernel-mode, even though "we don't have to" (if we would have tracked the event's internal state).
In some scenarios, especially in those where producer threads pass data for the consumer in high frequencies, most of the calls to Set() are redundant. If for example every 100ms a producer passes new data (which results in a calling to Set), but it takes the consumer an entire second to process that data, then it means we've spent time calling Set 9 times more than we had to. Additionally, after the consumer finished its processing, it will have to call WaitOne again, only to realize the event was already set by a producer.

So, how expensive are those redundant calls to AutoResetEvent? The following benchmark should demonstrate:

while (true)

{

    AutoResetEvent eventObj = new AutoResetEvent(false);

    Stopwatch sw = Stopwatch.StartNew();

    for (int i = 0; i < 1000000; i++)

    {

        eventObj.Set();

        eventObj.WaitOne();

    }

    Console.WriteLine(sw.ElapsedMilliseconds);

}

On my machine, the average result was 1035ms.
A whole second, this is the ridiculously large amount of time it took us the complete those few lines of code. We are wasting here an entire second just for using a synchronization primitive that in our real application, could be considered to be quite insignificant at first look. Seemingly, we shouldn't waste more than a few milliseconds running that code, but as you can see, things are a little different in practice.

EconomicAutoResetEvent
The solution for this problem is quite straightforward, since our goal is to reduce the number of redundant calls to Set/WaitOne to a minimum. How could you achieve that? One option is to use an alternative synchronization primitive, that for the sake of this post, will be named EconomicAutoResetEvent. The function of this new primitive, is to improve AutoResetEvent's performance by keeping track of the event's state, thus, avoiding transitions into the kernel.
I'll demonstrate a possible implementation for such primitive, and explain its behavior in more detail afterwords:

public class EconomicResetEvent

{

    private volatile int m_eventState;

    private AutoResetEvent m_event;

    private Thread m_worker;

    private const int EVENT_SET     = 1;

    private const int EVENT_NOT_SET = 2;

    private const int EVENT_ON_WAIT = 3;

    public EconomicResetEvent(bool initialState, Thread workerThread)

    {

        m_event = new AutoResetEvent(initialState);

        m_eventState = initialState ? EVENT_SET : EVENT_NOT_SET;

        m_worker = workerThread;

    }

    public void WaitForWork()

    {

        verifyCaller();

        if (m_eventState == EVENT_SET && Interlocked.CompareExchange(

            ref m_eventState, EVENT_NOT_SET, EVENT_SET) == EVENT_SET)

        {

            return;

        }

        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(

            ref m_eventState, EVENT_ON_WAIT, EVENT_NOT_SET) == EVENT_NOT_SET)

        {

            m_event.WaitOne();

        }

    }

    public void SignalWork()

    {

        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(

            ref m_eventState, EVENT_SET, EVENT_NOT_SET) == EVENT_NOT_SET)

        {

            return;

        }

        if (m_eventState == EVENT_ON_WAIT && Interlocked.CompareExchange(

            ref m_eventState, EVENT_NOT_SET, EVENT_ON_WAIT) == EVENT_ON_WAIT)

        {

            m_event.Set();

        }

    }

    // [Conditional("DEBUG")]

    private void verifyCaller()

    {

        if (m_worker != Thread.CurrentThread)

        {

            string errMsg = string.Format("Only the pre-defined Worker thread may 

               call WaitOne (Current: {0}, Worker: {1})", Thread.CurrentThread, m_worker);

            throw new SynchronizationLockException(errMsg);

        }

    }

}

I assume that the question most of you are probably thinking about right now is "How fast it is?". Well, after running the same benchmark as before, the average result I get on my machine is only 9ms. An outstanding 11500% of improvement over using AutoResetEvent directly. And to think that all we just did is to avoid some calls to the internal kernel object.
The most important part in the implementation is the tracking of the state of the internal kernel object (signaled/not-signaled). The state itself is kept in m_eventState, this how we could know which calls to WaitForWork/SignalWork could avoid accessing m_event (AutoResetEvent).
Additionally, in order to make the store/load operations on m_eventState thread-safe, I've used a couple of CAS operations, that even though they could be cheaper then a "full blown lock", it's still considered to be quite an expensive instruction that usually it's best to avoid when they're not necessary. This is exactly why the double-tests on the if statements are performed. This technique is usually referred to as TATAS (i.e, Test-and-Test-and-Set).
As the comment at the class's deceleration mentions, some mild race conditions could occur when the consumer thread is entering/leaving WaitForWork while a producer calls SignalWork exactly at a very specific moment. This race condition won't cause any functional harm, at its worst-case, it would cause us to perform a redundant call to Set/WaitOne. Resolving this "issue" would mean that we prefer to enforce fairness. Such enforcement could have some negative effects on performance.

Since we're already discussing the subject, It would be appropriate to mention that the Parallel Extensions library had introduced a new synchronization primitive named ManualResetEventSlim, that also supplies us with extended functionality and better performance over the standard event objects. Beyond keeping the state of the internal kernel event, it also performs some spinning around the flag before actually blocking the thread, and also uses lazy initialization to create the kernel object (its created only when it's needed). The good guys at Microsoft were kind enough to publish a short but informative document that demonstrates the performance differences between the new and old synchronization primitives under a selected group of usage scenarios.

Saturday, August 14, 2010

Don't Rely on Environment.ProcessorCount

One of the most hidden knowledge in multithreaded programming is the question "How many threads I should use in my application to achieve the best performance from my available hardware?". The answer to this question may vary since it depends on the characteristics of the application. Whether it's CPU bound, or IO bound? How much of the work is being paralleled? and so on... At the bottom line, all of these "formulas" are based upon the number of available processors.

Usually when an application runs it's initialization routine and gets ready to create a collection of worker threads, it will check how many processors are installed on the machine, so it could decide how many threads exactly it should create. To get the number of installed processors, you would usually call Environment.ProcessorCount. This property simply calls the Win32 GetSystemInfo function, and returns the dwNumberOfProcessors field to the caller. The problem with this value is that it doesn't necessarily represent the number of processors available to our process. In a certain scenario, the user that launched our application might have decided to give it a certain Processor Affinity, that will cause our application's threads to execute only on the processors set by the given affinity, instead of all the installed processors. The problem in this case is that the application will completely ignore the processor affinity, and create a larger amount of threads than it should (in an even worse case, it could assign the threads with affinities set to processors that aren't even available to the process).
What will eventually happen is that we'll have too many threads running on our subset of available processors. Causing us to suffer from a performance penalty caused by excessive context switching.
The scenario in which we set our process to run on only a subset of processors is far from being far-fetched since in situations where we have a couple of applications that are designed to take advantage of all of the available processors, and are meant to run in parallel on the same machine (without using some kind of VMware), then we would probably like to split our installed processors in half, giving each process only a set or processors to use. If the applications won't be aware to this subset, then we'll find ourselves wasting valuable CPU cycles, for no good reason.

In .Net, you could get the processor affinity by using the property Process.ProcessorAffinity. It will return a bitmask where each bit represents a logical CPU where our process can execute (in case the mask is set to 0, the scheduler will decide which processors our application will use. So basically, all of the processors are available).
Current versions of Windows provide support for more than 64 logical CPUs. In order to address all of those CPUs, they are divided into Groups, where each group can address up to 64 processors. So when looking at the processor affinity bitmask, you should be aware that it belongs only to a specific group (by default, only a single group is used).
So next time you're checking how many processors are available to your process, remember to count the number of lit bits in your processor affinity bitmask.

 static void PrintAffinitizedProcessors()
 {
     // gets the number of affinitized proccesors in the
     // current processor group (up to 64 logical processors)
     Process currentProcess = Process.GetCurrentProcess();
     long affinityMask = (long)currentProcess.ProcessorAffinity;
 
     if (affinityMask == 0)
         affinityMask = (long)Math.Pow(Environment.ProcessorCount, 2) - 1;
 
     const int BITS_IN_BYTE = 8;
     int numberOfBits = IntPtr.Size * BITS_IN_BYTE;
 
     int counter = 0;
 
     for (int i = 0; i < numberOfBits; i++)
     {
         if ((affinityMask >> i & 1) == 1)
         {
             Console.WriteLine(i);
             counter++;
         }
     }
 
     Console.WriteLine("Total: " + counter);
 }

Tuesday, July 20, 2010

Monitor's Locking Primitive

Lately, a discussion in a C# user group raised the question "In which synchronization primitive does the CLR uses when I call Monitor.Enter?". Does a Mutex is being created by the OS? Maybe an Event? or perhaps it's a user-mode primitive such as CriticalSection? Apparently there's some vagueness in the subject, so in this post I will demonstrate how can we find the answer to the question using the tools available to us.
In general, the CLR's object synchronization ability is implemented by allocating a SyncBlock for every object that we attempt to lock. Looking the the object's header, the CLR can find the corresponding SyncBlock object that belongs to that object. It's the SyncBlock's responsibility to synchronize the locking requests to that object.
One needs to remember that those synchronization abilities are a feature of the CLR, in the sense that they are implemented in the CLR, and not necessarily in (or using) the operating system's synchronization primitives. This matches the sense that "theoretically", a managed thread doesn't have to be based on an operating system's kernel thread. So basically, no one can guarantee that the CLR will always use one synchronization primitive or another. Today this isn't the case, and in the meanwhile it doesn't seem like things are going to change in the near or far future.

After a quick review of the documentation available to us in the MSDN, one can impressed as if there's no real documentation about the basic primitive being used. But since we are talking about a synchronization primitive, we can remember that IHostSyncManager interface that is exposed to us by the CLR's hosting ability. One of this interface's functionalities is the ability to replace the implementation of the synchronization primitive being used by the Monitor class. This ability is exposed by the method CreateMonitorEvent.
Even at this stage we may pay attention to what is being said under the remarks paragraph:

CreateMonitorEvent returns an IHostAutoEvent that the CLR uses in its implementation of the managed System.Threading.Monitor type. This method mirrors the Win32 CreateEvent function, with a value of false specified for the bManualReset parameter.

Even though, the keyword here is "Mirrors" so there isn't a true guarantee about what is happening in the CLR's internal implementation. In order to verify the thick hint we've just received here, we are going to have to pull out the big guns, and use WinDbg.
In favor of the test, I wrote up an application that results in an endless contention:

static void Main()
{
      Thread t1 = new Thread(() => { lock ("A") { while (true);} });
      t1.Start();

      lock ("A") { while (true);}
}

After the application already runs in the background, we could launch WinDbg and attach the appropriate process.
After loading SOS, our first step will be to find the thread that lost the race for acquiring the lock. To do so, we will print the managed stacks of all of our threads, and when we'll find a "suitable" stack trace, we'll move to its context:

>~*e!clrstack // execute !clrstack on all of the threads
OS Thread Id: 0xf80 (0) Child SP IP       Call Site
0012f3ec 030300fe ConsoleApplication1.Program.Main(System.String[]) [...]
0012f648 791421db [GCFrame: 0012f648]
OS Thread Id: 0xf4c (1)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0x840 (2)
Child SP IP       Call Site
02ccfe68 7c90e514 [DebuggerU2MCatchHandlerFrame: 02ccfe68]
OS Thread Id: 0xbe0 (3)
Child SP IP       Call Site
0313f67c 7c90e514 [GCFrame: 0313f67c]
0313f794 7c90e514 [GCFrame: 0313f794]
0313f7b0 7c90e514 [HelperMethodFrame_1OBJ: 0313f7b0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
0313f808 79b2e0c4 System.Threading.Monitor. Enter(System.Object, Boolean ByRef)
0313f818 03030163 ConsoleApplication1.Program.< Main>b__3() [...]
0313f848 79b2ae5b System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
0313f858 79ab7ff4 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
0313f87c 79ab7f34 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0313f898 79b2ade8 System.Threading.ThreadHelper. ThreadStart()
0313fabc 791421db [GCFrame: 0313fabc]
0313fd80 791421db [DebuggerU2MCatchHandlerFrame: 0313fd80]
OS Thread Id: 0xe60 (4)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process

>~3s // change the current thread to 3

After finding the correct thread, we will have to check its native stack, so we'll use the kb commend that will also display the first 3 parameters that were passed to each method:

>kb
ChildEBP RetAddr Args to Child
0313f3c8 7c90df4a 7c809590 00000001 0313f3f4 ntdll!KiFastSystemCallRet
0313f3cc 7c809590 00000001 0313f3f4 00000001 ntdll!ZwWaitForMultipleObjects+0xc
0313f468 791f516a 00000001 001820bc 00000000 KERNEL32!WaitForMultipleObjectsEx+0x12c
0313f4cc 791f4f98 00000001 001820bc 00000000 clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x56
0313f4ec 791f4dd8 00000001 001820bc 00000000 clr!Thread::DoAppropriateAptStateWait+0x4d
0313f580 791f4e99 00000001 001820bc 00000000 clr!Thread::DoAppropriateWaitWorker+0x17d
0313f5ec 791f4f17 00000001 001820bc 00000000 clr!Thread::DoAppropriateWait+0x60
0313f640 7919d409 ffffffff 00000001 00000000 clr!CLREvent::WaitEx+0x106
0313f654 792e0160 ffffffff 00000001 00000000 clr!CLREvent::Wait+0x19
0313f6e4 792e0256 001818a0 ffffffff 8079c412 clr!AwareLock::EnterEpilogHelper+0xa8
0313f724 792e029b 001818a0 001818a0 79142c0d clr!AwareLock::EnterEpilog+0x42
0313f744 792c7729 8079cb36 0313f830 00b3c368 clr!AwareLock::Enter+0x5f
0313f800 79b2e0c4 79161f8e 00941f02 0313f840 clr!JIT_MonReliableEnter_Portable+0x104
0313f840 79b2ae5b 00b3c3ec 01b3101c 0313f86c mscorlib_ni+0x2ae0c4
0313f850 79ab7ff4 00b3e010 00000000 00b3c3b8 mscorlib_ni+0x2aae5b
0313f86c 79ab7f34 00000000 00b3c3b8 00000000 mscorlib_ni+0x237ff4
0313f88c 79b2ade8 00b3c3b8 00000000 001818a0 mscorlib_ni+0x237f34
0313f8a4 791421db 000001a7 0313fae0 0313f930 mscorlib_ni+0x2aade8
0313f8b4 79164a2a 0313f980 00000000 0313f950 clr!CallDescrWorker+0x33
0313f930 79164bcc 0313f980 00000000 0313f950 clr!CallDescrWorkerWithHandler+0x8e

At this point we can already see that the last thing that the thread did before we've interrupted him, is to call to WaitForMultipleObjectsEx where the first parameter was 1 and the second is 0x001820BC. Having this information, we can understand that we are waiting on a single Handle object, since the first parameter specifies the affective size of the array that was passed as the second parameter. So all we've got left to do is to understand which object hides behind that Handle that was passed to the function.

>dp 0x001820BC 0x001820BC
001820bc 000006c8 // our handle's value
>!handle 000006c8 F // pass "F" as bitmask to display all of the relevant data
Handle 6c8
Type             Event
Attributes       0
GrantedAccess    0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
HandleCount      2
PointerCount     4
Name
Object Specific Information
    Event Type Auto Reset
    Event is Waiting

And so, this was our last step. We have confirmed that Monitor's synchronization primitive is in fact an Event object of an AutoReset type.
Whoever still wants to view the creation and usage of the Even in code, can open up the sync.cpp file under the SSCLI's implementation, and see how a call to CLREvent::CreateMonitorEvent triggers a call to UnsafeCreateEvent (that actually is a typedef to the familiar CreateEvent function).

Even so, one have to remember that this is only a partial answer. Since as I've mentioned at the beginning of this post, there's no guarantee that once we'll call Monitor.Enter we will always find ourselves accessing some kernel object. In fact, in one of his posts, Joe Duffy makes sure to mention that in the CLR's implementation, when a thread encounters contention, it will attempt to spin a little before re-trying to acquire the lock, without leaving the user-mode and waiting for some kernel object. So even if the CLR doesn't give a full blown implementation of a synchronization primitive, it may still provide some optimizations over the supplied services of the operating system (much like the CriticalSection type).