Tuesday, August 24, 2010

Reducing AutoResetEvent's Synchronization Overhead

One of the most common design patterns in multithreaded programming is the producer-consumer. In the usual scenario, a consumer thread is spawned into an infinite loop, where it waits on some handle until new data arrives from the producer threads. Then it wakes up, and process the new data.
The overall idea can be expressed via this code:

AutoResetEvent m_event = new AutoResetEvent(false);

private void WorkCycles()
{
    while(true)
    {
        // wait for a signal
        m_event.WaitOne();

        // do work..
    }
}

Using the AutoResetEvent class seems very suitable for this kind of scenario since it supplies us the exact behavior that we're looking for: causing a thread to to wait on a handle until a signal arrives. However, calls to WaitOne/Set come with a price, and it ain't cheap. The AutoResetEvent class is simply a managed-code wrapper for Win32's WaitForSingleObject, and it doesn't keep track of the internal event's state (whether it's currently signaled state or not), thus, every call to AutoResetEvent will result in diving all the way into kernel-mode, even though "we don't have to" (if we would have tracked the event's internal state).
In some scenarios, especially in those where producer threads pass data for the consumer in high frequencies, most of the calls to Set() are redundant. If for example every 100ms a producer passes new data (which results in a calling to Set), but it takes the consumer an entire second to process that data, then it means we've spent time calling Set 9 times more than we had to. Additionally, after the consumer finished its processing, it will have to call WaitOne again, only to realize the event was already set by a producer.

So, how expensive are those redundant calls to AutoResetEvent? The following benchmark should demonstrate:

while (true)
{
    AutoResetEvent eventObj = new AutoResetEvent(false);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 1000000; i++)
    {
        eventObj.Set();

        eventObj.WaitOne();
    }

    Console.WriteLine(sw.ElapsedMilliseconds);
}

On my machine, the average result was 1035ms.
A whole second, this is the ridiculously large amount of time it took us the complete those few lines of code. We are wasting here an entire second just for using a synchronization primitive that in our real application, could be considered to be quite insignificant at first look. Seemingly, we shouldn't waste more than a few milliseconds running that code, but as you can see, things are a little different in practice.

EconomicAutoResetEvent
The solution for this problem is quite straightforward, since our goal is to reduce the number of redundant calls to Set/WaitOne to a minimum. How could you achieve that? One option is to use an alternative synchronization primitive, that for the sake of this post, will be named EconomicAutoResetEvent. The function of this new primitive, is to improve AutoResetEvent's performance by keeping track of the event's state, thus, avoiding transitions into the kernel.
I'll demonstrate a possible implementation for such primitive, and explain its behavior in more detail afterwords:

public class EconomicResetEvent
{
    private volatile int m_eventState;
    private AutoResetEvent m_event;

    private Thread m_worker;

    private const int EVENT_SET     = 1;
    private const int EVENT_NOT_SET = 2;
    private const int EVENT_ON_WAIT = 3;

    public EconomicResetEvent(bool initialState, Thread workerThread)
    {
        m_event = new AutoResetEvent(initialState);
        m_eventState = initialState ? EVENT_SET : EVENT_NOT_SET;

        m_worker = workerThread;
    }

    public void WaitForWork()
    {
        verifyCaller();

        if (m_eventState == EVENT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_NOT_SET, EVENT_SET) == EVENT_SET)
        {
            return;
        }

        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_ON_WAIT, EVENT_NOT_SET) == EVENT_NOT_SET)
        {
            m_event.WaitOne();
        }
    }

    public void SignalWork()
    {
        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_SET, EVENT_NOT_SET) == EVENT_NOT_SET)
        {
            return;
        }

        if (m_eventState == EVENT_ON_WAIT && Interlocked.CompareExchange(
            ref m_eventState, EVENT_NOT_SET, EVENT_ON_WAIT) == EVENT_ON_WAIT)
        {
            m_event.Set();
        }
    }

    // [Conditional("DEBUG")]
    private void verifyCaller()
    {
        if (m_worker != Thread.CurrentThread)
        {
            string errMsg = string.Format("Only the pre-defined Worker thread may
               call WaitOne (Current: {0}, Worker: {1})", Thread.CurrentThread, m_worker);

            throw new SynchronizationLockException(errMsg);
        }
    }
}
 
I assume that the question most of you are probably thinking about right now is "How fast it is?". Well, after running the same benchmark as before, the average result I get on my machine is only 9ms. An outstanding 11500% of improvement over using AutoResetEvent directly. And to think that all we just did is to avoid some calls to the internal kernel object.
The most important part in the implementation is the tracking of the state of the internal kernel object (signaled/not-signaled). The state itself is kept in m_eventState, this how we could know which calls to WaitForWork/SignalWork could avoid accessing m_event (AutoResetEvent).
Additionally, in order to make the store/load operations on m_eventState thread-safe, I've used a couple of CAS operations, that even though they could be cheaper then a "full blown lock", it's still considered to be quite an expensive instruction that usually it's best to avoid when they're not necessary. This is exactly why the double-tests on the if statements are performed. This technique is usually referred to as TATAS (i.e, Test-and-Test-and-Set).
As the comment at the class's deceleration mentions, some mild race conditions could occur when the consumer thread is entering/leaving WaitForWork while a producer calls SignalWork exactly at a very specific moment. This race condition won't cause any functional harm, at its worst-case, it would cause us to perform a redundant call to Set/WaitOne. Resolving this "issue" would mean that we prefer to enforce fairness. Such enforcement could have some negative effects on performance.

Since we're already discussing the subject, It would be appropriate to mention that the Parallel Extensions library had introduced a new synchronization primitive named ManualResetEventSlim, that also supplies us with extended functionality and better performance over the standard event objects. Beyond keeping the state of the internal kernel event, it also performs some spinning around the flag before actually blocking the thread, and also uses lazy initialization to create the kernel object (its created only when it's needed). The good guys at Microsoft were kind enough to publish a short but informative document that demonstrates the performance differences between the new and old synchronization primitives under a selected group of usage scenarios.

6 comments:

  1. Hi Liran,

    I get an error when trying to compile your code

    'EconomicResetEvent.m_eventState': a reference to a volatile field will not be treated as volatile

    ReplyDelete
    Replies
    1. @Anonymous,
      That warning is expected and can be safely ignore.
      Usually it's an incredibly bad idea to pass a volatile field by reference since it essentially disables its 'volatility'. However, there are some exceptions to this rule, such as when calling an interlocked API. You could safely instruct the pre-processor to ignore this warning using '#pragma warning disable 420'.
      This behavior is also well documented in the warning's description on MSDN: http://msdn.microsoft.com/en-us/library/4bw5ewxy%28v=vs.100%29.aspx

      Thanks,
      Liran Chen

      Delete
  2. Hello,
    First, thanks for the code, it's very useful.
    I used your code to implement handler thread with Send() and Post() methods. Both Send() and Post() just add a job to a queue and call SignalWork() while handler thread itself use WaitForWork() and read jobs from the queue. I found that in some not too heavy scenarios sometimes (1 case in tens of thousands) handler thread (WaitForWork()) blocks on m_event.WaitOne() while other thread calls to Send() and expect job to complete waiting on another event. When this deadlock occurred I stopped the project in visual studio and confirmed that handler thread was blocked on the event with nonempty queue.
    After some research I found that possible reason of the problem is that 'volatile' keyword does not guarantee to read latest value. So when you do if(m_eventState==*) you may get incorrect result.
    Since these checks (m_eventState==*) look like optimization (no other need in them since Interlocked.* methods do the same check and guaranteed to be atomic) I just removed those checks leaving only Interlocked calls. The problem now seems to be gone.
    Also, no need to use 'volatile' if you use Interlocked calls only, because all Interlocked method treat their ref parameters in special way to guarantee both using latest value and atomic operation, so I just removed 'volatile'.

    Sergey.

    ReplyDelete
    Replies
    1. Actually, the one thing the 'volatile' keyword guarantees is that the latest value will be read. For example, if someone would like the expose the internal state of class, he could simply expose 'm_state' without having to use of any kind of additional locking/interlocked mechanism, since the 'volatile' keyword is enough to guarantee that the most up-to-date value would be read.
      If you believe your deadlock is caused by the usage of this class, please provide the steps it takes to reproduce the issue.

      Thanks.

      Delete
    2. Liran,
      I'll try to reproduce the problem with your class in few days.
      For now, try to download simple test project which proves that 'volatile' doesn't guarantee that you read latest value: http://yadi.sk/d/X7DhBlHo0snRD
      The idea for this test was taken from this great article:
      http://www.albahari.com/threading/part4.aspx#_The_volatile_keyword

      By the way, I use Intel i7-920 processor, but this doesn't prevent 'volatile' from screwing up.

      Sergey.

      Delete
    3. @Sergey
      The reordering observed in the example is perfectly fine. A dependency had been created between the a,b and x,y variables and such dependency needs to be synchronized. In contrast, in a scenario like this:
      T1:
      flag = 0;
      while (flag == 0) { }

      T2:
      while (true)
      flag = 1;

      'flag' will need to be marked as 'volatile' in order to make sure that T1 will see its updated value (instead of reading it from a local register for instance).

      Delete