Sunday, August 29, 2010

Brain Teasing With Strings

And in today's post... a riddle.
Actually, it comes down to a short C# program that deals with string comparisons, and demonstrates a series of "unexpected" behaviors. How "unexpected" are the results? Well, it depends on the reader's degree of experience at string comparisons.
Lately this sort of "brain-teasers" became quite popular on different forums/newsgroups, but it seems like no matter where those teasers are posted, it never comes with a full in-depth explanation for all of the "dark-magic" going on behind the scenes, that could potentially explain the demonstrated "weird" and "unexpected" behaviors. So basically, I'll attempt to make things a little bit clearer than it is now.
So without anymore introductions, please tell me, what would be the output for the following program?

static void Main()
{
    char[] a = new char[0];
    string b = new string(a);
    string c = new string(new char[0]);
    string d = new string(new char[0]);

    Console.WriteLine(object.ReferenceEquals(a, b));
    Console.WriteLine(object.ReferenceEquals(b, c));
    Console.WriteLine(object.ReferenceEquals(c, d));
}

When this question is presented to a group of experienced developers, it's most likely to result in a variety of different answers, where each of the answers is backed up with some explanation that usually "makes sense". The problem is that they usually just "makes sense", and doesn't refer to some hard proofs that can uncover the mystery completely.
So, are you ready with your answers? Let's check the output:

False
True
True
"Hmm... that was interesting".
Let's see what do we got here. On the one hand, a isn't equal to b. This makes sense since after all, they are difference instances of different types. But on the other hand, we can also see that b is equal to c and d as well. This could appear to be a little confusing since as we know, strings are supposed to be immutable, so according to the code, we are "supposed" to create three different string instances (mind you that we aren't dealing with user-hardcoded, interned strings here).
It appears that even though we've create three different arrays, we've only allocated a single string (that one that b, c and d references).

Usually, when a char[] (or, some other kind of string-representing structure) is passed to a string constructor, a new string is allocated to represent the array's characters. However, as our brief program demonstrates, this isn't always the case.
To understand why this happens, we'll have to dig into the implementation of the string class, and more specifically, to it's internal implementation in SString (the following code samples are taken from SSCLI's implementation).
Looking at SString's internal constructor, we could notice that the char[] argument that was received is checked whether it's set to NULL or it's length is equal to zero. If it does, the implementation completely "forgets" about that argument, and instead uses the new SString instance to reference an interned version of an empty string (so basically, the redundant allocation of the new, empty string, is optimized away).
We could see this behavior in SString's Set and Clear functions (the Set function is invoked from the constructor).

void SString::Set(const WCHAR *string)
{
    if (string == NULL || *string == 0)
        Clear();
    else
    {
        Resize((COUNT_T) wcslen(string), REPRESENTATION_UNICODE);
        wcscpy_s(GetRawUnicode(), GetBufferSizeInCharIncludeNullChar(), string);
    }

    RETURN;
}

void SString::Clear()
{
    SetRepresentation(REPRESENTATION_EMPTY);

    if (IsImmutable())
    {
        // Use shared empty string rather than allocating a new buffer
        SBuffer::SetImmutable(s_EmptyBuffer, sizeof(s_EmptyBuffer));
    }
    else
    {
        SBuffer::TweakSize(sizeof(WCHAR));
        GetRawUnicode()[0] = 0;
    }

    RETURN;
}

Tuesday, August 24, 2010

Reducing AutoResetEvent's Synchronization Overhead

One of the most common design patterns in multithreaded programming is the producer-consumer. In the usual scenario, a consumer thread is spawned into an infinite loop, where it waits on some handle until new data arrives from the producer threads. Then it wakes up, and process the new data.
The overall idea can be expressed via this code:

AutoResetEvent m_event = new AutoResetEvent(false);

private void WorkCycles()
{
    while(true)
    {
        // wait for a signal
        m_event.WaitOne();

        // do work..
    }
}

Using the AutoResetEvent class seems very suitable for this kind of scenario since it supplies us the exact behavior that we're looking for: causing a thread to to wait on a handle until a signal arrives. However, calls to WaitOne/Set come with a price, and it ain't cheap. The AutoResetEvent class is simply a managed-code wrapper for Win32's WaitForSingleObject, and it doesn't keep track of the internal event's state (whether it's currently signaled state or not), thus, every call to AutoResetEvent will result in diving all the way into kernel-mode, even though "we don't have to" (if we would have tracked the event's internal state).
In some scenarios, especially in those where producer threads pass data for the consumer in high frequencies, most of the calls to Set() are redundant. If for example every 100ms a producer passes new data (which results in a calling to Set), but it takes the consumer an entire second to process that data, then it means we've spent time calling Set 9 times more than we had to. Additionally, after the consumer finished its processing, it will have to call WaitOne again, only to realize the event was already set by a producer.

So, how expensive are those redundant calls to AutoResetEvent? The following benchmark should demonstrate:

while (true)
{
    AutoResetEvent eventObj = new AutoResetEvent(false);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 1000000; i++)
    {
        eventObj.Set();

        eventObj.WaitOne();
    }

    Console.WriteLine(sw.ElapsedMilliseconds);
}

On my machine, the average result was 1035ms.
A whole second, this is the ridiculously large amount of time it took us the complete those few lines of code. We are wasting here an entire second just for using a synchronization primitive that in our real application, could be considered to be quite insignificant at first look. Seemingly, we shouldn't waste more than a few milliseconds running that code, but as you can see, things are a little different in practice.

EconomicAutoResetEvent
The solution for this problem is quite straightforward, since our goal is to reduce the number of redundant calls to Set/WaitOne to a minimum. How could you achieve that? One option is to use an alternative synchronization primitive, that for the sake of this post, will be named EconomicAutoResetEvent. The function of this new primitive, is to improve AutoResetEvent's performance by keeping track of the event's state, thus, avoiding transitions into the kernel.
I'll demonstrate a possible implementation for such primitive, and explain its behavior in more detail afterwords:

public class EconomicResetEvent
{
    private volatile int m_eventState;
    private AutoResetEvent m_event;

    private Thread m_worker;

    private const int EVENT_SET     = 1;
    private const int EVENT_NOT_SET = 2;
    private const int EVENT_ON_WAIT = 3;

    public EconomicResetEvent(bool initialState, Thread workerThread)
    {
        m_event = new AutoResetEvent(initialState);
        m_eventState = initialState ? EVENT_SET : EVENT_NOT_SET;

        m_worker = workerThread;
    }

    public void WaitForWork()
    {
        verifyCaller();

        if (m_eventState == EVENT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_NOT_SET, EVENT_SET) == EVENT_SET)
        {
            return;
        }

        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_ON_WAIT, EVENT_NOT_SET) == EVENT_NOT_SET)
        {
            m_event.WaitOne();
        }
    }

    public void SignalWork()
    {
        if (m_eventState == EVENT_NOT_SET && Interlocked.CompareExchange(
            ref m_eventState, EVENT_SET, EVENT_NOT_SET) == EVENT_NOT_SET)
        {
            return;
        }

        if (m_eventState == EVENT_ON_WAIT && Interlocked.CompareExchange(
            ref m_eventState, EVENT_NOT_SET, EVENT_ON_WAIT) == EVENT_ON_WAIT)
        {
            m_event.Set();
        }
    }

    // [Conditional("DEBUG")]
    private void verifyCaller()
    {
        if (m_worker != Thread.CurrentThread)
        {
            string errMsg = string.Format("Only the pre-defined Worker thread may
               call WaitOne (Current: {0}, Worker: {1})", Thread.CurrentThread, m_worker);

            throw new SynchronizationLockException(errMsg);
        }
    }
}
 
I assume that the question most of you are probably thinking about right now is "How fast it is?". Well, after running the same benchmark as before, the average result I get on my machine is only 9ms. An outstanding 11500% of improvement over using AutoResetEvent directly. And to think that all we just did is to avoid some calls to the internal kernel object.
The most important part in the implementation is the tracking of the state of the internal kernel object (signaled/not-signaled). The state itself is kept in m_eventState, this how we could know which calls to WaitForWork/SignalWork could avoid accessing m_event (AutoResetEvent).
Additionally, in order to make the store/load operations on m_eventState thread-safe, I've used a couple of CAS operations, that even though they could be cheaper then a "full blown lock", it's still considered to be quite an expensive instruction that usually it's best to avoid when they're not necessary. This is exactly why the double-tests on the if statements are performed. This technique is usually referred to as TATAS (i.e, Test-and-Test-and-Set).
As the comment at the class's deceleration mentions, some mild race conditions could occur when the consumer thread is entering/leaving WaitForWork while a producer calls SignalWork exactly at a very specific moment. This race condition won't cause any functional harm, at its worst-case, it would cause us to perform a redundant call to Set/WaitOne. Resolving this "issue" would mean that we prefer to enforce fairness. Such enforcement could have some negative effects on performance.

Since we're already discussing the subject, It would be appropriate to mention that the Parallel Extensions library had introduced a new synchronization primitive named ManualResetEventSlim, that also supplies us with extended functionality and better performance over the standard event objects. Beyond keeping the state of the internal kernel event, it also performs some spinning around the flag before actually blocking the thread, and also uses lazy initialization to create the kernel object (its created only when it's needed). The good guys at Microsoft were kind enough to publish a short but informative document that demonstrates the performance differences between the new and old synchronization primitives under a selected group of usage scenarios.

Friday, August 20, 2010

The Case of Delayed ACKs and Nagle's Algorithm

Even though the negative consequences of combining Nagle's Algorithm with Delayed ACKs when using TCP is well documented in the literature, it's still considered to be a common pitfall that many people tend to forget about. Since the negative effects of the combination will appear only under certain pre-conditions, it could be a while until one correlates the application's "weird" behavior with the usage of those two algorithms. So as the performance penalty grows to be more distinct, it would be easier the identify its source.

Delayed ACKs
Since TCP guarantees that every packet that is sent on the network will arrive to its designated remote host, it needs some kind of acknowledgment from that remote host that the packet managed to arrive. So, every time a host receives a packet, it needs to send back an ACK message as an acknowledgment.
The problem with using dedicated ACK messages is that we could find ourselves "overflowing" the network with messages that have nothing to do with our application (and contains almost no data at all). All of this just to say "Yes, I've received your packet" (if you combine the minimal size of a TCP and IP header [even without the Ethernet's header size] you already reach 40 bytes [and for IPv6, that number increases to 60 bytes]).
So, in order to reduce that overhead, the usage of "Delayed ACKs" was defined. The idea is that instead of sending a dedicated ACK message for each received packet, we make an assumption that our application (the receiver) is probably about to send "some" message (not necessarily a response message) back to the remote host that sent us the packet in the first place. This way, we could "piggyback" the application's message and add to its header the ACK for the previously received packet. Hench, we could substantially reduce that amount of redundant data circulating in our network.
Usually its customary to use a 200ms delay for sending ACKs (the exact behavior of the timeout depends on the protocol's implementation detail. Does a 200ms timer is created when the socket opens? or perhaps only at times that the application needs to send ACKs?).
In Windows, the default timeout is 200ms as well. But if you want to, you can change it by modifying the value of the TcpDelAckTicks in the registry, and set it somewhere between 0ms and 600ms (as an anecdote, the Host Requirements RFC forbids using ACK delays greater than 500ms).
Its worth noting that according to the RFC, there's no need to send back an ACK for every packet we receive, since a single ACK message could acknowledge multiple received packets. If for example machine A sends 5 different messages in 10ms differences to machine B, the minute the first message arrives to machine B, it opens a timer with a 200ms timeout so it will be alerted to send back an ACK message for the first message (assuming this is a one-way interface, so machine B won't send back any messages). Until that timer will elapse, the machine will receive the other 4 messages. So if we're already going to send an ACK for the first message, we might as well modify it so it will acknowledge the entire set of 5 messages.
Additionally, receiving packets aren't the only triggers for sending an ACK message. For example, if our receive window (the maximum number of bytes we can receive without sending back an ACK message), we will have to immediately send back an ACK message. Another trigger is the "Send ACK for Every Second Packet", that does exactly what its name implies. In Windows, the default value (2) can be changed, by modifying the TcpAckFrequency registry value.

Nagle's Algorithm
Even though it isn't directly related to the usage of Delayed ACKs, this algorithm attempts to resolve a similar problem that occurs on the sender's side. From the same reason we shouldn't overflow the network with "small, ACK messages", we shouldn't send "small, application messages" ourselves, due to the overhead involved in sending the entire header, while the message's body is very small.
According to the algorithm, the protocol may delay small "send" operations so it could buffer them together (and thus, a single packet could be send for more than a couple of small, applicative messages).
Deciding when to stop buffering the data isn't arbitrary, as it depends on the rate of receiving ACKs from the remote host. The idea is that until we haven't received an ACK message for our previously sent packet, there's no point in sending an additional packet. So, while we are waiting to receive an ACK from the remote host, we are buffering all of data we are about to send (of course, under the limits of the MSS). The minute we'll receive the ACK for our previous packet, all of the buffered data will be combined, and sent as a single packet.

Seemingly, both of the mentioned algorithms justifies their existence since they are attempting to resolve real problems. However, what will happen if we'll combine them both? On one hand, both of them will try to reduce the amount of tinygrams being sent on their side (either ACK messages, or applicative messages). But on the other hand, under certain conditions, they may cause significant amount of delay when attempting to send messages on the network. The most obvious example is when you've got a one-way interface that only one side only sends messages, and the other just receive them (without "answering"). In this case, even if the application continuously sends messages under a high rate, we are expected to notice latencies up to 200ms, since the time the message was "sent" by the application, until it was received by the recipient (since it was delayed by Nagle's algorithm). In other cases, also in two-way interfaces, there might be occasions in which the client/server stops sending message for a couple of moments. This also might cause latencies up to 200ms in its receive rate. In those kind of cases, identifying the source of the latencies might be more difficult since they tend to appear randomly, with latencies that are not always constant (less then 200ms). The exact behavior is determined by the characteristics of the application.


In order to illustrate this behavior via code, I've written the following program that measures the time it takes us to send two messages that seemingly should be sent instantly. One right after the other.

void Server()
{
    Socket server = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    server.Bind(ServerEndPoint);
    server.Listen(1);
    Socket s = server.Accept();

    while (true)
    {
        // measure how long it takes to receive both messages
        Stopwatch stopwatch = Stopwatch.StartNew();

        s.Receive(new byte[8]);
        s.Receive(new byte[8]);

        // Output: around 200ms
        Console.WriteLine(stopwatch.ElapsedMilliseconds);
    }
}

void Client()
{
    Socket client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    client.Bind(ClientEndPoint);
    client.Connect(ServerEndPoint);

    while (true)
    {
        client.Send(new byte[8]); // will be sent immediately
        client.Send(new byte[8]); // delayed for 200ms

        // wait for an imaginery response
        client.Receive(new byte[0]);
    }
}

Due to the unpleasant effects that might cause due to the combination of the algorithms, the RFC states that implementations of the TCP protocol, that use Nagle's algorithm, must also support a way to disable it, so applicative messages won't be delayed by the protocol when they are being sent. This ability is exposed to us by using the TCP_NODELAY flag, and in .Net, its usage is wrapped with the Socket.NoDelay property.

Saturday, August 14, 2010

Don't Rely on Environment.ProcessorCount

One of the most hidden knowledge in multithreaded programming is the question "How many threads I should use in my application to achieve the best performance from my available hardware?". The answer to this question may vary since it depends on the characteristics of the application. Whether it's CPU bound, or IO bound? How much of the work is being paralleled? and so on... At the bottom line, all of these "formulas" are based upon the number of available processors.

Usually when an application runs it's initialization routine and gets ready to create a collection of worker threads, it will check how many processors are installed on the machine, so it could decide how many threads exactly it should create. To get the number of installed processors, you would usually call Environment.ProcessorCount. This property simply calls the Win32 GetSystemInfo function, and returns the dwNumberOfProcessors field to the caller. The problem with this value is that it doesn't necessarily represent the number of processors available to our process. In a certain scenario, the user that launched our application might have decided to give it a certain Processor Affinity, that will cause our application's threads to execute only on the processors set by the given affinity, instead of all the installed processors. The problem in this case is that the application will completely ignore the processor affinity, and create a larger amount of threads than it should (in an even worse case, it could assign the threads with affinities set to processors that aren't even available to the process).
What will eventually happen is that we'll have too many threads running on our subset of available processors. Causing us to suffer from a performance penalty caused by excessive context switching.
The scenario in which we set our process to run on only a subset of processors is far from being far-fetched since in situations where we have a couple of applications that are designed to take advantage of all of the available processors, and are meant to run in parallel on the same machine (without using some kind of VMware), then we would probably like to split our installed processors in half, giving each process only a set or processors to use. If the applications won't be aware to this subset, then we'll find ourselves wasting valuable CPU cycles, for no good reason.

In .Net, you could get the processor affinity by using the property Process.ProcessorAffinity. It will return a bitmask where each bit represents a logical CPU where our process can execute (in case the mask is set to 0, the scheduler will decide which processors our application will use. So basically, all of the processors are available).
Current versions of Windows provide support for more than 64 logical CPUs. In order to address all of those CPUs, they are divided into Groups, where each group can address up to 64 processors. So when looking at the processor affinity bitmask, you should be aware that it belongs only to a specific group (by default, only a single group is used).
So next time you're checking how many processors are available to your process, remember to count the number of lit bits in your processor affinity bitmask.

 static void PrintAffinitizedProcessors()
 {
     // gets the number of affinitized proccesors in the
     // current processor group (up to 64 logical processors)
     Process currentProcess = Process.GetCurrentProcess();
     long affinityMask = (long)currentProcess.ProcessorAffinity;
 
     if (affinityMask == 0)
         affinityMask = (long)Math.Pow(Environment.ProcessorCount, 2) - 1;
 
     const int BITS_IN_BYTE = 8;
     int numberOfBits = IntPtr.Size * BITS_IN_BYTE;
 
     int counter = 0;
 
     for (int i = 0; i < numberOfBits; i++)
     {
         if ((affinityMask >> i & 1) == 1)
         {
             Console.WriteLine(i);
             counter++;
         }
     }
 
     Console.WriteLine("Total: " + counter);
 }

Saturday, August 7, 2010

Accurately Measuring GC Suspensions

When you analyze the performance of a managed application, and look for the application's main bottlenecks, one place you should always check is the amount of time your application spend in GC. In order to get that information, you can always run Perfmon and get a general analysis of how the GC behaves in your application (where the most of the time, you'll be looking at the %Time in GC counter).
While looking at those data sets can give us a good overview about how the GC affects our application's performance, it's just not enough in order to get more in-depth insights about observed "fezzes" in the application. For instance, even if we'll see that in 90% of the time we only spend 2% in GC collections, it still doesn't mean that during some critical moment in our application, there wasn't any generation 2 collection that caused our application to freeze for about 200ms, which could cause some serious damage in some scenarios.
The problem is that GC collections don't just cause the application to "run more slowly", but it puts the application into a complete stop for unknown amounts of time. So solely looking at the %Time in GC graph doesn't tells us for sure what was the real damage of each collection. Also, remember that perfmon's highest sampling resolution is only 1 second. So in case we've suffered from a single generation 2 collection, we've might even won't notice it as we should. More than that, its almost impossible to make sure that perfmon is runs, gathers and records the correct data on every machine that runs your application at any given time.

This is why we need a reliable way to monitor when the GC decides to perform a collection, and when it does, check the collection's generation, and the amount of time it takes to complete it.
Today, the System.GC class doesn't expose us an accurate way to collect this data. However, by writing a custom CLR Host, we could integrate our code with the CLR and receive from it the proper notifications that will tell us exactly when each and every GC starts and end in the process.
In order to do so, we will need to implement the IHostGCManager interface, and use one out of its three available callback functions. The first is SuspensionStarting, and the second is SuspensionEnding, which also passes us a single DWORD parameter that represent the number of generation that ended. So as you might have already figured out, the CLR makes sure to call SuspensionStarting right before it starts the collection, and afterwords, it calls SuspensionEnding.
One thing to pay attention about SuspensionEnding is that the documentation available on MSDN isn't entirely accurate, and might confuse users about what are the possible triggers to call the callback. This is what the documentation says today:

"...Notifies the host that the common language runtime (CLR) is resuming execution of tasks on threads that had been suspended for a garbage collection.
[parameter: generation] The garbage collection generation that is just finishing, from which the thread is resuming."
So according to the documentation, the callback will only be invoked due to freezings caused by a GC. However, this isn't the only trigger for invoking this callback. Actually, the CLR will invoke it also after it continuous execution after other kinds of "stop-the-world" operations that it might perform (e.g, loading and unloading of AppDomains). If we'll look in the SSCLI's implementation to where that callback is invoked, we could notice the following code:

if (pGCThreadControl)
{
    // If we the suspension was for a GC, tell the host what generation GC.
    DWORD   Generation = (bFinishedGC
        ? GCHeap::GetGCHeap()->GetCondemnedGeneration()
        : ~0U);

    pGCThreadControl->SuspensionEnding(Generation);
}

So we can see that for every freezing that wasn't caused due to a GC, the generation parameter that is passed to the callback will contain the value of UINT_MAX, so when implementing your CLR Host, you should remember checking for this special value.

As for the measuring itself, we'll use the QueryPerformanceCounter function (in .Net, wrapped by the Stopwatch class), to achieve the highest possible time resolution for every collection.
Since in most of the time, the collections that we'll encounter will be very short ones (mostly ephemeral collections, that could take only a few millisecond per collection), we'll likely want to avoid spending time recording the data (so we'll avoid unnecessary IO). In such case it could be useful to use a logging framework that will filter collections according to their severity (e.g, Debug for very short collections, Info for more notable collections, and Warn for more lengthy collections that might indicate a problem). After attaching an appender that writes all of our logs to the console window, and running an application that constantly allocates memory, we could get an output such as this:


As a reference, I'm including a sample CLR Host that monitors collections, and writes to the console their duration:


#include
#include
#include
#include

using namespace std;

#define APP_STARTUP_EXE L"TestApplication.exe"
#define APP_ENTRY_TYPE L"SomeNamespace.Program"
#define APP_ENTRY_METHOD L"Main"

class MyCLRHost : public IHostControl, public IHostGCManager
{
private:
    LONG m_refCount;
    LARGE_INTEGER m_lastGCStart;
    LARGE_INTEGER m_frequency;

public:
    MyCLRHost() { QueryPerformanceFrequency(&m_frequency); }

    // IHostControl
    HRESULT __stdcall GetHostManager(REFIID riid, void** ppObject)
    {
        if(riid == IID_IHostGCManager)
        {
            *ppObject = static_cast(this);
            return S_OK;
        }

        *ppObject = NULL;
        return E_NOINTERFACE;
    }

    // IUnknown
    HRESULT __stdcall QueryInterface(REFIID riid, void** ppvObject)
    {
        if (riid == IID_IHostGCManager)
        {
            *ppvObject = static_cast(this);
            return S_OK;
        }

        *ppvObject = NULL;
        return E_NOINTERFACE;
    }

    HRESULT __stdcall SetAppDomainManager(DWORD appDomain, IUnknown* domainManager)
    {
        return S_OK;
    }

    ULONG __stdcall AddRef() { return InterlockedIncrement(&m_refCount); }
    ULONG __stdcall Release() { return InterlockedDecrement(&m_refCount); }

    // IHostGCManager
    HRESULT __stdcall ThreadIsBlockingForSuspension() { return S_OK; }

    HRESULT __stdcall SuspensionStarting()
    {
        m_lastGCStart;
        QueryPerformanceCounter(&m_lastGCStart);

        return S_OK;
    }

    HRESULT __stdcall SuspensionEnding(DWORD gen)
    {
        LARGE_INTEGER gcEnd;
        QueryPerformanceCounter(&gcEnd);
        double duration = ((gcEnd.QuadPart - m_lastGCStart.QuadPart))
            * 1000.0 / (double)m_frequency.QuadPart;

        if(gen != UINT_MAX)
            cout<<"GC generation "<<<" ended: "<<<"ms"<
        else
            cout<<"CLR suspension ended: "<<<" ms"<

        return S_OK;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    ICLRRuntimeHost* pCLR;
    DWORD startupFlags = STARTUP_CONCURRENT_GC;
    HRESULT hr = CorBindToRuntimeEx(L"v2.0.50727", L"wks", startupFlags,
        CLSID_CLRRuntimeHost, IID_ICLRRuntimeHost, (LPVOID*)&pCLR);
    assert(SUCCEEDED(hr));

    MyCLRHost customHost;
    hr = pCLR->SetHostControl(&customHost);
    assert(SUCCEEDED(hr));

    hr = pCLR->Start();
    assert(SUCCEEDED(hr));

    DWORD retcode;
    hr = pCLR->ExecuteInDefaultAppDomain(APP_STARTUP_EXE,
        APP_ENTRY_TYPE, APP_ENTRY_METHOD, L"" , &retcode);
    assert(SUCCEEDED(hr));

    return 0;
};

Tuesday, August 3, 2010

Forcing JIT Compilation During Runtime

One of the advantages/disadvantages of the .Net Framework is its usage of JIT Compilation: The process in which machine code is generated from CIL code during the application's run-time, at the production environment.
Depending on your point of view, this behavior can be considered as a helpful feature since during the JIT compilation, the compiler can utilize special features that the production processor supports (such as using more efficient, unique instructions that aren't supported by other processors). This way, we could achieve better performance than if we didn't use JIT compilation at all, since we would have to address a more basic set of processor features (in order to support a variety of processors).
The most dominant disadvantage of using a JIT compiler is the time being spent compiling CIL code on the production machine (since every time the user triggers a "first-time" call to a managed method, the JIT compiler is triggered, and starts to compile the method for the current and future calls).
In the CLR, every MethodDesc instance (a structure that exists for every managed method. Contains some metadata about the method and is allocated in the EE's (Execution Engine) memory), contains an IsJitted boolean field. Like the field name implies, its value tells us whether the the JIT compiler already compiled the relevant method (this field will assist us later on).
Once a method is called for the first time, the execution flow is redirected by a special stub into the JIT, that will be responsible to compile the method. Then, the stub that caused the redirection is overridden by a jmp instruction that directs the execution flow to the JIT's generated machine code (in fact, this mechanism resembles the delay load feature available in C++). So as you understand, each time a method is called for the first time, we suffer from a certain performance penalty.
Its important to emphasize that usually, that the JIT's performance penalty is quite minimal, since we only have to experience it at the method's first invocation. However, for applications that demand high responsiveness, this performance penalty might be intolerable.

The classic solution for this problem is to use NGen. The idea behind NGen is that when your application is installed on the production machine, the JIT is triggered to compile our entire set of managed assemblies, and generate a native image for the application. Afterwards, when the application is launched, the CLR will make sure to load the correct native image from the disk, and so avoid unnecessary "jitting".
The problem with NGen is that it's quite complicated to use, and if we are really looking to avoid performance penalties during our application's initialization, we will have to spend more of our valuable time in order to register our assemblies in the GAC, correctly set our DLL's base addresses in order to avoid rebasing, etc.
However, there is an alternative for using NGen, and it's triggering the JIT to compile your during runtime, exactly when your desire it. This solution have its own set of disadvantages and isn't necessarily better than using NGen, but in case you want to speed up your JIT process with minimal effort on your part, this a solution that you will definitely want to consider.

So heading right down to business, I'll explain how this works from the bottom-up, so first we will understand how we are able to trigger the JIT compiler ourselves, in an noninvasive fashion (without actually executing the method). And then I'll explain on how you can utilize this ability in your own existing applications.
If so, our first step is to learn how to trigger the JIT compiler on a single, specific method. For that, we will use the RuntimeHelpers.PrepareMethod function. Normally, we would use this method when we need to call virtual functions inside a CER region, but in our case, what is more interesting is that this method causes the JIT to compile the method that was passed to it (Remark: it should be noted that calling PrepareMethod may trigger the static constructor of the destination type, that is, if it has one).
Since in this demonstration we we'll want to JIT-compile the entire set of methods contained in a given assembly, we will use a utility method that iterates over the contained types in a given assembly, and invokes PrepareMethod on every relevant method:

public static void PreJITMethods(Assembly assembly)
{
    Type[] types = assembly.GetTypes();
    foreach (Type curType in types)
    {
        MethodInfo[] methods = curType.GetMethods(
                BindingFlags.DeclaredOnly |
                BindingFlags.NonPublic |
                BindingFlags.Public |
                BindingFlags.Instance |
                BindingFlags.Static);

        foreach (MethodInfo curMethod in methods)
        {
            if (curMethod.IsAbstract ||
                curMethod.ContainsGenericParameters)
                continue;

            RuntimeHelpers.PrepareMethod(curMethod.MethodHandle);
        }
    }
}

One important thing about this method is the placement of the if statement right before the invocation of PrepareMethod. The first part of the statement is obvious, we have to ignore abstract methods since they don't contain any implementation detail, so the JIT compiler have nothing to do with them (attempting to pass them to PrepareMethod will result in an exception). The second part of the statement is the ignorance of generic methods. We can't cause the JIT to compile a generic methods since at this stage we still don't know which parameters are passed to the method in different call sites in the application's code. This is in fact a disadvantage versus NGen, that is aware of which "versions" of the generic method needs to be compiled.

The next step is to integrate this utility method into our application's initialization routine. When doing so, there are a few things to consider. Would we rather to trigger the code generation once right after our application finished loading? Or perhaps every time a new assembly is loaded into the AppDomain? Do we prefer executing this code synchronously, or perhaps asynchronously? And what about dedicating a thread with a lower priority than our other applicative threads? This questions will have to be answered according to your own preferences, and your application's needs.
Another thing to remember is that by default, the CLR delays the loading of referenced assemblies until we reach the first time we actually reference a type from it in our code (so if have a reference to a certain assembly, and we never reached a point where we needed to use it, it won't be loaded into the AppDomain). So if we want to trigger the JIT compiler on all of the statically referenced assemblies, we will have to explicitly load them. In order to do so, we could  recursively call GetReferencedAssemblies and Load. For example:

// recursively load all of assemblies referenced by the given assembly
public static void ForceLoadAll(Assembly assembly)
{
    ForceLoadAll(assembly, new HashSet());
}

private static void ForceLoadAll(Assembly assembly,
                                 HashSet loadedAssmblies)
{
    bool alreadyLoaded = !loadedAssmblies.Add(assembly);
    if (alreadyLoaded)
        return;

    AssemblyName[] refrencedAssemblies =
        assembly.GetReferencedAssemblies();

    foreach (AssemblyName curAssemblyName in refrencedAssemblies)
    {
        Assembly nextAssembly = Assembly.Load(curAssemblyName);
        if (nextAssembly.GlobalAssemblyCache)
            continue;

        ForceLoadAll(nextAssembly, loadedAssmblies);
    }
}

In this example, I've chosen to ignore assemblies that are loaded from the GAC, so we won't compile the BCL's massive codebase (thus, minimizing our working set size). As you can understand, this method can be customized to fit your specific needs regarding which assemblies to load, and which to ignore.
Obviously, this method will only load assemblies that are statically referenced to our application. In order to trigger JIT compilation on dynamically loaded assemblies (as a product of using Reflection for instance), you may want to subscribe to the AssemblyLoad event that will be fired during the loading of new assemblies into the AppDomain.

In case you are interested to know how you can positively confirm that a certain method was JIT compiled by the PrepareMethod (or by some other means), then you will need to grab WinDbg and SOS.
All you need to do is to find the address of your method's MethodDesc instance, and check what is the value of the IsJitted field. Here is a typical example to how you would want to do so:

> !name2ee OtherAssmebly.dll C.ClassC
Module: 00a457b8 (OtherAssmebly.dll)
Token: 0x02000002
MethodTable: 00a48bbc
EEClass: 00f283d4
Name: C.ClassC

> !dumpmt -md 00a48bbc
EEClass: 00f283d4
Module: 00a457b8
Name: C.ClassC
mdToken: 02000002  (C:\...\OtherAssmebly.dll)
BaseSize: 0xc
ComponentSize: 0x0
Number of IFaces in IFaceMap: 0
Slots in VTable: 6
--------------------------------------
MethodDesc Table
   Entry MethodDesc      JIT Name
00a4c838   00a48bb0     NONE C.ClassC..ctor()
00a4c830   00a48ba0      JIT C.ClassC.FooMethod()

> !dumpmd 00a48ba0
Method Name: C.ClassC.FooMethod()
Class: 00f283d4
MethodTable: 00a48bbc
mdToken: 06000001
Module: 00a457b8
IsJitted: yes
CodeAddr: 00f50778