Liran Chen's Blog: CLR Hosting

Saturday, August 7, 2010

Accurately Measuring GC Suspensions

When you analyze the performance of a managed application, and look for the application's main bottlenecks, one place you should always check is the amount of time your application spend in GC. In order to get that information, you can always run Perfmon and get a general analysis of how the GC behaves in your application (where the most of the time, you'll be looking at the %Time in GC counter).
While looking at those data sets can give us a good overview about how the GC affects our application's performance, it's just not enough in order to get more in-depth insights about observed "fezzes" in the application. For instance, even if we'll see that in 90% of the time we only spend 2% in GC collections, it still doesn't mean that during some critical moment in our application, there wasn't any generation 2 collection that caused our application to freeze for about 200ms, which could cause some serious damage in some scenarios.
The problem is that GC collections don't just cause the application to "run more slowly", but it puts the application into a complete stop for unknown amounts of time. So solely looking at the %Time in GC graph doesn't tells us for sure what was the real damage of each collection. Also, remember that perfmon's highest sampling resolution is only 1 second. So in case we've suffered from a single generation 2 collection, we've might even won't notice it as we should. More than that, its almost impossible to make sure that perfmon is runs, gathers and records the correct data on every machine that runs your application at any given time.

This is why we need a reliable way to monitor when the GC decides to perform a collection, and when it does, check the collection's generation, and the amount of time it takes to complete it.
Today, the System.GC class doesn't expose us an accurate way to collect this data. However, by writing a custom CLR Host, we could integrate our code with the CLR and receive from it the proper notifications that will tell us exactly when each and every GC starts and end in the process.
In order to do so, we will need to implement the IHostGCManager interface, and use one out of its three available callback functions. The first is SuspensionStarting, and the second is SuspensionEnding, which also passes us a single DWORD parameter that represent the number of generation that ended. So as you might have already figured out, the CLR makes sure to call SuspensionStarting right before it starts the collection, and afterwords, it calls SuspensionEnding.
One thing to pay attention about SuspensionEnding is that the documentation available on MSDN isn't entirely accurate, and might confuse users about what are the possible triggers to call the callback. This is what the documentation says today:

"...Notifies the host that the common language runtime (CLR) is resuming execution of tasks on threads that had been suspended for a garbage collection.
[parameter: generation] The garbage collection generation that is just finishing, from which the thread is resuming."

So according to the documentation, the callback will only be invoked due to freezings caused by a GC. However, this isn't the only trigger for invoking this callback. Actually, the CLR will invoke it also after it continuous execution after other kinds of "stop-the-world" operations that it might perform (e.g, loading and unloading of AppDomains). If we'll look in the SSCLI's implementation to where that callback is invoked, we could notice the following code:

if (pGCThreadControl)
{
    // If we the suspension was for a GC, tell the host what generation GC.
    DWORD   Generation = (bFinishedGC
        ? GCHeap::GetGCHeap()->GetCondemnedGeneration()
        : ~0U);


    pGCThreadControl->SuspensionEnding(Generation);
}

So we can see that for every freezing that wasn't caused due to a GC, the generation parameter that is passed to the callback will contain the value of UINT_MAX, so when implementing your CLR Host, you should remember checking for this special value.

As for the measuring itself, we'll use the QueryPerformanceCounter function (in .Net, wrapped by the Stopwatch class), to achieve the highest possible time resolution for every collection.
Since in most of the time, the collections that we'll encounter will be very short ones (mostly ephemeral collections, that could take only a few millisecond per collection), we'll likely want to avoid spending time recording the data (so we'll avoid unnecessary IO). In such case it could be useful to use a logging framework that will filter collections according to their severity (e.g, Debug for very short collections, Info for more notable collections, and Warn for more lengthy collections that might indicate a problem). After attaching an appender that writes all of our logs to the console window, and running an application that constantly allocates memory, we could get an output such as this:

As a reference, I'm including a sample CLR Host that monitors collections, and writes to the console their duration:

#include 
#include 
#include 
#include 


using namespace std;


#define APP_STARTUP_EXE L"TestApplication.exe"
#define APP_ENTRY_TYPE L"SomeNamespace.Program"
#define APP_ENTRY_METHOD L"Main"


class MyCLRHost : public IHostControl, public IHostGCManager
{
private:
    LONG m_refCount;
    LARGE_INTEGER m_lastGCStart;
    LARGE_INTEGER m_frequency;


public:
    MyCLRHost() { QueryPerformanceFrequency(&m_frequency); }


    // IHostControl
    HRESULT __stdcall GetHostManager(REFIID riid, void** ppObject)
    {
        if(riid == IID_IHostGCManager)
        {
            *ppObject = static_cast(this);
            return S_OK;
        }


        *ppObject = NULL;
        return E_NOINTERFACE;
    }


    // IUnknown
    HRESULT __stdcall QueryInterface(REFIID riid, void** ppvObject)
    {
        if (riid == IID_IHostGCManager)
        {
            *ppvObject = static_cast(this);
            return S_OK;
        }


        *ppvObject = NULL;
        return E_NOINTERFACE;
    }


    HRESULT __stdcall SetAppDomainManager(DWORD appDomain, IUnknown* domainManager)
    { 
        return S_OK;
    }


    ULONG __stdcall AddRef() { return InterlockedIncrement(&m_refCount); }
    ULONG __stdcall Release() { return InterlockedDecrement(&m_refCount); }


    // IHostGCManager
    HRESULT __stdcall ThreadIsBlockingForSuspension() { return S_OK; }


    HRESULT __stdcall SuspensionStarting()
    {
        m_lastGCStart;
        QueryPerformanceCounter(&m_lastGCStart);


        return S_OK;
    }


    HRESULT __stdcall SuspensionEnding(DWORD gen)
    {
        LARGE_INTEGER gcEnd;
        QueryPerformanceCounter(&gcEnd);
        double duration = ((gcEnd.QuadPart - m_lastGCStart.QuadPart)) 
            * 1000.0 / (double)m_frequency.QuadPart;


        if(gen != UINT_MAX)
            cout<<"GC generation "<<<" ended: "<<<"ms"<
        else
            cout<<"CLR suspension ended: "<<<" ms"<


        return S_OK;
    }
};


int _tmain(int argc, _TCHAR* argv[])
{
    ICLRRuntimeHost* pCLR;
    DWORD startupFlags = STARTUP_CONCURRENT_GC;
    HRESULT hr = CorBindToRuntimeEx(L"v2.0.50727", L"wks", startupFlags,
        CLSID_CLRRuntimeHost, IID_ICLRRuntimeHost, (LPVOID*)&pCLR);
    assert(SUCCEEDED(hr));


    MyCLRHost customHost;
    hr = pCLR->SetHostControl(&customHost);
    assert(SUCCEEDED(hr));


    hr = pCLR->Start();
    assert(SUCCEEDED(hr));


    DWORD retcode;
    hr = pCLR->ExecuteInDefaultAppDomain(APP_STARTUP_EXE,
        APP_ENTRY_TYPE, APP_ENTRY_METHOD, L"" , &retcode);
    assert(SUCCEEDED(hr));


    return 0;
};

Tuesday, July 20, 2010

Monitor's Locking Primitive

Lately, a discussion in a C# user group raised the question "In which synchronization primitive does the CLR uses when I call Monitor.Enter?". Does a Mutex is being created by the OS? Maybe an Event? or perhaps it's a user-mode primitive such as CriticalSection? Apparently there's some vagueness in the subject, so in this post I will demonstrate how can we find the answer to the question using the tools available to us.
In general, the CLR's object synchronization ability is implemented by allocating a SyncBlock for every object that we attempt to lock. Looking the the object's header, the CLR can find the corresponding SyncBlock object that belongs to that object. It's the SyncBlock's responsibility to synchronize the locking requests to that object.
One needs to remember that those synchronization abilities are a feature of the CLR, in the sense that they are implemented in the CLR, and not necessarily in (or using) the operating system's synchronization primitives. This matches the sense that "theoretically", a managed thread doesn't have to be based on an operating system's kernel thread. So basically, no one can guarantee that the CLR will always use one synchronization primitive or another. Today this isn't the case, and in the meanwhile it doesn't seem like things are going to change in the near or far future.

After a quick review of the documentation available to us in the MSDN, one can impressed as if there's no real documentation about the basic primitive being used. But since we are talking about a synchronization primitive, we can remember that IHostSyncManager interface that is exposed to us by the CLR's hosting ability. One of this interface's functionalities is the ability to replace the implementation of the synchronization primitive being used by the Monitor class. This ability is exposed by the method CreateMonitorEvent.
Even at this stage we may pay attention to what is being said under the remarks paragraph:

CreateMonitorEvent returns an IHostAutoEvent that the CLR uses in its implementation of the managed System.Threading.Monitor type. This method mirrors the Win32 CreateEvent function, with a value of false specified for the bManualReset parameter.

Even though, the keyword here is "Mirrors" so there isn't a true guarantee about what is happening in the CLR's internal implementation. In order to verify the thick hint we've just received here, we are going to have to pull out the big guns, and use WinDbg.
In favor of the test, I wrote up an application that results in an endless contention:

static void Main()
{
      Thread t1 = new Thread(() => { lock ("A") { while (true);} });
      t1.Start();

      lock ("A") { while (true);}
}

After the application already runs in the background, we could launch WinDbg and attach the appropriate process.
After loading SOS, our first step will be to find the thread that lost the race for acquiring the lock. To do so, we will print the managed stacks of all of our threads, and when we'll find a "suitable" stack trace, we'll move to its context:

>~*e!clrstack // execute !clrstack on all of the threads
OS Thread Id: 0xf80 (0) Child SP IP       Call Site
0012f3ec 030300fe ConsoleApplication1.Program.Main(System.String[]) [...]
0012f648 791421db [GCFrame: 0012f648]
OS Thread Id: 0xf4c (1)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0x840 (2)
Child SP IP       Call Site
02ccfe68 7c90e514 [DebuggerU2MCatchHandlerFrame: 02ccfe68]
OS Thread Id: 0xbe0 (3)
Child SP IP       Call Site
0313f67c 7c90e514 [GCFrame: 0313f67c]
0313f794 7c90e514 [GCFrame: 0313f794]
0313f7b0 7c90e514 [HelperMethodFrame_1OBJ: 0313f7b0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
0313f808 79b2e0c4 System.Threading.Monitor. Enter(System.Object, Boolean ByRef)
0313f818 03030163 ConsoleApplication1.Program.< Main>b__3() [...]
0313f848 79b2ae5b System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
0313f858 79ab7ff4 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
0313f87c 79ab7f34 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0313f898 79b2ade8 System.Threading.ThreadHelper. ThreadStart()
0313fabc 791421db [GCFrame: 0313fabc]
0313fd80 791421db [DebuggerU2MCatchHandlerFrame: 0313fd80]
OS Thread Id: 0xe60 (4)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process

>~3s // change the current thread to 3

After finding the correct thread, we will have to check its native stack, so we'll use the kb commend that will also display the first 3 parameters that were passed to each method:

>kb
ChildEBP RetAddr Args to Child
0313f3c8 7c90df4a 7c809590 00000001 0313f3f4 ntdll!KiFastSystemCallRet
0313f3cc 7c809590 00000001 0313f3f4 00000001 ntdll!ZwWaitForMultipleObjects+0xc
0313f468 791f516a 00000001 001820bc 00000000 KERNEL32!WaitForMultipleObjectsEx+0x12c
0313f4cc 791f4f98 00000001 001820bc 00000000 clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x56
0313f4ec 791f4dd8 00000001 001820bc 00000000 clr!Thread::DoAppropriateAptStateWait+0x4d
0313f580 791f4e99 00000001 001820bc 00000000 clr!Thread::DoAppropriateWaitWorker+0x17d
0313f5ec 791f4f17 00000001 001820bc 00000000 clr!Thread::DoAppropriateWait+0x60
0313f640 7919d409 ffffffff 00000001 00000000 clr!CLREvent::WaitEx+0x106
0313f654 792e0160 ffffffff 00000001 00000000 clr!CLREvent::Wait+0x19
0313f6e4 792e0256 001818a0 ffffffff 8079c412 clr!AwareLock::EnterEpilogHelper+0xa8
0313f724 792e029b 001818a0 001818a0 79142c0d clr!AwareLock::EnterEpilog+0x42
0313f744 792c7729 8079cb36 0313f830 00b3c368 clr!AwareLock::Enter+0x5f
0313f800 79b2e0c4 79161f8e 00941f02 0313f840 clr!JIT_MonReliableEnter_Portable+0x104
0313f840 79b2ae5b 00b3c3ec 01b3101c 0313f86c mscorlib_ni+0x2ae0c4
0313f850 79ab7ff4 00b3e010 00000000 00b3c3b8 mscorlib_ni+0x2aae5b
0313f86c 79ab7f34 00000000 00b3c3b8 00000000 mscorlib_ni+0x237ff4
0313f88c 79b2ade8 00b3c3b8 00000000 001818a0 mscorlib_ni+0x237f34
0313f8a4 791421db 000001a7 0313fae0 0313f930 mscorlib_ni+0x2aade8
0313f8b4 79164a2a 0313f980 00000000 0313f950 clr!CallDescrWorker+0x33
0313f930 79164bcc 0313f980 00000000 0313f950 clr!CallDescrWorkerWithHandler+0x8e

At this point we can already see that the last thing that the thread did before we've interrupted him, is to call to WaitForMultipleObjectsEx where the first parameter was 1 and the second is 0x001820BC. Having this information, we can understand that we are waiting on a single Handle object, since the first parameter specifies the affective size of the array that was passed as the second parameter. So all we've got left to do is to understand which object hides behind that Handle that was passed to the function.

>dp 0x001820BC 0x001820BC
001820bc 000006c8 // our handle's value
>!handle 000006c8 F // pass "F" as bitmask to display all of the relevant data
Handle 6c8
Type             Event
Attributes       0
GrantedAccess    0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
HandleCount      2
PointerCount     4
Name
Object Specific Information
    Event Type Auto Reset
    Event is Waiting

And so, this was our last step. We have confirmed that Monitor's synchronization primitive is in fact an Event object of an AutoReset type.
Whoever still wants to view the creation and usage of the Even in code, can open up the sync.cpp file under the SSCLI's implementation, and see how a call to CLREvent::CreateMonitorEvent triggers a call to UnsafeCreateEvent (that actually is a typedef to the familiar CreateEvent function).

Even so, one have to remember that this is only a partial answer. Since as I've mentioned at the beginning of this post, there's no guarantee that once we'll call Monitor.Enter we will always find ourselves accessing some kernel object. In fact, in one of his posts, Joe Duffy makes sure to mention that in the CLR's implementation, when a thread encounters contention, it will attempt to spin a little before re-trying to acquire the lock, without leaving the user-mode and waiting for some kernel object. So even if the CLR doesn't give a full blown implementation of a synchronization primitive, it may still provide some optimizations over the supplied services of the operating system (much like the CriticalSection type).