Liran Chen's Blog: Performance

Showing posts with label Performance. Show all posts

Saturday, August 7, 2010

Accurately Measuring GC Suspensions

When you analyze the performance of a managed application, and look for the application's main bottlenecks, one place you should always check is the amount of time your application spend in GC. In order to get that information, you can always run Perfmon and get a general analysis of how the GC behaves in your application (where the most of the time, you'll be looking at the %Time in GC counter).
While looking at those data sets can give us a good overview about how the GC affects our application's performance, it's just not enough in order to get more in-depth insights about observed "fezzes" in the application. For instance, even if we'll see that in 90% of the time we only spend 2% in GC collections, it still doesn't mean that during some critical moment in our application, there wasn't any generation 2 collection that caused our application to freeze for about 200ms, which could cause some serious damage in some scenarios.
The problem is that GC collections don't just cause the application to "run more slowly", but it puts the application into a complete stop for unknown amounts of time. So solely looking at the %Time in GC graph doesn't tells us for sure what was the real damage of each collection. Also, remember that perfmon's highest sampling resolution is only 1 second. So in case we've suffered from a single generation 2 collection, we've might even won't notice it as we should. More than that, its almost impossible to make sure that perfmon is runs, gathers and records the correct data on every machine that runs your application at any given time.

This is why we need a reliable way to monitor when the GC decides to perform a collection, and when it does, check the collection's generation, and the amount of time it takes to complete it.
Today, the System.GC class doesn't expose us an accurate way to collect this data. However, by writing a custom CLR Host, we could integrate our code with the CLR and receive from it the proper notifications that will tell us exactly when each and every GC starts and end in the process.
In order to do so, we will need to implement the IHostGCManager interface, and use one out of its three available callback functions. The first is SuspensionStarting, and the second is SuspensionEnding, which also passes us a single DWORD parameter that represent the number of generation that ended. So as you might have already figured out, the CLR makes sure to call SuspensionStarting right before it starts the collection, and afterwords, it calls SuspensionEnding.
One thing to pay attention about SuspensionEnding is that the documentation available on MSDN isn't entirely accurate, and might confuse users about what are the possible triggers to call the callback. This is what the documentation says today:

"...Notifies the host that the common language runtime (CLR) is resuming execution of tasks on threads that had been suspended for a garbage collection.
[parameter: generation] The garbage collection generation that is just finishing, from which the thread is resuming."

So according to the documentation, the callback will only be invoked due to freezings caused by a GC. However, this isn't the only trigger for invoking this callback. Actually, the CLR will invoke it also after it continuous execution after other kinds of "stop-the-world" operations that it might perform (e.g, loading and unloading of AppDomains). If we'll look in the SSCLI's implementation to where that callback is invoked, we could notice the following code:

if (pGCThreadControl)
{
    // If we the suspension was for a GC, tell the host what generation GC.
    DWORD   Generation = (bFinishedGC
        ? GCHeap::GetGCHeap()->GetCondemnedGeneration()
        : ~0U);


    pGCThreadControl->SuspensionEnding(Generation);
}

So we can see that for every freezing that wasn't caused due to a GC, the generation parameter that is passed to the callback will contain the value of UINT_MAX, so when implementing your CLR Host, you should remember checking for this special value.

As for the measuring itself, we'll use the QueryPerformanceCounter function (in .Net, wrapped by the Stopwatch class), to achieve the highest possible time resolution for every collection.
Since in most of the time, the collections that we'll encounter will be very short ones (mostly ephemeral collections, that could take only a few millisecond per collection), we'll likely want to avoid spending time recording the data (so we'll avoid unnecessary IO). In such case it could be useful to use a logging framework that will filter collections according to their severity (e.g, Debug for very short collections, Info for more notable collections, and Warn for more lengthy collections that might indicate a problem). After attaching an appender that writes all of our logs to the console window, and running an application that constantly allocates memory, we could get an output such as this:

As a reference, I'm including a sample CLR Host that monitors collections, and writes to the console their duration:

#include 
#include 
#include 
#include 


using namespace std;


#define APP_STARTUP_EXE L"TestApplication.exe"
#define APP_ENTRY_TYPE L"SomeNamespace.Program"
#define APP_ENTRY_METHOD L"Main"


class MyCLRHost : public IHostControl, public IHostGCManager
{
private:
    LONG m_refCount;
    LARGE_INTEGER m_lastGCStart;
    LARGE_INTEGER m_frequency;


public:
    MyCLRHost() { QueryPerformanceFrequency(&m_frequency); }


    // IHostControl
    HRESULT __stdcall GetHostManager(REFIID riid, void** ppObject)
    {
        if(riid == IID_IHostGCManager)
        {
            *ppObject = static_cast(this);
            return S_OK;
        }


        *ppObject = NULL;
        return E_NOINTERFACE;
    }


    // IUnknown
    HRESULT __stdcall QueryInterface(REFIID riid, void** ppvObject)
    {
        if (riid == IID_IHostGCManager)
        {
            *ppvObject = static_cast(this);
            return S_OK;
        }


        *ppvObject = NULL;
        return E_NOINTERFACE;
    }


    HRESULT __stdcall SetAppDomainManager(DWORD appDomain, IUnknown* domainManager)
    { 
        return S_OK;
    }


    ULONG __stdcall AddRef() { return InterlockedIncrement(&m_refCount); }
    ULONG __stdcall Release() { return InterlockedDecrement(&m_refCount); }


    // IHostGCManager
    HRESULT __stdcall ThreadIsBlockingForSuspension() { return S_OK; }


    HRESULT __stdcall SuspensionStarting()
    {
        m_lastGCStart;
        QueryPerformanceCounter(&m_lastGCStart);


        return S_OK;
    }


    HRESULT __stdcall SuspensionEnding(DWORD gen)
    {
        LARGE_INTEGER gcEnd;
        QueryPerformanceCounter(&gcEnd);
        double duration = ((gcEnd.QuadPart - m_lastGCStart.QuadPart)) 
            * 1000.0 / (double)m_frequency.QuadPart;


        if(gen != UINT_MAX)
            cout<<"GC generation "<<<" ended: "<<<"ms"<
        else
            cout<<"CLR suspension ended: "<<<" ms"<


        return S_OK;
    }
};


int _tmain(int argc, _TCHAR* argv[])
{
    ICLRRuntimeHost* pCLR;
    DWORD startupFlags = STARTUP_CONCURRENT_GC;
    HRESULT hr = CorBindToRuntimeEx(L"v2.0.50727", L"wks", startupFlags,
        CLSID_CLRRuntimeHost, IID_ICLRRuntimeHost, (LPVOID*)&pCLR);
    assert(SUCCEEDED(hr));


    MyCLRHost customHost;
    hr = pCLR->SetHostControl(&customHost);
    assert(SUCCEEDED(hr));


    hr = pCLR->Start();
    assert(SUCCEEDED(hr));


    DWORD retcode;
    hr = pCLR->ExecuteInDefaultAppDomain(APP_STARTUP_EXE,
        APP_ENTRY_TYPE, APP_ENTRY_METHOD, L"" , &retcode);
    assert(SUCCEEDED(hr));


    return 0;
};

Tuesday, August 3, 2010

Forcing JIT Compilation During Runtime

One of the advantages/disadvantages of the .Net Framework is its usage of JIT Compilation: The process in which machine code is generated from CIL code during the application's run-time, at the production environment.
Depending on your point of view, this behavior can be considered as a helpful feature since during the JIT compilation, the compiler can utilize special features that the production processor supports (such as using more efficient, unique instructions that aren't supported by other processors). This way, we could achieve better performance than if we didn't use JIT compilation at all, since we would have to address a more basic set of processor features (in order to support a variety of processors).
The most dominant disadvantage of using a JIT compiler is the time being spent compiling CIL code on the production machine (since every time the user triggers a "first-time" call to a managed method, the JIT compiler is triggered, and starts to compile the method for the current and future calls).
In the CLR, every MethodDesc instance (a structure that exists for every managed method. Contains some metadata about the method and is allocated in the EE's (Execution Engine) memory), contains an IsJitted boolean field. Like the field name implies, its value tells us whether the the JIT compiler already compiled the relevant method (this field will assist us later on).
Once a method is called for the first time, the execution flow is redirected by a special stub into the JIT, that will be responsible to compile the method. Then, the stub that caused the redirection is overridden by a jmp instruction that directs the execution flow to the JIT's generated machine code (in fact, this mechanism resembles the delay load feature available in C++). So as you understand, each time a method is called for the first time, we suffer from a certain performance penalty.
Its important to emphasize that usually, that the JIT's performance penalty is quite minimal, since we only have to experience it at the method's first invocation. However, for applications that demand high responsiveness, this performance penalty might be intolerable.

The classic solution for this problem is to use NGen. The idea behind NGen is that when your application is installed on the production machine, the JIT is triggered to compile our entire set of managed assemblies, and generate a native image for the application. Afterwards, when the application is launched, the CLR will make sure to load the correct native image from the disk, and so avoid unnecessary "jitting".
The problem with NGen is that it's quite complicated to use, and if we are really looking to avoid performance penalties during our application's initialization, we will have to spend more of our valuable time in order to register our assemblies in the GAC, correctly set our DLL's base addresses in order to avoid rebasing, etc.
However, there is an alternative for using NGen, and it's triggering the JIT to compile your during runtime, exactly when your desire it. This solution have its own set of disadvantages and isn't necessarily better than using NGen, but in case you want to speed up your JIT process with minimal effort on your part, this a solution that you will definitely want to consider.

So heading right down to business, I'll explain how this works from the bottom-up, so first we will understand how we are able to trigger the JIT compiler ourselves, in an noninvasive fashion (without actually executing the method). And then I'll explain on how you can utilize this ability in your own existing applications.
If so, our first step is to learn how to trigger the JIT compiler on a single, specific method. For that, we will use the RuntimeHelpers.PrepareMethod function. Normally, we would use this method when we need to call virtual functions inside a CER region, but in our case, what is more interesting is that this method causes the JIT to compile the method that was passed to it (Remark: it should be noted that calling PrepareMethod may trigger the static constructor of the destination type, that is, if it has one).
Since in this demonstration we we'll want to JIT-compile the entire set of methods contained in a given assembly, we will use a utility method that iterates over the contained types in a given assembly, and invokes PrepareMethod on every relevant method:

public static void PreJITMethods(Assembly assembly)
{
    Type[] types = assembly.GetTypes();
    foreach (Type curType in types)
    {
        MethodInfo[] methods = curType.GetMethods(
                BindingFlags.DeclaredOnly |
                BindingFlags.NonPublic |
                BindingFlags.Public |
                BindingFlags.Instance |
                BindingFlags.Static);


        foreach (MethodInfo curMethod in methods)
        {
            if (curMethod.IsAbstract ||
                curMethod.ContainsGenericParameters)
                continue;


            RuntimeHelpers.PrepareMethod(curMethod.MethodHandle);
        }
    }
}

One important thing about this method is the placement of the if statement right before the invocation of PrepareMethod. The first part of the statement is obvious, we have to ignore abstract methods since they don't contain any implementation detail, so the JIT compiler have nothing to do with them (attempting to pass them to PrepareMethod will result in an exception). The second part of the statement is the ignorance of generic methods. We can't cause the JIT to compile a generic methods since at this stage we still don't know which parameters are passed to the method in different call sites in the application's code. This is in fact a disadvantage versus NGen, that is aware of which "versions" of the generic method needs to be compiled.

The next step is to integrate this utility method into our application's initialization routine. When doing so, there are a few things to consider. Would we rather to trigger the code generation once right after our application finished loading? Or perhaps every time a new assembly is loaded into the AppDomain? Do we prefer executing this code synchronously, or perhaps asynchronously? And what about dedicating a thread with a lower priority than our other applicative threads? This questions will have to be answered according to your own preferences, and your application's needs.
Another thing to remember is that by default, the CLR delays the loading of referenced assemblies until we reach the first time we actually reference a type from it in our code (so if have a reference to a certain assembly, and we never reached a point where we needed to use it, it won't be loaded into the AppDomain). So if we want to trigger the JIT compiler on all of the statically referenced assemblies, we will have to explicitly load them. In order to do so, we could recursively call GetReferencedAssemblies and Load. For example:

// recursively load all of assemblies referenced by the given assembly
public static void ForceLoadAll(Assembly assembly)
{
    ForceLoadAll(assembly, new HashSet());
}


private static void ForceLoadAll(Assembly assembly,
                                 HashSet loadedAssmblies)
{
    bool alreadyLoaded = !loadedAssmblies.Add(assembly);
    if (alreadyLoaded)
        return;


    AssemblyName[] refrencedAssemblies =
        assembly.GetReferencedAssemblies();


    foreach (AssemblyName curAssemblyName in refrencedAssemblies)
    {
        Assembly nextAssembly = Assembly.Load(curAssemblyName);
        if (nextAssembly.GlobalAssemblyCache)
            continue;


        ForceLoadAll(nextAssembly, loadedAssmblies);
    }
}

In this example, I've chosen to ignore assemblies that are loaded from the GAC, so we won't compile the BCL's massive codebase (thus, minimizing our working set size). As you can understand, this method can be customized to fit your specific needs regarding which assemblies to load, and which to ignore.
Obviously, this method will only load assemblies that are statically referenced to our application. In order to trigger JIT compilation on dynamically loaded assemblies (as a product of using Reflection for instance), you may want to subscribe to the AssemblyLoad event that will be fired during the loading of new assemblies into the AppDomain.

In case you are interested to know how you can positively confirm that a certain method was JIT compiled by the PrepareMethod (or by some other means), then you will need to grab WinDbg and SOS.
All you need to do is to find the address of your method's MethodDesc instance, and check what is the value of the IsJitted field. Here is a typical example to how you would want to do so:

> !name2ee OtherAssmebly.dll C.ClassC
Module: 00a457b8 (OtherAssmebly.dll)
Token: 0x02000002
MethodTable: 00a48bbc
EEClass: 00f283d4
Name: C.ClassC

> !dumpmt -md 00a48bbc
EEClass: 00f283d4
Module: 00a457b8
Name: C.ClassC
mdToken: 02000002  (C:\...\OtherAssmebly.dll)
BaseSize: 0xc
ComponentSize: 0x0
Number of IFaces in IFaceMap: 0
Slots in VTable: 6
--------------------------------------
MethodDesc Table
   Entry MethodDesc      JIT Name
00a4c838   00a48bb0     NONE C.ClassC..ctor()
00a4c830   00a48ba0      JIT C.ClassC.FooMethod()

> !dumpmd 00a48ba0
Method Name: C.ClassC.FooMethod()
Class: 00f283d4
MethodTable: 00a48bbc
mdToken: 06000001
Module: 00a457b8
IsJitted: yes
CodeAddr: 00f50778

Monday, June 28, 2010

String.Format Isn't Suitable for Intensive Logging

One of the first rules that every .Net programmer learns to memorize is "Never use the + operator to combine strings. Only use the StringBuilder class to do so". If you'll ask a random programmer why should we do so, the typical answer would be "Since strings in .Net are immutable, every time we use the + operator we actually cause a memory allocation, so in order to avoid we should use the StringBuilder".
This answer is a common misconception. Does StringBuilder is necessarily better than using the + operator, or perhaps other means for combining strings? No, not necessarily.

First of all, let's look at the + operator's behavior. Basically, what it does is to allocate a new string, and copy the characters from the two older strings into the new one. So theoretically, there's a serious negative impact on memory usage, since when combining multiple strings, we could find ourselves allocation redundant strings that will only be used to contain "parts" of targeted final string.
This is only partialy true, since the C# compiler will detect combinations of 5 strings and above using the + operator, and will convert it to use string.Concat.
For example, this code:

string str2 = 1.ToString() + 2.ToString() + 3.ToString() + 4.ToString() + 5.ToString();

Will be compiled into this code:

string[] CS$0$0000 = new string[] { 1, 2, 3, 4, 5 }; // edit: removed ToString
string text1 = string.Concat(CS$0$0000);

Regarding string.Concat, you shouldn't be worried about excessive memory allocations, since the first thing the method does is to calculate the combined string's total length, creates the string using a call to AllocateFastString, and then simply copies the content of the old strings to the new one using wstrcpy (efficient raw bytes copying using unsafe code).

One of the more common usages for combining strings, is in log formatting. In case your application does lots of intensive logging, then there's a good chance that it's spending some serious amounts of time in formatting and combining strings that will eventually arrive to the logging framework.
The situation might change a little depending on which logging framework you use, but I'll demonstrate using log4net. The case in log4net is that you may choose to use the simple logging methods without formatting (e.g, Deubg/Info/Warn..), or, you may also choose to use the ones that support formatting (e.g, DebugFormat/InfoFormat/WarnFormat..). While the first option receives a parameterless string, the other receives a formatted string, with additional paramters that should be placed inside the string, using string.Format. Now this could introduce a problem.
The issue here is that every time we would like to insert some parameter into our log, we are most likely to use log4net's Format method (that uses string.Format) instead of using string.Concat or perhaps string.Join. Just because this API seems more convenient, and approachable.
The following benchmark demonstrates the differences between the three alternatives:

while(true)
{
    Stopwatch sw = Stopwatch.StartNew();


    for (int i = 0; i < 500000; i++)
    {
        // 960ms
        string str1 = string.Format("{0}, {1}, {2}, {3}, {4}", 1, 2, 3, 4, 5);
        // 665ms
        string str2 = string.Concat(1, ", ", 2, ", ",3, ", ", 4, ", ", 5);
        // 566ms
        string str3 = string.Join(", ", new string[] { 1, 2, 3, 4, 5 });
    }


    Console.WriteLine(sw.ElapsedMilliseconds);
}

In this case, even though I've use a fairly simple log, the differences between the methods are definite. When comparing the performance of string.Format to string.Concat, we see that we achieved a 31% of improvement. And in comparison to string.Join, we achieve a substantial 62% of performance gain (string.Join's implementation doesn't use string.Concat under the covers, but works in a similar way by calling FastAllocateString and using the UnSafeCharBuffer class to construct the new string).
Don't think that this performance improvement comes with the cost of more intensive memory usage, since in this area, Concat and Join still achieve much better results than string.Format.
These results dosn't mean that StringBuilder will always produce lesser performance, since in cases in which you don't always know the number of concatenations you will need to perform, or perhaps in cases in which you could optimize memory allocations by modifying the capacity variable in the StringBuilder's constructor, you could actually achieve better results than these.

Of course that if you perform your logs only "once in a while", this performance difference is almost meaningless. And the fact that using FooFormat might be more comfortable to us might be just enough to justify its usage. But, in cases in which your application performs some intensice logging, these performance differences might be substantial for you.

Wednesday, June 2, 2010

DateTime.Now Causes Boxing

Update 2.9.2010

The content of this post regards DateTime.Now's implementation until v3.5. At the release of v4.0, its internal implementation was updated to avoid unnecessary boxing.

However, looking into the new implementation details revealed a new source for dynamic memory allocations in DateTime.Now. The issue is discussed in detail in the updated post: DateTime.Now in v4.0 Causes Dynamic Memory Allocations.

Perhaps you weren't aware of it, but every time you access DateTime.Now, you cause to a dynamic memory allocation through the boxing of an Int32 variable.
The reason for this lays deep inside the implementation of the Now property. If we will look a bit closer using Reflector, we could see that it actually wraps a call to UtcNow and then converts it to local time using the ToLocalTime method.
While there's nothing wrong with the call to UtcNow (it just wraps a call to the Win32 GetSystemTimeAsFileTime), there is some trouble with ToLocalTime, or actually, with the class CurrentSystemTimeZone that is being used. During the conversion the method GetDaylightChanges is called that causes the boxing. You can look for yourself:

    public override DaylightTime GetDaylightChanges(int year)
    {
        object key = year;
        if (!this.m_CachedDaylightChanges.Contains(key))
        {
            // ..lots of code
        }
 
        return (DaylightTime)this.m_CachedDaylightChanges[key];
    }

This is a good example showing that even in latest versions of the .Net Framework, some types still uses non-generic data structures, from the days of 1.0/1.1. In this case, the data member m_CachedDaylightChanges is actually a Hashtable).
This means that you might find plenty more of cases in which unnecessary boxing occurs without your knowing. But what makes DateTime.Now case especially serious is that it's a very common property to access, especially by logging frameworks for instance.

All of those boxings comes with a price, and it's not cheap. Due to the high frequency of dynamic memory allocations, we could cause some serious stress for the GC.
In order to demonstrate the impact on memory usage, I'll use the following benchmark:

        while (true)
        {
            Stopwatch sw = Stopwatch.StartNew();
 
            for (int i = 0; i < 1000000; i++)
            {
                DateTime now = DateTime.Now;
            }
 
            Console.WriteLine(sw.ElapsedMilliseconds);
        }

After executing this code, we can learn two things. The first, is that it takes on average 456ms to complete each iteration on the while loop (I'll address this result in a moment), and the other, is that we can learn the magnitude of the affect of calling DateTime.Now on our application's memory usage patterns. In this case, I've used perfmon to monitor the application's memory allocation rate:

Even though most of these allocations will be released by the GC in a Generation 0 collection, we still can't take them as granted. Especially if you're developing an application where you need to minimize as possible any dynamic memory allocation, mostly because you want to reduce the number of collections performed by the GC.

How to Avoid The Boxing?
After we've seen why we shouldn't call DateTime.Now (in certain scenarios), raises the question "How could we avoid those unnecessary memory allocations?".
First, there's Microsoft's side, which needs to make sure to update their code to make sure they avoid boxing (e.g, using a Dictionary instead of a Hashtable). But until this happens, we'll need to take matters to our own hands and workaround this issue. My suggestion is to write a wrapper method that is almost identical to UtcNow's implementation (that simply wraps a call to the Win32 API). All we need to do is to call GetLocalTime that will return use the the in SYSTEMTIME format, which we will need to convert back to a DateTime type.
For example:

public static class TimeUtil
{
    [DllImport("kernel32.dll")]
    static extern void GetLocalTime(out SYSTEMTIME time);
 
    [StructLayout(LayoutKind.Sequential)]
    private struct SYSTEMTIME
    {
        public ushort Year;
        public ushort Month;
        public ushort DayOfWeek;
        public ushort Day;
        public ushort Hour;
        public ushort Minute;
        public ushort Second;
        public ushort Milliseconds;
    }
 
    public static DateTime LocalTime
    {
        get
        {
            SYSTEMTIME nativeTime;
            GetLocalTime(out nativeTime);
 
            return new DateTime(nativeTime.Year, nativeTime.Month, nativeTime.Day,
                                nativeTime.Hour, nativeTime.Minute, nativeTime.Second,
                                nativeTime.Milliseconds, DateTimeKind.Local);
        }
    }
}

Running the benchmark from before with this updated implementation, we could see that we completely got rid of the memory allocations, and as a bonus, every iteration takes only 370ms. That is, an 18% improvement over the performance of DateTime.Now's implementation.