Showing posts with label .Net Internals. Show all posts
Showing posts with label .Net Internals. Show all posts

Sunday, August 31, 2014

Unhandled Exceptions in PerformIOCompletionCallback

Debugging unhandled exceptions in your code might not be the way you've planned to spend your afternoon. On the other hand, at least its your code you're debugging and not, let's say... the Frameworks'. In this post, I'll talk about the latter, and more specifically, a case that popped up in the PerformIOCompletionCallback function.

The main problem with exceptions that are being thrown before the CLR has a chance to execute your own code, is that the call stack is almost useless. A NullReferenceException that was thrown while handling an IOCompletion might consist only from the following line:

System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0x54

Not very helpful. Since none of our code was executed, we can't simply see which variable was null or even to understand which module in our application caused this exception. In order the proceed with the analysis, it's time to pull out the big guns and load a full memory dump of the process into WinDbg.

Once the dump, SOS and the appropriate symbols are loaded, we should be able to find the problematic thread using the ~*e!clrstack command. Using the faulting instruction pointer (IP), we could use the !u command to dump the assembly of the failing function. The output should be similar to this one (truncated for brevity):

0:073> !u 000007FEF33888A4
preJIT generated code
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Begin 000007fef3388850, size 11e. Cold region begin 000007fef39340f0, size 3b
Hot region:
000007fe`f3388850 53              push    rbx
000007fe`f3388851 55              push    rbp
000007fe`f3388852 57              push    rdi
000007fe`f3388853 4883ec40        sub     rsp,40h
000007fe`f3388857 488d6c2420      lea     rbp,[rsp+20h]
000007fe`f338885c 4c894550        mov     qword ptr [rbp+50h],r8
000007fe`f3388860 895548          mov     dword ptr [rbp+48h],edx
000007fe`f3388863 894d40          mov     dword ptr [rbp+40h],ecx
000007fe`f3388866 48896500        mov     qword ptr [rbp],rsp
000007fe`f338886a 33c0            xor     eax,eax
000007fe`f338886c 48894510        mov     qword ptr [rbp+10h],rax
000007fe`f3388870 48c7450800000000 mov     qword ptr [rbp+8],0
000007fe`f3388878 488d5508        lea     rdx,[rbp+8]
000007fe`f338887c 488d0d6d7e71ff  lea     rcx,[mscorlib_ni+0x1b06f0]
000007fe`f3388883 e820a886ff      call    mscorlib_ni+0x3030a8 
000007fe`f3388888 ba9d010000      mov     edx,19Dh
000007fe`f338888d b901000000      mov     ecx,1
000007fe`f3388892 e889a786ff      call    mscorlib_ni+0x303020 
000007fe`f3388897 488b4d50        mov     rcx,qword ptr [rbp+50h]
000007fe`f338889b e8d07b89ff      call    mscorlib_ni+0x330470 (System.Threading.OverlappedData.GetOverlappedFromNative(...)
000007fe`f33888a0 488b7820        mov     rdi,qword ptr [rax+20h]
000007fe`f33888a4 488b4708        mov     rax,qword ptr [rdi+8]
000007fe`f33888a8 488b5818        mov     rbx,qword ptr [rax+18h]
000007fe`f33888ac 4885db          test    rbx,rbx
000007fe`f33888af 741a            je      mscorlib_ni+0xa988cb (000007fe`f33888cb)
...

According to Microsoft's x64 register usage, at the time of the exception, the RAX register contained the return value of the GetOverlappedFronNative(...) function. Then, after a little playing around with certain field offsets, a NullReferenceException is thrown. Such errors often occur when the owner object (socket, file stream etc.) is disposed while there's an outstanding IO operation. Once the asynchronous operation is completed, the CLR attempts to use the disposed object - and we enter the wonderful world of undefined behavior. Though the usual result is quite deterministic: a complete crash of the process due to the unhandled exception. Needles to say, you should never dispose an object before you're sure it has no remaining asynchronous IO operations in progress.

This brings us to the final step. How can we check that the target object was indeed already disposed, and how can we tell which part of our codebase is responsible for this unfortunate bug?
First, we'll need to retrieve the OverlappedData instance that the native pointer is mapped into. This is exactly the purpose of the GetOverlappedFromNative function, but since it's implemented inside the CLR, we can't simply use a decompiler to see its implementation. Instead, we could use the u command again. This time, the output should be similar to this:

0:073> !u 000007fe`f2c20470
Unmanaged code
000007fe`f2c20470 e9bb8d1a01      jmp     clr!GetOverlappedFromNative (000007fe`f3dc9230)

0:073> !u 000007fe`f3dc9230
Unmanaged code
000007fe`f3dc9230 488d41b0        lea     rax,[rcx-50h]
000007fe`f3dc9234 c3              ret

After a short redirection into the CLR, we can see the function simply draws the pointer back by 0x50 bytes (that is, the native structure is actually contained by the managed one). This is consistent with the function's implementation in the SSCLI:

static OverlappedDataObject* GetOverlapped (LPOVERLAPPED nativeOverlapped)
{
    ...        
    return (OverlappedDataObject*)((BYTE*)nativeOverlapped - offsetof(OverlappedDataObject, Internal));
}

Dumping the contents of the OverlappedData's instance should give an output similar to this:

0:073> !do 0x0000000005043e08-0x50
Name:        System.Threading.OverlappedData
MethodTable: 000007fef2903180
EEClass:     000007fef2a0a110
Size:        120(0x78) bytes
File:        C:\...\mscorlib.dll
Fields:
... Offset      Type         VT   Attr         Value      Name
...   8  System.IAsyncResult  0 instance 0000000000000000 m_asyncResult
...  10 ...ompletionCallback  0 instance 0000000005a46028 m_iocb
...  18 ...ompletionCallback  0 instance 0000000000000000 m_iocbHelper
...  20 ...eading.Overlapped  0 instance 0000000005a46010 m_overlapped
...  28        System.Object  0 instance 0000000013335c18 m_userObject
...  30 ...ppedDataCacheLine  0 instance 00000000050436d8 m_cacheLine
...  38        System.IntPtr  1 instance           221828 m_pinSelf
...  40        System.IntPtr  1 instance                0 m_userObjectInternal
...  48         System.Int32  1 instance                1 m_AppDomainId
...  4c         System.Int16  1 instance               13 m_slot
...  4e          System.Byte  1 instance                0 m_isArray
...  4f          System.Byte  1 instance                0 m_toBeCleaned
...  50 ....NativeOverlapped  1 instance 0000000005043e08 m_nativeOverlapped

Dumping the m_iocb member and then m_target, should give you the problematic object. In the demonstrated case, the object is an instance of SocketAsyncEventArgs, so dumping it and looking the the m_DisposedCalled field proves the assumed theory:

0:073> !do 0000000005a53b58 
Name:        System.Net.Sockets.SocketAsyncEventArgs
MethodTable: 000007fef21248c0
EEClass:     000007fef1e6d550
Size:        440(0x1b8) bytes
File:        C:\...\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
...  Offset                 Type VT     Attr            Value Name
...      80     System.EventArgs  0   shared           static Empty
...       >> Domain:Value  0000000000112a40:0000000003326168 <<
...       8 ...et.Sockets.Socket  0 instance 0000000000000000 m_AcceptSocket
...      10 ...et.Sockets.Socket  0 instance 0000000000000000 m_ConnectSocket
...      18        System.Byte[]  0 instance 0000000013335c18 m_Buffer
...     170 System.Net.WSABuffer  1 instance 0000000005a53cc8 m_WSABuffer
...      e8        System.IntPtr  1 instance         13335c28 m_PtrSingleBuffer
...     130         System.Int32  1 instance            32768 m_Count
...     134         System.Int32  1 instance                0 m_Offset
...      20 ...lib]], mscorlib]]  0 instance 0000000000000000 m_BufferList
...     16c       System.Boolean  1 instance                0 m_BufferListChanged
...      28 ...m.Net.WSABuffer[]  0 instance 0000000000000000 m_WSABufferArray
...     138         System.Int32  1 instance            32768 m_BytesTransferred
...      30 ...entArgs, System]]  0 instance 0000000000000000 m_Completed
...
...     16f       System.Boolean  1 instance                1 m_DisposeCalled
...

In order to find out which part of our code initiated the current IO operation, we could use !gcroot on the target object or the m_userObject member. In some cases, simply examining the contents of m_userObject could give us a rough idea regarding the identity of the culprit.

Friday, September 10, 2010

"!CLRStack -p" Isn't Always Reliable

One of the most commonly used commands in SOS is !CLRStack. When combined with the -p switch, SOS will attempt to display the values of the parameters that were passed to the functions in our managed call-stack. It's important to emphasis that SOS will only attempt to display the correct values, since sometimes its just going to get it all wrong.

In the case where SOS comes to the conclusion that it cannot retrieve the value of a parameter, the string will be displayed. This happens at situations where SOS honestly can't track the value of the parameter by just looking at a specific stack frame. For example, if we're using a calling convention such as fast call, the first two parameters (starting from the left) that can be contained in a register, will be passed in the ECX and EDX registers instead of being pushed onto the stack. For member functions, the value of the this pointer is usually passed in the ECX register.
This kind of behavior can lead into situations where the values of some of the function parameters may be missing since the registers we're already overridden by other functions that we're called down the stack.
As opposed to situations where SOS is able to come to the conclusion that it isn't sure what is the correct value of a parameter, every once in a while it's just going to get things wrong, and display incorrect values. This obviously could terribly mislead the person who attempts to debug the application.
The thing that might be the most concerning about this phenomena, is that it's not very hard to reproduce. Let's take the following application as an example:

    class Program
    {
        static void Main(string[] args)
        {
            Foo a = new Foo();
            a.Work(1, 2, 3, 4, 5);
        }
    }
 
    class Foo
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        public void Work(int x, int y, int z, int k, int p)
        {
            // break with debugger here
        }
    }
Now we'll run WinDbg, load SOS and see what the !CLRStack command has to say about the parameters that we're passed to the Work method.
(Note: The output might slightly vary, depending on your version of SOS. The following debugging session was perform on version 4.0.30319.1).

0:000> !clrstack -p
OS Thread Id: 0xfbc (0)
Child SP IP       Call Site
0012f3fc 030300b3 ConsoleApplication1.Foo.Work(Int32, Int32, Int32, Int32, Int32) [Program.cs @ 24]
    PARAMETERS:
        this () = 0x00b3c358
        x () = 0x00000001
        y (0x0012f40c) = 0x00000003
        z (0x0012f408) = 0x00000004
        k (0x0012f404) = 0x00000005
        p (0x0012f400) = 0x03030092
0012f414 03030092 ConsoleApplication1.Program.Main(System.String[]) [Program.cs @ 16]
    PARAMETERS:
        args =
0012f648 791421db [GCFrame: 0012f648]

0:000> r ecx  // holds "this" pointer
ecx=00b3c358
0:000> r edx  // holds "x" parameter
edx=00000001
If so, it isn't difficult to see that besides the this pointer and the x parameter (that we're passed as registers), SOS got all of the other parameters wrong. In fact, one may notice in a "shift" that was performed on the displayed values (y got the value of z, while z got the value of k, and so on...).
In order to better understand what went on here, we'll print the memory between the relevant stack pointers (in previous versions, this value was represented by the ESP column instead of the Child SP column).

0:000> dp /c 1 0012f3fc  0012f414
0012f3fc  0012f414 // EBP
0012f400  03030092 // Return Address
0012f404  00000005 // p 
0012f408  00000004 // k
0012f40c  00000003 // z
0012f410  00000002 // y
0012f414  0012f424 // EBP

Now, when we compare the stack-addresses of the parameters that SOS gave us, against the real addresses that we see here, we can confirm that a small address shift was performed to the memory addresses of the parameters that we're passed on the stack. So every time SOS attempts to display the value of a parameter, it actually displays the value of the parameter that was passed next to it.
This scenario is a classic example to the bit "buggy" nature of SOS. It doesn't mean that we have to immediatly stop using the !CLRStack command, but it should to remind us not to take SOS's output as the "absolute truth" when debugging, and just keep ourselves alert for all kind of "weird" behaviors such as this one.

Thursday, September 2, 2010

DateTime.Now in v4.0 Causes Dynamic Memory Allocations

A while back I've mentioned in a post that calling DateTime.Now causes boxing. Following the posting, a feedback item was also posted on Microsoft Connect, reporting about the issue.
Yesterday, the feedback's status turned to Resolved after Greg cleverly remarked that this issue was fixed in v4.0 (the boxing occurred until v3.5).

However, after reading the response I took the same benchmark code from the previous post an ran it again, this time using .Net Framework 4.0. Surprisingly, perfmon kept reporting on high allocation rates in the Allocated Bytes/sec counter. So I've opened up Reflector and took a look at DateTime.Now's new implementation detail. And indeed, the boxing issue was resolved since the new implementation uses the TimeZoneInfo type instead of TimeZone. Unable to find the source of the allocations from just looking at the code, it was time to launch WinDbg.
After letting the application to run for a while, I've attached WinDbg to the process and executed the !DumpHeap -stat command a couple of times so I could take a general look at the kind and amount of objects that currently live on the managed heap. The output was as follows:

0:003> !dumpheap -stat
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
...
79ba1d88       13         1532 System.Char[]
79ba2f20        1         3164 Dictionary[[String, mscorlib],[String, mscorlib]][]
79b9f9ac      449        12220 System.String
001663d8       53        13212      Free
79b56c28       30        17996 System.Object[]
79b8e7b0     9119       291808 System.Globalization.DaylightTime
Total 9771 objects

0:003> !dumpheap -stat 
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
...
79ba1d88       13         1532 System.Char[]
79ba2f20        1         3164 Dictionary[[String, mscorlib],[String, mscorlib]][]
79b9f9ac      449        12220 System.String
001663d8       53        13212      Free
79b56c28       30        17996 System.Object[]
79b8e7b0    20654       660928 System.Globalization.DaylightTime
Total 21306 objects

This output reveals the identity of our memory-consuming bandit: DaylightTime.
Now, all is left is to spot where this type is being instantated and used. For this purpose, we could use the !BPMD -md command in order to place breakpoints on specific managed methods that DaylightTime exposes (you could dump the type's methods using the following command: !DumpMT -md ).
After setting the breakpoint, the application continues its execution and immediately breaks. Looking at the managed callstack using !CLRStack reveals the source method of the allocation: TimeZoneInfo.GetIsDaylightSavingsFromUtc. This method creates an instance of DaylightTime, and since DaylightTime is a class (hence, a reference type), a dynamic memory allocation is being made.

   // a snippet from the implementation:
   TimeSpan span = utc;
   DaylightTime daylightTime = GetDaylightTime(Year, rule);
   DateTime startTime = daylightTime.Start - span;
   DateTime endTime = (daylightTime.End - span) - rule.DaylightDelta;

In conclusion, DateTime.Now's implementation was updated in BCL v4.0 and the boxing of Int32s was removed since the implementation uses the TimeZoneInfo type instead of TimeZone. However, using TimeZoneInfo results in a new source for dynamic allocations, but this time instead of redundant boxings, the allocations are caused just because a reference type is beind used under the hood. And since each instance of DaylightTime is sized up at 32 bytes (including the Object Header), the update in the BCL could be considered as a turn for the worst regarding memory usage since DaylightTime instances are more memory consuming than boxed instances of Int32.

Sunday, August 29, 2010

Brain Teasing With Strings

And in today's post... a riddle.
Actually, it comes down to a short C# program that deals with string comparisons, and demonstrates a series of "unexpected" behaviors. How "unexpected" are the results? Well, it depends on the reader's degree of experience at string comparisons.
Lately this sort of "brain-teasers" became quite popular on different forums/newsgroups, but it seems like no matter where those teasers are posted, it never comes with a full in-depth explanation for all of the "dark-magic" going on behind the scenes, that could potentially explain the demonstrated "weird" and "unexpected" behaviors. So basically, I'll attempt to make things a little bit clearer than it is now.
So without anymore introductions, please tell me, what would be the output for the following program?

static void Main()
{
    char[] a = new char[0];
    string b = new string(a);
    string c = new string(new char[0]);
    string d = new string(new char[0]);

    Console.WriteLine(object.ReferenceEquals(a, b));
    Console.WriteLine(object.ReferenceEquals(b, c));
    Console.WriteLine(object.ReferenceEquals(c, d));
}

When this question is presented to a group of experienced developers, it's most likely to result in a variety of different answers, where each of the answers is backed up with some explanation that usually "makes sense". The problem is that they usually just "makes sense", and doesn't refer to some hard proofs that can uncover the mystery completely.
So, are you ready with your answers? Let's check the output:

False
True
True
"Hmm... that was interesting".
Let's see what do we got here. On the one hand, a isn't equal to b. This makes sense since after all, they are difference instances of different types. But on the other hand, we can also see that b is equal to c and d as well. This could appear to be a little confusing since as we know, strings are supposed to be immutable, so according to the code, we are "supposed" to create three different string instances (mind you that we aren't dealing with user-hardcoded, interned strings here).
It appears that even though we've create three different arrays, we've only allocated a single string (that one that b, c and d references).

Usually, when a char[] (or, some other kind of string-representing structure) is passed to a string constructor, a new string is allocated to represent the array's characters. However, as our brief program demonstrates, this isn't always the case.
To understand why this happens, we'll have to dig into the implementation of the string class, and more specifically, to it's internal implementation in SString (the following code samples are taken from SSCLI's implementation).
Looking at SString's internal constructor, we could notice that the char[] argument that was received is checked whether it's set to NULL or it's length is equal to zero. If it does, the implementation completely "forgets" about that argument, and instead uses the new SString instance to reference an interned version of an empty string (so basically, the redundant allocation of the new, empty string, is optimized away).
We could see this behavior in SString's Set and Clear functions (the Set function is invoked from the constructor).

void SString::Set(const WCHAR *string)
{
    if (string == NULL || *string == 0)
        Clear();
    else
    {
        Resize((COUNT_T) wcslen(string), REPRESENTATION_UNICODE);
        wcscpy_s(GetRawUnicode(), GetBufferSizeInCharIncludeNullChar(), string);
    }

    RETURN;
}

void SString::Clear()
{
    SetRepresentation(REPRESENTATION_EMPTY);

    if (IsImmutable())
    {
        // Use shared empty string rather than allocating a new buffer
        SBuffer::SetImmutable(s_EmptyBuffer, sizeof(s_EmptyBuffer));
    }
    else
    {
        SBuffer::TweakSize(sizeof(WCHAR));
        GetRawUnicode()[0] = 0;
    }

    RETURN;
}

Tuesday, August 3, 2010

Forcing JIT Compilation During Runtime

One of the advantages/disadvantages of the .Net Framework is its usage of JIT Compilation: The process in which machine code is generated from CIL code during the application's run-time, at the production environment.
Depending on your point of view, this behavior can be considered as a helpful feature since during the JIT compilation, the compiler can utilize special features that the production processor supports (such as using more efficient, unique instructions that aren't supported by other processors). This way, we could achieve better performance than if we didn't use JIT compilation at all, since we would have to address a more basic set of processor features (in order to support a variety of processors).
The most dominant disadvantage of using a JIT compiler is the time being spent compiling CIL code on the production machine (since every time the user triggers a "first-time" call to a managed method, the JIT compiler is triggered, and starts to compile the method for the current and future calls).
In the CLR, every MethodDesc instance (a structure that exists for every managed method. Contains some metadata about the method and is allocated in the EE's (Execution Engine) memory), contains an IsJitted boolean field. Like the field name implies, its value tells us whether the the JIT compiler already compiled the relevant method (this field will assist us later on).
Once a method is called for the first time, the execution flow is redirected by a special stub into the JIT, that will be responsible to compile the method. Then, the stub that caused the redirection is overridden by a jmp instruction that directs the execution flow to the JIT's generated machine code (in fact, this mechanism resembles the delay load feature available in C++). So as you understand, each time a method is called for the first time, we suffer from a certain performance penalty.
Its important to emphasize that usually, that the JIT's performance penalty is quite minimal, since we only have to experience it at the method's first invocation. However, for applications that demand high responsiveness, this performance penalty might be intolerable.

The classic solution for this problem is to use NGen. The idea behind NGen is that when your application is installed on the production machine, the JIT is triggered to compile our entire set of managed assemblies, and generate a native image for the application. Afterwards, when the application is launched, the CLR will make sure to load the correct native image from the disk, and so avoid unnecessary "jitting".
The problem with NGen is that it's quite complicated to use, and if we are really looking to avoid performance penalties during our application's initialization, we will have to spend more of our valuable time in order to register our assemblies in the GAC, correctly set our DLL's base addresses in order to avoid rebasing, etc.
However, there is an alternative for using NGen, and it's triggering the JIT to compile your during runtime, exactly when your desire it. This solution have its own set of disadvantages and isn't necessarily better than using NGen, but in case you want to speed up your JIT process with minimal effort on your part, this a solution that you will definitely want to consider.

So heading right down to business, I'll explain how this works from the bottom-up, so first we will understand how we are able to trigger the JIT compiler ourselves, in an noninvasive fashion (without actually executing the method). And then I'll explain on how you can utilize this ability in your own existing applications.
If so, our first step is to learn how to trigger the JIT compiler on a single, specific method. For that, we will use the RuntimeHelpers.PrepareMethod function. Normally, we would use this method when we need to call virtual functions inside a CER region, but in our case, what is more interesting is that this method causes the JIT to compile the method that was passed to it (Remark: it should be noted that calling PrepareMethod may trigger the static constructor of the destination type, that is, if it has one).
Since in this demonstration we we'll want to JIT-compile the entire set of methods contained in a given assembly, we will use a utility method that iterates over the contained types in a given assembly, and invokes PrepareMethod on every relevant method:

public static void PreJITMethods(Assembly assembly)
{
    Type[] types = assembly.GetTypes();
    foreach (Type curType in types)
    {
        MethodInfo[] methods = curType.GetMethods(
                BindingFlags.DeclaredOnly |
                BindingFlags.NonPublic |
                BindingFlags.Public |
                BindingFlags.Instance |
                BindingFlags.Static);

        foreach (MethodInfo curMethod in methods)
        {
            if (curMethod.IsAbstract ||
                curMethod.ContainsGenericParameters)
                continue;

            RuntimeHelpers.PrepareMethod(curMethod.MethodHandle);
        }
    }
}

One important thing about this method is the placement of the if statement right before the invocation of PrepareMethod. The first part of the statement is obvious, we have to ignore abstract methods since they don't contain any implementation detail, so the JIT compiler have nothing to do with them (attempting to pass them to PrepareMethod will result in an exception). The second part of the statement is the ignorance of generic methods. We can't cause the JIT to compile a generic methods since at this stage we still don't know which parameters are passed to the method in different call sites in the application's code. This is in fact a disadvantage versus NGen, that is aware of which "versions" of the generic method needs to be compiled.

The next step is to integrate this utility method into our application's initialization routine. When doing so, there are a few things to consider. Would we rather to trigger the code generation once right after our application finished loading? Or perhaps every time a new assembly is loaded into the AppDomain? Do we prefer executing this code synchronously, or perhaps asynchronously? And what about dedicating a thread with a lower priority than our other applicative threads? This questions will have to be answered according to your own preferences, and your application's needs.
Another thing to remember is that by default, the CLR delays the loading of referenced assemblies until we reach the first time we actually reference a type from it in our code (so if have a reference to a certain assembly, and we never reached a point where we needed to use it, it won't be loaded into the AppDomain). So if we want to trigger the JIT compiler on all of the statically referenced assemblies, we will have to explicitly load them. In order to do so, we could  recursively call GetReferencedAssemblies and Load. For example:

// recursively load all of assemblies referenced by the given assembly
public static void ForceLoadAll(Assembly assembly)
{
    ForceLoadAll(assembly, new HashSet());
}

private static void ForceLoadAll(Assembly assembly,
                                 HashSet loadedAssmblies)
{
    bool alreadyLoaded = !loadedAssmblies.Add(assembly);
    if (alreadyLoaded)
        return;

    AssemblyName[] refrencedAssemblies =
        assembly.GetReferencedAssemblies();

    foreach (AssemblyName curAssemblyName in refrencedAssemblies)
    {
        Assembly nextAssembly = Assembly.Load(curAssemblyName);
        if (nextAssembly.GlobalAssemblyCache)
            continue;

        ForceLoadAll(nextAssembly, loadedAssmblies);
    }
}

In this example, I've chosen to ignore assemblies that are loaded from the GAC, so we won't compile the BCL's massive codebase (thus, minimizing our working set size). As you can understand, this method can be customized to fit your specific needs regarding which assemblies to load, and which to ignore.
Obviously, this method will only load assemblies that are statically referenced to our application. In order to trigger JIT compilation on dynamically loaded assemblies (as a product of using Reflection for instance), you may want to subscribe to the AssemblyLoad event that will be fired during the loading of new assemblies into the AppDomain.

In case you are interested to know how you can positively confirm that a certain method was JIT compiled by the PrepareMethod (or by some other means), then you will need to grab WinDbg and SOS.
All you need to do is to find the address of your method's MethodDesc instance, and check what is the value of the IsJitted field. Here is a typical example to how you would want to do so:

> !name2ee OtherAssmebly.dll C.ClassC
Module: 00a457b8 (OtherAssmebly.dll)
Token: 0x02000002
MethodTable: 00a48bbc
EEClass: 00f283d4
Name: C.ClassC

> !dumpmt -md 00a48bbc
EEClass: 00f283d4
Module: 00a457b8
Name: C.ClassC
mdToken: 02000002  (C:\...\OtherAssmebly.dll)
BaseSize: 0xc
ComponentSize: 0x0
Number of IFaces in IFaceMap: 0
Slots in VTable: 6
--------------------------------------
MethodDesc Table
   Entry MethodDesc      JIT Name
00a4c838   00a48bb0     NONE C.ClassC..ctor()
00a4c830   00a48ba0      JIT C.ClassC.FooMethod()

> !dumpmd 00a48ba0
Method Name: C.ClassC.FooMethod()
Class: 00f283d4
MethodTable: 00a48bbc
mdToken: 06000001
Module: 00a457b8
IsJitted: yes
CodeAddr: 00f50778

Tuesday, July 20, 2010

Monitor's Locking Primitive

Lately, a discussion in a C# user group raised the question "In which synchronization primitive does the CLR uses when I call Monitor.Enter?". Does a Mutex is being created by the OS? Maybe an Event? or perhaps it's a user-mode primitive such as CriticalSection? Apparently there's some vagueness in the subject, so in this post I will demonstrate how can we find the answer to the question using the tools available to us.
In general, the CLR's object synchronization ability is implemented by allocating a SyncBlock for every object that we attempt to lock. Looking the the object's header, the CLR can find the corresponding SyncBlock  object that belongs to that object. It's the SyncBlock's responsibility to synchronize the locking requests to that object.
One needs to remember that those synchronization abilities are a feature of the CLR, in the sense that they are implemented in the CLR, and not necessarily in (or using) the operating system's synchronization primitives. This matches the sense that "theoretically", a managed thread doesn't have to be based on an operating system's kernel thread. So basically, no one can guarantee that the CLR will always use one synchronization primitive or another. Today this isn't the case, and in the meanwhile it doesn't seem like things are going to change in the near or far future.

After a quick review of the documentation available to us in the MSDN, one can impressed as if there's no real documentation about the basic primitive being used. But since we are talking about a synchronization primitive, we can remember that IHostSyncManager interface that is exposed to us by the CLR's hosting ability. One of this interface's functionalities is the ability to replace the implementation of the synchronization primitive being used by the Monitor class. This ability is exposed by the method CreateMonitorEvent.
Even at this stage we may pay attention to what is being said under the remarks paragraph:
CreateMonitorEvent returns an IHostAutoEvent  that the CLR uses in its implementation of the managed System.Threading.Monitor type. This method mirrors the Win32 CreateEvent function, with a value of false specified for the bManualReset parameter.
Even though, the keyword here is "Mirrors" so there isn't a true guarantee about what is happening in the CLR's internal implementation. In order to verify the thick hint we've just received here, we are going to have to pull out the big guns, and use WinDbg.
In favor of the test, I wrote up an application that results in an endless contention:

static void Main()
{
      Thread t1 = new Thread(() => { lock ("A") { while (true);} });
      t1.Start();

      lock ("A") { while (true);}
}

After the application already runs in the background, we could launch WinDbg and attach the appropriate process.
After loading SOS, our first step will be to find the thread that lost the race for acquiring the lock. To do so, we will print the managed stacks of all of our threads, and when we'll find a "suitable" stack trace, we'll move to its context:

>~*e!clrstack // execute !clrstack on all of the threads 
OS Thread Id: 0xf80 (0) Child SP IP       Call Site
0012f3ec 030300fe ConsoleApplication1.Program.Main(System.String[]) [...]
0012f648 791421db [GCFrame: 0012f648]
OS Thread Id: 0xf4c (1)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0x840 (2)
Child SP IP       Call Site
02ccfe68 7c90e514 [DebuggerU2MCatchHandlerFrame: 02ccfe68]
OS Thread Id: 0xbe0 (3)
Child SP IP       Call Site
0313f67c 7c90e514 [GCFrame: 0313f67c]
0313f794 7c90e514 [GCFrame: 0313f794]
0313f7b0 7c90e514 [HelperMethodFrame_1OBJ: 0313f7b0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
0313f808 79b2e0c4 System.Threading.Monitor. Enter(System.Object, Boolean ByRef)
0313f818 03030163 ConsoleApplication1.Program.< Main>b__3() [...]
0313f848 79b2ae5b System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
0313f858 79ab7ff4 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
0313f87c 79ab7f34 System.Threading. ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0313f898 79b2ade8 System.Threading.ThreadHelper. ThreadStart()
0313fabc 791421db [GCFrame: 0313fabc]
0313fd80 791421db [DebuggerU2MCatchHandlerFrame: 0313fd80]
OS Thread Id: 0xe60 (4)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process

>~3s // change the current thread to 3

After finding the correct thread, we will have to check its native stack, so we'll use the kb commend that will also display the first 3 parameters that were passed to each method:

>kb
ChildEBP RetAddr  Args to Child             
0313f3c8 7c90df4a 7c809590 00000001 0313f3f4 ntdll!KiFastSystemCallRet
0313f3cc 7c809590 00000001 0313f3f4 00000001 ntdll!ZwWaitForMultipleObjects+0xc
0313f468 791f516a 00000001 001820bc 00000000 KERNEL32!WaitForMultipleObjectsEx+0x12c
0313f4cc 791f4f98 00000001 001820bc 00000000 clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x56
0313f4ec 791f4dd8 00000001 001820bc 00000000 clr!Thread::DoAppropriateAptStateWait+0x4d
0313f580 791f4e99 00000001 001820bc 00000000 clr!Thread::DoAppropriateWaitWorker+0x17d
0313f5ec 791f4f17 00000001 001820bc 00000000 clr!Thread::DoAppropriateWait+0x60
0313f640 7919d409 ffffffff 00000001 00000000 clr!CLREvent::WaitEx+0x106
0313f654 792e0160 ffffffff 00000001 00000000 clr!CLREvent::Wait+0x19
0313f6e4 792e0256 001818a0 ffffffff 8079c412 clr!AwareLock::EnterEpilogHelper+0xa8
0313f724 792e029b 001818a0 001818a0 79142c0d clr!AwareLock::EnterEpilog+0x42
0313f744 792c7729 8079cb36 0313f830 00b3c368 clr!AwareLock::Enter+0x5f
0313f800 79b2e0c4 79161f8e 00941f02 0313f840 clr!JIT_MonReliableEnter_Portable+0x104
0313f840 79b2ae5b 00b3c3ec 01b3101c 0313f86c mscorlib_ni+0x2ae0c4
0313f850 79ab7ff4 00b3e010 00000000 00b3c3b8 mscorlib_ni+0x2aae5b
0313f86c 79ab7f34 00000000 00b3c3b8 00000000 mscorlib_ni+0x237ff4
0313f88c 79b2ade8 00b3c3b8 00000000 001818a0 mscorlib_ni+0x237f34
0313f8a4 791421db 000001a7 0313fae0 0313f930 mscorlib_ni+0x2aade8
0313f8b4 79164a2a 0313f980 00000000 0313f950 clr!CallDescrWorker+0x33
0313f930 79164bcc 0313f980 00000000 0313f950 clr!CallDescrWorkerWithHandler+0x8e

At this point we can already see that the last thing that the thread did before we've interrupted him, is to call to WaitForMultipleObjectsEx where the first parameter was 1 and the second is 0x001820BC. Having this information, we can understand that we are waiting on a single Handle object, since the first parameter specifies the affective size of the array that was passed as the second parameter. So all we've got left to do is to understand which object hides behind that Handle that was passed to the function.

>dp 0x001820BC 0x001820BC
001820bc  000006c8 // our handle's value
>!handle 000006c8 F // pass "F" as bitmask to display all of the relevant data
Handle 6c8
  Type             Event
  Attributes       0
  GrantedAccess    0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount      2
  PointerCount     4
  Name            
  Object Specific Information
    Event Type Auto Reset
    Event is Waiting


And so, this was our last step. We have confirmed that Monitor's synchronization primitive is in fact an Event object of an AutoReset type.
Whoever still wants to view the creation and usage of the Even in code, can open up the sync.cpp file under the SSCLI's implementation, and see how a call to CLREvent::CreateMonitorEvent triggers a call to UnsafeCreateEvent (that actually is a typedef to the familiar CreateEvent function).

Even so, one have to remember that this is only a partial answer. Since as I've mentioned at the beginning of this post, there's no guarantee that once we'll call Monitor.Enter we will always find ourselves accessing some kernel object. In fact, in one of his posts, Joe Duffy makes sure to mention that in the CLR's implementation, when a thread encounters contention, it will attempt to spin a little before re-trying to acquire the lock, without leaving the user-mode and waiting for some kernel object. So even if the CLR doesn't give a full blown implementation of a synchronization primitive, it may still provide some optimizations over the supplied services of the operating system (much like the CriticalSection type).

Monday, July 5, 2010

Behind The .locals init Flag

As we all probably know, the C# language specification demands that every declared local variable will be assigned-to before its usage.
Even so, whoever used ILDASM to peek into the compiler's generated IL code, must have noticed that right after the deceleration for a method, there's some peculiar line that starts with .local init. Actually something like that:

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       10 (0xa)
  .maxstack  1
  .locals init ([0] int32 x) <--- localsinit flag 
  IL_0000:  ldc.i4.4
  IL_0001:  stloc.0
  IL_0002:  ldloc.0
  IL_0003:  call       void [mscorlib]System.Console::WriteLine(int32) 
}

This line represents the existence of the CorILMethod_InitLocals flag in the current method's header. This flag effectively guarantees that the CLR will initialize all of the local variables declared in the method's body to they're default values. Meaning, regardless to which default value you have chosen to set your local variable to (in our case, the variable x gets the value of 4), the platform will always make sure that before our code will execute, the local variable x will necesserily be initialized to its legal, default value (in this case, 0).

In Microsoft's implementation of the CLI, this flag always exists in the method's header (assuming there are local variables declared in its body). This could make us wonder why would the compiler insist on reporting an error every time a programmer forgets the initialize its local variables before using them. This constraint held by the compiler may seem redundant, but in fact, there's several reason to why it is quite necessary.

Before diving into the meaning of the .locals init flag, let's review again the issue of the so called "duplicate assignment" performed by the compiler and the CLR.
Looking at the IL code from before, one may think that every time we declare a local variable we have to pay some "initialization overhead" since both the compiler and the CLR insist on initializing it to some "default value". Even though this overhead is quite minimal, it still gives us some "bad vibe" since it just seems redundent.
But in fact, this duplicate assignment never really occurs. The reason for that lays inside the way that the .locals init flag guarantees the default values assignment. All it does, is to make sure the JIT compiler will generate code that will initialize the variable before its usage. In our case, the JIT will have to generate a mov instruction that will set our x variable to 0.
And so, the assembler code we get during run-time (without using optimizations), confirms it:

Normal JIT generated code
ConsoleApplication4.Program.Main(System.String[])
Begin 00e20070, size 30
00E20070 push        ebp
00E20071 mov         ebp,esp
00E20073 sub         esp,8
00E20076 mov         dword ptr [ebp-4],ecx
00E20079 cmp         dword ptr ds:[00942E14h],0
00E20080 je          00E20087
00E20082 call        7A0CA6C1 (JitHelp: CORINFO_HELP_DBG_IS_JUST_MY_CODE)

-------------------- Generated code due to the LocalsInit flag  ----------------
00E20087 xor         edx,edx                 // zero out the EDX register
00E20089 mov         dword ptr [ebp-8],edx // assign the value of EDX to the location of 'X'

--------------------- Our own application's code ---------------------------------
00E2008C mov         dword ptr [ebp-8],4   // assign the value 4 to the location of 'X'

00E20093 mov         ecx,dword ptr [ebp-8]
00E20096 call        79793E74 (System.Console.WriteLine(Int32), mdToken: 060007c3)
00E2009B nop
00E2009C mov         esp,ebp
00E2009E pop         ebp
00E2009F ret

If so, in this code example one can defenitly see the effect of the localsInit flag on the generated assembler code. And also, we can see the existence of the "duplicate-assignment" phenomena as mentioned before.
However, one should remember that this code was generated without any usage of JIT optimizations. Once we will allow these optimizations, we would see that the JIT is able to identify the duplicate assignment, and treats it as dead code since it has no effect on the variable. As a result, the first assignment is completely removed, and only the user's value initialization appears in the generated code:

Normal JIT generated code
ConsoleApplication4.Program.Main(System.String[])
Begin 00c80070, size 19
00c80070 push    ebp
00c80071 mov     ebp,esp
00c80073 call    mscorlib_ni+0x22d2f0 (792ed2f0) (System.Console.get_Out(), mdToken: 06000772)
00c80078 mov     ecx,eax
00c8007a mov   edx,4  // assign 4 to the "virtual representation" of X 
00c8007f mov     eax,dword ptr [ecx]
00c80081 call    dword ptr [eax+0BCh]

The first thing we notice is that we don't have a representation of x in our application's main memory. Instead, there's a CPU register that holds its "virtual" representation. But more importantly, we can now see that there's no remnant of the previous duplicate assignment we've witnessed due to the usage of the localsInit flag.

Now, we can check what in fact is the meaning behind the usage of localsInit and the compilers constraint for initializing local variables before they're usage.
Microsoft's argument regarding using the Definite Assignment mechanism is that most of the times programmers don't to initialize they're local variables are cause by logical bugs, and not by the fact that the programmer relies on its compiler to set the variable to its default value. In one of Eric Lippert's comments on his blog, he says it himself:
"The reason we require definite assignment is because failure to definitely assign a local is probably a bug. We do not want to detect and then silently ignore your bug! We tell you so that you can fix it."
The importance of the localsInit flag, can be summarized in one word: Verification.
Verification is the process in which the CLI makes sure that all of the CIL code that exit in the application is "safe". This includes making sure that all of that methods we use are able the receive the number of parameters we pass them, the parameter's types, that all of the local variables are properly initialized, etc.
In case the CLR detects a piece of code that fails the verification process, a VerficationException will be thrown.
Nontheless, not every IL code needs to be verifiable, as mentioned in Partition III of the standard:
"It is perfectly acceptable to generate correct CIL code that is not verifiable, but which is known to be memory safe by the compiler writer. Thus, correct CIL  might not be verifiable, even though the producing compiler might know that it is memory safe."
Having said that, every time we write some unverifiable code, we have to update the proper permissions using the SecurityPermissionAttribute, and explicitly tell the CLR not the perform verification on our code using the SkipVerfication property (the CLR won't perform definite assignment analysis on our code). One of the most common times in which we would want to this is when we write unsafe code. In such cases, we will have to explicitly mark the required check-box in the project's properties, thus, allowing to compiler to compile our unsafe code blocks, and making it to add the UnverifiableCodeAttribute to the generated assembly, telling the CLR that this module is unverifiable.

The verification process requires that every local variable will be initialized. To be exact, it requires that in case no one requested to skip over the verification process, the localsInit flag must be present in the generated IL. For this reason, looking at the CIL instructions reference, you may encounter remarks such as this:
"Local variables are initialized to 0 before entering the method only if the localsinit on the method is true (see Partition I) ... System.VerificationException is thrown if the the localsinit bit for this method has not been set, and the assembly containing this method has not been granted
System.Security.Permissions.SecurityPermission.SkipVerification (and the CIL does not perform automatic definite-assignment analysis) "
Later on, the document address the overhead issue of performing the definite assignment analysis during runtime:
"Performance measurements on C++ implementations (which do not require definite-assignment analysis) indicate that adding this requirement has almost no impact, even in highly optimized code. Furthermore, customers incorrectly attribute bugs to the compiler when this zeroing is not performed, since such code often fails when small, unrelated changes are made to the program."

Monday, June 28, 2010

String.Format Isn't Suitable for Intensive Logging

One of the first rules that every .Net programmer learns to memorize is "Never use the + operator to combine strings. Only use the StringBuilder class to do so". If you'll ask a random programmer why should we do so, the typical answer would be "Since strings in .Net are immutable, every time we use the + operator we actually cause a memory allocation, so in order to avoid we should use the StringBuilder".
This answer is a common misconception. Does StringBuilder is necessarily better than using the + operator, or perhaps other means for combining strings? No, not necessarily.

First of all, let's look at the + operator's behavior. Basically, what it does is to allocate a new string, and copy the characters from the two older strings into the new one. So theoretically, there's a serious negative impact on memory usage, since when combining multiple strings, we could find ourselves allocation redundant strings that will only be used to contain "parts" of targeted final string.
This is only partialy true, since the C# compiler will detect combinations of 5 strings and above using the + operator, and will convert it to use string.Concat.
For example, this code:

string str2 = 1.ToString() + 2.ToString() + 3.ToString() + 4.ToString() + 5.ToString();

Will be compiled into this code:

string[] CS$0$0000 = new string[] { 1, 2, 3, 4, 5 }; // edit: removed ToString
string text1 = string.Concat(CS$0$0000);

Regarding string.Concat, you shouldn't be worried about excessive memory allocations, since the first thing the method does is to calculate the combined string's total length, creates the string using a call to AllocateFastString, and then simply copies the content of the old strings to the new one using wstrcpy (efficient raw bytes copying using unsafe code).

One of the more common usages for combining strings, is in log formatting. In case your application does lots of intensive logging, then there's a good chance that it's spending some serious amounts of time in formatting and combining strings that will eventually arrive to the logging framework.
The situation might change a little depending on which logging framework you use, but I'll demonstrate using log4net. The case in log4net is that you may choose to use the simple logging methods without formatting (e.g, Deubg/Info/Warn..), or, you may also choose to use the ones that support formatting (e.g, DebugFormat/InfoFormat/WarnFormat..). While the first option receives a parameterless string, the other receives a formatted string, with additional paramters that should be placed inside the string, using string.Format. Now this could introduce a problem.
The issue here is that every time we would like to insert some parameter into our log, we are most likely to use log4net's Format method (that uses string.Format) instead of using string.Concat or perhaps string.Join. Just because this API seems more convenient, and approachable.
The following benchmark demonstrates the differences between the three alternatives:
 
while(true)
{
    Stopwatch sw = Stopwatch.StartNew();

    for (int i = 0; i < 500000; i++)
    {
        // 960ms
        string str1 = string.Format("{0}, {1}, {2}, {3}, {4}", 1, 2, 3, 4, 5);
        // 665ms
        string str2 = string.Concat(1, ", ", 2, ", ",3, ", ", 4, ", ", 5);
        // 566ms
        string str3 = string.Join(", ", new string[] { 1, 2, 3, 4, 5 });
    }

    Console.WriteLine(sw.ElapsedMilliseconds);
}

In this case, even though I've use a fairly simple log, the differences between the methods are definite. When comparing the performance of string.Format to string.Concat, we see that we achieved a 31% of improvement. And in comparison to string.Join, we achieve a substantial 62% of performance gain (string.Join's implementation doesn't use string.Concat under the covers, but works in a similar way by calling FastAllocateString and using the UnSafeCharBuffer class to construct the new string).
Don't think that this performance improvement comes with the cost of more intensive memory usage, since in this area, Concat and Join still achieve much better results than string.Format.
These results dosn't mean that StringBuilder will always produce lesser performance, since in cases in which you don't always know the number of concatenations you will need to perform, or perhaps in cases in which you could optimize memory allocations by modifying the capacity variable in the StringBuilder's constructor, you could actually achieve better results than these.

Of course that if you perform your logs only "once in a while", this performance difference is almost meaningless. And the fact that using FooFormat might be more comfortable to us might be just enough to justify its usage. But, in cases in which your application performs some intensice logging, these performance differences might be substantial for you.

Sunday, June 13, 2010

Headaches with Prefix and Temporary Variables

No less than 6 years ago, Luca Bolognese wrote a post that dealt with an issue raised in the C# User Group. The question was "What should be the value of 'x' after executing the following lines of code?"


int x = 3;
x += x++;

If we're aware of the differences between Prefix and Postfix, then we should realize that the value of x will be 6 since the last increment is redundant. In fact, we could simplify the expression by writing x = x + x, since we'll still get the same result (It's important to pay attention and understand that this is the case specifically for C#. In C++ for example, the behavior of this expression is undefined).

So after we've passed this starting point, allow me to take one step forward and suggest the following, highly readable, lines of code:


int x = 10;
x = --x + x + --x;

This is the point in which things get a little more interesting. What would be the value of x after execution? This expression can be quite confusing, so I suggest taking a couple of minutes to make sure you're not missing anything.
Ready? If your answer was "26", then you're correct. Why? well, there's a subtle hint in the post's header.

What's confusing about this expression is that we always need to keep track of where our variable's value is stored. Whether its in the main memory? a CPU register? or perhaps in a totally different location? In fact, during execution, the program actually reserves 4 bytes on the thread's stack to keep the result of part of the computation.
In the first step, we decrement x's value by 1, and update it's value in main memory (could effectively be in the CPU cache, but this part isn't relevant to our case). At this point, x's value is set to 9. Next, we multiply x's value by 2 (x+x) and store the resulting value on a new variable located on the stack. Do not confuse, we don't update the value of the original x variable at this time. Afterwords, we decrement x's value again by 1 (now x is set to 8), then then combine it to the value of the temporary variable. The result of this last computation goes to x. So by the end of this single line of code, x is set to 26.

Looking at the generated assembler code, things might look a bit clearer:
 
00000019  mov  dword ptr [ebp-4],0Ah   // x = 10
00000020  dec  dword ptr [ebp-4]       // x = 9
00000023  mov  eax, dword ptr [ebp-4]  // x = 9, eax = 9
00000026  add  eax, dword ptr [ebp-4]  // x = 9, eax = 18
00000029  mov  dword ptr [ebp-8],eax   // x = 9, eax = 18, temp = 18
0000002c  dec  dword ptr [ebp-4]       // x = 8, eax = 18, temp = 18
0000002f  mov  eax, dword ptr [ebp-8]  // x = 8, eax = 18, temp = 18
00000032  add  dword ptr [ebp-4],eax   // x = 26

Wednesday, June 2, 2010

DateTime.Now Causes Boxing

Update 2.9.2010
The content of this post regards DateTime.Now's implementation until v3.5. At the release of v4.0, its internal implementation was updated to avoid unnecessary boxing.
However, looking into the new implementation details revealed a new source for dynamic memory allocations in DateTime.Now. The issue is discussed in detail in the updated post: DateTime.Now in v4.0 Causes Dynamic Memory Allocations.


Perhaps you weren't aware of it, but every time you access DateTime.Now, you cause to a dynamic memory allocation through the boxing of an Int32 variable.
The reason for this lays deep inside the implementation of the Now property. If we will look a bit closer using Reflector, we could see that it actually wraps a call to UtcNow and then converts it to local time using the ToLocalTime method.
While there's nothing wrong with the call to UtcNow (it just wraps a call to the Win32 GetSystemTimeAsFileTime), there is some trouble with ToLocalTime, or actually, with the class CurrentSystemTimeZone that is being used. During the conversion the method GetDaylightChanges is called that causes the boxing. You can look for yourself:

    public override DaylightTime GetDaylightChanges(int year)
    {
        object key = year;
        if (!this.m_CachedDaylightChanges.Contains(key))
        {
            // ..lots of code
        }
 
        return (DaylightTime)this.m_CachedDaylightChanges[key];
    }

This is a good example showing that even in latest versions of the .Net Framework, some types still uses non-generic data structures, from the days of 1.0/1.1. In this case, the data member m_CachedDaylightChanges is actually a Hashtable).
This means that you might find plenty more of cases in which unnecessary boxing occurs without your knowing. But what makes DateTime.Now case especially serious is that it's a very common property to access, especially by logging frameworks for instance.

All of those boxings comes with a price, and it's not cheap. Due to the high frequency of dynamic memory allocations, we could cause some serious stress for the GC.
In order to demonstrate the impact on memory usage, I'll use the following benchmark:

        while (true)
        {
            Stopwatch sw = Stopwatch.StartNew();
 
            for (int i = 0; i < 1000000; i++)
            {
                DateTime now = DateTime.Now;
            }
 
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
After executing this code, we can learn two things. The first, is that it takes on average 456ms to complete each iteration on the while loop (I'll address this result in a moment), and the other, is that we can learn the magnitude of the affect of calling DateTime.Now on our application's memory usage patterns. In this case, I've used perfmon to monitor the application's memory allocation rate:


Even though most of these allocations will be released by the GC in a Generation 0 collection, we still can't take them as granted. Especially if you're developing an application where you need to minimize as possible any dynamic memory allocation, mostly because you want to reduce the number of collections performed by the GC.

How to Avoid The Boxing?
After we've seen why we shouldn't call DateTime.Now (in certain scenarios), raises the question "How could we avoid those unnecessary memory allocations?".
First, there's Microsoft's side, which needs to make sure to update their code to make sure they avoid boxing (e.g, using a Dictionary instead of a Hashtable). But until this happens, we'll need to take matters to our own hands and workaround this issue. My suggestion is to write a wrapper method that is almost identical to UtcNow's implementation (that simply wraps a call to the Win32 API). All we need to do is to call GetLocalTime that will return use the the in SYSTEMTIME format, which we will need to convert back to a DateTime type.
For example:

public static class TimeUtil
{
    [DllImport("kernel32.dll")]
    static extern void GetLocalTime(out SYSTEMTIME time);
 
    [StructLayout(LayoutKind.Sequential)]
    private struct SYSTEMTIME
    {
        public ushort Year;
        public ushort Month;
        public ushort DayOfWeek;
        public ushort Day;
        public ushort Hour;
        public ushort Minute;
        public ushort Second;
        public ushort Milliseconds;
    }
 
    public static DateTime LocalTime
    {
        get
        {
            SYSTEMTIME nativeTime;
            GetLocalTime(out nativeTime);
 
            return new DateTime(nativeTime.Year, nativeTime.Month, nativeTime.Day,
                                nativeTime.Hour, nativeTime.Minute, nativeTime.Second,
                                nativeTime.Milliseconds, DateTimeKind.Local);
        }
    }
}

Running the benchmark from before with this updated implementation, we could see that we completely got rid of the memory allocations, and as a bonus, every iteration takes only 370ms. That is, an 18% improvement over the performance of DateTime.Now's implementation.