Liran Chen's Blog

Sunday, August 31, 2014

Unhandled Exceptions in PerformIOCompletionCallback

Debugging unhandled exceptions in your code might not be the way you've planned to spend your afternoon. On the other hand, at least its your code you're debugging and not, let's say... the Frameworks'. In this post, I'll talk about the latter, and more specifically, a case that popped up in the PerformIOCompletionCallback function.

The main problem with exceptions that are being thrown before the CLR has a chance to execute your own code, is that the call stack is almost useless. A NullReferenceException that was thrown while handling an IOCompletion might consist only from the following line:

System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0x54

Not very helpful. Since none of our code was executed, we can't simply see which variable was null or even to understand which module in our application caused this exception. In order the proceed with the analysis, it's time to pull out the big guns and load a full memory dump of the process into WinDbg.

Once the dump, SOS and the appropriate symbols are loaded, we should be able to find the problematic thread using the ~*e!clrstack command. Using the faulting instruction pointer (IP), we could use the !u command to dump the assembly of the failing function. The output should be similar to this one (truncated for brevity):

0:073> !u 000007FEF33888A4
preJIT generated code
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Begin 000007fef3388850, size 11e. Cold region begin 000007fef39340f0, size 3b
Hot region:
000007fe`f3388850 53 push rbx
000007fe`f3388851 55 push rbp
000007fe`f3388852 57 push rdi
000007fe`f3388853 4883ec40 sub rsp,40h
000007fe`f3388857 488d6c2420 lea rbp,[rsp+20h]
000007fe`f338885c 4c894550 mov qword ptr [rbp+50h],r8
000007fe`f3388860 895548 mov dword ptr [rbp+48h],edx
000007fe`f3388863 894d40 mov dword ptr [rbp+40h],ecx
000007fe`f3388866 48896500 mov qword ptr [rbp],rsp
000007fe`f338886a 33c0 xor eax,eax
000007fe`f338886c 48894510 mov qword ptr [rbp+10h],rax
000007fe`f3388870 48c7450800000000 mov qword ptr [rbp+8],0
000007fe`f3388878 488d5508 lea rdx,[rbp+8]
000007fe`f338887c 488d0d6d7e71ff lea rcx,[mscorlib_ni+0x1b06f0]
000007fe`f3388883 e820a886ff call mscorlib_ni+0x3030a8
000007fe`f3388888 ba9d010000 mov edx,19Dh
000007fe`f338888d b901000000 mov ecx,1
000007fe`f3388892 e889a786ff call mscorlib_ni+0x303020
000007fe`f3388897 488b4d50 mov rcx,qword ptr [rbp+50h]
000007fe`f338889b e8d07b89ff call mscorlib_ni+0x330470 (System.Threading.OverlappedData.GetOverlappedFromNative(...)
000007fe`f33888a0 488b7820 mov rdi,qword ptr [rax+20h]
000007fe`f33888a4 488b4708 mov rax,qword ptr [rdi+8]
000007fe`f33888a8 488b5818 mov rbx,qword ptr [rax+18h]
000007fe`f33888ac 4885db test rbx,rbx
000007fe`f33888af 741a je mscorlib_ni+0xa988cb (000007fe`f33888cb)
...

According to Microsoft's x64 register usage, at the time of the exception, the RAX register contained the return value of the GetOverlappedFronNative(...) function. Then, after a little playing around with certain field offsets, a NullReferenceException is thrown. Such errors often occur when the owner object (socket, file stream etc.) is disposed while there's an outstanding IO operation. Once the asynchronous operation is completed, the CLR attempts to use the disposed object - and we enter the wonderful world of undefined behavior. Though the usual result is quite deterministic: a complete crash of the process due to the unhandled exception. Needles to say, you should never dispose an object before you're sure it has no remaining asynchronous IO operations in progress.

This brings us to the final step. How can we check that the target object was indeed already disposed, and how can we tell which part of our codebase is responsible for this unfortunate bug?
First, we'll need to retrieve the OverlappedData instance that the native pointer is mapped into. This is exactly the purpose of the GetOverlappedFromNative function, but since it's implemented inside the CLR, we can't simply use a decompiler to see its implementation. Instead, we could use the u command again. This time, the output should be similar to this:

0:073> !u 000007fe`f2c20470
Unmanaged code
000007fe`f2c20470 e9bb8d1a01 jmp clr!GetOverlappedFromNative (000007fe`f3dc9230)

0:073> !u 000007fe`f3dc9230
Unmanaged code
000007fe`f3dc9230 488d41b0 lea rax,[rcx-50h]
000007fe`f3dc9234 c3 ret

After a short redirection into the CLR, we can see the function simply draws the pointer back by 0x50 bytes (that is, the native structure is actually contained by the managed one). This is consistent with the function's implementation in the SSCLI:

static OverlappedDataObject* GetOverlapped (LPOVERLAPPED nativeOverlapped)
{
...
return (OverlappedDataObject*)((BYTE*)nativeOverlapped - offsetof(OverlappedDataObject, Internal));
}

Dumping the contents of the OverlappedData's instance should give an output similar to this:

0:073> !do 0x0000000005043e08-0x50
Name: System.Threading.OverlappedData
MethodTable: 000007fef2903180
EEClass: 000007fef2a0a110
Size: 120(0x78) bytes
File: C:\...\mscorlib.dll
Fields:
... Offset Type VT Attr Value Name
... 8 System.IAsyncResult 0 instance 0000000000000000 m_asyncResult
... 10 ...ompletionCallback 0 instance 0000000005a46028 m_iocb
... 18 ...ompletionCallback 0 instance 0000000000000000 m_iocbHelper
... 20 ...eading.Overlapped 0 instance 0000000005a46010 m_overlapped
... 28 System.Object 0 instance 0000000013335c18 m_userObject
... 30 ...ppedDataCacheLine 0 instance 00000000050436d8 m_cacheLine
... 38 System.IntPtr 1 instance 221828 m_pinSelf
... 40 System.IntPtr 1 instance 0 m_userObjectInternal
... 48 System.Int32 1 instance 1 m_AppDomainId
... 4c System.Int16 1 instance 13 m_slot
... 4e System.Byte 1 instance 0 m_isArray
... 4f System.Byte 1 instance 0 m_toBeCleaned
... 50 ....NativeOverlapped 1 instance 0000000005043e08 m_nativeOverlapped

Dumping the m_iocb member and then m_target, should give you the problematic object. In the demonstrated case, the object is an instance of SocketAsyncEventArgs, so dumping it and looking the the m_DisposedCalled field proves the assumed theory:

0:073> !do 0000000005a53b58
Name: System.Net.Sockets.SocketAsyncEventArgs
MethodTable: 000007fef21248c0
EEClass: 000007fef1e6d550
Size: 440(0x1b8) bytes
File: C:\...\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
... Offset Type VT Attr Value Name
... 80 System.EventArgs 0 shared static Empty
... >> Domain:Value 0000000000112a40:0000000003326168 <<
... 8 ...et.Sockets.Socket 0 instance 0000000000000000 m_AcceptSocket
... 10 ...et.Sockets.Socket 0 instance 0000000000000000 m_ConnectSocket
... 18 System.Byte[] 0 instance 0000000013335c18 m_Buffer
... 170 System.Net.WSABuffer 1 instance 0000000005a53cc8 m_WSABuffer
... e8 System.IntPtr 1 instance 13335c28 m_PtrSingleBuffer
... 130 System.Int32 1 instance 32768 m_Count
... 134 System.Int32 1 instance 0 m_Offset
... 20 ...lib]], mscorlib]] 0 instance 0000000000000000 m_BufferList
... 16c System.Boolean 1 instance 0 m_BufferListChanged
... 28 ...m.Net.WSABuffer[] 0 instance 0000000000000000 m_WSABufferArray
... 138 System.Int32 1 instance 32768 m_BytesTransferred
... 30 ...entArgs, System]] 0 instance 0000000000000000 m_Completed
...
... 16f System.Boolean 1 instance 1 m_DisposeCalled
...

In order to find out which part of our code initiated the current IO operation, we could use !gcroot on the target object or the m_userObject member. In some cases, simply examining the contents of m_userObject could give us a rough idea regarding the identity of the culprit.

Friday, August 12, 2011

Writing a Manual Memory Manager in C#

Garbage collection. Aye? or Nay?
As usual, it depends. That is, on which developer you might ask. Some like to have as much control as possible over the way their code executes, while others simply love the fact that they don't have to deal with the "mundane" job of keeping track on their memory allocations.
Since there aren't really any "absolute truths" in anything related to programming, in reality you'd sometimes want to have complete control over your memory management, while at other times you wouldn't really care about it "as long as it gets done".
Since we're mostly discussing .Net and here, we could say that we've got the "as long as it gets done" part covered quite well, by the CLR's garbage collection mechanism. So it's time to see how we could approach implementing a manual memory manager in C#.

What we've eventually would like to have, is an API that would enable us to allocate and deallocate typed memory on demand. Of course C# doesn't natively support the new and delete keywords we so kindly remember from C++, so we'll have come up with our own utility functions to do the job.
Eventually, our code should look something similar to this:

static void Main()

{

    ITypeAllocator mgr = new ManalocManager();

    IFooData obj = mgr.New<IFooData>();

    obj.Bar = 1;

    obj.Car = 2;

    obj.Jar = 3;

    obj.Tar = 4;

    mgr.Delete(obj);

}

Disabling the garbage collector completely is an unreasonable thing to do in a platform such as .Net. Doing so would probably miss the platform's purpose. Anyone who truly wants to have complete control over the execution of its program wouldn't bother using C# anyway (or any other managed language for that matter).
However, while using C#, there might be some times that we'll want to manage our own memory, instead of having the garbage collector doing it for us. And even if not, it's still a subject interesting enough to explore and mostly play with.

In order to demonstrate how we could do achieve manual memory management in C#, lets have a look at the following interface:

public interface IFooData

{

    int Bar { get; set; }

    long Jar { get; set; }

    double Car { get; set; }

    byte Tar { get; set; }

}

The classic method to implement this interface would be to create a class with four members that will match the coresponding properties. However, doing so will result in a 21 bytes structure that will reside in the GC heap (not counting padding and the preceding object header).
Instead, we could allocate the required memory block in the native heap (using AllocHGlobal) and modify out propertie's to access the native memory at the required offsets (e.g. 0 for Bar, 4 for Jar and 12 for Car). Using a Delete method, we could free the native memory block on demand, when we please.

public class FooData : IFooData

{

    private unsafe byte* _native;

    public FooData()

    {

        unsafe

        {

            _native = (byte*)(Marshal.AllocHGlobal(21).ToPointer());

        }

    }

    public int Bar

    {

        get { unsafe { return *(int*)&_native[0]; } }

        set { unsafe { *(int*)&_native[0] = value; } }

    }

    public long Jar

    {

        get { unsafe { return *(long*)&_native[4]; } }

        set { unsafe { *(long*)&_native[4] = value; } }

    }

    public double Car

    {

        get { unsafe { return *(double*)&_native[12]; } }

        set { unsafe { *(double*)&_native[12] = value; } }

    }

    public byte Tar

    {

        get { unsafe { return *(byte*)&_native[20]; } }

        set { unsafe { *(byte*)&_native[20] = value; } }

    }

    public void Delete()

    {

        unsafe

        {

            Marshal.FreeHGlobal(new IntPtr(((void*)(_native))));

            _native = (byte*)(IntPtr.Zero);

        }

    }

}

The problem with such implementation is that it could be very tedious to code and implement. Even for simple structures like IFooData, the resulting implementation could be quite taunting.
Fortunately enough, we can automate the implementation process by adding a code generator that will implemenet our interfaces on the fly, during runtime.
The following interface should loosely describe the capabilities our manual memory manager should support:

public interface ITypeAllocator

{

    T New();

    void Delete(T instance);

    void PreGenerate(params Type[] types);

}

The generic parameter T accepts user-defined data representing interfaces such as IFooData.
Once the New method is called, our manager should generate code, compile it and instancate it during runtime. The resulting instance is then returned to the caller for it to be used. Once it finishes using it, and wants to release its memory, it calls the Delete method.
The PreGenerate method's purpose is the optimize the code's generation/compilation process. Once the user pre-generates a type, it won't have to wait on the first call to the New method (much like the process of forcing the JIT compiler to execute on your assemblies).

When it comes to code generation, there are basically two ways to choose from: CodeDOM and Templates. Each one of them has its pros and cons, personaly I tend to prefer the CodeDOM way of doing things. While using it could result in quite verbose code, I believe that its easier to maintain in larger projects than templates.
Unforutantly, .Net's CodeDOM model doesn't support unsafe code, so I had to resort to using some workarounds to represent all of the unsafe code blocks.
This should be a good time to mention the Refly library which wraps around .Net CodeDOM API, making it much simpler and innutative to use.
The demonstrated implementation is very naive and limited regarding the kind of types it is able to generate, though it should illustrate the discussed concept.

public class ManalocManager : ITypeAllocator

{

    // key: userType, value: generatedType

    private Dictionary<Type, Type> m_generatedTypesCache;

    public ManalocManager()

    {

        m_generatedTypesCache = new Dictionary<Type, Type>();

    }

    public void Delete(T instance)

    {

        if (!(instance is IManalocGeneratedType))

            throw new ArgumentException("Attempted to delete an unexpected type");

        IManalocGeneratedType generatedType = (IManalocGeneratedType)instance;

        generatedType.Delete();

    }

    public void PreGenerate(params Type[] types)

    {

        foreach (Type curUserType in types)

            generateAndAddToCache(curUserType);

    }

    public T New()

    {

        Type userType = typeof(T);

        Type generatedType;

        bool alreadyGenerated = m_generatedTypesCache.TryGetValue(userType, out generatedType);

        if (!alreadyGenerated)

            generatedType = generateAndAddToCache(userType);

        object result = Activator.CreateInstance(generatedType);

        return (T)result;

    }

    private Type generateAndAddToCache(Type userType)

    {

        Type generatedType = generateProxy(userType);

        m_generatedTypesCache.Add(userType, generatedType);

        return generatedType;

    }

    private Type generateProxy(Type userType)

    {

        NamespaceDeclaration ns;

        string typeName = createType(userType, out ns);

        string sourceFile = generateCode(ns);

        Assembly compiledAssembly = compile(userType, sourceFile);

        Type compiledType = compiledAssembly.GetType(typeName);

        return compiledType;

    }

    private string createType(Type userType, out NamespaceDeclaration namespaceDec)

    {

        PropertyInfo[] userProperties = userType.GetProperties();

        namespaceDec = new NamespaceDeclaration("Manaloc.AutoGenerated");

        namespaceDec.Imports.Add(userType.Namespace);

        ClassDeclaration classDec = namespaceDec.AddClass(userType.Name + "_Manaloc_AutoGenerated");

        classDec.Interfaces.Add(userType);

        classDec.Interfaces.Add(typeof(IManalocGeneratedType));

        FieldDeclaration nativeMember = classDec.AddField("unsafe byte*", "native");

        addConstructor(userProperties, classDec, nativeMember);

        addDeleteMethod(classDec, nativeMember);

        addProperties(userProperties, classDec, nativeMember);

        string typeName = namespaceDec.Name + "." + classDec.Name;

        return typeName;

    }

    private void addConstructor(PropertyInfo[] userProperties, ClassDeclaration classDec, FieldDeclaration nativeMember)

    {

        int totalSize = sumSize(userProperties);

        ConstructorDeclaration ctor = classDec.AddConstructor();

        ctor.Body.Add(Stm.Snippet("unsafe{"));

        ctor.Body.AddAssign(

            Expr.This.Field(nativeMember),

            Expr.Cast(typeof(byte*), Expr.Type(typeof(Marshal)).Method("AllocHGlobal").Invoke(Expr.Prim(totalSize)).

            Method("ToPointer").Invoke()));

        ctor.Body.Add(Stm.Snippet("}"));

    }

    private void addDeleteMethod(ClassDeclaration classDec, FieldDeclaration nativeMember)

    {

        MethodDeclaration disposeMethod = classDec.AddMethod("Delete");

        disposeMethod.Attributes = MemberAttributes.Final | MemberAttributes.Public;

        disposeMethod.Body.Add(Stm.Snippet("unsafe{"));

        disposeMethod.Body.Add(

            Expr.Type(typeof(Marshal)).Method("FreeHGlobal").Invoke(

            Expr.New(typeof(IntPtr), Expr.Cast("void*", Expr.This.Field(nativeMember)))));

        disposeMethod.Body.AddAssign(Expr.This.Field(nativeMember),

            Expr.Cast("byte*", Expr.Type(typeof(IntPtr)).Field("Zero")));

        disposeMethod.Body.Add(Stm.Snippet("}"));

    }

    private void addProperties(PropertyInfo[] userProperties, ClassDeclaration classDec, FieldDeclaration nativeMember)

    {

        int offset = 0;

        foreach (PropertyInfo curProperty in userProperties)

        {

            Type propType = curProperty.PropertyType;

            int propSize = Marshal.SizeOf(propType);

            PropertyDeclaration propDec = classDec.AddProperty(propType, curProperty.Name);

            propDec.Attributes = MemberAttributes.Final | MemberAttributes.Public;

            if (curProperty.CanRead)

                addGetter(nativeMember, offset, propType, propDec);

            if (curProperty.CanWrite)

                addSetter(nativeMember, offset, propType, propDec);

            offset += propSize;

        }

    }

    private void addSetter(FieldDeclaration nativeMember, int offset, Type propType, PropertyDeclaration propDec)

    {

        propDec.Set.Add(Stm.Snippet("unsafe{"));

        propDec.Set.Add(Stm.Snippet("*(" + propType.Name + "*)&"));

        propDec.Set.AddAssign(Expr.This.Field(nativeMember).Item(offset), Expr.Value);

        propDec.Set.Add(Stm.Snippet("}"));

    }

    private void addGetter(FieldDeclaration nativeMember, int offset, Type propType, PropertyDeclaration propDec)

    {

        propDec.Get.Add(Stm.Snippet("unsafe{"));

        propDec.Get.Add(Stm.Snippet("return *(" + propType.Name + "*)&"));

        propDec.Get.Add(Expr.This.Field(nativeMember).Item(offset));

        propDec.Get.Add(Stm.Snippet("}"));

    }

    private string generateCode(NamespaceDeclaration ns)

    {

        string sourceFile = null;

        const string outDir = "ManalocAutoGenerated";

        if (!Directory.Exists(outDir))

            Directory.CreateDirectory(outDir);

        Refly.CodeDom.CodeGenerator generator = new Refly.CodeDom.CodeGenerator();

        generator.CreateFolders = false;

        generator.FileCreated += (object sender, StringEventArgs args) => { sourceFile = args.Value; };

        generator.GenerateCode(outDir, ns);

        if (sourceFile == null)

            throw new Exception("Faliled to generate source file");

        return sourceFile;

    }

    private Assembly compile(Type userType, string sourceFile)

    {

        CompilerParameters compilerParams = new CompilerParameters();

        compilerParams.CompilerOptions = "/unsafe /optimize";

        compilerParams.ReferencedAssemblies.Add(userType.Assembly.Location);

        CompilerResults result =

            Refly.CodeDom.CodeGenerator.CsProvider.CompileAssemblyFromFile(compilerParams, new string[] { sourceFile });

        Assembly compiledAssembly = result.CompiledAssembly;

        return compiledAssembly;

    }

    private int sumSize(PropertyInfo[] userProperties)

    {

        int size = 0;

        foreach (PropertyInfo curProperty in userProperties)

            size += Marshal.SizeOf(curProperty.PropertyType);

        return size;

    }

}

public interface IManalocGeneratedType

{

    void Delete();

}

Monday, October 25, 2010

The Case of NUnit Hanging During Startup

Recently, a coworker of mine encountered a strange behavior in NUnit. Every time he'd open NUnit's graphical interface, it would freeze and stop responding. Even though a test haven't even started to execute, the application simply froze, leaving the user only to wait patiently and stare helplessly at the screen.

In order to investigate the issue, I've fired up Windbg and attached it to the hanged process.
After loading SOS, the !clrstack command was issued so I could get a better understanding to what was keeping the application busy. Since NUnit's graphical interface stopped from responding, one could already assume that the main thread stopped handling messages from the message pump for some reason.
Reviewing the command's output confirmed that assumption. (The outputs in the post were edited for brevity).

0:007> ~*e!clrstack
PDB symbol for mscorwks.dll not loaded
OS Thread Id: 0xe4 (0)
ESP       EIP     
0012ee44 7c90e4f4 Win32Native.GetFileAttributesEx(...)
0012ee58 792e03f6 System.IO.File.FillAttributeInfo(...)
0012eeec 7927ff71 System.IO.File.InternalExists(System.String)
0012ef20 792e96a6 System.IO.File.Exists(System.String)
0012ef4c 03a214c3 NUnit.Util.RecentFileEntry.get_Exists()
0012ef54 00e0fb7a NUnit.Gui.NUnitForm.NUnitForm_Load(Object, EventArgs)
0012ef8c 7b1d4665 System.Windows.Forms.Form.OnLoad(EventArgs)
0012efc0 7b1d4105 System.Windows.Forms.Form.OnCreateControl()
0012efcc 7b1c6d11 System.Windows.Forms.Control.CreateControl(Boolean)
0012f008 7b1c6b14 System.Windows.Forms.Control.CreateControl()
0012f020 7b1d2fc8 System.Windows.Forms.Control.WmShowWindow(Forms.Message)
0012f05c 7b1c8906 System.Windows.Forms.Control.WndProc(Forms.Message)
0012f060 7b1d1d6a [InlinedCallFrame: 0012f060] 
...
0012f418 7b195911 System.Windows.Forms.Application.Run(System.Windows.Forms.Form)
0012f42c 00e00643 NUnit.Gui.AppEntry.Main(System.String[])
0012f464 00e00076 NUnit.Gui.Class1.Main(System.String[])
0012f688 79e71b4c [GCFrame: 0012f688] 
OS Thread Id: 0x1464 (1)
Unable to walk the managed stack. The current thread is likely not a 
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0xdac (2)
Failed to start stack walk: 80004005

Evidently, NUnit is performing a synchronous IO operation on its main thread, which for some reason seems to refuse to complete. It happens to be a good example to why it's considered to be a bad practice to perform "heavy-weight lifting" on the main UI thread, or synchronous IO in general.

After obtaining this piece of information, it was required to understand why the operation doesn't complete. Perhaps the path of the requested file could shed some light on the observed behavior.
Since the path is passed across several managed methods before entering the Win32 API, I've attempted to dump the values of the parameters in the managed call stack using the !clrstack -p command.

0:000> !clrstack -p
OS Thread Id: 0xe4 (0)
ESP       EIP     
0012ee44 7c90e4f4 Win32Native.GetFileAttributesEx(...)
0012ee58 792e03f6 File.FillAttributeInfo(System.String, ...)
    PARAMETERS:
        path = no data
        data = 0x0012eeec
        tryagain = no data
        returnErrorOnNotFound = 0x00000001

0012eeec 7927ff71 System.IO.File.InternalExists(System.String)
    PARAMETERS:
        path = no data

0012ef20 792e96a6 System.IO.File.Exists(System.String)
    PARAMETERS:
        path = no data

0012ef4c 03a214c3 NUnit.Util.RecentFileEntry.get_Exists()
    PARAMETERS:
        this = no data

0012ef54 00e0fb7a NUnit.Gui.NUnitForm.NUnitForm_Load(Object, EventArgs)
    PARAMETERS:
        this = 0x013349ec
        sender = no data
        e = no data

Unfortunately, SOS wasn't able to retrieve the address of the managed string (thus, stating no data). It's likely that either the variable was stored in a register that was already overwritten by some later executing code, or perhaps we're just witnessing some of SOS's good old "buggy nature".

At this point, I take a small step into the world of native debugging and issue the kb command, to display the native call stack, including the first 3 parameters passed to each method.

0:000> kb
ChildEBP RetAddr  Args to Child              
0012ed84 7c90d79c 7c8111ff 0012eddc 0012eda4 ntdll!KiFastSystemCallRet
0012ed88 7c8111ff 0012eddc 0012eda4 00151bb8 ntdll!ZwQueryFullAttributesFile+0xc
0012ee08 0097a310 0139272c 00000000 0012eeec KERNEL32!GetFileAttributesExW+0x84

WARNING: Frame IP not in any known module. Following frames may be wrong.
0012ee2c 792e03f6 0012ee5c 00000000 01392764 +0x97a30f
0012eedc 7927ff71 00000001 00000000 00000000 mscorlib_ni+0x2203f6
0012ef18 792e96a6 00000000 00000000 00000000 mscorlib_ni+0x1bff71
0012ef80 7b1d4665 01336c40 0134f25c 00000000 mscorlib_ni+0x2296a6
0012efb8 7b1d4105 00000004 0012f000 7b1c6d11 System_Windows_Forms_ni+0x204665
0012efc4 7b1c6d11 0133797c 013349ec 00000001 System_Windows_Forms_ni+0x204105
0012f000 7b1c6b14 00000000 00000000 013349ec System_Windows_Forms_ni+0x1f6d11
0012f018 7b1d2fc8 013349ec 00000000 00000000 System_Windows_Forms_ni+0x1f6b14
0012f054 7b1c8906 3cc1af52 79e7a6b8 0012f2d4 System_Windows_Forms_ni+0x202fc8
0012f0ac 7b1d1d6a 013349ec 0012f0c0 7b1d1d20 System_Windows_Forms_ni+0x1f8906
0012f0b8 7b1d1d20 0012f0d0 7b1d2f11 0012f10c System_Windows_Forms_ni+0x201d6a
0012f0c0 7b1d2f11 0012f10c 013349ec 0012f0e4 System_Windows_Forms_ni+0x201d20
0012f0d0 7b1d1af4 00e111ec 0012f10c 01335448 System_Windows_Forms_ni+0x202f11
0012f0e4 7b1c8640 0012f100 7b1c85c1 00000000 System_Windows_Forms_ni+0x201af4
0012f0ec 7b1c85c1 00000000 01335448 0012f130 System_Windows_Forms_ni+0x1f8640
0012f100 7b1c849a 01335448 00291106 00000018 System_Windows_Forms_ni+0x1f85c1
0012f164 7e418734 00291106 00000018 00000001 System_Windows_Forms_ni+0x1f849a

As you can see, the thread eventually made its way into the kernel, calling GetFileAttributesExW.
Knowing that the first parameter that gets passed to the method (lpFileName) represent the name of the file or directory in hand, we are only left to dump its value using the du command.

0:000> du 0139272c
0139272c  "\\123.123.123.123\foo.dll"

With that, the culprit is found. Apparently, some time ago a UT assembly was loaded into NUnit from a certain IP address that is no longer available in the network, causing the thread to stall indefinitely.

In order to fix the issue, the problematic path was needed to be removed from the RecentProjects list in NUnit's settings file (located at %AppData%\NUnit\NUnitSettings.xml).

Monday, October 11, 2010

Hot/Cold Data Separation Considered Harmful in Multithreaded Environments

When optimizing your code for performance, a quite important step in the way is to check that your application correctly utilizes its processor's cache. Of course, the magnitude of importance here is solely depended on your own personal scenario, but in general, it's always a good idea to keep your cache usage patterns in mind when attempting to resolve a performance bottleneck or perhaps for when you're just looking for possible ways to enhance you're application's performance.

Due to the relative high cost of accessing the main memory, modern processors make use of a data and an instruction cache in an attempt to lower the latency involved in accessing the main memory.
A processor may choose to keep in its cache the values of frequently accessed memory addresses, or perhaps prefetch the values of adjacent memory addresses for possible future usage.
When the processor attempts to access a memory address, it first check to see whether it's already exist in one of its caches (L1, L2, L3 etc.). If the address isn't found, then we have a "cache miss" and the processor we'll have to perform a round-trip to the main memory in order to obtain the required data. Wasting valuable CPU cycles on the way.

Not too long ago, one of the most popular guidelines in designing cache-aware data structures was to split them into "hot" and "cold" parts. This kind of separation makes sense since it could lead to a more efficient utilization of the cache. Frequently used fields (the hot part) are grouped together in the same cache lines, instead of being mixed with infrequently used fields (the cold part). This way, the cache is able to contain more hot data, and cache lines aren't wasted to hold cold data.
For those of you who are interested, the subject is covered in greater detail in the article Splitting Data Objects to Increase Cache Utilization by Michael Franz and Thomas Kistler.

Red: Hot data, Green: Cold data

While following this guideline could lead to a more efficient cache utilization, it won't necessarily lead to a performance improvement in a multithreaded environment. In fact, it's more likely that you'll face a performance degradation than gaining any performance improvement at all.
The problem in hand is that highly utilized caches could easily send your valuable CPU cycles down the drain, due to the effects of false sharing. Once a single cache line holds several frequently written-to fields, the chance it will get invalidated by another processor gets greater. Sharing a single cache line with multiple frequently modified fields could easily have a negative effect on your cache's locality.
In multithreaded environments, one could benefit from sparsely allocating frequently used fields in the cache. There's an obvious trade-off between cache utilization (memory usage) and locality (performance). While the cache will contain less hot data (which could result in more round-trips to the main memory), it would benefit from better locality, hence better scalability across multiple processors that might attempt to modify the same object simultaneously.

When designing a cache-aware data structure, you don't necessarily have to order your fields in such way that a hot field will always get followed by a cold field. Instead, "artificial" padding could be used to fill the excessive space left in the cache line which holds the structure. In .Net, decorating the type with an StructLayout(LayoutKind.Explicit) attribute and assigning each field with its appropriate offset would do the job.

Friday, September 24, 2010

Writing a Semi-Local Object Pool

Using object pooling in managed environments can usually benefit us in two ways:

Reducing the amount of time required to create "heavyweight" objects (that might involve executing some time consuming tasks).
Reducing the amount and rate of dynamic memory allocations. Thus, reducing the GC latency in future collections.

Nevertheless, it's important to remember that under certain scenarios, using object pools might actually have a negative impact on your application's performance. Since in managed environments (e.g, CLR, JVM), dynamic memory allocations is considerably fast, using object pools in favor of "lightwight" objects could cause somewhat of an unnecessary overhead during the object's allocation process. Brian Goetz summarized this issue:

Allocation in JVMs was not always so fast -- early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice -- because it was, along with everything else in early JVMs -- and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.)

A common, simple pattern for implementing an object pool is to create a single pool instance that is shared across all of the application. To achieve thread-safety, you would usually find a single, global lock around the Allocate and Free methods.
It's obvious that this type of design could introduce major concurrency bottlenecks. The more objects we'll attempt to pool, the greater the chance that we'll have threads attempting to acquire the pool's lock. And since we only maintain a single, global pool, contentions around that lock are bound to appear. Effectively ruining our application's scalability.
To demonstrate the issue, I've written a small benchmark that uses all of the available processors to allocate and free a constant number of objects (each thread gets an equal amount of objects to pool). Logically speaking, the more processors we use, the faster we should be able to finish allocating and freeing the constant number of objects. However, as the results show, we're actually experiencing a slowdown that gets worse as soon as we add more and more processors.
The results aren't surprising since they can be easily explained due to the massive amount of contentions we're experiencing around our single lock. (The time axis in the chart is expressed in milliseconds).

The first implementation of the pool that was being used in the test:
(Mind you that the code samples in this post are purely meant to demonstrate the conceptual differences between the pools).

    // holds a dictionary that makes a pool-per-type corelation
    public class SimpleMainPool
    {
        private Dictionary<Type, ISubPool> m_main;

        // to make things simpler, the dictionary isn't modified
        // after the first initialization
        public SimpleMainPool(Type[] pooledTypes)
        {
            m_main = new Dictionary<Type, ISubPool>();

            foreach (Type curType in pooledTypes)
                m_main.Add(curType, new SemiLocalPool(curType));
        }

        public object Allocate(Type type)
        {
            ISubPool sub = m_main[type];

            object pooledObj = sub.Allocate();
            return pooledObj;
        }

        public void Free(object obj)
        {
            ISubPool sub = m_main[obj.GetType()];
            sub.Free(obj);
        }
    }

    // our simple thread-safe pool
    class SimplePool : ISubPool
    {
        private const int PRIME = 50;

        private Type m_type;

        private Stack<object> m_sharedPool;

        public SimplePool(Type type)
        {
            m_sharedPool = new Stack<object>(PRIME);
            m_type = type;

            for (int i = 0; i < PRIME; i++)
            {
                object sharedObj = Activator.CreateInstance(m_type);
                m_sharedPool.Push(sharedObj);
            }
        }

        public object Allocate()
        {
            lock (m_sharedPool)
            {
                if (m_sharedPool.Count == 0)
                {
                    for (int i = 0; i < PRIME; i++)
                    {
                        object newAlloc = Activator.CreateInstance(m_type);
                        m_sharedPool.Push(newAlloc);
                    }
                }

                object fromLocal = m_sharedPool.Pop();
                return fromLocal;
            }
        }

        public void Free(object obj)
        {
            lock (m_sharedPool)
            {
                m_sharedPool.Push(obj);
            }
        }
    }

    interface ISubPool
    {
        object Allocate();
        void Free(object obj);
    }

As in all things related to concurrency, if you don't have locality, then you've got sharing, and once you have sharing, you will probably end up with contentions that are bound to harm your application's performance, wasting valuable CPU cycles.
So if we'd like to improve our scalability, then our goal is clear: reducing the amount of shared data. For example, if pools wouldn't be shared across different threads, then we wouldn't had to worry about synchronizing them and we could avoid the involved contentions altogether. A simple way to achieve this, is to use the TLS to allocate an independent pool for every thread. This way, on the one hand we'll achieve perfect scalability due to the avoidance of state sharing, but on the other hand, this kind of implementation could lead to an excessive memory usage. For instance, if a single instance of our pool (including all of its pre-allocated objects) weights about 10Mb, then on a machine with 16 processors, we could find ourselves dedicating no less then 160Mb in favor of our thread-local pools, even though its not likely that every single thread needs to use all the types of objects that we're allocated in its local pool.
For example, if we're parallelizing some algorithm using 3 threads, where thread 1 needs to use objects of type A and thread 2 needs to use objects of type B and thread 3 needs to use objects of type C, then it makes no sense that every one of those threads will hold a pool that will contain objects of all three types.

A possible solution for this problem is to use a pool hierarchy, where every time a thread attempts to create an object, it will direct itself to its "closest" pool instance. If that pool doesn't contain available instances of the requested object, then it will continue to navigate up the hierarchy until it reaches a pool that holds available instances of the object. Once the thread finishes using the object, it will return it to a pool that is located "closer" to that thread, this way we are able to maintain a level of locality between a thread and its used objects.
Instead of getting confused with unclear and too complex hierarchies, I'll demonstrate the concept using a flat hierarchy that offers a single "global" pool that is shared across all threads, and another local pool for every thread.

Basically, the idea is that the only place synchronization is involved is in the shared pool. So in the optimal scenario, each local pool will eventually hold only the amount of objects required to keep the thread from accessing the shared pool.
Every time a thread needs to create an object, it will first check its local pool. Since this pool only serves the requesting thread, we won't have to deal with synchronization here. Only in case where we've ran out of objects, we'll move on to the shared pool and transfer N more instances of the requested object to the local pool. It could be wise to transfer more objects than the thread initially asked for in order to avoid future accesses to the shared pool. Also, in order to cap the amount of memory we'd like to dedicate for each thread, we could decide that each local pool can hold a maximum of X objects. Once we've exceeded that number, every time a thread will want to free an object, it will return it to the shared pool instead of its local pool (of course, this could cause some contentions, depending on the implementation detail [e.g. the pool may buffer object returns]. But its entirely up to the developer to perform this kind of fine-tuning [memory usage vs. scalability]).

To demonstrate to concept, I've came up with this simplistic pool implementation:

    class SemiLocalPool : ISubPool
    {
        private const int SHARED_PRIME = 50;

        private const int LOCAL_PRIME = 20;
        private const int LOCAL_MAX = 1000;

        [ThreadStatic]
        private static Stack<object> t_localPool;

        private Type m_type;
        private Stack<object> m_sharedPool;

        public SemiLocalPool(Type type)
        {
            m_sharedPool = new Stack<object>(SHARED_PRIME);

            m_type = type;

            for (int i = 0; i < SHARED_PRIME; i++)
            {
                object sharedObj = Activator.CreateInstance(m_type);
                m_sharedPool.Push(sharedObj);
            }
        }

        public static void Init()
        {
            t_localPool = new Stack<object>(LOCAL_PRIME);
        }

        public object Allocate()
        {
            // first, try to allocate from the local pool
            if (t_localPool.Count > 0)
            {
                object localObj = t_localPool.Pop();
                return localObj;
            }

            int allocated = 0;

            lock (m_sharedPool)
            {
                // pass objects from shared to local pool
                for (; m_sharedPool.Count > 0 && allocated < LOCAL_PRIME - 1; allocated++)
                {
                    object sharedObj = m_sharedPool.Pop();
                    t_localPool.Push(sharedObj);
                }

                // prime share pool
                if (m_sharedPool.Count == 0)
                {
                    for (int i = 0; i < SHARED_PRIME; i++)
                    {
                        // bad practice: holding the lock while executing external code
                        object sharedObj = Activator.CreateInstance(m_type);

                        m_sharedPool.Push(sharedObj);
                    }
                }
            }

            // if the shared pool didn't contain enough elements, prime the remaining items
            for (; allocated < LOCAL_PRIME - 1; allocated++)
            {
                object newAlloc = Activator.CreateInstance(m_type);
                t_localPool.Push(newAlloc);
            }

            object fromLocal = Activator.CreateInstance(m_type);
            return fromLocal;
        }

        public void Free(object obj)
        {
            // first return to local pool
            if (t_localPool.Count < LOCAL_MAX)
            {
                t_localPool.Push(obj);
                return;
            }

            // only after reaching LOCAL_MAX push back to the shared pool
            lock (m_sharedPool)
            {
                m_sharedPool.Push(obj);
            }
        }
    }

The scalability difference between the two implementations is closely related to the thread's pool usage pattern and to the values given to LOCAL_MAX, LOCAL_PRIME etc. If we reach a situation where there's always enough objects in the local pool, then we'll should enjoy perfect scalability.
For the purpose of the demonstration, here are the results of the previous benchmark, now using the new pool implementation (beside exceeding the predefined values at the beginning of the benchmark, the benchmark's behavior exhibits optimal usage pattern [accessing only the local pool after a while]).

One problematic characteristic of this type of design is its reliance on thread affinity. While in some scenarios it could actually benefit us, in others it could make the Semi-Local Pool irrelevant.
If every thread in our application is affinitized to certain section of the code (that allocates a constant set of objects), then using this design could be optimal since we dedicate each local pool to a managed thread. We actually assume that the thread will always attempt to allocate objects from a specific, constant set of objects.
However, if the threads doesn't comply with this assumption, then its only a matter of time until each local pool will hold the entire set of pooled objects in the applications (which will of course lead to high memory usage).
In order to improve our way of handling with such scenarios, we could decide to add a kind of additional hierarchy level, that will separate the shared pools according to different sections in the code. Meaning, threads that are currently executing code from a network module for example will access Pool X, while threads that are currently executing some algorithm will access Pool Y. This way we could achieve object locality not by relaying on thread affinity, but on "category affinity" (each section of the code uses a certain set of objects, relevant to it). When a thread will want to allocate an object, it will tell the pool which area in the code its currently executing, so it would receive the appropriate "category pool. It's likely that this pool already contains the same type of objects that will be requested by the current thread since they we're already allocated by other threads that previously executed the same code section.

And some code to illustrate the concept:

    public class CategorizedMainPool
    {
        private Dictionary<string, SimpleMainPool> m_main;

        public CategorizedMainPool(Tuple<string, Type[]>[] pooledCategories)
        {
            m_main = new Dictionary<string, SimpleMainPool>();

            foreach (Tuple<string, Type[]> curCategory in pooledCategories)
            {
                SimpleMainPool curSub = new SimpleMainPool(curCategory.Item2);
                m_main.Add(curCategory.Item1, curSub);
            }
        }

        public object Allocate(string category, Type type)
        {
            SimpleMainPool sub = m_main[category];

            object pooledObj = sub.Allocate(type);
            return pooledObj;
        }

        public void Free(string category, object obj)
        {
            SimpleMainPool sub = m_main[catagory];
            sub.Free(obj);
        }
    }

Friday, September 10, 2010

"!CLRStack -p" Isn't Always Reliable

One of the most commonly used commands in SOS is !CLRStack. When combined with the -p switch, SOS will attempt to display the values of the parameters that were passed to the functions in our managed call-stack. It's important to emphasis that SOS will only attempt to display the correct values, since sometimes its just going to get it all wrong.

In the case where SOS comes to the conclusion that it cannot retrieve the value of a parameter, the string will be displayed. This happens at situations where SOS honestly can't track the value of the parameter by just looking at a specific stack frame. For example, if we're using a calling convention such as fast call, the first two parameters (starting from the left) that can be contained in a register, will be passed in the ECX and EDX registers instead of being pushed onto the stack. For member functions, the value of the this pointer is usually passed in the ECX register.
This kind of behavior can lead into situations where the values of some of the function parameters may be missing since the registers we're already overridden by other functions that we're called down the stack.
As opposed to situations where SOS is able to come to the conclusion that it isn't sure what is the correct value of a parameter, every once in a while it's just going to get things wrong, and display incorrect values. This obviously could terribly mislead the person who attempts to debug the application.
The thing that might be the most concerning about this phenomena, is that it's not very hard to reproduce. Let's take the following application as an example:

    class Program
    {
        static void Main(string[] args)
        {
            Foo a = new Foo();
            a.Work(1, 2, 3, 4, 5);
        }
    }
 
    class Foo
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        public void Work(int x, int y, int z, int k, int p)
        {
            // break with debugger here
        }
    }

Now we'll run WinDbg, load SOS and see what the !CLRStack command has to say about the parameters that we're passed to the Work method.
(Note: The output might slightly vary, depending on your version of SOS. The following debugging session was perform on version 4.0.30319.1).

0:000> !clrstack -p 

OS Thread Id:  0xfbc (0) 

Child SP IP       Call Site 

0012f3fc 030300b3  ConsoleApplication1.Foo.Work(Int32, Int32, Int32, Int32, Int32) [Program.cs  @ 24]

    PARAMETERS: 

        this () = 0x00b3c358 

        x () = 0x00000001 

        y (0x0012f40c) = 0x00000003 

        z (0x0012f408) = 0x00000004 

        k (0x0012f404) = 0x00000005 

        p (0x0012f400) = 0x03030092

0012f414 03030092  ConsoleApplication1.Program.Main(System.String[]) [Program.cs @ 16] 

     PARAMETERS: 

        args =  

0012f648 791421db [GCFrame: 0012f648]

0:000> r ecx  //  holds "this" pointer 

ecx=00b3c358  

0:000> r edx  // holds "x" parameter

edx=00000001

If so, it isn't difficult to see that besides the this pointer and the x parameter (that we're passed as registers), SOS got all of the other parameters wrong. In fact, one may notice in a "shift" that was performed on the displayed values (y got the value of z, while z got the value of k, and so on...).

In order to better understand what went on here, we'll print the memory between the relevant stack pointers (in previous versions, this value was represented by the ESP column instead of the Child SP column).

0:000> dp /c 1 0012f3fc  0012f414

0012f3fc  0012f414 // EBP

0012f400  03030092 // Return Address

0012f404  00000005 // p 

0012f408  00000004 //  k 

0012f40c  00000003 // z 

0012f410  00000002 // y 

0012f414  0012f424 //  EBP

Now, when we compare the stack-addresses of the parameters that SOS gave us, against the real addresses that we see here, we can confirm that a small address shift was performed to the memory addresses of the parameters that we're passed on the stack. So every time SOS attempts to display the value of a parameter, it actually displays the value of the parameter that was passed next to it.
This scenario is a classic example to the bit "buggy" nature of SOS. It doesn't mean that we have to immediatly stop using the !CLRStack command, but it should to remind us not to take SOS's output as the "absolute truth" when debugging, and just keep ourselves alert for all kind of "weird" behaviors such as this one.

Thursday, September 2, 2010

DateTime.Now in v4.0 Causes Dynamic Memory Allocations

A while back I've mentioned in a post that calling DateTime.Now causes boxing. Following the posting, a feedback item was also posted on Microsoft Connect, reporting about the issue.
Yesterday, the feedback's status turned to Resolved after Greg cleverly remarked that this issue was fixed in v4.0 (the boxing occurred until v3.5).

However, after reading the response I took the same benchmark code from the previous post an ran it again, this time using .Net Framework 4.0. Surprisingly, perfmon kept reporting on high allocation rates in the Allocated Bytes/sec counter. So I've opened up Reflector and took a look at DateTime.Now's new implementation detail. And indeed, the boxing issue was resolved since the new implementation uses the TimeZoneInfo type instead of TimeZone. Unable to find the source of the allocations from just looking at the code, it was time to launch WinDbg.
After letting the application to run for a while, I've attached WinDbg to the process and executed the !DumpHeap -stat command a couple of times so I could take a general look at the kind and amount of objects that currently live on the managed heap. The output was as follows:

0:003> !dumpheap -stat
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
...
79ba1d88       13         1532 System.Char[]
79ba2f20        1         3164 Dictionary[[String, mscorlib],[String, mscorlib]][]
79b9f9ac      449        12220 System.String
001663d8       53        13212      Free
79b56c28       30        17996 System.Object[]
79b8e7b0     9119       291808 System.Globalization.DaylightTime
Total 9771 objects

0:003> !dumpheap -stat
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
...
79ba1d88       13         1532 System.Char[]
79ba2f20        1         3164 Dictionary[[String, mscorlib],[String, mscorlib]][]
79b9f9ac      449        12220 System.String
001663d8       53        13212      Free
79b56c28       30        17996 System.Object[]
79b8e7b0    20654       660928 System.Globalization.DaylightTime
Total 21306 objects

This output reveals the identity of our memory-consuming bandit: DaylightTime.
Now, all is left is to spot where this type is being instantated and used. For this purpose, we could use the !BPMD -md command in order to place breakpoints on specific managed methods that DaylightTime exposes (you could dump the type's methods using the following command: !DumpMT -md ).
After setting the breakpoint, the application continues its execution and immediately breaks. Looking at the managed callstack using !CLRStack reveals the source method of the allocation: TimeZoneInfo.GetIsDaylightSavingsFromUtc. This method creates an instance of DaylightTime, and since DaylightTime is a class (hence, a reference type), a dynamic memory allocation is being made.

   // a snippet from the implementation:
   TimeSpan span = utc;
   DaylightTime daylightTime = GetDaylightTime(Year, rule);
   DateTime startTime = daylightTime.Start - span;
   DateTime endTime = (daylightTime.End - span) - rule.DaylightDelta;

In conclusion, DateTime.Now's implementation was updated in BCL v4.0 and the boxing of Int32s was removed since the implementation uses the TimeZoneInfo type instead of TimeZone. However, using TimeZoneInfo results in a new source for dynamic allocations, but this time instead of redundant boxings, the allocations are caused just because a reference type is beind used under the hood. And since each instance of DaylightTime is sized up at 32 bytes (including the Object Header), the update in the BCL could be considered as a turn for the worst regarding memory usage since DaylightTime instances are more memory consuming than boxed instances of Int32.