Sunday, August 31, 2014

Unhandled Exceptions in PerformIOCompletionCallback

Debugging unhandled exceptions in your code might not be the way you've planned to spend your afternoon. On the other hand, at least its your code you're debugging and not, let's say... the Frameworks'. In this post, I'll talk about the latter, and more specifically, a case that popped up in the PerformIOCompletionCallback function.

The main problem with exceptions that are being thrown before the CLR has a chance to execute your own code, is that the call stack is almost useless. A NullReferenceException that was thrown while handling an IOCompletion might consist only from the following line:

System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0x54

Not very helpful. Since none of our code was executed, we can't simply see which variable was null or even to understand which module in our application caused this exception. In order the proceed with the analysis, it's time to pull out the big guns and load a full memory dump of the process into WinDbg.

Once the dump, SOS and the appropriate symbols are loaded, we should be able to find the problematic thread using the ~*e!clrstack command. Using the faulting instruction pointer (IP), we could use the !u command to dump the assembly of the failing function. The output should be similar to this one (truncated for brevity):

0:073> !u 000007FEF33888A4
preJIT generated code
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Begin 000007fef3388850, size 11e. Cold region begin 000007fef39340f0, size 3b
Hot region:
000007fe`f3388850 53              push    rbx
000007fe`f3388851 55              push    rbp
000007fe`f3388852 57              push    rdi
000007fe`f3388853 4883ec40        sub     rsp,40h
000007fe`f3388857 488d6c2420      lea     rbp,[rsp+20h]
000007fe`f338885c 4c894550        mov     qword ptr [rbp+50h],r8
000007fe`f3388860 895548          mov     dword ptr [rbp+48h],edx
000007fe`f3388863 894d40          mov     dword ptr [rbp+40h],ecx
000007fe`f3388866 48896500        mov     qword ptr [rbp],rsp
000007fe`f338886a 33c0            xor     eax,eax
000007fe`f338886c 48894510        mov     qword ptr [rbp+10h],rax
000007fe`f3388870 48c7450800000000 mov     qword ptr [rbp+8],0
000007fe`f3388878 488d5508        lea     rdx,[rbp+8]
000007fe`f338887c 488d0d6d7e71ff  lea     rcx,[mscorlib_ni+0x1b06f0]
000007fe`f3388883 e820a886ff      call    mscorlib_ni+0x3030a8 
000007fe`f3388888 ba9d010000      mov     edx,19Dh
000007fe`f338888d b901000000      mov     ecx,1
000007fe`f3388892 e889a786ff      call    mscorlib_ni+0x303020 
000007fe`f3388897 488b4d50        mov     rcx,qword ptr [rbp+50h]
000007fe`f338889b e8d07b89ff      call    mscorlib_ni+0x330470 (System.Threading.OverlappedData.GetOverlappedFromNative(...)
000007fe`f33888a0 488b7820        mov     rdi,qword ptr [rax+20h]
000007fe`f33888a4 488b4708        mov     rax,qword ptr [rdi+8]
000007fe`f33888a8 488b5818        mov     rbx,qword ptr [rax+18h]
000007fe`f33888ac 4885db          test    rbx,rbx
000007fe`f33888af 741a            je      mscorlib_ni+0xa988cb (000007fe`f33888cb)
...

According to Microsoft's x64 register usage, at the time of the exception, the RAX register contained the return value of the GetOverlappedFronNative(...) function. Then, after a little playing around with certain field offsets, a NullReferenceException is thrown. Such errors often occur when the owner object (socket, file stream etc.) is disposed while there's an outstanding IO operation. Once the asynchronous operation is completed, the CLR attempts to use the disposed object - and we enter the wonderful world of undefined behavior. Though the usual result is quite deterministic: a complete crash of the process due to the unhandled exception. Needles to say, you should never dispose an object before you're sure it has no remaining asynchronous IO operations in progress.

This brings us to the final step. How can we check that the target object was indeed already disposed, and how can we tell which part of our codebase is responsible for this unfortunate bug?
First, we'll need to retrieve the OverlappedData instance that the native pointer is mapped into. This is exactly the purpose of the GetOverlappedFromNative function, but since it's implemented inside the CLR, we can't simply use a decompiler to see its implementation. Instead, we could use the u command again. This time, the output should be similar to this:

0:073> !u 000007fe`f2c20470
Unmanaged code
000007fe`f2c20470 e9bb8d1a01      jmp     clr!GetOverlappedFromNative (000007fe`f3dc9230)

0:073> !u 000007fe`f3dc9230
Unmanaged code
000007fe`f3dc9230 488d41b0        lea     rax,[rcx-50h]
000007fe`f3dc9234 c3              ret

After a short redirection into the CLR, we can see the function simply draws the pointer back by 0x50 bytes (that is, the native structure is actually contained by the managed one). This is consistent with the function's implementation in the SSCLI:

static OverlappedDataObject* GetOverlapped (LPOVERLAPPED nativeOverlapped)
{
    ...        
    return (OverlappedDataObject*)((BYTE*)nativeOverlapped - offsetof(OverlappedDataObject, Internal));
}

Dumping the contents of the OverlappedData's instance should give an output similar to this:

0:073> !do 0x0000000005043e08-0x50
Name:        System.Threading.OverlappedData
MethodTable: 000007fef2903180
EEClass:     000007fef2a0a110
Size:        120(0x78) bytes
File:        C:\...\mscorlib.dll
Fields:
... Offset      Type         VT   Attr         Value      Name
...   8  System.IAsyncResult  0 instance 0000000000000000 m_asyncResult
...  10 ...ompletionCallback  0 instance 0000000005a46028 m_iocb
...  18 ...ompletionCallback  0 instance 0000000000000000 m_iocbHelper
...  20 ...eading.Overlapped  0 instance 0000000005a46010 m_overlapped
...  28        System.Object  0 instance 0000000013335c18 m_userObject
...  30 ...ppedDataCacheLine  0 instance 00000000050436d8 m_cacheLine
...  38        System.IntPtr  1 instance           221828 m_pinSelf
...  40        System.IntPtr  1 instance                0 m_userObjectInternal
...  48         System.Int32  1 instance                1 m_AppDomainId
...  4c         System.Int16  1 instance               13 m_slot
...  4e          System.Byte  1 instance                0 m_isArray
...  4f          System.Byte  1 instance                0 m_toBeCleaned
...  50 ....NativeOverlapped  1 instance 0000000005043e08 m_nativeOverlapped

Dumping the m_iocb member and then m_target, should give you the problematic object. In the demonstrated case, the object is an instance of SocketAsyncEventArgs, so dumping it and looking the the m_DisposedCalled field proves the assumed theory:

0:073> !do 0000000005a53b58 
Name:        System.Net.Sockets.SocketAsyncEventArgs
MethodTable: 000007fef21248c0
EEClass:     000007fef1e6d550
Size:        440(0x1b8) bytes
File:        C:\...\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
...  Offset                 Type VT     Attr            Value Name
...      80     System.EventArgs  0   shared           static Empty
...       >> Domain:Value  0000000000112a40:0000000003326168 <<
...       8 ...et.Sockets.Socket  0 instance 0000000000000000 m_AcceptSocket
...      10 ...et.Sockets.Socket  0 instance 0000000000000000 m_ConnectSocket
...      18        System.Byte[]  0 instance 0000000013335c18 m_Buffer
...     170 System.Net.WSABuffer  1 instance 0000000005a53cc8 m_WSABuffer
...      e8        System.IntPtr  1 instance         13335c28 m_PtrSingleBuffer
...     130         System.Int32  1 instance            32768 m_Count
...     134         System.Int32  1 instance                0 m_Offset
...      20 ...lib]], mscorlib]]  0 instance 0000000000000000 m_BufferList
...     16c       System.Boolean  1 instance                0 m_BufferListChanged
...      28 ...m.Net.WSABuffer[]  0 instance 0000000000000000 m_WSABufferArray
...     138         System.Int32  1 instance            32768 m_BytesTransferred
...      30 ...entArgs, System]]  0 instance 0000000000000000 m_Completed
...
...     16f       System.Boolean  1 instance                1 m_DisposeCalled
...

In order to find out which part of our code initiated the current IO operation, we could use !gcroot on the target object or the m_userObject member. In some cases, simply examining the contents of m_userObject could give us a rough idea regarding the identity of the culprit.