Debugging unhandled exceptions in your code might not be the way you've planned to spend your afternoon. On the other hand, at least its your code you're debugging and not, let's say... the Frameworks'. In this post, I'll talk about the latter, and more specifically, a case that popped up in the PerformIOCompletionCallback function.
The main problem with exceptions that are being thrown before the CLR has a chance to execute your own code, is that the call stack is almost useless. A NullReferenceException that was thrown while handling an IOCompletion might consist only from the following line:
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0x54
Not very helpful. Since none of our code was executed, we can't simply see which variable was null or even to understand which module in our application caused this exception. In order the proceed with the analysis, it's time to pull out the big guns and load a full memory dump of the process into WinDbg.
Once the dump, SOS and the appropriate symbols are loaded, we should be able to find the problematic thread using the ~*e!clrstack command. Using the faulting instruction pointer (IP), we could use the !u command to dump the assembly of the failing function. The output should be similar to this one (truncated for brevity):
0:073> !u 000007FEF33888A4
preJIT generated code
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Begin 000007fef3388850, size 11e. Cold region begin 000007fef39340f0, size 3b
Hot region:
000007fe`f3388850 53 push rbx
000007fe`f3388851 55 push rbp
000007fe`f3388852 57 push rdi
000007fe`f3388853 4883ec40 sub rsp,40h
000007fe`f3388857 488d6c2420 lea rbp,[rsp+20h]
000007fe`f338885c 4c894550 mov qword ptr [rbp+50h],r8
000007fe`f3388860 895548 mov dword ptr [rbp+48h],edx
000007fe`f3388863 894d40 mov dword ptr [rbp+40h],ecx
000007fe`f3388866 48896500 mov qword ptr [rbp],rsp
000007fe`f338886a 33c0 xor eax,eax
000007fe`f338886c 48894510 mov qword ptr [rbp+10h],rax
000007fe`f3388870 48c7450800000000 mov qword ptr [rbp+8],0
000007fe`f3388878 488d5508 lea rdx,[rbp+8]
000007fe`f338887c 488d0d6d7e71ff lea rcx,[mscorlib_ni+0x1b06f0]
000007fe`f3388883 e820a886ff call mscorlib_ni+0x3030a8
000007fe`f3388888 ba9d010000 mov edx,19Dh
000007fe`f338888d b901000000 mov ecx,1
000007fe`f3388892 e889a786ff call mscorlib_ni+0x303020
000007fe`f3388897 488b4d50 mov rcx,qword ptr [rbp+50h]
000007fe`f338889b e8d07b89ff call mscorlib_ni+0x330470 (System.Threading.OverlappedData.GetOverlappedFromNative(...)
000007fe`f33888a0 488b7820 mov rdi,qword ptr [rax+20h]
000007fe`f33888a4 488b4708 mov rax,qword ptr [rdi+8]
000007fe`f33888a8 488b5818 mov rbx,qword ptr [rax+18h]
000007fe`f33888ac 4885db test rbx,rbx
000007fe`f33888af 741a je mscorlib_ni+0xa988cb (000007fe`f33888cb)
...
According to Microsoft's x64 register usage, at the time of the exception, the RAX register contained the return value of the GetOverlappedFronNative(...) function. Then, after a little playing around with certain field offsets, a NullReferenceException is thrown. Such errors often occur when the owner object (socket, file stream etc.) is disposed while there's an outstanding IO operation. Once the asynchronous operation is completed, the CLR attempts to use the disposed object - and we enter the wonderful world of undefined behavior. Though the usual result is quite deterministic: a complete crash of the process due to the unhandled exception. Needles to say, you should never dispose an object before you're sure it has no remaining asynchronous IO operations in progress.
This brings us to the final step. How can we check that the target object was indeed already disposed, and how can we tell which part of our codebase is responsible for this unfortunate bug?
First, we'll need to retrieve the OverlappedData instance that the native pointer is mapped into. This is exactly the purpose of the GetOverlappedFromNative function, but since it's implemented inside the CLR, we can't simply use a decompiler to see its implementation. Instead, we could use the u command again. This time, the output should be similar to this:
0:073> !u 000007fe`f2c20470
Unmanaged code
000007fe`f2c20470 e9bb8d1a01 jmp clr!GetOverlappedFromNative (000007fe`f3dc9230)
0:073> !u 000007fe`f3dc9230
Unmanaged code
000007fe`f3dc9230 488d41b0 lea rax,[rcx-50h]
000007fe`f3dc9234 c3 ret
After a short redirection into the CLR, we can see the function simply draws the pointer back by 0x50 bytes (that is, the native structure is actually contained by the managed one). This is consistent with the function's implementation in the SSCLI:
static OverlappedDataObject* GetOverlapped (LPOVERLAPPED nativeOverlapped)
{
...
return (OverlappedDataObject*)((BYTE*)nativeOverlapped - offsetof(OverlappedDataObject, Internal));
}
Dumping the contents of the OverlappedData's instance should give an output similar to this:
0:073> !do 0x0000000005043e08-0x50
Name: System.Threading.OverlappedData
MethodTable: 000007fef2903180
EEClass: 000007fef2a0a110
Size: 120(0x78) bytes
File: C:\...\mscorlib.dll
Fields:
... Offset Type VT Attr Value Name
... 8 System.IAsyncResult 0 instance 0000000000000000 m_asyncResult
... 10 ...ompletionCallback 0 instance 0000000005a46028 m_iocb
... 18 ...ompletionCallback 0 instance 0000000000000000 m_iocbHelper
... 20 ...eading.Overlapped 0 instance 0000000005a46010 m_overlapped
... 28 System.Object 0 instance 0000000013335c18 m_userObject
... 30 ...ppedDataCacheLine 0 instance 00000000050436d8 m_cacheLine
... 38 System.IntPtr 1 instance 221828 m_pinSelf
... 40 System.IntPtr 1 instance 0 m_userObjectInternal
... 48 System.Int32 1 instance 1 m_AppDomainId
... 4c System.Int16 1 instance 13 m_slot
... 4e System.Byte 1 instance 0 m_isArray
... 4f System.Byte 1 instance 0 m_toBeCleaned
... 50 ....NativeOverlapped 1 instance 0000000005043e08 m_nativeOverlapped
Dumping the m_iocb member and then m_target, should give you the problematic object. In the demonstrated case, the object is an instance of SocketAsyncEventArgs, so dumping it and looking the the m_DisposedCalled field proves the assumed theory:
0:073> !do 0000000005a53b58
Name: System.Net.Sockets.SocketAsyncEventArgs
MethodTable: 000007fef21248c0
EEClass: 000007fef1e6d550
Size: 440(0x1b8) bytes
File: C:\...\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
... Offset Type VT Attr Value Name
... 80 System.EventArgs 0 shared static Empty
... >> Domain:Value 0000000000112a40:0000000003326168 <<
... 8 ...et.Sockets.Socket 0 instance 0000000000000000 m_AcceptSocket
... 10 ...et.Sockets.Socket 0 instance 0000000000000000 m_ConnectSocket
... 18 System.Byte[] 0 instance 0000000013335c18 m_Buffer
... 170 System.Net.WSABuffer 1 instance 0000000005a53cc8 m_WSABuffer
... e8 System.IntPtr 1 instance 13335c28 m_PtrSingleBuffer
... 130 System.Int32 1 instance 32768 m_Count
... 134 System.Int32 1 instance 0 m_Offset
... 20 ...lib]], mscorlib]] 0 instance 0000000000000000 m_BufferList
... 16c System.Boolean 1 instance 0 m_BufferListChanged
... 28 ...m.Net.WSABuffer[] 0 instance 0000000000000000 m_WSABufferArray
... 138 System.Int32 1 instance 32768 m_BytesTransferred
... 30 ...entArgs, System]] 0 instance 0000000000000000 m_Completed
...
... 16f System.Boolean 1 instance 1 m_DisposeCalled
...
In order to find out which part of our code initiated the current IO operation, we could use !gcroot on the target object or the m_userObject member. In some cases, simply examining the contents of m_userObject could give us a rough idea regarding the identity of the culprit.