Thursday, February 21, 2008

Shooting the PAGE_GUARD flag with MiniDumpWithIndirectlyReferencedMemory

A colleague of mine (thanks Ralf for pointing this out!) told me that using MiniDumpWithIndirectlyReferencedMemory in MiniDumpWriteDump can cause a nasty crashes.
Following the "in 99.9% of the cases it is your own fault" pattern I suspected the problem to be somewhere else but in dbghelp. Ralf kindly provided me with a sample project which I condensed a bit to fit on a single page:

// GuardPageDump.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "C:/Program Files/Debugging Tools for Windows/sdk/inc/dbghelp.h"
#pragma comment( lib, "C:/Program Files/Debugging Tools for Windows/sdk/lib/i386/dbghelp.lib")

#include "process.h"

void
BigFunc ()
{

char
sBigBuffer [20000] = {'\0'};
printf ("BigFunc was called!\nsBigBuffer: %s\n",sBigBuffer);
}


void
ProblemFunc(HANDLE hWaitForMe)
{

char
sDummy1[] = {'A','\0'};

unsigned long
iDummy = reinterpret_cast<unsigned long>(&(sDummy1[0]));

// let iDummp seem like a pointer pointing to the guarded page
iDummy -= 0x2000;
printf ("Integer value: %d\n", iDummy);

//make sure the integer value pointing to the guard page area is on the stack, when MiniDumpWriteDump is called from another stack
//during wait we will issue a dump creation on the main thread
WaitForSingleObject(hWaitForMe,INFINITE);

printf ("Calling BigFunc crashes since stack can no longer be extended\n");

BigFunc();

printf ("Integer value: %d\n", iDummy);
}


void
ProblemFuncThread(void * p)
{

ProblemFunc(reinterpret_cast<HANDLE>(p));
_endthread();
}


int
_tmain(int argc, _TCHAR* argv[])
{

printf ("This program demonstrates the damaging effect of creating a userdump with MiniDumpWithIndirectlyReferencedMemory\n");

HANDLE hWaitForMe = CreateEvent(NULL,FALSE,FALSE,NULL);

printf ("Calling ProblemFunc on a different thread\n");
uintptr_t hThread = _beginthread(ProblemFuncThread,0,hWaitForMe);
// give the thread time to start
::Sleep(1000);

printf ("Creating a userdump of type MiniDumpWithIndirectlyReferencedMemory\n");
printf ("Resets guard page flag on the page pointed to by iDummy\n");
HANDLE hFile = CreateFile(_T("c:\\temp\\test_indirect.dmp"), GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL );
MiniDumpWriteDump(GetCurrentProcess(),GetCurrentProcessId(),hFile,(MINIDUMP_TYPE) (MiniDumpWithIndirectlyReferencedMemory),0,0,0);

CloseHandle(hFile);

SetEvent(hWaitForMe);
CloseHandle(hWaitForMe);

WaitForSingleObject(reinterpret_cast<HANDLE>(hThread),INFINITE);
return
0;
}


This sample assumes you have installed Debugging Tools for Windows to the default location (C:/Program Files/Debugging Tools for Windows) and you have selected to install the SDK as well.

I'm setting two breakpoints: One on the line that is calling MiniDumpWriteDump - another on the line before calling BicFunc.

First let's have a look at the state before calling MiniDumpWriteDump:


Switching to thread 001 we will notice the value of iDummy = 0x88df38 on the stack that is waiting:


Now before calling MiniDumpWriteDump let's have a look at the memory layout:

0:001> !vadump

[...]

BaseAddress: 00125000
RegionSize: 00001000
State: 00001000 MEM_COMMIT
Protect: 00000104 PAGE_READWRITE + PAGE_GUARD
Type: 00020000 MEM_PRIVATE

[...]

BaseAddress: 0088d000
RegionSize: 00001000
State: 00001000 MEM_COMMIT
Protect: 00000104 PAGE_READWRITE + PAGE_GUARD
Type: 00020000 MEM_PRIVATE


Ok, there is a PAGE_GUARD flag for each thread...

0:000> ~0s;!teb
[...]
GuardPageDump!wmain+0x91:
00401991 6a00 push 0
TEB at 7ffdf000
[...]
StackLimit: 00126000
[...]
0:000> ~1s;!teb
[...]
ntdll!KiFastSystemCallRet:
771d9a94 c3 ret
TEB at 7ffde000
[...]
StackLimit: 0088e000
[...]

Adding the RegionSize to BaseAddress gives us the StackLimit observed by !teb.

Now comes the clue: iDummy holds the value 0x88df38 (no pointer) that represents an address in the guarded page. Dbghelp does not know by looking at the stack if this is a pointer or value and follows the indirections. Looking at the memory layout after call to MiniDumpWriteDump reveals the problem:
0:001> !vadump
[...]
BaseAddress: 00125000
RegionSize: 00001000
State: 00001000 MEM_COMMIT
Protect: 00000104 PAGE_READWRITE + PAGE_GUARD
Type: 00020000 MEM_PRIVATE

[...]

BaseAddress: 0088d000
RegionSize: 00003000
State: 00001000 MEM_COMMIT
Protect: 00000004 PAGE_READWRITE
Type: 00020000 MEM_PRIVATE

The PAGE_GUARD flag for thread 001 is gone! Now it's just a question of time until your application will crash without giving you any clue on the root cause of the problem:

0:001> g
(13e0.ee4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=0088c000 ebx=00bc2e30 ecx=0088b100 edx=771d9a94 esi=00000000 edi=00000000
eip=00401a07 esp=0088ff24 ebp=0088ff2c iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
GuardPageDump!_chkstk+0x27:
00401a07 8500 test dword ptr [eax],eax ds:0023:0088c000=????????

0:001> g
(13e0.ee4): Access violation - code c0000005 (!!! second chance !!!)


Next thing I will do is to tell Microsoft about this. Meanwhile I'll try to restore the PAGE_GUARD flag as a workaround fix using VirtualProtext but I'm not sure if this is really a good idea.

BTW: I've tested this the latest version of dbghelp.dll I know:
A colleague of mine recently investigated a crash that could be deducted to the usage of MiniDumpWriteDump along with MiniDumpWithIndirectlyReferencedMemory flag. The problem is, that MiniDumpWriteDump clears the PAGE_GUARD flag if a value or pointer is on the a stack other than the stack that calls MiniDumpWriteDump.
As a result the process will crash later if the stack needs to be extended.
I've described the problem in detail here:
http://voneinem-windbg.blogspot.com/2008/02/shooting-pageguard-flag-with.html

I would be nice to see fix for this, because the MiniDumpWithIndirectlyReferencedMemory is really useful.
I didn't investigate into the other flags of MINIDUMP_TYPE Enumeration but MiniDumpNormal and MiniDumpWithFullMemory seem to work without a problem.

I've been testing this with the latest public version of dbghelp.dll
0:001> lmvm dbghelp
start end module name
68af0000 68c05000 dbghelp (deferred)
Image path: C:\Users\voneinem\Documents\Visual Studio 2008\Projects\GuardPageDump\Release\dbghelp.dll
Image name: dbghelp.dll
Timestamp: Thu Sep 27 23:27:05 2007 (46FC2029)
CheckSum: 0010087A
ImageSize: 00115000
File version: 6.8.4.0
Product version: 6.8.4.0

Wednesday, February 13, 2008

Digging into Kernel Space

The company I am working for manufactures a laboratory device that is getting conntected via RS232 interface. As many laptop computers today do not have such an interface we must provide our customers with a good solution how to operate their instruments in such an environment. Therefore we identified a USB to RS232 adapter as a good solution and selected a cable with prolific chipset PL-2303HX and drivers.

Lately I have seen sporadic cases of strange process hangs. The process that in writing into RS232 is stuck. I was able to attach windbg to that process and quickly realized that it was hanging in a call to Kernel32!WriteFile. As we are setting com timeouts via kernel32!SetCommTimeouts to 10 seconds in the worst case I would have expected the function to return in any case. This did not happen :-(

Strange enough, killing the process was not possible anymore. No Task Manager, no Process Explorer and no kill -f was working. Finally I wasn't even able to shut down the system properly (XP SP2) and needed to kill it the hard way.

OK I could not do anything more in user mode space that saying WriteFile is the bad guy. This is a bad explanation for customers. They want the system to work - regardless of how is the culprit.

So I was entering new spaces - the kernel space...

First task was to get a dump. OSR Online offers a tool called Bang which did the job.
Before I needed to configure the system to generate a full dump. Therefore right click on computer and select properties. Go to the advanced tab and click on Settings for Startup and recovery. In the 'Write debugging information' frame select 'Complete memory dump' and the dump file name:


Finish with OK and now start the Bang.exe (it seems to require a local admin account - so I used the runas...):


It is not hard to hit the nice red button and you'll see a nice blue screen ;-)

After reboot the promised memory.dmp was available.

But what next? I'm fully unexperienced in kernel debugging and I tried the 'poking around until you find something' pattern without success. Fortunately I have a MSDN subscription that offers two free support incidents. So I went to Bill and told him that I'm not happy with this. He said, please give me the dump and I'll see. Then he said it'll take songer than a day but not much. Finally took nearly tree weeks - never mind ;-)
I got the dump analysis out of Microsoft hands and this was pretty much pointing to the place I suspected (the USB to RS232 driver). Additionally he said that my process was accessing the same port from two different threads. This touched my pride as a developer and I needed to understand the magic bill was doing my dump. Finally I discovered that it was quite true as the two threads were accessing two different ports (I had in sum 4 ports on the system) - but that's forgiven.

Now to the internals of the kernel dump analyis:

I know the name of the hanging process. So first we need to get the processes:
0: kd> !process 0 0
[...]
PROCESS 820f4838 SessionId: 0 Cid: 01d4 Peb: 7ffd8000 ParentCid: 0608
DirBase: 02a00440 ObjectTable: e4b36400 HandleCount: 209.
Image: MyProcess.exe
[...]


Now we need to get the details of the process:
0: kd> !process 820f4838 f
PROCESS 820f4838 SessionId: 0 Cid: 01d4 Peb: 7ffd8000 ParentCid: 0608
DirBase: 02a00440 ObjectTable: e4b36400 HandleCount: 209.
Image:
MyProcess.exe
VadRoot 81b9ffa8 Vads 288 Clone 0 Private 61253. Modified 84841. Locked 0.
DeviceMap e4c3b450
Token e4c02030
ElapsedTime 1 Day 04:24:53.014
UserTime 00:00:23.156
KernelTime 00:00:20.000
QuotaPoolUsage[PagedPool] 156580
QuotaPoolUsage[NonPagedPool] 12612
Working Set Sizes (now,min,max) (25572, 50, 345) (102288KB, 200KB, 1380KB)
PeakWorkingSetSize 52956
VirtualSize 372 Mb
PeakVirtualSize 379 Mb
PageFaultCount 203878
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 63188
DebugPort 81905360

THREAD 81a81020 Cid 01d4.0c20 Teb: 7ffaf000 Win32Thread: e397ceb0 WAIT: (DelayExecution) KernelMode Non-Alertable
81a81110 NotificationTimer
IRP List:
816ecde0: (0006,0220) Flags: 00000a30 Mdl: 00000000
Not impersonating
DeviceMap e4c3b450
Owning Process 820f4838 Image:
MyProcess.exe
Wait Start TickCount 6562217 Ticks: 0
Context Switch Count 408541 LargeStack
UserTime 00:00:00.109
KernelTime 00:00:00.421
Win32 Start Address 0x77c3a341
Start Address 0x7c810659
Stack Init a9910000 Current a990fc04 Base a9910000 Limit a990d000 Call 0
Priority 10 BasePriority 10 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr
a990fc1c 80502e56 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
a990fc28 804faa13 nt!KiSwapThread+0x8a (FPO: [0,0,0])
a990fc54 8057ca62 nt!KeDelayExecutionThread+0x1c9 (FPO: [Non-Fpo])
a990fc78 8057e7ff nt!IopCancelAlertedRequest+0x52 (FPO: [Non-Fpo])
a990fc94 8057c341 nt!IopSynchronousServiceTail+0xe1 (FPO: [Non-Fpo])
a990fd38 805409ac nt!NtWriteFile+0x5d7 (FPO: [Non-Fpo])
a990fd38 7c90eb94 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ a990fd64)
WARNING: Frame IP not in any known module. Following frames may be wrong.
0208fd70 00000000 0x7c90eb94

THREAD 8169da90 Cid 01d4.0ef0 Teb: 7ff9f000 Win32Thread: 00000000 WAIT: (DelayExecution) KernelMode Non-Alertable
8169db80 NotificationTimer
IRP List:
816b3710: (0006,0220) Flags: 00000a30 Mdl: 00000000
Not impersonating
DeviceMap e4c3b450
Owning Process 820f4838 Image:
MyProcess.exe
Wait Start TickCount 6562217 Ticks: 0
Context Switch Count 64786
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x77c3a341
Start Address 0x7c810659
Stack Init a9a5c000 Current a9a5bc04 Base a9a5c000 Limit a9a59000 Call 0
Priority 6 BasePriority 6 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr
a9a5bc1c 80502e56 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
a9a5bc28 804faa13 nt!KiSwapThread+0x8a (FPO: [0,0,0])
a9a5bc54 8057ca62 nt!KeDelayExecutionThread+0x1c9 (FPO: [Non-Fpo])
a9a5bc78 8057e7ff nt!IopCancelAlertedRequest+0x52 (FPO: [Non-Fpo])
a9a5bc94 8057c341 nt!IopSynchronousServiceTail+0xe1 (FPO: [Non-Fpo])
a9a5bd38 805409ac nt!NtWriteFile+0x5d7 (FPO: [Non-Fpo])
a9a5bd38 7c90eb94 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ a9a5bd64)
WARNING: Frame IP not in any known module. Following frames may be wrong.
04dcfd9c 00000000 0x7c90eb94

Here we can see the two hanging threads in NtWriteFile I already have seen in user space.

I also was able to go that far without Bill's help - now comes the special magic:
The key to the success is lying in the IRP ( I/O request packet) list!

So let's have at the details of the IRPs:
1: kd> !irp 816ecde0
Irp is active with 5 stacks 5 is current (= 0x816ecee0)
No Mdl: System buffer=81e316a8: Thread 81a81020: Irp stack trace.
cmd flg cl Device File Completion-Context
[...]
>[ 4, 0] 0 1 81e36040 81ca1798 00000000-00000000 pending
*** ERROR: Module load completed but symbols could not be loaded for ser2pl.sys
\Driver\Ser2pl
Args: 00000002 00000000 00000000 00000004
1: kd> !irp 816b3710
Irp is active with 5 stacks 5 is current (= 0x816b3810)
No Mdl: System buffer=821b62f0: Thread 8169da90: Irp stack trace.
cmd flg cl Device File Completion-Context
[...]
>[ 4, 0] 0 1 81e37040 81e06a78 00000000-00000000 pending
\Driver\Ser2pl
Args: 00000001 00000000 00000000 00000004

In both cases it is the Ser2pl driver (from prolific) that it not working it's queue with status pending.

But now to Bill statement that two threads are accessing the same port:
Let's see were the data is directed to:

1: kd> !object 81e36040
Object: 81e36040 Type: (823b1958) Device
ObjectHeader: 81e36028 (old version)
HandleCount: 0 PointerCount: 3
Directory Object: e10023d0 Name:
Serial3

1: kd> !object
81e37040
Object: 81e37040 Type: (823b1958) Device
ObjectHeader: 81e37028 (old version)
HandleCount: 0 PointerCount: 3
Directory Object: e10023d0 Name:
Serial2

Finally we can get the port mapping:
1: kd> !object \GLOBAL??
[...]
08 e1632920 SymbolicLink COM2
[...]
e1789a58 SymbolicLink COM3
[...]
e37b2e40 SymbolicLink COM4
[...]
e383d438 SymbolicLink COM5

1: kd> !object e1632920
Object: e1632920 Type: (823ed398) SymbolicLink
ObjectHeader: e1632908 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name: COM2
Target String is '\Device\Serial0'
1: kd> !object e1789a58
Object: e1789a58 Type: (823ed398) SymbolicLink
ObjectHeader: e1789a40 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name:
COM3
Target String is '\Device\
Serial2'
1: kd> !object e37b2e40
Object: e37b2e40 Type: (823ed398) SymbolicLink
ObjectHeader: e37b2e28 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name:
COM4
Target String is '\Device\
Serial3'
1: kd> !object e383d438
Object: e383d438 Type: (823ed398) SymbolicLink
ObjectHeader: e383d420 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name: COM5
Target String is '\Device\Serial4'


So one thread was writing on to COM3 and the other to COM4. I was a bit less concerned that it was not obviously my fault but of course the problem is not gone. Now I hope to get fix from the cable supplier or select different hardware.
I must say I learned a lot with this case but it is not the type of issue I'm begging for!

BTW: after all this investigations I found post on Marks blog about this class of defect: Unkillable Processes

Finally a big thanks to Microsoft Support!

Tuesday, February 12, 2008

Getting the GCRoot on a range of objects

Recently I was in the situation to look at the GCRoot of a bunch of objects in a range.
Eran posted on how to do this taks for getting detailed object information. I used this as a template for my needs:

.foreach (obj { !DumpHeap -short start_address end_address }) { !GCRoot ${obj} }

Now replace start_address and end_address with your adresses...

Friday, February 08, 2008

Rank four on Google for my Blog?

Today I was curious and just googled for windbg.
Surprisingly I noticed, that my blog got rank 4:



I do not understand this, because there are lots of blogs (Tess, Dimtry, John just to name three of the best) outside that earn a much higher ranking than this one. The primary intent of this blog was to extend my brain which unfortunately is easily loosing things over time. So if you should hit my blog just googling for windbg make sure to visit the blogs on my roll...

BTW: Tess is currently conducting an online training on windbg. Check it out...