Wednesday, February 13, 2008

Digging into Kernel Space

The company I am working for manufactures a laboratory device that is getting conntected via RS232 interface. As many laptop computers today do not have such an interface we must provide our customers with a good solution how to operate their instruments in such an environment. Therefore we identified a USB to RS232 adapter as a good solution and selected a cable with prolific chipset PL-2303HX and drivers.

Lately I have seen sporadic cases of strange process hangs. The process that in writing into RS232 is stuck. I was able to attach windbg to that process and quickly realized that it was hanging in a call to Kernel32!WriteFile. As we are setting com timeouts via kernel32!SetCommTimeouts to 10 seconds in the worst case I would have expected the function to return in any case. This did not happen :-(

Strange enough, killing the process was not possible anymore. No Task Manager, no Process Explorer and no kill -f was working. Finally I wasn't even able to shut down the system properly (XP SP2) and needed to kill it the hard way.

OK I could not do anything more in user mode space that saying WriteFile is the bad guy. This is a bad explanation for customers. They want the system to work - regardless of how is the culprit.

So I was entering new spaces - the kernel space...

First task was to get a dump. OSR Online offers a tool called Bang which did the job.
Before I needed to configure the system to generate a full dump. Therefore right click on computer and select properties. Go to the advanced tab and click on Settings for Startup and recovery. In the 'Write debugging information' frame select 'Complete memory dump' and the dump file name:


Finish with OK and now start the Bang.exe (it seems to require a local admin account - so I used the runas...):


It is not hard to hit the nice red button and you'll see a nice blue screen ;-)

After reboot the promised memory.dmp was available.

But what next? I'm fully unexperienced in kernel debugging and I tried the 'poking around until you find something' pattern without success. Fortunately I have a MSDN subscription that offers two free support incidents. So I went to Bill and told him that I'm not happy with this. He said, please give me the dump and I'll see. Then he said it'll take songer than a day but not much. Finally took nearly tree weeks - never mind ;-)
I got the dump analysis out of Microsoft hands and this was pretty much pointing to the place I suspected (the USB to RS232 driver). Additionally he said that my process was accessing the same port from two different threads. This touched my pride as a developer and I needed to understand the magic bill was doing my dump. Finally I discovered that it was quite true as the two threads were accessing two different ports (I had in sum 4 ports on the system) - but that's forgiven.

Now to the internals of the kernel dump analyis:

I know the name of the hanging process. So first we need to get the processes:
0: kd> !process 0 0
[...]
PROCESS 820f4838 SessionId: 0 Cid: 01d4 Peb: 7ffd8000 ParentCid: 0608
DirBase: 02a00440 ObjectTable: e4b36400 HandleCount: 209.
Image: MyProcess.exe
[...]


Now we need to get the details of the process:
0: kd> !process 820f4838 f
PROCESS 820f4838 SessionId: 0 Cid: 01d4 Peb: 7ffd8000 ParentCid: 0608
DirBase: 02a00440 ObjectTable: e4b36400 HandleCount: 209.
Image:
MyProcess.exe
VadRoot 81b9ffa8 Vads 288 Clone 0 Private 61253. Modified 84841. Locked 0.
DeviceMap e4c3b450
Token e4c02030
ElapsedTime 1 Day 04:24:53.014
UserTime 00:00:23.156
KernelTime 00:00:20.000
QuotaPoolUsage[PagedPool] 156580
QuotaPoolUsage[NonPagedPool] 12612
Working Set Sizes (now,min,max) (25572, 50, 345) (102288KB, 200KB, 1380KB)
PeakWorkingSetSize 52956
VirtualSize 372 Mb
PeakVirtualSize 379 Mb
PageFaultCount 203878
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 63188
DebugPort 81905360

THREAD 81a81020 Cid 01d4.0c20 Teb: 7ffaf000 Win32Thread: e397ceb0 WAIT: (DelayExecution) KernelMode Non-Alertable
81a81110 NotificationTimer
IRP List:
816ecde0: (0006,0220) Flags: 00000a30 Mdl: 00000000
Not impersonating
DeviceMap e4c3b450
Owning Process 820f4838 Image:
MyProcess.exe
Wait Start TickCount 6562217 Ticks: 0
Context Switch Count 408541 LargeStack
UserTime 00:00:00.109
KernelTime 00:00:00.421
Win32 Start Address 0x77c3a341
Start Address 0x7c810659
Stack Init a9910000 Current a990fc04 Base a9910000 Limit a990d000 Call 0
Priority 10 BasePriority 10 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr
a990fc1c 80502e56 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
a990fc28 804faa13 nt!KiSwapThread+0x8a (FPO: [0,0,0])
a990fc54 8057ca62 nt!KeDelayExecutionThread+0x1c9 (FPO: [Non-Fpo])
a990fc78 8057e7ff nt!IopCancelAlertedRequest+0x52 (FPO: [Non-Fpo])
a990fc94 8057c341 nt!IopSynchronousServiceTail+0xe1 (FPO: [Non-Fpo])
a990fd38 805409ac nt!NtWriteFile+0x5d7 (FPO: [Non-Fpo])
a990fd38 7c90eb94 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ a990fd64)
WARNING: Frame IP not in any known module. Following frames may be wrong.
0208fd70 00000000 0x7c90eb94

THREAD 8169da90 Cid 01d4.0ef0 Teb: 7ff9f000 Win32Thread: 00000000 WAIT: (DelayExecution) KernelMode Non-Alertable
8169db80 NotificationTimer
IRP List:
816b3710: (0006,0220) Flags: 00000a30 Mdl: 00000000
Not impersonating
DeviceMap e4c3b450
Owning Process 820f4838 Image:
MyProcess.exe
Wait Start TickCount 6562217 Ticks: 0
Context Switch Count 64786
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x77c3a341
Start Address 0x7c810659
Stack Init a9a5c000 Current a9a5bc04 Base a9a5c000 Limit a9a59000 Call 0
Priority 6 BasePriority 6 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr
a9a5bc1c 80502e56 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
a9a5bc28 804faa13 nt!KiSwapThread+0x8a (FPO: [0,0,0])
a9a5bc54 8057ca62 nt!KeDelayExecutionThread+0x1c9 (FPO: [Non-Fpo])
a9a5bc78 8057e7ff nt!IopCancelAlertedRequest+0x52 (FPO: [Non-Fpo])
a9a5bc94 8057c341 nt!IopSynchronousServiceTail+0xe1 (FPO: [Non-Fpo])
a9a5bd38 805409ac nt!NtWriteFile+0x5d7 (FPO: [Non-Fpo])
a9a5bd38 7c90eb94 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ a9a5bd64)
WARNING: Frame IP not in any known module. Following frames may be wrong.
04dcfd9c 00000000 0x7c90eb94

Here we can see the two hanging threads in NtWriteFile I already have seen in user space.

I also was able to go that far without Bill's help - now comes the special magic:
The key to the success is lying in the IRP ( I/O request packet) list!

So let's have at the details of the IRPs:
1: kd> !irp 816ecde0
Irp is active with 5 stacks 5 is current (= 0x816ecee0)
No Mdl: System buffer=81e316a8: Thread 81a81020: Irp stack trace.
cmd flg cl Device File Completion-Context
[...]
>[ 4, 0] 0 1 81e36040 81ca1798 00000000-00000000 pending
*** ERROR: Module load completed but symbols could not be loaded for ser2pl.sys
\Driver\Ser2pl
Args: 00000002 00000000 00000000 00000004
1: kd> !irp 816b3710
Irp is active with 5 stacks 5 is current (= 0x816b3810)
No Mdl: System buffer=821b62f0: Thread 8169da90: Irp stack trace.
cmd flg cl Device File Completion-Context
[...]
>[ 4, 0] 0 1 81e37040 81e06a78 00000000-00000000 pending
\Driver\Ser2pl
Args: 00000001 00000000 00000000 00000004

In both cases it is the Ser2pl driver (from prolific) that it not working it's queue with status pending.

But now to Bill statement that two threads are accessing the same port:
Let's see were the data is directed to:

1: kd> !object 81e36040
Object: 81e36040 Type: (823b1958) Device
ObjectHeader: 81e36028 (old version)
HandleCount: 0 PointerCount: 3
Directory Object: e10023d0 Name:
Serial3

1: kd> !object
81e37040
Object: 81e37040 Type: (823b1958) Device
ObjectHeader: 81e37028 (old version)
HandleCount: 0 PointerCount: 3
Directory Object: e10023d0 Name:
Serial2

Finally we can get the port mapping:
1: kd> !object \GLOBAL??
[...]
08 e1632920 SymbolicLink COM2
[...]
e1789a58 SymbolicLink COM3
[...]
e37b2e40 SymbolicLink COM4
[...]
e383d438 SymbolicLink COM5

1: kd> !object e1632920
Object: e1632920 Type: (823ed398) SymbolicLink
ObjectHeader: e1632908 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name: COM2
Target String is '\Device\Serial0'
1: kd> !object e1789a58
Object: e1789a58 Type: (823ed398) SymbolicLink
ObjectHeader: e1789a40 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name:
COM3
Target String is '\Device\
Serial2'
1: kd> !object e37b2e40
Object: e37b2e40 Type: (823ed398) SymbolicLink
ObjectHeader: e37b2e28 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name:
COM4
Target String is '\Device\
Serial3'
1: kd> !object e383d438
Object: e383d438 Type: (823ed398) SymbolicLink
ObjectHeader: e383d420 (old version)
HandleCount: 0 PointerCount: 1
Directory Object: e1000788 Name: COM5
Target String is '\Device\Serial4'


So one thread was writing on to COM3 and the other to COM4. I was a bit less concerned that it was not obviously my fault but of course the problem is not gone. Now I hope to get fix from the cable supplier or select different hardware.
I must say I learned a lot with this case but it is not the type of issue I'm begging for!

BTW: after all this investigations I found post on Marks blog about this class of defect: Unkillable Processes

Finally a big thanks to Microsoft Support!

2 comments:

Anonymous said...

Volker,

Did you ever get any results from the Prolific or the cable supplier?

Thanks,

EAK

Volker von Einem said...

Unfortunately not - but they silently put a new revision on their website. I don't know if this fixes the problem :-(