Friday, May 30, 2008

Debugging reference count leaks

One of the hardest things is debugging a reference count leak. COM objects lifetime depends on the reference count (read here for more...). So each client of a COM object must call AddRef on the IUnknown interface when going to use it and it must call Release when done. If any client (and there might be many many of a single one) violates this rule you get into severe trouble.

Scenarios

1.) Number of Release calls = Number of AddRef calls

This is the normal scenario: As soon as no client needs the server object anymore it is getting destroyed

2.) Number of Release calls > Number of AddRef calls

If Release is called one time too often another client might crash because the server get's destroyed too early - bad thing here is that you see the crash in some place but this does not tell you where is root cause is located. All you know is which objects reference count has been corrupted.

3.) Number of AddRef calls > Number of Release calls

If AddRef is called one time too often the reference count never reaches 0 and hence the server object never get's destroyed. This is causing memory leaks and also might cause resource leaks. The effect of this scenario is much less obvious: You might see memory increasing over time and/or performance degrade and/or resources to be locked when they should be unlocked again.

Finding the place where the unbalanced AddRef/Release occurred might be like finding the needle in the hay. I did research in the Google reachable web but didn't find a good tool available that really assist's in this task. Luckily Sara Ford described in this post the first step you need to take in order to get the data necessary to drill down into the problem.

Somehow I didn't manage to set the trace points in Visual Studio 2005 (can anybody tell me how to set a break point on a single objects AddRef, Release methods?) so I launched my beloved WinDbg.

First I created script to create me an xml snippet for an event that alters the ref count (I didn't find a better name so I called it ToXml.txt and placed it into my scripts folder):

.printf "-->\n<Event><Ref>%d</Ref><![CDATA[",poi(${$arg1})
k100
.printf "]]></Event>\n<!--\n"

Then I placed a break point on the server objects constructor

bp MyServer!CMyClass::CMyClass

When the breakpoint hit, I stepped out <Shift>+<H11> into CComCreator::CreateInstance and then stepped over the p->SetVoid(pv); call in this class.
(I think it should be possible to set a breakpoint directly at MyServer!ATL::CComCreator<ATL::CComObject<CMyClass> >::CreateInstance+0xb1, but I didn't try...)

Now I gathered the address of m_dwRef by:

0:000> ?? &(p->m_dwRef)
long * 0x110d724c

Next thing to do is setting the data breakpoint by:

ba w4 0x110d724c "$$>a<C:/windbg/scripts/ToXml.txt 0f084cb4;gc"

(you might need to change the path 'C:/windbg/scripts/')

With .logopen we make sure that we directly write all events into an logfile:

.logopen c:\temp\Events.xml

Now let the application run with 'g' or <F5> and do whatever creates your ref counting problem.

When done break into and close the log with .logclose.

At this point we are half the way through. The Events.xml we created is not valid xml. You need to add

<?xml version="1.0" encoding="UTF-8"?>
<Events>
<!--

at the beginning and

--></Events>

at the end.

Now comes the tooling. In my scenario I had around 1400 Events - a little tedious to analyze all by hand.

So I created "Volkers RefCount Buster" which does the following:

1.) After loading the file (enter path in first text box and press Start) all events are identified for either beeing AddRef or Release

2.) Then the call stack is taken to group the events:

First action is to exclude events that match the pattern entered in the second text box (exclude pattern):

var includeQuery = from frame in this.StackFrames.Frames

where String.IsNullOrEmpty(ExcludePattern) ? true : !excludePattern.IsMatch(frame)

select frame;

Then the remaining frames are searched for the selection pattern:

var selectionQuery = from frame in includeQuery

let match = selectionPattern.Match(frame)

where match.Success

select match.Value;

and the top most match is taken:

string sourceGroup = selectionQuery.FirstOrDefault();

the all events are grouped into the found source groups:

var ResultQuery = from refCountEvent in refCountEvents

group refCountEvent by refCountEvent.SourceGroup into g

select g;

Then the number of AddRefs and Releases is calculated for each group and accumulated:

foreach (var ResultSet in ResultQuery)

{

long numOfAddRefs = (from rce in ResultSet.AsEnumerable()

where rce.RefCountType == EventType.AddRef

select rce).Count();

long numOfReleases = (from rce in ResultSet.AsEnumerable()

where rce.RefCountType == EventType.Release

select rce).Count();

long balance = numOfAddRefs - numOfReleases;

...

Now it's up to you to find the Exclude Pattern and Selection Pattern that will directly point you to the component or file, that is the culprit. Then you just need to look at those stacks that belong to the found bad guy and you will also be able to see the source line that created the problem.

VolkersRefCountBuster

You can download the sources and binaries here...

Have fun,

Volker

7 comments:

PJ Gray said...

I used your technique with much success, thanks for posting it. I even wrote a similar application to yours to parse the XML output.

However, I was confused by your explanation of setting the data breakpoint. In your example, you get the address of the refcount variable, but then set the breakpoint on a different address. Was this just a misprint, or am I missing an important step?

I ended up doing things slightly differently, where I had two windbg scripts, one for addref, and one for release. I then used 'bm /a MyModule!*MyClass*AddRef*' to set breakpoints on calls to addref, and likewise to release. I had to delete the 'thunks' that were in there, but it seemed to work.

Could I have simplified this using your data breakpoint?

Volker von Einem said...

Thanks for the hint. This is really a misprint which I will correct now.

For your last question: I also started with the approach of setting a normal break point on AddRef and Release ending with the data breakpoint which makes things much easier.

PJ Gray said...

Again I am curious about something.

Is your break-on-access method REQUIRED for making sure that you get every addref/release to the object in question? Can COM somehow change the reference count of the object with out going thru that objects addref/release functions? ie, using perhaps the unknown interface directly?

My problem is that I don't know which object instance is leaking! I have run UMDH to get a list of leaks, so I know the type of the object, but thats it! That is what led me to use regular breakpoints based on the classname listed in the UMDH stacktrace.

Any hints would be helpful...

Volker von Einem said...

Hard to tell if the memory break point is required but I would say it's a good idea because the com server has direct access to the ref count member.

How do you know, that you have a COM object leak, if you down know which object is leaking?

PJ Gray said...

Well, I know that there should be no COM objects left allocated after a certain point in our application. (when we close a file, after parsing it). Every COM object should have been deleted when the file was closed. Therefore, I run UMDH after the file is closed, and any COM objects listed are leaking.

The problem is that a given stacktrace in UMDH might have a few hundred (or even thousand) different allocations. Let alone the number of times each one of those allocations is addref/released!!!

So I can set a memory breakpoint on the first object allocated, but there is no guarantee that it is leaking! Because there might be some objects of that same type that are correctly addref/released in other parts of our application that do not leak!

It is almost like I need to set a breakpoint in the construction of the object type, but only have the breakpoint set a memory breakpoint on ITS ref count object pointer that triggers the xml output script, and then continues. Basically, automating the first part of your method.

hmmmmmm.......the magic of windbg! I don't know if it'll work, but I'll report back if it does!

PJ Gray said...

Just a little update. I was actually able to finish fixing my leaks by using the bm /a technique to put separate AddRef/Release breakpoints. This allowed me to track every object instance created, rather than just one. There were some gotcha's, but those were due to our own implementation details, not a failure in the debugging technique.

So anyway, thank you so much for posting this. It is what allowed me to fix this error!

Volker von Einem said...

Good to hear that!