Technical Blog for Jim Beveridge: July 2010

There are few features of STL that have caused me more aggravation than the erase() functions.The problems with erase() are, first that it takes an iterator as an argument and second that erase() invalidates all iterators - sometimes. This makes for a catch-22 situation if you don't know "the secret."

Consider what happens if you want to erase each element of a collection that meets some condition. Deleting a single element is not an issue because you don't need the iterator afterward. Deleting all elements is similarly easy because you can simply call clear(). Deleting an arbitrary number of elements is generally written this way by most beginners:

set<int>coll;

/* ... */

for (set<int>::iterator itr=coll.begin(); itr!=coll.end(); ++itr)

   if (*itr % 2 == 0)

      coll.erase(itr);

Predictably, this code doesn't work. The iterator is invalidated when erase() is called and so the ++itr statement in the for() loop fails. Most modern versions of STL have debugging code to warn you if you use an invalidated iterator, but no such training wheels existed in the early days of STL. I got a lot of bug reports because of this mistake.

My first attempted workaround was for the vector class. I wanted to ignore the iterator completely and just step through the collection with an index:

for (unsigned int i=0; i<coll.size(); ++i)

When I tried to call coll.erase(), I discovered that I needed an iterator, which is what I was trying to avoid in the first place. Years after I first tried to make this solution work, I finally learned the solution. Here's an example to delete all even values when using an index-based for loop;

vector<int> coll;

/* ... */

for (unsigned int i=0; i<coll.size(); ++i)

    if (coll[i]%2 == 0)

        coll.erase(coll.begin() + i--);

coll.begin() is always a valid iterator (even if it equals "end()"), and you are allowed to use array arithmetic on iterators for vector<> and for deque<>. Although this solutions works, it's a bit of a hack, especially because you have to "fix up" i after the call to erase() so that the index of the array won't change. This strategy also won't work for collections such as list<> and set<> that can't be indexed by a scalar.

So I was back to figuring out how to get a "valid" iterator after calling erase().

In some STL collections, the erase() function returns an iterator to the next element. Therefore, even though the original iterator was invalidated, you were given a new iterator that would work, so there was no conflict. I thought the code would be easy to write, once I learned about that return value:

set<int> coll;

/* ... */

for (set<int>::iterator itr=coll.begin(); itr!=coll.end(); ++itr)

   if (*itr%2 == 0)

      itr = coll.erase(itr);

I liked this solution. No tricks, easy to write, easy to read. But it crashed. Why? Because erase() could return coll.end(), which the for loop would try to increment, which wasn't allowed. So the solution looks like this:

set<int> coll;

/* ... */

set<int>::iterator itr=coll.begin();

while ( itr!=coll.end() )

{

   if (*itr%2 == 0)

      itr = coll.erase(itr);

else
++itr;
}

The Microsoft version of STL implements map::erase() by returning an iterator, but this is non-standard. I therefore ran into trouble when I switched to STLport and map::erase() no longer returned an iterator. I found the solution buried in the STL source code, where I learned that the iterators in some collections are stable if they are incremented before erase() is called, in which case you end up with this result:

set<int> coll;

/* ... */

set<int>::iterator itr=coll.begin();

while ( itr!=coll.end() )

{

   if (*itr%2 == 0)

   {

      set<int>::iterator next = itr;

      ++next;

      coll.erase(itr);

      itr = next;

   }
   else
   {
      ++itr;

   }

}

Later I learned that the code could be simplified:

set<int> coll;

/* ... */

set<int>::iterator itr=coll.begin();

while ( itr!=coll.end())

{

   if (*itr%2 == 0)

      coll.erase(itr++);

else
++itr;
}

Note that we can't use the "hack" from the beginning of this article, which would allow us to write coll.erase(itr--) and then increment the iterator in a surrounding for loop. The problem is that a scalar can decrement below zero, but an iterator cannot decrement before begin().

Personally, I don't like the autoincrement code above as much as when erase() returns a value. It relies on the reader understanding the "trick" and it makes it more difficult for collection classes to know in advance what to do. For example, this technique can't be used with a vector, which is why vector does return an iterator. This creates a confusing inconsistency. (Which is why the Microsoft version of map::erase() is non-standard, but much more consistent.)

The question is, why does this trick work? If you've spent much time in C++, you've inevitably learned about the dangers of code like this:

int i = 5;

i = i++;

The code could end with i equal to either 5 or 6 - the result is undefined because the C++ standard only guarantees that ++ will be executed after i is read and before the semi-colon. It could be incremented before or after the assignment happens. (If you are wondering why, remember that assignments in C++ can be embedded in expressions, unlike most languages.)

To reword the guarantee in the language of the C++ Standard, the only guarantee is that ++ will be executed after i is read and before the next sequence point. The sequence point is the key to understanding our call to erase().

In the expression above, the semicolon is the sequence point. Now look at our code example:

coll.erase(itr++);

If you apply the same evaluation rules, the undefined timing of the execution of ++ would mean that the iterator could be evaluated after erase() returns and the iterator had been invalidated. This would crash, but experience shows that it doesn't. The reason is that the function call to erase() introduces a new sequence point. The code is guaranteed to be evaluated as:

The value of itr is loaded and pushed on the stack for erase().
itr is incremented.
erase() is called.

All of which gives the desired result. QED.

One of my biggest annoyances with debugging MFC code is constantly stepping through CString functions and COM object functions. In Visual C++ 6 there was some functionality in autoexp.dat for handing this, but I never got it to work to my satifaction.

This week I was able to solve the problem. Here's how (note that the 10.0 is for Visual Studio 2010. The value will be 9.0 for VS2008. For earlier versions, see the link at the end of this article.)

Open RegEdit
On 32-bit Windows, go to:
HKEY_LOCAL_MACHINE\Software\Microsoft\ VisualStudio\10.0\NativeDE\StepOver

On 64-bit Windows, go to:
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\VisualStudio\10.0\NativeDE\StepOver

At least for Visual Studio 2010, there are several default functions defined.
Add a new String value. Name it CStringT.
Edit the key and set the value to:

ATL\:\:CStringT.*

The colons must be backslashed as shown. CStringT is the template class used for CString, CStringA and CStringW. The CStringT class is shared with ATL, so that's why it's in the ATL namespace, even if you are using MFC. The "dot star" at the end matches anything for the rest of the line.
Here are a couple of other examples:

Key: CSimpleString
Value: ATL\:\:CSimpleStringT.*

Key: CComPtrBase
Value: ATL\:\:CComPtrBase.*

Key: vector
Value: std\:\:vector.*

Update 7/26/2010: There are additional features documented by Andy Pennell at Microsoft.

Technical Blog for Jim Beveridge

Sunday, July 25, 2010

Safely using erase() in STL

Friday, July 16, 2010

"Invalid pointer" error in Visual Studio 2010 in Configuration Manager

Thursday, July 1, 2010

Debug "Just My Code" for C++ and MFC

Blog Archive

Labels