Using sequential containers and analyzing strings
5.1 Separating students into categories
5.1.1 Erasing elements in place
Our extract_fails function does what we want, and is reasonably efficient, but it has a subtle flaw: It requires enough memory to hold two copies of each student record. The reason is that as it builds up pass and fail, the original records are still around. When the function is done with its for
statement, and is ready to copy the results and return, there are two copies of each student record.
We would like to avoid keeping multiple copies of data around any longer than necessary. One way to do so is to eliminate pass entirely. Instead of creating two vectors, we will create a single local variable, named fail, to hold the value that we intend to return. For each record in students, we will
compute the grade. If it is a passing grade, we'll leave the record alone; if it's a failing grade, we'll append a copy of it to fail and remove it from students.
To use this strategy, we need a way to remove an element from a vector. The good news is that such a facility exists; the bad news is that removing elements from vectors is slow enough to argue against using this approach for large amounts of input data. If the data we process get really big, performance degrades to an astonishing extent.
For example, if all of our students were to fail, the execution time of the function that we are about to see would grow proportionally to the square of the number of students. That means that for a class of 100 students, the program would take 10,000 times as long to run as it would for one student.
The problem is that our input records are stored in a vector, which is optimized for fast random access. One price of that optimization is that it can be expensive to insert or delete elements other than at the end of the vector.
We shall see two ways to solve the performance problem: We can use a data structure that is better suited to our algorithm, or we can use a smarter algorithm that avoids the overhead of our initial design. From here through §5.5.2/87, we'll develop a solution that uses a more appropriate data structure. We'll show an algorithmic solution in §6.3/116.
Before we can understand why these solutions are improvements, we must have something to improve. Therefore, we'll begin by looking at the slow but direct solution:
// second try: correct but potentially slow
vector<Student_info> extract_fails(vector<Student_info>& students) {
vector<Student_info> fail;
vector<Student_info>::size_type i = 0;
// invariant:elements [0, i) of students represent passing grades while (i != students.size()) {
if (fgrade(students[i])) { fail.push_back(students[i]};
students.erase(students.begin() + i);
} else ++i;
}
return fail;
}
We begin this version by creating fail, which is the vector into which we'll copy the records for students with failing grades. We next define i, which we'll use as an index into students. We'll process each record, iterating through students until we've seen all the entries in students.
For each record in students, we determine whether it represents a failing grade. If so, then we need to copy that record into fail and remove it from students. The push_back call to append a copy of
students[i] to fail is nothing new. What is new is the way we remove the element from students: students.erase(students.begin() + i);
The vector type includes a member named erase, which removes an element from the vector. The argument to erase indicates which element to remove. As it happens, there is no version of the erase function that operates on indices, because, as we shall see in §5.5/85, not all containers support indices, and it is more useful for the library to offer a form of erase that will work the same way with all containers. Instead, the erase function takes a type that we shall discuss in §5.2.1/80.
What's important to understand now is that we can indicate which element to erase by adding our index to the value returned by students.begin(). Recall that students.begin() returns a value that denotes the vector's initial element—the one with index 0 . If we add an integer, such as i, to that value, the result denotes the element with index i. We can now see that this call to erase removes the ith element from students.
Once we have removed an element from the vector, the vector now has one fewer element than it did before:
In addition to changing the size of the vector, erase removes the element with index i, thereby causing i to denote the next element in the sequence. Each element after position i is copied to the preceding position. Thus, although i does not change, erase has the effect of adjusting the index to denote the next element in the vector, which means that we must not increment it for the next iteration.
If the record we're looking at does not contain a failing grade, then we want to leave it in students. In that case, we must increment i, so that i will refer to the next record on the next trip through the while.
We determine whether we have seen all the records in students by comparing i with
students.size(). When we erase an element from the vector, the vector has one fewer element than it did before. Therefore, it is essential that we call students.size on each trip through the condition. If, instead, we precomputed and stored the result of size
// this code will fail because of misguided optimization vector<Student_info>::size_type size = students.size();
while (i != size) {
if (fgrade(students[i])) { fail.push_back(students[i]);
students.erase(students.begin() + i);
} else ++i;
}
our program would fail, because calling erase would have changed the number of elements in students. If we precomputed the size and actually erased any records for failing students, then we would make too many trips through students, and the references to students[i] would be to nonexistent elements! Fortunately, calls to size() are usually fast, so the expected overhead from calling size each time is negligible.