• Tidak ada hasil yang ditemukan

Using sequential containers and analyzing strings

5.1 Separating students into categories

5.5.2 Why bother?

The code that extracts records for failing students is a good example of the effect of data structure choices on performance. The code accesses elements sequentially, which generally implies that a vector is the best choice. On the other hand, we also delete elements from the interior of the container, thus favoring lists.

As with any performance-related question, the data structure that is "best" depends on whether performance even matters. Performance is a tricky subject that is generally outside the scope of this book, but it is worth noting that the choice of data structure can have a profound effect on a program's performance. For small inputs, lists are slower than vectors.

For large inputs, a program that uses vectors in an inappropriate way can run much more slowly than it would if it were to use lists. It can be surprising how quickly performance degrades as the input grows.

To test our programs' performance, we used three files of student records. The first file had 735 records. The second file was ten times as big, and the third, ten times bigger than that, or 73,500 records. The following table records the time, in seconds, that it took to execute the programs on each file size:

File size list vector 735 0.1 0.1 7,350 0.8 6.7 73,500 8.8 597.1

For the file with 73,500 records, the list version of the program took less than nine seconds to run, whereas the vector version took nearly ten minutes. The discrepancy would have been even greater had there been more failing students.

5.6 Taking strings apart

Now that we've seen some of what we can do with containers, we're going to turn our

attention back to strings. Until now, we've done only a few things with strings: We've created them, read them, concatenated them, written them, and looked at their size. In each of these uses, we have dealt with the string as a single entity. Often, this kind of abstract usage is what we want: We want to ignore the detailed contents of a string. Sometimes, though, we need to look at the specific characters in a string.

As it turns out, we can think of a string as a special kind of container: It contains only

characters, and it supports some, but not all, of the container operations. The operations that it does support include indexing, and the string type provides an iterator that is similar to a vector iterator. Thus, many of the techniques that we can apply to vectors apply also to strings.

For example, we might want to break a line of input into words, separated from each other by whitespace (space, tab, backspace, or the end of the line). If we can read the input directly, we can get the words from the input trivially. After all, that's exactly how the string input operator executes: It reads characters up to the whitespace character.

However, there are times when we want to read an entire line of input and examine the words within that line. We'll see examples in §7.3/126 and §7.4.2/131.

Because such an operation might be generally useful, we'll write a function to do it. The function will take a string and return a vector<string>, which will contain an entry for each whitespace-separated word in that string. In order to understand this function, you need to know that strings support indexing much the same way as vectors do. So, for example, if s is a string that contains at least one character, the first character of s is s[0], and the last

character of s is s[s.size() - 1].

Our function will define two indices, i and j, that will delimit each word in turn. The idea is that we will locate a word by computing values for i and j such that the word will be the characters in the range [i, j). For example,

Once we have these indices, we'll use the characters that they delimit to create a new string,

which we will copy into our vector. When we are done, we will return the vector to our caller:

vector<string> split(const string& s) {

vector<string> ret;

typedef string::size_type string_size;

string_size i = 0;

// invariant: we have processed characters [original value of i, i) while (i != s.size()) {

// ignore leading blanks

// invariant: characters in range [original i, current i) are all spaces while (i != s.size() && isspace(s[i]))

++i;

// find end of next word string_size j = i;

// invariant: none of the characters in range [original j, current j)is a space while (j != s.size() && !isspace(s[j]))

j++;

// if we found some nonwhitespace characters if (i != j) {

// copy from s starting at i and taking j - i chars ret.push_back(s.substr(i, j - i));

i = j;

} }

return ret;

}

In addition to the system headers that we have already encountered, this code needs the

<cctype> header, which defines isspace. More generally, this header defines useful functions for processing individual characters. The c at the beginning of cctype is a reminder that the ctype facility is part of C++'s inheritance from C.

The split function has a single parameter, which is a reference to a const string that we'll name s. Because we will be copying words from s, split does not need to change the string. As in §4.1.2/54, we can pass a const reference to avoid the cost of copying the string, while still ensuring that split will not change its argument.

We start off by defining ret, which will hold the words from the input string. The next two statements define and initialize our first index, i. As we saw in §2.4/22, string::size_type is the name for the appropriate type to index a string. Because we need to use this type more

than once, we start by defining a shorter synonym for this type, as we did in §3.2.2/43, to simplify the subsequent declarations. We will use i as the index that finds the start of each word, advancing i through the input string one word at a time.

The test in the outermost while ensures that once we've processed the last word in the input, we'll stop.

Inside the while, we start by positioning our two indices. First, we find the first non-space character in s that is at or after the position currently indicated by i. Because there might be multiple whitespace characters in the input, we increment i until it denotes a character that is not whitespace.

There is a lot going on in this statement:

while (i != s.size() && isspace(s[i])) ++i;

The isspace function is a predicate that takes a char and returns a value that indicates whether that char is whitespace. The && operator tests whether both its operands are true, failing if either of them is false. In this expression, the operation will succeed if i is not equal to the size of s (meaning that we have not reached the end of the string), and s[i] is a

whitespace character. In that case, we will increment i and check again.

As we described in §2.4.2.2/26, the logical && operation uses a short-circuit strategy for evaluating its operands. Unlike our earlier examples, this one relies on the short-circuit property of &&. The binary logical operations (operators && and ||) execute by testing their left-hand operands first. If that test suffices to determine the overall result, then the right-hand operand is not evaluated. In the case of the &&, the second condition is evaluated if and only if the first condition is true. Thus, the condition in the while executes by first checking whether i != s.size(). Only if this test succeeds does it use i to look at a character in s. Of course, if i is equal to s.size(), then there are no more characters left to examine, and so we drop out of the loop.

Once we fall out of this while, we know either that i denotes a character that is not whitespace, or that we've run out of input without finding such a character.

Assuming that i is still a valid index, the next while will find the space that terminates the current word in s. We start by creating our other index, j, and initializing it to the value of i. The next while,

while (j != s.size() && !isspace(s[j])) ++j;

executes similarly to the previous one, but this time the while stops when it encounters a

whitespace character. As before, we start by ensuring that j is still in range. If so, we again call isspace on the character indexed by j. This time, we negate the return from isspace using the logical negation operator, !. In other words, we want the condition to be true if isspace(s[j]) is not true.

Having completed the two inner while loops, we know that we have either found another word or run out of input while looking for a word. If we have run out of input, then both i and j will be equal to s.size(). Otherwise, we have found a word, which we must push onto ret:

// if we found some nonwhitespace characters if (i != j) {

// copy from s starting at i and taking j - i chars ret.push_back(s.substr(i, j - i));

i = j;

}

The call to push_back uses a member of the string class, named substr, that we have not previously seen. It takes an index and a length, and creates a new string that contains a copy of characters from the initial string, starting at the index given by the first argument, and copying as many characters as indicated by its second argument. The substring that we extract starts at i, which is the first character in the word that we just found. We copy

characters from s starting with the one indexed by i, and continuing until we have copied the characters in the (half-open) range [i, j). Remembering from §2.6/31 that the number of elements in a half-open range is the difference between the bounds, we see that we will copy exactly j - i characters.

5.7 Testing our split function

Having written our function, we'd like to test it. The easiest way to do so is to write a program that reads a line of input and passes that line to the split function. We can then write the contents of the vector that split returns. Such a test program will make it easy to inspect the output, and to verify that the split function generates the words that we expect.

More usefully, this test function should produce the same results as a program that just reads words from the standard input and writes the words one per output line. We can write this latter program, run it and our test program on the same input files, and verify that our programs generate identical output. If so, we can be fairly confident in our split function.

Let's start by writing the test program for split:

int main() { string s;

// read and split each line of input while (getline(cin, s)) {

vector<string> v = split(s);

// write each word in v

for (vector<string>::size_type i = 0; i != v.size(); ++i) cout << v[i] << endl;

}

return 0;

}

This program needs to read the input an entire line at a time. Fortunately, the string library provides what we need in the getline function, which reads input until it reaches the end of the line. The getline function takes two arguments. The first is the istream from which to read; the second is a reference to the string into which to store what is read. As usual, the getline function returns a reference to the istream from which we read, so that we can test that istream in a condition. If we hit end-of-file or encounter invalid input, then the return from getline will indicate failure and we'll break out of the while.

As long as we can read a line of input, we store that line in s and pass it to split, storing the return value from split in v. Next, we loop through v, writing each string in that vector on a separate line.

Assuming that we added the proper #includes, including one for our own header that

contained a declaration for split, we could run this function and visually verify that it and split

work as expected. We can do even better, though, by comparing the output of this program with a program that lets the library do all the work:

int main() {

string s;

while (cin >> s) cout << s << endl;

return 0;

}

This program and the previous one should generate identical output. Here, we let the string input operator separate the input stream into a series of words, which we write one to a line.

By running both programs on the same, complex input, we can have a good idea that our split function works.

5.8 Putting strings together

In §1.2/12 and §2.5.4/29, we wrote a program to write someone's name centered in a box of asterisks. However, we never actually created a string to hold our program's output. Instead, we wrote the various parts of our output, one at a time, and let the output file combine those fragments into a picture.

We will now revisit this problem, with the aim of building a single data structure that represents the entire framed string. This program is a simplified version of one of our favorite examples, called character pictures. A character picture is a rectangular array of characters that can be displayed. It is a simplification of what happens in a real application—in this case, applications based on bitmap graphics. The simplifications are to use characters instead of bits, and to write onto ordinary files instead of displaying on graphical hardware. The problem builds on an exercise originally presented in the first edition of Stroustrup's The C++ Programming

Language (Addison-Wesley, 1986), and that we explored in some depth in Ruminations on C++ (Addison-Wesley, 1997).