• Tidak ada hasil yang ditemukan

Variable Typing and Comparison Expressions

Dalam dokumen GAWK: Effective AWK Programming (Halaman 148-153)

6.3 Truth Values and Conditions

6.3.2 Variable Typing and Comparison Expressions

The Guide is definitive. Reality is frequently inaccurate.

—Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Unlike in other programming languages, in awk variables do not have a fixed type.

Instead, they can be either a number or a string, depending upon the value that is assigned to them. We look now at how variables are typed, and how awkcompares variables.

6.3.2.1 String Type versus Numeric Type

Scalar objects in awk (variables, array elements, and fields) are dynamically typed. This means their type can change as the program runs, fromuntyped before any use,4 to string or number, and then from string to number or number to string, as the program progresses.

(gawk also provides regexp-typed scalars, but let’s ignore that for now; seeSection 6.1.2.2 [Strongly Typed Regexp Constants], page 117.)

You can’t do much with untyped variables, other than tell that they are untyped. The following program testsaagainst""and0; the test succeeds whenahas never been assigned a value. It also uses the built-in typeof() function (not presented yet; see Section 9.1.7 [Getting Type Information], page 213) to showa’s type:

$ gawk 'BEGIN { print (a == "" && a == 0 ?

> "a is untyped" : "a has a type!") ; print typeof(a) }' a a is untyped

a unassigned

A scalar has numeric type when assigned a numeric value, such as from a numeric constant, or from another scalar with numeric type:

$ gawk 'BEGIN { a = 42 ; print typeof(a)

> b = a ; print typeof(b) }' number

number

Similarly, a scalar has string type when assigned a string value, such as from a string constant, or from another scalar with string type:

$ gawk 'BEGIN { a = "forty two" ; print typeof(a)

> b = a ; print typeof(b) }' string

string

So far, this is all simple and straightforward. What happens, though, when awk has to process data from a user? Let’s start with field data. What should the following command produce as output?

echo hello | awk '{ printf("%s %s < 42\n", $1,

($1 < 42 ? "is" : "is not")) }'

Since ‘hello’ is alphabetic data,awkcan only do a string comparison. Internally, it converts 42 into"42" and compares the two string values"hello"and "42". Here’s the result:

$ echo hello | awk '{ printf("%s %s < 42\n", $1,

> ($1 < 42 ? "is" : "is not")) }' a hello is not < 42

However, what happens when data from a user looks like a number? On the one hand, in reality, the input data consists of characters, not binary numeric values. But, on the

4 gawkcalls thisunassigned, as the following example shows.

other hand, the data looks numeric, andawk really ought to treat it as such. And indeed, it does:

$ echo 37 | awk '{ printf("%s %s < 42\n", $1,

> ($1 < 42 ? "is" : "is not")) }' a 37 is < 42

Here are the rules for when awktreats data as a number, and for when it treats data as a string.

The POSIX standard uses the term numeric string for input data that looks numeric.

The ‘37’ in the previous example is a numeric string. So what is the type of a numeric string? Answer: numeric.

The type of a variable is important because the types of two variables determine how they are compared. Variable typing follows these definitions and rules:

• A numeric constant or the result of a numeric operation has thenumeric attribute.

• A string constant or the result of a string operation has the string attribute.

• Fields, getline input,FILENAME,ARGV elements,ENVIRON elements, and the elements of an array created by match(), split(), and patsplit() that are numeric strings have the strnum attribute.5 Otherwise, they have the string attribute. Uninitialized variables also have the strnumattribute.

• Attributes propagate across assignments but are not changed by any use.

The last rule is particularly important. In the following program, ahas numeric type, even though it is later used in a string operation:

BEGIN {

a = 12.345

b = a " is a cute number"

print b }

When two operands are compared, either string comparison or numeric comparison may be used. This depends upon the attributes of the operands, according to the following symmetric matrix:

STRING NUMERIC STRNUM

STRING string string string

NUMERIC string numeric numeric

STRNUM string numeric numeric

The basic idea is that user input that looks numeric—and only user input—should be treated as numeric, even though it is actually made of characters and is therefore also a string. Thus, for example, the string constant" +3.14", when it appears in program source code, is a string—even though it looks numeric—and is never treated as a number for comparison purposes.

In short, when one operand is a “pure” string, such as a string constant, then a string comparison is performed. Otherwise, a numeric comparison is performed. (The primary

5 Thus, a POSIX numeric string andgawk’s strnum are the same thing.

difference between a number and a strnum is that for strnums gawk preserves the original string value that the scalar had when it came in.)

This point bears additional emphasis: Input that looks numeric is numeric. All other input is treated as strings.

Thus, the six-character input string ‘ +3.14’ receives the strnum attribute. In contrast, the eight characters " +3.14" appearing in program text comprise a string constant. The following examples print ‘1’ when the comparison between the two different constants is true, and ‘0’ otherwise:

$ echo ' +3.14' | awk '{ print($0 == " +3.14") }' True a 1

$ echo ' +3.14' | awk '{ print($0 == "+3.14") }' False a 0

$ echo ' +3.14' | awk '{ print($0 == "3.14") }' False a 0

$ echo ' +3.14' | awk '{ print($0 == 3.14) }' True a 1

$ echo ' +3.14' | awk '{ print($1 == " +3.14") }' False a 0

$ echo ' +3.14' | awk '{ print($1 == "+3.14") }' True a 1

$ echo ' +3.14' | awk '{ print($1 == "3.14") }' False a 0

$ echo ' +3.14' | awk '{ print($1 == 3.14) }' True a 1

You can see the type of an input field (or other user input) using typeof():

$ echo hello 37 | gawk '{ print typeof($1), typeof($2) }' a string strnum

6.3.2.2 Comparison Operators

Comparison expressionscompare strings or numbers for relationships such as equality. They are written usingrelational operators, which are a superset of those in C.Table 6.3describes them.

Expression Result

x <y True if x is less than y

x <= y True if x is less than or equal toy x >y True if x is greater thany

x >= y True if x is greater than or equal toy x == y True if x is equal toy

x != y True if x is not equal to y

x ~y True if the stringx matches the regexp denoted byy

x !~ y True if the stringx does not match the regexp denoted byy subscript inarray True if the arrayarrayhas an element with the subscriptsubscript

Table 6.3: Relational operators

Comparison expressions have the value one if true and zero if false. When comparing operands of mixed types, numeric operands are converted to strings using the value of CONVFMT (seeSection 6.1.4 [Conversion of Strings and Numbers], page 119).

Strings are compared by comparing the first character of each, then the second character of each, and so on. Thus, "10" is less than "9". If there are two strings where one is a prefix of the other, the shorter string is less than the longer one. Thus,"abc" is less than

"abcd".

It is very easy to accidentally mistype the ‘==’ operator and leave off one of the ‘=’

characters. The result is still validawkcode, but the program does not do what is intended:

if (a = b) # oops! should be a == b ...

else ...

Unless b happens to be zero or the null string, the if part of the test always succeeds.

Because the operators are so similar, this kind of error is very difficult to spot when scanning the source code.

The following list of expressions illustrates the kinds of comparisons awk performs, as well as what the result of each comparison is:

1.5 <= 2.0

Numeric comparison (true)

"abc" >= "xyz"

String comparison (false) 1.5 != " +2"

String comparison (true)

"1e2" < "3"

String comparison (true) a = 2; b = "2"

a == b String comparison (true) a = 2; b = " +2"

a == b String comparison (false) In this example:

$ echo 1e2 3 | awk '{ print ($1 < $2) ? "true" : "false" }' a false

the result is ‘false’ because both $1 and $2 are user input. They are numeric strings—

therefore both have the strnum attribute, dictating a numeric comparison. The purpose of the comparison rules and the use of numeric strings is to attempt to produce the behavior that is “least surprising,” while still “doing the right thing.”

String comparisons and regular expression comparisons are very different. For example:

x == "foo"

has the value one, or is true if the variable xis precisely ‘foo’. By contrast:

x ~ /foo/

has the value one if xcontains ‘foo’, such as "Oh, what a fool am I!".

The righthand operand of the ‘~’ and ‘!~’ operators may be either a regexp constant (/. . ./) or an ordinary expression. In the latter case, the value of the expression as a string is used as a dynamic regexp (see Section 3.1 [How to Use Regular Expressions], page 47;

also seeSection 3.6 [Using Dynamic Regexps], page 55).

A constant regular expression in slashes by itself is also an expression. /regexp/ is an abbreviation for the following comparison expression:

$0 ~ /regexp/

One special place where /foo/is not an abbreviation for ‘$0 ~ /foo/’ is when it is the righthand operand of ‘~’ or ‘!~’. See Section 6.1.2 [Using Regular Expression Constants], page 115, where this is discussed in more detail.

6.3.2.3 String Comparison Based on Locale Collating Order

The POSIX standard used to say that all string comparisons are performed based on the locale’s collating order. This is the order in which characters sort, as defined by the locale (for more discussion, see Section 6.6 [Where You Are Makes a Difference], page 138). This order is usually very different from the results obtained when doing straight byte-by-byte comparison.6

Because this behavior differs considerably from existing practice,gawkonly implemented it when in POSIX mode (see Section 2.2 [Command-Line Options], page 31). Here is an example to illustrate the difference, in anen_US.UTF-8locale:

$ gawk 'BEGIN { printf("ABC < abc = %s\n",

> ("ABC" < "abc" ? "TRUE" : "FALSE")) }' a ABC < abc = TRUE

$ gawk --posix 'BEGIN { printf("ABC < abc = %s\n",

> ("ABC" < "abc" ? "TRUE" : "FALSE")) }' a ABC < abc = FALSE

Fortunately, as of August 2016, comparison based on locale collating order is no longer required for the==and!=operators.7 However, comparison based on locales is still required for<,<=,>, and >=. POSIX thus recommends as follows:

Since the == operator checks whether strings are identical, not whether they collate equally, applications needing to check whether strings collate equally can use:

a <= b && a >= b

As of version 4.2, gawk continues to use locale collating order for <, <=,>, and >= only in POSIX mode.

Dalam dokumen GAWK: Effective AWK Programming (Halaman 148-153)