• Tidak ada hasil yang ditemukan

Using Regular Expression Constants

Dalam dokumen GAWK: Effective AWK Programming (Halaman 135-138)

6.1 Constants, Variables, and Conversions

6.1.2 Using Regular Expression Constants

Regular expression constants consist of text describing a regular expression enclosed in slashes (such as/the +answer/). This section describes how such constants work in POSIX awk and gawk, and then goes on to describe strongly typed regexp constants, which are a gawk extension.

6.1.2.1 Standard Regular Expression Constants

When used on the righthand side of the ‘~’ or ‘!~’ operators, a regexp constant merely stands for the regexp that is to be matched. However, regexp constants (such as /foo/) may be used like simple expressions. When a regexp constant appears by itself, it has the same meaning as if it appeared in a pattern (i.e., ‘($0 ~ /foo/)’). SeeSection 7.1.2 [Expressions as Patterns], page 141. This means that the following two code segments:

if ($0 ~ /barfly/ || $0 ~ /camelot/) print "found"

and:

if (/barfly/ || /camelot/) print "found"

are exactly equivalent. One rather bizarre consequence of this rule is that the following Boolean expression is valid, but does not do what its author probably intended:

# Note that /foo/ is on the left of the ~ if (/foo/ ~ $1) print "found foo"

This code is “obviously” testing$1 for a match against the regexp/foo/. But in fact, the expression ‘/foo/ ~ $1’ really means ‘($0 ~ /foo/) ~ $1’. In other words, first match the input record against the regexp /foo/. The result is either zero or one, depending upon the success or failure of the match. That result is then matched against the first field in the record. Because it is unlikely that you would ever really want to make this kind of test, gawk issues a warning when it sees this construct in a program. Another consequence of this rule is that the assignment statement:

matches = /foo/

assigns either zero or one to the variable matches, depending upon the contents of the current input record.

Constant regular expressions are also used as the first argument for thegensub(),sub(), and gsub() functions, as the second argument of the match() function, and as the third argument of thesplit()andpatsplit()functions (seeSection 9.1.3 [String-Manipulation Functions], page 189). Modern implementations of awk, including gawk, allow the third argument of split() to be a regexp constant, but some older implementations do not.

Because some built-in functions accept regexp constants as arguments, confusion can arise when attempting to use regexp constants as arguments to user-defined functions (see Section 9.2 [User-Defined Functions], page 214). For example:

function mysub(pat, repl, str, global) {

if (global)

gsub(pat, repl, str) else

sub(pat, repl, str) return str

}

{

...

text = "hi! hi yourself!"

mysub(/hi/, "howdy", text, 1) ...

}

In this example, the programmer wants to pass a regexp constant to the user-defined function mysub(), which in turn passes it on to either sub() or gsub(). However, what really happens is that thepatparameter is assigned a value of either one or zero, depending upon whether or not$0matches/hi/. gawkissues a warning when it sees a regexp constant used as a parameter to a user-defined function, because passing a truth value in this way is probably not what was intended.

6.1.2.2 Strongly Typed Regexp Constants

This section describes agawk-specific feature.

As we saw in the previous section, regexp constants (/.../) hold a strange position in the awk language. In most contexts, they act like an expression: ‘$0 ~ /.../’. In other contexts, they denote only a regexp to be matched. In no case are they really a “first class citizen” of the language. That is, you cannot define a scalar variable whose type is “regexp”

in the same sense that you can define a variable to be a number or a string:

num = 42 Numeric variable str = "hi" String variable

re = /foo/ Wrong! re is the result of $0 ~ /foo/

For a number of more advanced use cases, it would be nice to have regexp constants that are strongly typed; in other words, that denote a regexp useful for matching, and not an expression.

gawk provides this feature. A strongly typed regexp constant looks almost like a regular regexp constant, except that it is preceded by an ‘@’ sign:

re = @/foo/ Regexp variable

Strongly typed regexp constants cannot be used everywhere that a regular regexp con- stant can, because this would make the language even more confusing. Instead, you may use them only in certain contexts:

• On the righthand side of the ‘~’ and ‘!~’ operators: ‘some_var ~ @/foo/’ (see Section 3.1 [How to Use Regular Expressions], page 47).

• In the case part of a switch statement (see Section 7.4.5 [The switch Statement], page 151).

• As an argument to one of the built-in functions that accept regexp constants:

gensub(), gsub(), match(), patsplit(), split(), and sub() (see Section 9.1.3 [String-Manipulation Functions], page 189).

• As a parameter in a call to a user-defined function (see Section 9.2 [User-Defined Functions], page 214).

• On the righthand side of an assignment to a variable: ‘some_var = @/foo/’. In this case, the type of some_varis regexp. Additionally,some_varcan be used with ‘~’ and

‘!~’, passed to one of the built-in functions listed above, or passed as a parameter to a user-defined function.

You may use the typeof()built-in function (see Section 9.1.7 [Getting Type Informa- tion], page 213) to determine if a variable or function parameter is a regexp variable.

The true power of this feature comes from the ability to create variables that have regexp type. Such variables can be passed on to user-defined functions, without the confusing aspects of computed regular expressions created from strings or string constants. They may also be passed through indirect function calls (seeSection 9.3 [Indirect Function Calls], page 224) and on to the built-in functions that accept regexp constants.

When used in numeric conversions, strongly typed regexp variables convert to zero.

When used in string conversions, they convert to the string value of the original regexp text.

Dalam dokumen GAWK: Effective AWK Programming (Halaman 135-138)