• Tidak ada hasil yang ditemukan

GAWK: Effective AWK Programming

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Membagikan "GAWK: Effective AWK Programming"

Copied!
572
0
0

Teks penuh

Weinberger's The AWK Programming Language (Addison-Wesley, 1988). awk's simple programming paradigm—find a pattern in the input and then perform an action—often reduces complex or tedious data manipulations to a few lines of code. A copy of the GPL is included in this book for your reference (see [GNU General Public License], page 509).

How to Run awk Programs

  • One-Shot Throwaway awk Programs
  • Running awk Without Input Files
  • Running Long Programs
  • Executable awk Programs
  • Comments in awk Programs
  • Shell Quoting Issues
    • Quoting in MS-Windows Batch Files

The first argument in the list is the full file name of the awk program. The shell does not interpret the quoted text, but passes it verbatim to the command.

Data files for the Examples

In this example, the '\042' is the octal code for a double quote; gawk converts it to a real double quote for output through the print statement. For each double quote in the original string, let N be the number of backslash(es) before it, N can be zero.

Some Simple Examples

If lines containing 'li' are found, they will be printed because 'print $0' means to print the current line. Just 'printing' means the same thing in itself, so we could have written that instead.).

An Example with Two Rules

This is a simple way to delete blank lines from a file (or rather, to create a new file similar to the old file, but from which the blank lines have been removed). If you used the expression 'NR % 2 == 1' instead, the program would print the odd-numbered lines.

A More Complex Example

Note how the line beginning with 'Jean-Paul' is printed twice in the mailing list, once for each line. By manipulating fields and using print statements, you can produce some very useful reports.

Once awk sees the '#' that begins a comment, everything else on the line is ignored. NOTE: The requirement for lines on the same line to be separated by a semicolon was not in the original awk language; it was added for consistency with the handling of statements within an action.

Other Features of awk

To have the pattern and action on separate lines, you must use backslash continuation; there is no other option. Another thing to keep in mind is that backslash continuation and comments don't mix.

When to Use awk

Summary

This chapter covers how to runawk, both the standard POSIX and command-line-specific options for gawk, and what awk and gawk do with optional arguments. Many of the options and features described here are discussed in more detail later in the book; feel free to skip things in this chapter that don't interest you right now.

Invoking awk

It then discusses how gawk searches for source files, reads standard input along with other files, gawk's environment variables, gawk's exit state, use of include files, and deprecated and undocumented options and/or functions.

Command-Line Options

See Section A.7 [Overview of Common Extensions], page 459, for a summary of the extensions to gawk that are disabled by this option. However, this option remains (both for backward compatibility and for use in combination with --traditional).

Other Command-Line Arguments

If --lintis is specified on the command line and gawks in POSIX mode because of POSIXLY_CORRECT, then it issues a warning message indicating that POSIX mode is in effect. Variable values ​​specified on the command line are processed for escape sequences (see Section 3.2 [Escape Sequences], page 48).

Naming Standard Input

The variable assignment function is especially useful for assigning to variables such as RS, OFS and ORS, which control input and output formats, before scanning the data files. Given the variable assignment function, the -F option for setting the value of FS is not strictly necessary.

The Environment Variables gawk Uses

The AWKPATH Environment Variable

NOTE: To include the current directory in the path, either place.as the path entry or enter a null path entry. Various past versions of gawk would also explicitly search the current directory, either before or after the path search.

The AWKLIBPATH Environment Variable

Other Environment Variables

If this variable exists, gawk switches to reading source files one line at a time, instead of reading in blocks. TIDYMEM If this variable exists, gawk uses mtrace() library calls from the GNU C library to help find possible memory leaks.

Including Other Files into Your Program

If you have a large script with useful, general-purpose functions, you can break it down into library files and put those files in a separate directory. Of course, you can keep library files in more than one directory; the more complex the work environment, the more directories you may need to organize the files to be included.

Loading Dynamic Extensions into Your Program

You can then include those "libraries" either by using the full pathnames of the files, or by setting the AWKPATHenvironment variable accordingly and then using @include with just the file part of the full pathname. The rules for finding a source file described in Section 2.5.1 [TheAWKPATHEnvironment Variable], page 39, also apply to files loaded with @include.

Obsolete Options and/or Features

Undocumented Options and Features

Summary

You can use a single minus sign ('-') to refer to standard input on the command line. gawk also allows you to use the special filename /dev/stdin. gawk pays attention to a number of environment variables. AWKPATH, AWKLIBPATH and POSIXLY_CORRECT are the most important. The exit status of gawk conveys information to the program that called it.

How to Use Regular Expressions

Since regular expressions are such a fundamental part of clumsy programming, their format and usage deserve a separate chapter. A regular expression enclosed in slashes ('/') is an awk pattern that matches any input record whose text belongs to that set.

Escape Sequences

Because the regexp is bounded by slashes, you must dodge any slash that is part of the pattern to indicate that you can continue processing the rest of the regexp. Because the string is separated by double quotes, you must escape each quote that is part of the string to tell awk to continue processing the rest of the string.

Regular Expression Operators

Regexp Operators in awk

The '$' is an anchor and does not match the end of the line (the dot just before the '\n' newline character) embedded in the string. This is called an expression in square brackets.1 Matches any of the characters enclosed in square brackets.

Some Notes On Interval Expressions

If there are two numbers separated by a comma, the preceding reexpression is repeated n times. Initially, because old programs may use '{' and '}' in regexp constants, gawk did not match interval expressions in regexps.

Using Bracket Expressions

Use a padded character list ('[^\x00-\x7F]') to match any single-byte characters that are not in the ASCII range. This applies to non-ASCII character sets, which may have single symbols (called collection elements) represented by more than one character.

How Much Text Matches?

WARNING: The library functions that gawk uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collection symbols or equivalence classes. Within a parenthesized expression, an opening parenthesis ('[') that does not begin a character class, constituent element, or equivalence class is taken literally.

Using Dynamic Regexps

The first slash escapes the second one so that the string actually contains the two characters '\' and. When a string constant is used, awkm must first convert the string into this internal form and then perform the pattern matching.

Because '^' and '$' always work in terms of the beginning and end of strings, these operators do not add any new forawk capabilities. The current method of using '\y' for GNU '\b' seems to be the lesser of two evils.

Case Sensitivity in Matching

Prior to version 5.0, single-byte characters were tested based on the ISO-8859-1 (ISO Latin-1) character set. However, as of version 5.0, single-byte characters are also tested based on the values ​​of the locale's character set.4.

Summary

IGNORECASE can be set on the command line or in a BEGIN rule (see Section 2.3 [Other Command Line Arguments], page 38; see also Section 7.1.4.1 [Startup and Cleanup Actions], page 144). In multibyte locales, case equivalence is tested against the wide character value of the locale character set.

How Input Is Split into Records

Record Splitting with Standard awk

The name of the current input file can be found in the predefined variable FILENAME (see Section 7.5 [Predefined Variables], page 157). The value of the built-in variable NF is the number of fields in the current record.

Record Splitting with gawk

When you reach the end of the input file, break the current input record, even if the last character in the file is not an RS character. When the end of record is determined, gawk sets the RT variable to the text in the input that matches RS.

Examining Fields

This works for certain special files, such as /proc/environ on GNU/Linux systems, where the null character is actually the record separator. This example prints every record in the file mail list whose first field contains the string.

Nonconstant Field Numbers

If you try to reference a field beyond the last one (like $8 when the record only has seven fields), you get the empty string. If used in a numeric operation you get zero.). As mentioned in Section 4.2 [Examining Fields], page 65, store the current record's number of fields in the built-in variable NF (see also Section 7.5 [Predefined Variables], page 157).

Changing the Contents of a Field

For example, the NF value is set to the number of the highest field you create. Note, however, that simply referencing an out-of-range field does not change the value of $0 or NF.

Specifying How Fields Are Separated

  • Whitespace Normally Separates Fields
  • Using Regular Expressions to Separate Fields
  • Making Each Character a Separate Field
  • Setting FS from the Command Line
  • Making the Full Line Be a Single Field
  • Field-Splitting Summary

But when the value of FS is awkfirst, it strips leading and trailing whitespace from the record and then determines where the fields are. '-' as part of the person's name was used as a field separator instead of '-'. in the phone number originally intended.

Reading Fixed-Width Data

  • Processing Fixed-Width Data
  • Skipping Intervening Fields
  • Capturing Optional Trailing Data
  • Field Values With Fixed-Width Data

As of version 4.2, gawk allows you to add a final '*' character to the value of FIELDWIDTHS to provide a way to say "everything else in the record after the defined fields". The idea is that while there aren't as many characters as expected, there are still some, so the data has to be made available to the program.

Defining Fields by Content

More on CSV Files

Checking How gawk Is Splitting Records

Multiple-Line Records

If not in match mode (see Section 2.2 [Command Line Options], page 31), gawk sets RT to input text that matches the value specified by RS. But if the input file ends without any text matching RS, thengawk sets RT to the null string.

Explicit Input with getline

  • Using getline with No Arguments
  • Using getline into a Variable
  • Using getline from a File
  • Using getline into a Variable from a File
  • Using getline from a Pipe
  • Using getline into a Variable from a Pipe
  • Using getline from a Coprocess
  • Using getline into a Variable from a Coprocess
  • Points to Remember About getline
  • Summary of getline Variants

When you use the 'command | getlinevar' the output of the command is piped to getline and into the var variable. When you use 'command|& getline var', the output from coprocesscommand is piped bidirectionally to getline and to the variable var.

Reading Input with a Timeout

The duration can be set or changed at any time and will take effect on the next attempt to read from the input device. Because of this, changing the timeout value, as in the previous example, is not very useful.

Retrying Reads After Certain Input Errors

If the PROCINFO element is not present and the GAWK_READ_TIMEOUT environment variable exists, gawk uses its value to initialize the timeout value. A network client can sometimes take a long time to establish a connection before it can start reading any data, or an attempt to open a specific FIFO file for reading can block indefinitely until another process opens it for writing.

Directories on the Command Line

Summary

Use getline in its various forms to read additional records from the standard input stream, from a file, or from a pipe or co-process. Command line directories are fatal to default awk; gawk ignores them if they are not in POSIX mode.

Exercises

However, with two exceptions, you cannot specify how to print it—how many columns, whether or not to use exponential notation, and so on. For the exceptions, see Section 5.3 [Output separators], page 95, and Section 5.4 [Controlling numeric output with pressure], page 96.).

The print Statement

One of the most common programming actions is to print or output some or all of the input data.

A common mistake in using the print statement is to omit the comma between two items. This is why the printf statement was created (see Section 5.5 [Using printfStatements for Better Printing], page 96); one of its specialties is aligning data columns.

Output Separators

Controlling Numeric Output with print

Using printf Statements for Fancier Printing

  • Introduction to the printf Statement
  • Format-Control Letters
  • Modifiers for printf Formats
  • Examples Using printf

'4.3' represents two modifiers, discussed in the next subsection.) '%E' uses 'E' instead of 'e' in output. Instead of providing explicit width and/or height values ​​in the format string, they are passed in the argument list.

Redirecting Output of print and printf

The difference between this and the single '>' redirection is that the old content (if any) of the output file is not deleted. See section 10.2.9 [Quoting strings to pass to the shell], page 244, for a function that can help generate command lines to be passed to the shell.

Special Files for Standard Preopened Data Streams

It is a common mistake to use '>' redirection for the first output to a file and then use In this type of case, a program should use '>' for all print statements because the output file is only opened once.

Special File names in gawk

Accessing Other Open Files with gawk

The filenames /dev/stdin, /dev/stdout, and /dev/stderr are essentially aliases for /dev/fd/0,/dev/fd/1, and/dev/fd/2, respectively. Note that using close() on a filename of the form "/dev/fd/N", for file descriptor numbers greater than two, actually closes the given file descriptor.

Special Files for Network Communications

Unless special pains are taken in the shell from which gawk is invoked, only descriptors 0, 1, and 2 are available.

Special File name Caveats

Closing Input and Output Redirections

Its value must exactly match the string that was used to open the file or start the command (spaces and other "unimportant" characters are included). The return value is -1 if the argument names something that was never opened with a redirect, or if there is a system problem closing the file or process.

Enabling Nonfatal Output

In gawk, starting with version 4.2, when a pipe or coprocess (input or output) is closed, the return value is the exit status of the command, as described in Table 5.1.3 Otherwise, it is the return value of the system's close()orclose()C functions when closing input or output files, respectively. In general, different implementations differ in what they report when pipes are closed; so the return value cannot be used portably.

Summary

When trying to open a TCP/IP socket (see Section 12.4 [Using igawk for network programming], page 327), gawktry several times. The GAWK_SOCK_RETRIESenvironment variable (see Section 2.5.3 [Other Environment Variables], page 41) allows you to override the default built-in gawk retries count.

Exercises

An expression evaluates to a value that you can print, test, or pass to a function. In addition, an expression can assign a new value to a variable or a field by using an assignment operator.

Constants, Variables, and Conversions

  • Constant Expressions
    • Numeric and String Constants
    • Octal and Hexadecimal Numbers
    • Regular Expression Constants
  • Using Regular Expression Constants
    • Standard Regular Expression Constants
    • Strongly Typed Regexp Constants
  • Variables
    • Using Variables in a Program
    • Assigning Variables on the Command Line
  • Conversion of Strings and Numbers
    • How awk Converts Between Strings and Numbers
    • Locales Can Influence Conversion

Most of the time, you'll want to use octal or hexadecimal constants when working with built-in bit manipulation functions; see Section 9.1.6 [Bit Manipulation Functions], page 210, for more information. When used in string conversions, they are converted to the string value of the original regexp text.

Operators: Doing Something with Values

  • Arithmetic Operators
  • String Concatenation
  • Assignment Expressions
  • Increment and Decrement Operators

It is the simplest assignment operator because the value of the right operand is stored unchanged. This adds a value to one by one - this new value is also the value of the expression.

Truth Values and Conditions

  • True and False in awk
  • Variable Typing and Comparison Expressions
    • String Type versus Numeric Type
    • Comparison Operators
    • String Comparison Based on Locale Collating Order
  • Boolean Expressions
  • Conditional Expressions

Evaluation of the full expression is "short-circuited" if the result can be partially determined through its evaluation. When a row whose first field is 'START' is seen, the value of interest is set to true, using

Table 6.3: Relational operators
Table 6.3: Relational operators

Function Calls

Operator Precedence (How Operators Nest)

The redirect does not return an expression that could be the operand of another operator. As a result, it makes no sense to use a redirect operator near another lower priority operator without parentheses.

Where You Are Makes a Difference

The locale setting once affected regexp matching, but this is no longer true (see Section A.8 [Regexp Ranges and Locales: A Long Sad Story], page 459). Finally, the locale affects the decimal separator value used when Gawk parses input data.

Summary

Expressions used to pass parameter values ​​are fully evaluated before calling the function. Locale can affect the format of data output by an awk program, and sometimes the format of data read as input.

Pattern Elements

  • Regular Expressions as Patterns
  • Expressions as Patterns
  • Specifying Record Ranges with Patterns
  • The BEGIN and END Special Patterns
    • Startup and Cleanup Actions
    • Input/Output from BEGIN and END Rules
  • The BEGINFILE and ENDFILE Special Patterns
  • The Empty Pattern

As long as the range pattern remains enabled, it automatically matches each input record read. For example, suppose there is text between two identical markers (eg, the '%' symbol), each on its own line, that should be ignored.

Using Shell Variables in Programs

In this case, your program can execute the nextfile statement (see Section 7.4.9 [The nextfileStatement], page 155). The else statement (see Section 7.4.8 [The else Statement], page 154) is not allowed within either a BEGINFILE or an ENDFILE rule.

Actions

A better method is to use awk's variable assignment feature (see Section 6.1.3.2 [Command Line Variable Assignment], page 118) to assign the shell variable value to an awk variable. Also given in awkare is the next statement (see Section 7.4.8 [Next Statement], page 154) and then the text file statement (see Section 7.4.9 [Next File Declaration], page 155).

Control Statements in Actions

  • The if-else Statement
  • The while Statement
  • The do-while Statement
  • The for Statement
  • The switch Statement
  • The break Statement
  • The continue Statement
  • The next Statement
  • The nextfile Statement
  • The exit Statement

That also causes the next record to be read immediately, but it doesn't change the control flow in any way (i.e. the rest of the current action is performed with a new input record). It jumps to the end of the body of this implicit loop and executes the increment (which reads another record).

Predefined Variables

Built-in Variables That Control awk

LINT # When this variable is true (non-null or non-null), gawk behaves as if the --lint command line option is active (see Section 2.2 [Command Line Options], page 31). It works by being passed as the first argument to the sprintf() function (see Section 9.1.3 [String-Manipulation Functions], page 189).

Built-in Variables That Convey Information

If it is a regular expression, the records are separated by matches of the regular expression in the input text. A subarray indexed by the names of all identifiers used in the text of the awk program.

Using ARGC and ARGV

If the value of ARGC is decreased, it eliminates input files from the end of the list. To remove a file from the middle of the list, store the null string in ARGV instead of the file name.

Summary

The following fragment processes the ARGV to examine and then remove the aforementioned command line options:. When designing your program, you should choose options that do not conflict with gawk's options, because it will process any options it accepts before passing the rest of the command line to your program.

The Basics of Arrays

  • Introduction to Arrays
  • Referring to an Array Element
  • Assigning Array Elements
  • Basic Array Example
  • Scanning All Elements of an Array
  • Using Predefined Array Scanning Orders with gawk

A reference to an array element that has no recorded value returns a value of the null string. Second, it is valid (if a bit unusual) to set an array element equal to the empty string.

Figure 8.1: A contiguous array
Figure 8.1: A contiguous array

Using Numbers to Subscript Arrays

You can also get the default behavior by assigning a null string to PROCINFO["sorted_in"] or by simply deleting the "sorted_in" element from the PROCINFO array with the delete statement. Octal and hexadecimal constants (see Section 6.1.1.2 [Octal and Hexadecimal Numbers], page 114) are internally converted to numbers, and their original form is forgotten.

Using Uninitialized Variables as Subscripts

As with many things in awk, most of the time it works as you would expect. But it is useful to have a precise knowledge of the actual rules, as they can sometimes have a subtle effect on your programs.

The delete Statement

Gambar

Table 6.3: Relational operators
Figure 8.1: A contiguous array

Referensi

Dokumen terkait

The main purpose of this thesis is to implement Robot Operating System – Industrial (ROS-I) into the existing SGU€s Delta Arm Robot.. The ROS-I was used due to its