Subcomponent Design: Data Preparer - PDF Presented by: University

As discussed in Section 4.2, the Data Preparer prepares the incoming raw log files for further processing and analysis and performs two functions: pre-parsing and data transforming.

This section describes this processing stage in more detail and also describes the design and implementation of this subcomponent.

The Pre-Parser function parses each log message according to the log message format to extract the metadata and content of each log message. To more clearly illustrate the pre-parsing function, an example excerpt from a log file is shown in Figure 4.2.

2020-07-16 10:20:06,151 - INFO - array_2.wide.bc8n856m4k_XEngOps fxcorrelator_xengops.py:638 - X-engine hardware auto rst/resync mechanism enabled.

2020-07-16 10:20:06,267 - INFO - array_2.wide.bc8n856m4k.katcp.server server.py:567 - Reading loop for client 127.0.0.1:48278 completed

2020-07-16 10:20:06,267 - WARNING - array_2.wide.bc8n856m4k.katcp.server server.py:576 - Stream for client 127.0.0.1:48278 closed with error error(32, 'Broken pipe')

Log message preamble Log message content

Figure 4.2: Example excerpt from a log file highlighting the log message preamble and the log message content

Figure 4.2 shows three log messages from the MeerKAT CBF system and clearly distinguishes the metadata from the actual log message. In this dissertation, the collection of log message metadata will be referred to as the log message preamble.

With sufficient knowledge about the design and operation of the MeerKAT CBF, the log message preamble can be further dissected to identify all the individual metadata components it comprises. This decomposition is shown in Figure 4.3 below.

2020-07-16 10:20:06,267 - INFO - array_2.wide.bc8n856m4k.katcp.server server.py:567 - Reading loop for client 127.0.0.1:48278 completed

Date Time Level Instrument Line Number

Content Log message preamble

Package

Figure 4.3: Decomposition of the MeerKAT CBF log message preamble

It can be seen that the log message preamble consists of the following metadata components:

Date,Time,Level,Instrument,Package andLine Number. The actual log message component is simply referred to asContent. The goal then, of the Pre-Parser, is to extract and identify the log message preamble and its components, and to identify and extract the log message content for further analysis.

The Pre-Parser is adapted from work by Zhu et al. and their work on the LogParser toolkit [40][35][7]. The Pre-Parser implementation requires the format of the entire log message, with the preamble decomposed into its constituents, to be provided as a text string. The Pre-Parser then uses this format to generate a regular expression sequence for matching and extracting the various components of the entire log message and also extracts the names of the various components of the log message to be used as headings in the final structured dataset that is created by the Data Transformer. An example of the input and outputs of the Pre-Parser is shown in Figure 4.4.

“<Data> <Time> - <Level> - <Instrument> <Component> - <Content>”

Pre-Parser

Input: string representing log message format

r'^(?P<Data>.*?)\s+(?P<Time>.*?)\s+-\s+(?

P<Level>.*?)\s+(?P<Instrument>.*?)\s+(?

P<Component>.*?)\s+-\s+(?P<Content>.*?)$

['Data', 'Time', 'Level', 'Instrument', 'Component',

'Content']

Output: regex for extracting the various

components of the log message Output: list of headers for naming columns in structured dataset

Figure 4.4: Input to and outputs of the Pre-Parser subcomponent

As shown in Figure 4.4, the components of the log message are specified using enclosing <, >

characters and the delimiters separating the components are explicitly indicated. The Pre- Parser uses these delimiters, including white space characters, as the basis for generating the regular expression that can extract all the various components from the log message. The grouping feature of regular expressions is used to individually group the various components of the log message for ease of accessibility later in the processing pipeline.

The Pre-Parser is implemented as a Python function and only requires a string representing the log message format and generates a regular expression for pattern matching. The function makes use of the built-in Python re module [76] for compiling the regular expression used for pattern matching and for splitting the input string based on the occurrence of enclosing <, >

characters. The functionality of the Pre-Parser is illustrated through a UML Activity Diagram shown in Figure 4.5.

The Pre-Parser initially creates an empty string for storing the regular expression pattern to be used to extract the various components of the log message. It also initialises an empty list for storing the names of the components of the log message. The provided input string is split on the occurrence of enclosing<, >. This is achieved using the re.split method and results in a list of substrings containing the named components and delimiters. Each of the resulting substrings are then further processed until none remain.

If a given substring contains a component name, the name is extracted from the substring by removing the enclosing <, >characters. The name is then appended to the list of component names. To pattern match the particular component in a given log message, which may be of varying length and contain an arbitrary collection of characters, (.*) is appended to the log format regular expression. The sequence(.*) will match 0 or more of any character.

If a given substring instead does not contain a component name, it may be regarded as a delimiter. The regular expression generated to extract the components of a log messages relies

Split log format into substrings

Select next substring Unprocessed substrings

remaining?

Does substring contain component

name?

Replace whitespace with

regex syntax Create empty

regex string

Extract name from substring

Append (.*) to log format regex Create empty

list for headings

Append name to list of headers

Yes No

Start

End

Input to function:

String representing log message format

Output from function:

- list of headings of log message components

- regular expression for extracting components from log message

String is split on each <…> substring

Does substring contain whitespace?

logformat regex

Append substring to log

format regex No

Yes Compile regular

expression

(.*) matches 0 or more of any token as a group.

This matches each component of the log message.

This adds tokens to match the delimiters between in the log message components.

Figure 4.5: UML Activity Diagram illustrating the functionality of the Pre-Parser process on the delimiters between components to ensure correct decomposition. If a delimiter substring contains white space, they are replaced by the regular expression syntax for white space before being appended to the log format regular expression. If the delimiter substring does not contain white space, the substring is simply appended to the log format regular expression.

Once all substrings have been processed, the log format regular expression is compiled and is returned alongside the list of component names. The resultant regular expression is able to match and extract the various components, including the message content, of a given log message corresponding to the provided log message format.

The Pre-Parser stage does not use, nor does it require, the actual log files. Instead, it generates information used by the Data Transformer.

The Data Transformer uses the regular expression generated by the Pre-Parser to extract the various components of the log message preamble and the log message content. Using the extracted preamble headers, the Data Transformer processes the the raw log messages into structured, matrix-like dataset in which each row is a unique log message and each column is a component of the log message. This structured version of the raw log files allows for data manipulation and processing by later stages in the pipeline.

Dalam dokumen PDF Presented by: University (Halaman 70-74)