As discussed in Section 4.2, the Data Preparer prepares the incoming raw log files for fur- ther processing and analysis and performs two functions: pre-parsing and data transforming.
This section describes this processing stage in more detail and also describes the design and implementation of this subcomponent.
The Pre-Parser function parses each log message according to the log message format to ex- tract the metadata and content of each log message. To more clearly illustrate the pre-parsing function, an example excerpt from a log file is shown in Figure 4.2.
2020-07-16 10:20:06,151 - INFO - array_2.wide.bc8n856m4k_XEngOps fxcorrelator_xengops.py:638 - X-engine hardware auto rst/resync mechanism enabled.
2020-07-16 10:20:06,267 - INFO - array_2.wide.bc8n856m4k.katcp.server server.py:567 - Reading loop for client 127.0.0.1:48278 completed
2020-07-16 10:20:06,267 - WARNING - array_2.wide.bc8n856m4k.katcp.server server.py:576 - Stream for client 127.0.0.1:48278 closed with error error(32, 'Broken pipe')
Log message preamble Log message content
Figure 4.2: Example excerpt from a log file highlighting the log message preamble and the log message content
Figure 4.2 shows three log messages from the MeerKAT CBF system and clearly distinguishes the metadata from the actual log message. In this dissertation, the collection of log message metadata will be referred to as the log message preamble.
With sufficient knowledge about the design and operation of the MeerKAT CBF, the log mes- sage preamble can be further dissected to identify all the individual metadata components it comprises. This decomposition is shown in Figure 4.3 below.
2020-07-16 10:20:06,267 - INFO - array_2.wide.bc8n856m4k.katcp.server server.py:567 - Reading loop for client 127.0.0.1:48278 completed
Date Time Level Instrument Line Number
Content Log message preamble
Package
Figure 4.3: Decomposition of the MeerKAT CBF log message preamble
It can be seen that the log message preamble consists of the following metadata components:
Date,Time,Level,Instrument,Package andLine Number. The actual log message component is simply referred to asContent. The goal then, of the Pre-Parser, is to extract and identify the log message preamble and its components, and to identify and extract the log message content for further analysis.
The Pre-Parser is adapted from work by Zhu et al. and their work on the LogParser toolkit [40][35][7]. The Pre-Parser implementation requires the format of the entire log message, with the preamble decomposed into its constituents, to be provided as a text string. The Pre-Parser then uses this format to generate a regular expression sequence for matching and extracting the various components of the entire log message and also extracts the names of the various components of the log message to be used as headings in the final structured dataset that is created by the Data Transformer. An example of the input and outputs of the Pre-Parser is shown in Figure 4.4.
“<Data> <Time> - <Level> - <Instrument> <Component> - <Content>”
Pre-Parser
Input: string representing log message format
r'^(?P<Data>.*?)\s+(?P<Time>.*?)\s+-\s+(?
P<Level>.*?)\s+(?P<Instrument>.*?)\s+(?
P<Component>.*?)\s+-\s+(?P<Content>.*?)$
['Data', 'Time', 'Level', 'Instrument', 'Component',
'Content']
Output: regex for extracting the various
components of the log message Output: list of headers for naming columns in structured dataset
Figure 4.4: Input to and outputs of the Pre-Parser subcomponent
As shown in Figure 4.4, the components of the log message are specified using enclosing <, >
characters and the delimiters separating the components are explicitly indicated. The Pre- Parser uses these delimiters, including white space characters, as the basis for generating the regular expression that can extract all the various components from the log message. The grouping feature of regular expressions is used to individually group the various components of the log message for ease of accessibility later in the processing pipeline.
The Pre-Parser is implemented as a Python function and only requires a string representing the log message format and generates a regular expression for pattern matching. The function makes use of the built-in Python re module [76] for compiling the regular expression used for pattern matching and for splitting the input string based on the occurrence of enclosing <, >
characters. The functionality of the Pre-Parser is illustrated through a UML Activity Diagram shown in Figure 4.5.
The Pre-Parser initially creates an empty string for storing the regular expression pattern to be used to extract the various components of the log message. It also initialises an empty list for storing the names of the components of the log message. The provided input string is split on the occurrence of enclosing<, >. This is achieved using the re.split method and results in a list of substrings containing the named components and delimiters. Each of the resulting substrings are then further processed until none remain.
If a given substring contains a component name, the name is extracted from the substring by removing the enclosing <, >characters. The name is then appended to the list of component names. To pattern match the particular component in a given log message, which may be of varying length and contain an arbitrary collection of characters, (.*) is appended to the log format regular expression. The sequence(.*) will match 0 or more of any character.
If a given substring instead does not contain a component name, it may be regarded as a delimiter. The regular expression generated to extract the components of a log messages relies
Split log format into substrings
Select next substring Unprocessed substrings
remaining?
Does substring contain component
name?
Replace whitespace with
regex syntax Create empty
regex string
Extract name from substring
Append (.*) to log format regex Create empty
list for headings
Append name to list of headers
Yes No
Yes No
Start
End
Input to function:
String representing log message format
Output from function:
- list of headings of log message components
- regular expression for extracting components from log message
String is split on each <…> substring
Does substring contain whitespace?
logformat regex
Append substring to log
format regex No
Yes Compile regular
expression
(.*) matches 0 or more of any token as a group.
This matches each component of the log message.
This adds tokens to match the delimiters between in the log message components.
Figure 4.5: UML Activity Diagram illustrating the functionality of the Pre-Parser process on the delimiters between components to ensure correct decomposition. If a delimiter substring contains white space, they are replaced by the regular expression syntax for white space before being appended to the log format regular expression. If the delimiter substring does not contain white space, the substring is simply appended to the log format regular expression.
Once all substrings have been processed, the log format regular expression is compiled and is returned alongside the list of component names. The resultant regular expression is able to match and extract the various components, including the message content, of a given log message corresponding to the provided log message format.
The Pre-Parser stage does not use, nor does it require, the actual log files. Instead, it generates information used by the Data Transformer.
The Data Transformer uses the regular expression generated by the Pre-Parser to extract the various components of the log message preamble and the log message content. Using the ex- tracted preamble headers, the Data Transformer processes the the raw log messages into struc- tured, matrix-like dataset in which each row is a unique log message and each column is a component of the log message. This structured version of the raw log files allows for data manipulation and processing by later stages in the pipeline.