Thread divergence free and space eﬃcient GPU implementation of NFA AC

Thread divergence free GPU implementation for DFA AC algorithm is available but GPU implementation for NFA AC algorithm is not available. We have developed thread divergence-free implementation of NFA AC algorithm and we have given a space-efficient version of NFA AC automata. Space requirement for our NFA AC algorithm is log(N) times less than DFA AC implementation where N is the number of nodes in automaton.

Our AC NFA implementation can store up to 2K nodes in 64 KB shared memory on the GPU and this can be very fast when compared to the AC DFA implementation.

Multipattern String Matching Problem

Snort uses the AC algorithm because the runtime for the AC algorithm is independent of the number of patterns in the dictionary and is linear in the length of the target string. For word processing and security applications, the AC algorithm is the most commonly used algorithm because only this algorithm provides multipattern sequence matching over time, linearly proportional to the length of the input document. These algorithms describe a parallel version of the Aho-Corasick (AC) algorithm using deterministic finite automata for IBM Cell Broadband Engine (CBE).

13] developed GPU adaptations of the Aho-Corasick array matching algorithm for the two cases of GPU-GPU and Host-to-Host.

Document Retrieval Problem

Algorithm development for GPU can be defined based on the location of the input data and the location of the result/output. AC algorithm first generates multipattern matching machine/automata for dictionary of patterns and then uses this precomputed multipattern matching automaton to perform multipattern string matching in time linearly proportional to the length of the input data or document. 14] a linear space index for the text that contains all occurrences of a pattern P of length|P| can be done in runtimeO(|P|+occ) hereocc is the number of occurrences of patternP in text and runtime is independent of the length of the text.

The main point here is that there may be many occurrences of the pattern in collection D, but the total number of documents in which the pattern appears may be much less than the number of documents in collection D.

Overview of the Thesis Work

The document retrieval problem is to retrieve all documents di in which the query pattern P occurs. So a method that first finds all occurrences and then reports the unique documents cannot be efficient due to the O(|P|+count) running time, where count is the total number in the document. Muthukrishnan gave an optimal O(n)-spatial data structure that answers a document search query in O(|P|+occ), where occ is the number of documents containing a pattern P.

Thesis Overview

These two algorithms are Boyer-Moore pattern search algorithm and Aho-Corasick (AC) multipattern search algorithm. Aho Corasick (AC) Multipattern Search Algorithm: AC algorithm uses pattern matching machine for multipattern search in a document. Pattern Matcher: A pattern matcher is a program that inputs a string text and returns all patterns of P that are available as a substring.

A pattern matching machine is a set of states and each state has a state ID which is a number. The pattern matching engine uses three functions: the goto function (g), the error function (f), and the output function (out) to process any node. The AC algorithm has defined two versions depending on how the pattern matching engine is defined for the dictionary of input patterns.

The multipattern search begins by defining the matching machine/automaton start state as the current state and the first character in the text string is assigned as the current character. After executing a state transition machine lists the corresponding patterns for the reached/next state as an output along with the position of the current character in the stringT. The number of state transitions made by the DFA while performing multipattern searches in a string of length n is n.

When the NFA version is used, the number of state transitions made while running multipattern searches is 2n. Input: a text stringT =t1, t2, .., tn where everyti is an input character and M is a pattern matching engine with the goto function g, the error function f and the output function off, as described above.

Document Retrivel Problem

This can be answered by using Suffix Tree in timeO(m) and spaceO(n2) , by using KMP (Knuth-MorrisPratt string search algorithm) in timeO(n) and spaceO(m) and by using Suffix Array in time O( mlogn) and spaceO (n). This query can be answered by Suffix Tree in time O(m+occ) where is the number of occurrence of pattern P in TextT and space O(n2) using longest common prefix information. The definition of a sffix tree for multiple string documents is called the generalized sffix tree.

Muka xumura waliigalaa keessatti, tokkoon tokkoon baala ijoollee baala fakkeessituu qaba, kunis tokkoon tokkoon sanada inni itti fufu waliin walqabatu. Bakki μp paateeniin pis tiif noodi muka xumura waliigalaa keessa jiru akka σmp durtii p qabuu fi |σmp|noodota akkasii hunda keessaa isa xiqqaa kan durtii qabu dha. Mukti xumuraa mana kitaabaa galmee waliigalaa akka Weiner fi kkf ibsanitti yeroo fi bakka O(n) keessatti ijaaramuu danda’a.

Suppose the lowest common ancestor for two nodes u and v in suffix tree T is represented by lca(u, v). Muthurishnan's preprocessing has runtime O(n) and after preprocessing, document list queries can be answered in timeO(m+occ), where occ is the number of documents in which pis is present. Name all these leaves of the sufxtree1, .., ln in the order in which they appear in the order in which they traverse the sufxtree.

GPU is well suited for computing intensive tasks because it can execute and handle a large number of threads in parallel. Now let's understand how GPU runs a large number of threads in parallel. There are different models of parallel data execution available.

Figure 2.2 describes the suﬃx tree and suﬃx array for above mentioned example.

CUDA Programming Model

Compiler Model

Brief overview of CPU-GPU executable file generation and CUDA program compilation is given in Figure 3.5.

Execution Model

CUDA Memory Architecture

The number of state transitions made by the DFA while executing multi-pattern queries in a range of lengths. The NFA version of matched machines uses less memory than the DFA version of matched machines. In the NFA version, matched automaton states have few success transitions and are more compressible than DFA states.

Sblockis the number of input characters for which a thread block calculates output B, the number of blocks is equal to n/Sblock. Thread is the number of input characters for which a thread computes output equal to Sblock/T. T W is total work which means effective string length processed by GPU. Now we are going to discuss the calculation strategy proposed by Zha-Sahni et.

Thus, overall, a block processes an array of lengthSblock+maxL−1 and producesSblock elements of the output and the total number of blocks. Then each thread computeSthread=Sblock/T of the output values to be computed by the block. Now we will discuss the definition of nextStatefunction for NFA and DFA version.

The definition of the DFA version of nextState can be used in Algorithm 2, but the definition of the NFA version of NextState cannot be used in Algorithm 2, because this definition would generate thread divergence. Now we will discuss to see how thread divergence is happening due to the next definition of AC algorithm of NFA version (Algorithm 4).

Thread Divergence Free NFA AC algorithm for GPU

Thread divergence: Threads from a block merge into fixed-size warps for execution in a CUDA core, and threads within a diverge must follow the same execution trajectory. But if the threads within a warp do not follow the execution trajectory, then this type of execution would create thread divergence within a warp. Thread divergence is not allowed because the GPU uses the single-instruction multiple threading (SIMT) model for parallel execution of threads within a thread.

In the next section we will see how to remove thread divergence for the NFA AC algorithm by redefining nextState. So the while loop in Algorithm 4 can be executed a number of times per node to find the next go to node. The NFA version AC algorithm has thread divergence due to the nextStatedefiniton of the NFA version AC algorithm (algorithm 4).

We can use the above-mentioned Fail function property to redefine the nextState function so that thread divergence can be removed. In Algorithm 5, we replace the while loop of Algorithm 4 with a new while loop, so that it will go through the same level a number of times for each thread. Thus, Algorithm 5 is free from thread divergence because all threads will have the same execution path.

So, overall in this section we developed a thread divergence-free NFA AC algorithm for GPU. In the next section we will see how automata conforming to the AC version of NFA can be represented in less space than DFA AC automata.

Space requirement optimization for AC algorithm

DFA AC space requirement

NFA AC space requirement

Here the bit value in num represents whether there is an output edge or the next node for character "i" is available or not. So we can get the next node id for the character i for each node using the number and the left most child node id: the next node id would be the sum of the node id of leftmost child for the current node and the number of bits set before its bit in the 256-bit number. . Here nnid is the next node id for character i and left tN N is the node id for the most left node, id and counts the number of bits set before the bit in the 256 bit number, stored in the node.

Solution for problem of representation 2 is as follows: Using f set array of length 32 where each of f set[i] value is a 1 Byte number and of f set[i] store number 1's available in previous 8∗ ibits of number. Shift array of length 32 where each value of of f set[i] is 1 byte number, so a total of 256 bits required. The above two representations 2 and 3 of goto trie did not differ much space requirement, but later one is efficient for calculating the next node ID.

Comparision of NFA and DFA AC algorithm For GPU

Space requirement for DFA AC isN∗256∗log(N) bits and for our NFA AC space requirement isN∗(log(N) + 256) +N∗log(N) bits Here is the total number of nodes in automaton. Thus, our compacted version of NFA AC automata takes log(N) times less space than DFA AC automata. More optimized version NFA AC algorithm can be given and space requirement can be reduced for NFA AC automata.

On improving the worst-case running time of the Boyer-Moore string matching algorithm.