Why and How to Replace Perl Compatible Regular Expressions (PCRE) with Hyperscan

ID 标签 689752
已更新 3/23/2018
版本 Latest
公共

author-image

作者

Introduction to PCRE and Hyperscan

Perl Compatible Regular Expressions (PCRE), is a widely used regular expression matching library written in the C language, inspired by the regular expression capabilities of the Perl programming language. Its syntax is much more powerful and flexible than many other regular expression libraries, such as the Portable Operating System Interface for UNIX* (POSIX).

Hyperscan is a high performance, multi-pattern regular expression matching library developed by Intel, which supports the same syntax as PCRE. In this article, we will describe the API differences and provide a performance contrast between PCRE and Hyperscan, then show how to replace PCRE with Hyperscan in a typical scenario.

Functionality Comparison

PCRE supports only block mode compilation and matching, while Hyperscan supports both block and streaming mode. Streaming mode is more practical and flexible in real network scenarios.

PCRE supports only single pattern compilation and matching, while Hyperscan can support multiple patterns. In real scenarios, it is common that multiple patterns are applied, and Hyperscan can efficiently complete all work in one compilation and scan.

API Comparison

Both PCRE and Hyperscan interfaces have compile time and runtime phases. When replacing PCRE with Hyperscan, the basic idea is to replace the PCRE API with the Hyperscan API at both compile time and runtime.

Compile time API changes

With PCRE, we often use the following API to compile each pattern:

#include <pcre.h>
pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr); 

It is very easy to replace this API with the Hyperscan compile time API:

#include <hs_compile.h>
hs_error_t hs_compile(const char *expression, unsigned int flags,
               unsigned int mode, const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

Parameters:

expression – single pattern. Corresponding to “pattern” of pcre.
flags – flag of single pattern. Corresponding to “options” of pcre.
mode – mode selection. What corresponds to pcre is HS_MODE_BLOCK.
platform – platform information.
db – Generated Hyperscan database. Corresponding to the return value of pcre_compile2.
error – return the compile error.

Return value is HS_SUCCESS(0) or error code.

Hyperscan also provides an API for multi-pattern compilation:

hs_error_t hs_compile_multi(const char *const *expressions,
               const unsigned int *flags, const unsigned int *ids,
               unsigned int elements, unsigned int mode,
               const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

This API supports the compiling of several patterns together to generate one Hyperscan database. There are some differences in the parameters:

expressions – the array of patterns.
flags – the flag array of patterns.
ids – the id array of patterns.
elements – the number of patterns.

The others remain the same.

Run time API replacement

To build the compiled PCRE database, we often use the following API to scan the input data:

#include <pcre.h>
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize); 

It is easy to replace it with the Hyperscan runtime API:

#include <hs_runtime.h>
hs_error_t hs_scan(const hs_database_t *db, const char *data,
               unsigned int length, unsigned int flags,
               hs_scratch_t *scratch, match_event_handler onEvent,
               void *context);

Parameters:

db – Hypersan data base. Corresponding to “code” of pcre.
data – input data. Corresponding to “subject” of pcre.
length – length of the input data. Corresponding to “length” of pcre.
flags – reserved options.
scratch – the space storing temporary state during runtime, allocated by hs_alloc_scratch().
onEvent – callback function at match, user-defined.
context – callback function parameter, user-defined.

Return value is HS_SUCCESS(0) or error code.

Using prefilter mode

Hyperscan does not completely match PCRE syntax; for example, it doesn’t support Back Reference and Zero-Width Assertion. However, Hyperscan’s performance advantage over PCRE makes it worthwhile to convert the unsupported pattern to its superset, which can be supported by Hyperscan’s prefilter mode. For example:

Convert /foo(\d)+bar\1+baz/ to /foo(\d)+bar(\d)+baz/

At compile time, each pattern goes through the classifier first, which decides whether to use Hyperscan, prefilter mode, or PCRE. The patterns supported by Hyperscan and prefilter mode are multi-compiled together to generate one Hyperscan database; in addition, every pattern that uses prefilter mode or PCRE is compiled separately to generate the PCRE database:


Figure 1. Compile time.

At runtime, the input data is scanned against the Hyperscan database and each non-prefilter mode PCRE database. If Hyperscan finds a match, it should be reconfirmed with PCRE:


Figure 2. Runtime.

Please refer to API Reference: Constants for more API information.

Performance Comparison

To contrast Hyperscan and PCRE performance, we used the Hyperscan performance testing tool hsbench. We selected the following 15 regular expressions, which include both plain text strings and regular expression rules:

ID Signature
1 Twain
2 (?i)Twain
3 [a-z]shing
4 Huck[a-zA-Z]+|Saw[a-zA-Z]+
5 \b\w+nn\b
6 [a-q][^u-z]{13}x
7 Tom|Sawyer|Huckleberry|Finn
8 (?i)Tom|Sawyer|Huckleberry|Finn
9 .{0,2}(Tom|Sawyer|Huckleberry|Finn)
10 .{2,4}(Tom|Sawyer|Huckleberry|Finn)
11 Tom.{10,25}river|river.{10,25}Tom
12 [a-zA-Z]+ing
13 \s[a-zA-Z]{0,12}ing\s
14 ([A-Za-z]awyer|[A-Za-z]inn)\s
15 ["'][^"']{0,30}[?!\.]["']

We ran the test on a single core of an Intel® Core™ i7-8700K processor at 3.70 GHz. We chose an e-book from The Entire Project Gutenberg Works of Mark Twain by Mark Twain, containing about 20M words (18,905,427 bytes) as input, and then looped for 200 times using hsbench. Time spent and throughput of PCRE (v8.41, just-in-time mode) and Hyperscan (v4.7.0) are as follows:

  Corpus: mtent.txt Total Data: 18905427 Bytes x 200
  Time(s) Throughput (Mbit/s)
ID pcre_jit hs pcre_jit hs
1 3.518 1.452 8,598.26 20,832.43
2 3.631 1.498 8,330.68 20,192.71
3 3.327 2.355 9,091.88 12,844.45
4 1.582 2.246 19,120.53 13,467.80
5 12.901 2.067 2,344.68 14,634.10
6 1.462 1.585 20,689.93 19,084.34
7 4.755 3.037 6,361.45 9,960.05
8 11.075 3.08 2,731.26 9,821.00
9 31.684 3.037 954.70 9,960.05
10 31.014 3.042 975.32 9,943.68
11 3.143 2.097 9,624.14 14,424.74
12 7.42 3.279 4,076.64 9,224.97
13 8.463 4.45 3,574.23 6,797.46
14 6.423 2.89 4,709.43 10,466.67
15 2.395 4.267 12,629.93 7,088.98
Total 132.793 40.382 227.79 749.0635234
Multi 132.793 13.4 227.79 2,257.36

The results show that Hyperscan has a performance advantage over PCRE for most of the rules tested. The highest throughput (see test 9) is 10.4 times greater using Hyperscan.

Multiple pattern matching test results also show the advantage of using Hyperscan. Multi-pattern matching is very common in practical use. Hyperscan can compile all the patterns simultaneously into one database which is scanned against the input corpus only once. The results above (see the rows labeled Total and Multi) show that it takes Hyperscan only 13.4 seconds to perform multiple pattern matching, and a total of 40.382 seconds when the 15 rules are scanned separately in single pattern scans. Because PCRE supports only single pattern compilation and scanning, each of the 15 rules must be compiled separately and scanned against the input corpus. Altogether, it takes PCRE 132.793 seconds to complete all scans. The throughput histogram is as follows:

Replacement Pseudo-code Samples

Assume that we have a pattern set and input corpus:

// patterns
const char *pats[];
unsigned flags[];

// input data
const char *buf = "blah...................blah";
size_t buf_len = strlen(buf);

When using PCRE, we may have this kind of implementation:

// pcre compile
for each pattern i
    pcres[i] = pcre_compile2(pats[i], flags[i], …);

// pcre runtime
for each pattern i
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern i match at ovector[1]

Now we’ll describe the details of replacing PCRE with Hyperscan.

In addition to a possible requirement for prefiltering mode, we also have to be careful about a pattern having variable width, which means that a pattern may consume a different amount of data to get matches. This is because Hyperscan reports all matches but PCRE only reports one match under greedy or ungreedy mode. We also need to reconfirm the match from a variable width pattern with PCRE. We may use the following function to check whether a pattern has variable width or not:

bool is_variable_width(re, flags) {
    hs_expr_info_t *info = NULL;
    ret = hs_expression_info(re, flags, &info, ...);
    if (ret == HS_SUCCESS) and info and (info->min_width == info->max_width)
        return false
    else
        return true
}

Here we show two different scenarios - single compile or multi-compile.

Single pattern compile

// try compile hs, prefilter mode and pcre compile
for each pattern i
    dbs[i].pcre = NULL;
    ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &dbs[i].hs, …);
    if ret == HS_SUCCESS
        hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        if pats[i] has variable width
            dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
else
        dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &dbs[i].hs, …);
        if ret == HS_SUCCESS
            hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        else
            dbs[i].hs = NULL;

// runtime
on_match(…, to, …, ctx) { // hs callback
    if !ctx->pcre // not prefilter mode pattern
        report pattern ctx->i match at to
        return
    
    // got a match from a prefilter mode or variable width pattern, need pcre confirm
    ret = pcre_exec(ctx->pcre, …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ctx->i match at ovector[1]
    return
}

for each pattern i
    if dbs[i].hs
        ctx = pack dbs[i].pcre and i
        hs_scan(dbs[i].hs, buf, buf_len, 0, dbs[i].scratch, on_match, &ctx);
    else
        ret = pcre_exec(dbs[i].pcre, …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern i match at ovector[1]

// house clean
for each pattern i
    if dbs[i].hs
        hs_free_scratch(dbs[i].scratch);
        hs_free_database(dbs[i].hs);

Multi-pattern compile

// try hs, use prefilter mode and pcre compile if failed
for each pattern i
ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &hs, …);
if ret == HS_SUCCESS
        store pats[i] to hs_pats[]
        store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
        store ids[i] to hs_ids[]
        if pats[i] has variable width
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
    else
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &hs, …);
        if ret == HS_SUCCESS
            store pats[i] to hs_pats[]
            store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
            store ids[i] to hs_ids[]
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
        else
            pcres[n_pcre++] = pcre_compile2(pats[i], flags[i], …);

// hs multi compile
hs_compile_multi(hs_pats, hs_flags, hs_ids, n_hs, HS_MODE_BLOCK, …, &hs, …);
hs_alloc_scratch(hs, &scratch);

// hs runtime for multi compiled part
on_match(id, …, to, …, ctx) {
    if ctx[id] // got match from a prefilter mode or variable width pattern, need pcre confirm
        ret = pcre_exec(ctx[id], …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern id match at ovector[1]
    else
        report pattern id match at to
    return
}

ctx = id2pcre; // user defined on_match context
hs_scan(hs, buf, buf_len, 0, scratch, on_match, ctx);

// pcre runtime for the rest
for each db i in pcres[]
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ids[i] match at ovector[1]

// house clean
hs_free_scratch(scratch);
hs_free_database(hs_db);

Summary

Hyperscan is a high performance regular expression matching library that is very suitable for multi-pattern matching and is faster than PCRE. In this article, we showed how to replace PCRE with Hyperscan in a typical scenario. In addition to its performance advantage, Hyperscan has another superior feature, the streaming mode. It enables us to deal with the case that input data is separated into pieces of blocks. For these reasons, Hyperscan is expected to replace many of the existing regular expression matching engines in a greater number of scenarios.

"