Regular Expressions

Pikt regular expressions follow the usual regular expression rules with any necessary clarifications/amplifications to follow.

Here are the regular expression operators:

OPERATOR           MEANING

a =~ b             string b matches at least one
                   substring within a
a =~~ b            like the above, but without case sensitivity
a !~ b             string b matches no substring within a
a !~~ b            like the above, but without case sensitivity

For example, all of the following are true:

"this is a test" =~ "is"
"this is a test" =~~ "IS"
"this is a test" !~ "THIS"
"this is a test" !~~ "that"

"this is a test" =~ ""
"" !~ "this is a test"

These characters have special meaning within Pikt regular expressions:

CHARACTER(S)    MEANING

.               matches any single character
*               matches zero or more instances of the preceding
                character/pattern
?               matches zero or one instance(s) of the preceding
                character/pattern
+               matches one or more instances of the preceding
                character/pattern
{m,n}           matches as few as m, or as many as n, instances
                of the preceding character/pattern

( )             enclose a subexpression, or set of subexpressions
                separated by |
|               separates subexpressions (think of "or")
[ ]             enclose a set of characters/character ranges
^               as the first character in a [ ] subexpression,
                indicates set negation; as the first character
                in a regular expression, anchors to the
                beginning of the string expression on the
                left-hand side of the regexp operator
$               anchors to the end of the string expression
                on the left-hand side of the regexp operator

In addition to user-specified character classes, Pikt supports these built-in predefined character classes:

[[:alnum:]]     the set of alphanumeric characters
[[:alpha:]]     the set of letters
[[:blank:]]     tab and space
[[:cntrl:]]     the control characters
[[:digit:]]     the decimal digits
[[:graph:]]     the printable characters except space
[[:lower:]]     the lower-case letters
[[:print:]]     the printable characters
[[:punct:]]     the punctuation characters
[[:space:]]     whitespace characters
[[:upper:]]     the upper-case letters

Backslash escapes suppress a character's specialness. So, "\\*" is a literal asterisk, and the following are all true:

"fo*bar" !~ "fo*bar"        // left side literal string,
                            // right side regexp
"fo*bar" !~ "fo\*bar"

"fo*bar" =~ "fo\\*bar"

"fo*bar" =~ "\\*"

"*" =~ "\\*"

In any of the above left-hand expressions, you could substitute "fo\*bar", and the statements would all still be true.

Usually, just a single backslash is required for this purpose. In Pikt, however, backslashes are a general escape character. If, for example, you want to output the literal text string "$x" without the $x being interpreted as a variable (which Pikt would attempt to resolve to a value), you would use "\$x". So, if you require a backslash in the final product, you must supply double backslashes going in. Again, see the sample config files for examples of double-backslash usage.

Note that every time a regular expression containing matching parentheses is invoked, for example in any of the following situations

dat "([^:]*):([^:]*)"

if $line =~ "^([^:]*):([^:]*)"

do #split($rdline, "([^:]*):([^:]*)")

you can reference the first parentheses-enclosed matched subexpression with $1, the second with $2, and so on. $0 references the entire matched subexpression.

Note well: The $0, $1, and so on only persist until the next regexp pattern match. The next time you use =~ (or any of the other regexp operators), or the next time you invoke the #split() function (in any of its forms), any previous $0, $1, ... values get supplanted by the values in the latest regexp. You will encounter many strange bugs unless you keep this in mind!

Alternate forms for referencing regexp matches are: $[0], $[1], $[2], and so on. These make it possible to reference the matched expressions within for loops:

set #n = #split($rdlin)
for #i=1 #i<=#n #i+=1
        output $[#i]
endfor

Here is a technique for saving $0, $1, ... before a subsequent regexp action:

set #n = #split($rdlin)
for #i=1 #i<=#n #i+=1
        set $f[#i] = $[#i]
endfor
...
if $f[3] =~ "cantata|sonata|toccata"    // wipes out
                                        // $3 & $[3] value
        output $f[3]
fi

Better still is to use the #split() function (with all three arguments required) this way:

do #split($f, $rdlin, " ")
...
if $f[3] =~ "cantata|sonata|toccata"    // wipes out
                                        // $3 & $[3] value
        output $f[3]
fi

If you failed to save the previous regexp values in the $f[] array and simply referenced $3 or $[3], that value would be undefined, since in the =~ test you didn't put ( )'s around any third subexpression, but even if you did (around "toccata") you have lost your previous $3 value.

For further coverage of regular expressions, see the GNU RX info pages.

Refer to the sample alarms.cfg for examples.

1st page