Index of /dev/null

a wordlist folding algorithm

Assumed you wish to match a large wordlist against a huge chunk of text. As a small test case, let

for, far, bar, foo, boofaz, boofar, boof, faz, foobaz, foobars, boofar

be your wordlist.

Now, you may apply the according regualar expression:

(1) /\b(for|far|bar|foo|boofaz|boofar|boof|faz|foobaz|foobars|boofar)\b/

But which way a regex engine would implement the assignment? There are different options. The very worst algorithm would be surely to look up every word separately in the whole text. That would be the same as doing

foreach (qw( for far bar foo boofaz boofar boof faz foobaz foobars boofar ))
{
  return "matching!" if $text =~ m/\b$_\b/;
}
return "not matching.";

Assumed you would match m words against a text consisting of n letters, this peace of coding horror would have a runtime estimation of O(m*n).

Now, a better approach would be to run only once through the text, using a matching stack. Thus, assume " foobar " would appear somewhere in the text, the stack trace might look as follows then (read from bottom to top):

[7] ' ' => nothing matches.
[6] 'r' => "foobars" might match.
[5] 'a' => "foobaz" or "foobars" might match.
[4] 'b' => "foobaz" or "foobars" might match.
[3] 'o' => "foo", "foobaz", or "foobars" might match.
[2] 'o' => "for", "foo", "foobaz", or "foobars" might match.
[1] 'f' => "for", "far", "foo", "faz", "foobaz", or "foobars" might match.
[0] ' ' => "\b" matches.

So, but what if the wordlist is getting large? It seems that we should run nearly through the whole list each time a character is pushed onto the stack in order to find out whether the current stack contents still may be matched or not.

It's clear that a considerable optimization would be to sort the word list in advance. Moreover, instead of looking up one item after another, a really smart approach would be to walk downwards a search tree instead. As a tree, the wordlist above would appear like this:

          _____________|_____________
          |                         |
          b                         f
    ______|______        ___________|___________
    |           |        |                     |
   oof          ar       a                     o
    |                 ___|___            ______|______
    a ?               |     |            |           |
 ___|___              r     z            o           r
 |     |                                 |
 r     z                                 ba ?
                                     ____|____
                                     |       |
                                     rs      z

Here, the "?" denotes an optional node. Remember the length of the way downwards such a tree is in logarithmic relation to the number of nodes. Thus, loosely speeking, we have improved the worst algorithm above up to O(n*log(m)) at least.

Actually I'm not sure whether regex engines would apply optimizations like that when compiling. I guess they do, so it might be needless to replace the regex (1) above by the optimized version, implementing the sorted tree of alternative and optional nodes:

(2) /\b(b(?:ar|oof(?:a(?:r|z))?)|f(?:a(?:r|z)|o(?:o(?:ba(?:rs|z))?|r)))\b/

Nevertheless I couldn't help to create a little Perl routine that folds a wordlist into an optimized regex. Now, here it is:

sub foldWordsToRegex {

  local *toString = sub {
    ## node: [ prefix, [ nodes ], opt ]

    my ($prefix, $nodes, $opt) = ${$_[0]};
    my $rv = quotemeta $prefix;
    if ( ref $nodes eq q|ARRAY| && @$nodes )
    {
      $rv .= '(?:'.(join '|', map { toString($_) } @$nodes).')';
      $rv .= '?' if $opt;
    }
    $rv;
  };

  local *fold = sub(@_) {

    sub reduce($);
    local *reduce = sub {
      my ($prefix, $nodes, $opt) = ${$_[0]};

      return $_[0] unless ref $nodes eq q|ARRAY| && @$nodes > 1;

      ## 1st char of the prefix of 1st node in list
      my ($c, $qc);

      ## check whether 2nd prefix starts with same letter as the 1st
      if ( length $nodes->[0][0] )
      {
        $c = substr $nodes->[0][0], 0, 1;
        $qc = quotemeta $c;
        $nodes->[1][0] =~ m/^$qc/ or undef $c;
      }

      unless ( defined $c )
      {
        return $_[0] unless @$nodes > 2;

        ## try to reduce next list part
        my $first = shift @$nodes;
        my $next = reduce ['', $nodes, 0];
        return [ $prefix, [ $first, $next ], $opt] if length $next->[0];

        ## couldn't be reduced
        return [ $prefix, [ $first, ${$next->[1]} ], $opt ];
      }

      ## reduce any ensuing node whose prefix starts with $c
      my @new;
      my $newopt = 0;
      while ( @$nodes )
      {
        $nodes->[0][0] =~ s/^$qc// or last;

        ## reduce node or detect new optional node
        my $n = shift @$nodes;
        if ( length $n->[0] )
        {
          push @new, $n;
          next;
        }
        $newopt = 1;
      }

      if ( @$nodes || $opt )
      {
        my $new = reduce [ $c, [ @new ], $newopt ];
        if ( @$nodes )
        {
          ## reduce remaining nodes
          my $next = reduce ['', $nodes, 0];
          return [ $prefix, [ $new, $next ], $opt] if length $next->[0];

          ## couldn't be reduced
          return [ $prefix, [ $new, ${$next->[1]} ], $opt ];
        }

        ## current node is optional
        return [ $prefix, [ $new, @$nodes ], $opt ];
      }

      ## nothing left to reduce
      reduce [ $prefix.$c, [ @new ], $newopt ];
    };

    reduce [ '', [( map { [$_] } sort @_ )], 0];
  };

  toString(fold(@_));
};

Well, not so easy, but it works :)

Here, the inner recursion fold will create the actually tree, where nodes having the form of arrays consisting of prefix, subnodes and a flag denoting optional nodes. The second inner function toString then creates the actual regular expression string from that tree. So, for instance, calling

&foldWordsToRegex(qw( for far bar foo boofaz boofar boof faz foobaz foobars boofar ))

would return the regex (2).

no comments

post a comment

Leave a comment
You may use HTML tags for style