More Regex Work

One of the overall guiding principles of my work is that anything that can be considered an editorial statement within the context of a particular DH project should be exposed to the project owner instead of being hidden away in Ruby code (or any other language of the day). The Fabulator+Radiant combination allows the researcher to see in the content management system everything that pertains to their project. Any tweaks in how they interpret their data are done within the CMS, not within Ruby code.

With that in mind, I’m slowly making progress on a general grammar engine that will let us write markup parsers within the CMS. This is useful when we are working with markup that was developed before TEI, such as in the Donne project. Right now, the parsing of the transcriptions is done in Ruby, so any modification in the parsing and possible translation to HTML or TEI is done in Ruby, away from the eyes of the people with an editorial interest in that translation.

First, I’ll mention the limitations of the Ruby 1.8 regex engine.

It doesn’t really get Unicode, or if it does, Ruby integers don’t. I can only create characters in the range of 0x00-0xff. Thus, the initial grammar engine will only work with ascii. I know this is a major limitation for projects using something other than simple Latin characters. I have good reason to believe that this limitation can be removed when Radiant moves to Ruby 1.9. On the other hand, patches are welcome once I push the current development code to github. :-)

Ruby doesn’t support conditional branching within a regular expression. This limits the complexity we can support. As a result, we’ve had to rethink how we handle character class algebra and will require the ‘bitset’ library. However, I think the resulting facility of specifying character sets is worth the extra installation effort.

We are currently working on two different regular expression languages. One is used to specify tokens and the other for rules. The major difference is that the atomic unit in a token is the character while the atomic unit in a rule is the token. Rules can also have actions associated with them while tokens do not.

Both tokens and rules will act like functions in a library. You can call them in an expression just like a function. If you call the token or rule without a trailing question mark, you’ll get any data structure that results from matching the token/rule against the provided string. If you call it with the trailing question mark, you’ll get a boolean response indicating the success/failure of trying to match the string.

Tokens

As a test case, I’m starting to put together a grammar for the Donne markup. For tokens, I have:

You’ll notice that I’ve specified a mode for almost all of the tokens (using a g:context element as shorthand). This is similar to the mode attribute in XSLT in that the token is only active if the grammar engine is in that mode. If no mode is specified, then the token is active in all modes.

I’ve also used named character classes instead of specifying explicit ranges. This will help when the engine supports Unicode since the named classes will automatically expand to encompass the standard Unicode equivalents. It also makes the expressions a little more readable.

Character Set Algrebra

Character sets can be more than just simple expressions as in the above token definitions. You can add and subtract them to get exactly the set of characters you want.

Some examples of character set algebra (assuming strict 7-bit ascii):

  • [:xdigit:] == [:digit: + [a-f] + [A-F]]
  • [:alnum:] == [:alpha: + :digit:] == [:upper: + :lower: + :digit:]
  • consonants: [:alpha: – [aeiouAEIOU]]
  • everything except vowels: [ – [aeiouAEIOU] ]

Ordering is important. Set operations are evaluated left to right, so the following are not equivalent: [:alpha: – [aeiouAEIOU]] and [ -[aeiouAEIOU] + :alpha: ]. The first is the set of consonants (alpha from which we remove the vowels) while the second is everything (all characters from which we remove the vowels and then add all alphabetical characters).

Set operations can be parenthesized if it helps make things clearer. For example, the following two are equivalent: [:alpha: – [aeiou] – [AEIOU]] and [:alpha: – ([aeiou] + [AEIOU])].

Rules

Rules describe how tokens are related. For example, the top-level rule in the Donne transcription grammar is:

This says that a document consists of a line followed by zero or more lines separated by one or more new lines and ending with an optional newline. If this successfully matches, then the result of the match should be whatever was produced by matching the individual ‘line’ rules.

The ‘[NL]’ is just the token from before. The ‘line’ rule is defined as:

This rule is a bit more complex than the ‘lines’ rule. It also shows us how we might use the mode to change the token set we have available.

Here, we expect to begin in the ‘linenumber’ mode so that we don’t have to worry about the linetext tokens matching when we expect only the ‘linenumber’ rule and tokens to match. Once we are past the ‘linenumber’, we switch to expecting tokens in the ‘linetext’ mode. If we find an optional control character and text, then we successfully match a line and run the associated code that sets up the data for that line.

The linenumber rule is defined as:

You’ll also notice that the actual match is in a g:when element. A rule can match multiple patterns, each with its own associated code. Here, we are using the linenumber tokens defined above and expecting them to be separated by dots (.). You can think of the “’.’” notation as defining an anonymous token that matches a literal string. Spaces in the rule pattern are ignored. To match an explicit space in the string you’re parsing, you will need to include it explicitly in the pattern.

If we match a line number, we just translate the names of the items from uppercase to lowercase in the result that is passed on to the ‘line’ rule.

Grammars

Grammars are collections of rules and tokens. The current vision is that grammars will live inside libraries (yet to be coded) and make available as functions those rules and tokens that are not attached to a particular mode.

The resulting system will provide named regular expressions (tokens) and parsers (rules). I’m debating how I want to construct the parser: either top-down or bottom-up. Top-down is easier in some ways because we don’t need to construct a state machine to manage building up more complex rules from less-complex sets of tokens/rules. Bottom-up is nice because it allows the higher-level patterns to emerge from the collection of tokens/rules produced from processing the given string without having to determine beforehand what our target is: we can stop processing as soon as we have a single rule encompassing everything we’ve seen.

Right now, we have a working token expression parser that gives us basic regular expression support. We’re developing the rule engine and the general library support in Fabulator in which we’ll embed the grammar definitions. Once we have libraries and grammars available in the Radiant+Fabulator combination, we’ll remove the ‘matches’ function from the grammar extension. Regular expressions shouldn’t be embedded in a program if they are subject to change. Instead, they can be documented by putting them in a project-specific library/grammar.

Published by

James

James is a software developer and self-published author. He received his B.S. in Math and Physics and his M.A. in English from Texas A&M University. After spending almost two decades in academia, he now works in the Washington, DC, start up world.