This project is read-only.

Matching all characters EXCEPT a certain subset

Jun 19, 2013 at 4:35 PM
I'm hoping you can help me... I'm using gplex to create an analyzer for Lucene.Net, and I'm having an issue with matching a range of characters using a predicate that may include characters that I would wish to ignore/process separately. Specifically, I need to process Chinese characters individually, even though they may be matched by the [:IsLetter:] predicate.

Lucene.Net's StandardAnalyzer (defined in JFlex) accomplishes this by defining a rule '!(!a|b)', which matches everything in set 'a' that is not part of set 'b'. This rule is then used as a component in other rules:
{cj} b // where b defines the subset of all Chinese or Japanese characters
{letter} !(!a|b) // where 'a' is a predicate that matches any letter, so this would match any letter EXCEPT Chinese/Japanese
{alphanumeric} {letter}|{number}
The desired result being that the literal string 'abc' would return 1 alphanumeric token, and '惝掭掝' would return 3 tokens, or 'abc惝' would return 2 tokens.

I'm reasonably certain I can accomplish this using exclusive rules without delving into custom character predicates, but this is my first foray into lex/yacc and I'm a little lost... I'd appreciate any pointers or help you can offer.
Jun 20, 2013 at 4:52 PM
Aaaand I'm a moron. As stated, I'm a tad new to the whole lex/yacc scene, and found I was completely missing the yacc/gppg side of the equation, which I should be able to leverage to achieve the result I want.
Jun 21, 2013 at 12:15 AM
Hi Joseph
If you want to create some character set, like [a-zA-Z] only too big to define character by character, then you can use a character predicate. Character predicates with syntax like [:isFoo:] evaluate character sets at gplex runtime (rather than scanner runtime). Section 8.3.6 of the V1.2 gplex.pdf explains how to define your own character predicates in the case that the .NET base library does not supply the one you need. Send me a message if you need further info.