How to fix ambiguity or invalid detection of identifier?

Jun 26, 2013 at 1:46 AM
Hi,

I am trying to create a SQL parser for the SQL-92 BNF that was converted by me to the lex/yacc grammar. I have the following two rules in the lexer's grammar:
[a-zA-Z][a-zA-Z_0-9]* {
    Console.WriteLine("SQL_IDENTIFIER");
    return (int) Tokens.SQL_IDENTIFIER;
}
[a-zA-Z_][a-zA-Z_0-9]* {
    Console.WriteLine("REGULAR_IDENTIFIER");
    return (int) Tokens.REGULAR_IDENTIFIER;
}
These two rules corresponds to the <SQL language identifier> and the <regular identifier> rules of the BNF.

After compilation when I try to check SQL files with test samples Parser recognizes all identifiers as SQL_IDENTIFIER. For example for the following SQL:
SELECT * FROM (SELECT * FROM table_name) t;
it shows:
Parsing SQL: SELECT * FROM (SELECT * FROM table_name) t;
SELECT
ASTERISK
FROM
LEFT_PAREN
SELECT
ASTERISK
FROM
SQL_IDENTIFIER
RIGHT_PAREN
SQL_IDENTIFIER
SEMICOLON
EOF
DONE!
All parser's rules are in the following format
regular_identifier:
    SQL_ID;
without any C# actions and respective rules are:
...
actual_identifier:
    DOUBLE_QUOTED_STRING | regular_identifier;

regular_identifier:
    SQL_ID;
...

character_set_specification:
    schema_name PERIOD sql_language_identifier
    | sql_language_identifier;

sql_language_identifier:
    SQL_ID;
I expect that program (parser + lexer) should show REGULAR_IDENTIFIER instead of all SQL_IDENTIFIER but it shows SQL_IDENTIFIER for some reason. I'm stuck with it and don't know what to do next to fix the problem.

Could you help me to resolve this problem.

Thanks ;)
Coordinator
Jun 27, 2013 at 7:19 AM
Edited Jun 27, 2013 at 7:20 AM
Hi Jeff.
The problem is that the grammar that you are dealing with is descriptive rather than designed for recognition. All that it really tells you is that ordinary identifiers can start with an underscore "_" while sql identifiers cannot. The grammar tells you how to invent a legal identifier for each case, but does not help at all to recognize identifiers as the grammar is ambiguous. For example, if you take a random identifier, such as "foo" to choose a random example, then this matches both patterns. The best that you can do is to separate the cases "definitely a regular identifier" and "could be either regular or sql identifier".

Once you do this you have to then hack on your syntactic grammar so that it can deal with the "could be either" identifiers.

As to the "always overrides" message, the rule with LEX is that if more than one pattern matches you return the longest. If the length measurement is a tie, then you choose the lowest numbered rule. Suppose you have two rules

"xyz" { return (int)Tokens.xyz; }
{alpha}{3} {return (int)Tokens.TLA; }

Then if the scanner sees "xyz" this will also match the second rule. However xyz will be returned because of the tie-breaking rule. However, if you put the rules in a different order --

{alpha}{3} {return (int)Tokens.TLA; }
"xyz" { return (int)Tokens.xyz; }

then when the scanner sees "xyz" the tie-breaking rule will return TLA. However gplex will warn you that nothing matches the second rule, so that you realize that you have got the order wrong.

John
Jun 27, 2013 at 9:51 PM
Edited Jun 28, 2013 at 2:50 PM
Thanks John for reply. It has clarified for me how the lexer handles of ambiguous cases. As I understood I need to change the lexer's grammar as:
[a-zA-Z][a-zA-Z_0-9]* {
    Console.WriteLine("SQL_ID_OR_REGULAR_ID");
    return (int) Tokens.SQL_ID_OR_REGULAR_ID;
}
[a-zA-Z_][a-zA-Z_0-9]* {
    Console.WriteLine("REGULAR_IDENTIFIER");
    return (int) Tokens.REGULAR_IDENTIFIER;
}
am I right? if yes, I not completely understand how can I handle the SQL_ID_OR_REGULAR_ID case, i.e. what should I change in the parser's grammar to make it recognize this case.
Coordinator
Jun 30, 2013 at 1:02 AM
Edited Jun 30, 2013 at 1:03 AM
Hi. Well that is one way of doing it. However if you order the rules in this way it would be more helpful to name them "IDENT_FOR_EITHER" and "IDENT_NOT_FOR_SQL". All the identifiers that do not start with an underscore will be recognized by the first rule (using the tie-breaking rule) leaving only the identifiers that are not legal for SQL to be matched by the second rule. You might prefer shorter names, if you can think of ones that still signal the meaning.

I have just taken a look at the grammar fragment in your original post. There is something missing here.
The token REGULAR_IDENTIFIER does not appear in the grammar fragment, although you have the production
regular_identifier:
    SQL_ID;
I suspect that what you need, in terms of the two token that I gave names to is
regular_identifer:
    IDENT_FOR_EITHER | IDENT_NOT_FOR_SQL ;
...
sql_language_identifier:
    IDENT_FOR_EITHER ;
...
You might need to check my understanding of these two symbols reflects their use in the rest of the grammar.