The lexer hack
From Wikipedia, the free encyclopedia
In computer programming, the
lexer hack (as opposed to "a lexer hack") describes a common solution to the problems in parsing ANSI
C, due to the reference lexical grammar being context-sensitive. In C, classifying
a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer based
only on the regular grammar of token classes.
Contents
[hide]
Problem[edit]
The problem is that in the following code, the lexical class of A
cannot be determined without
further contextual information:
(A)*B
This code could be multiplication of two variables, in which case A
is a variable
;
written unambiguously:
A * B
Alternatively, it could be casting the dereferenced value of B
to the type A
,
in which case A
is a typedef-name
; written in
usual human-readable form, but still ambiguously from the point of view of the grammar:
(A) *B
In more detail, in a compiler, the lexer performs one
of the earliest stages of converting the source code to a program. It scans the text to
extract meaningful tokens, such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs
here if a single sequence of tokens can ambiguously match more than one syntax rule.
This ambiguity can happen in C if the lexer does not distinguish between variable and typedef identifiers.[1] For
example, in the C expression:
(A) * B
the lexer may find these tokens:
- left parenthesis
- identifier 'A'
- right parenthesis
- operator '*'
- identifier 'B'
The problem is precisely that the lexical class of A cannot be determined without further context: the parser can interpret this as variable A multiplied
by B or as type A casting the dereferenced value of B. This is known as the "typedef-name: identifier" problem, due to the name of the problematic production
rule.[2]
The hack solution[edit]
The solution generally consists of feeding information from the parser's symbol
table back into the lexer. That is, rather than functioning as a pure one-way pipeline from
the lexer to the parser, there is a backchannel from the parser back to the lexer.
This mixing of the lexer and parser is generally regarded as inelegant, which is why it is called a "hack".
Without added context, the lexer cannot distinguish type identifiers from other identifiers without extra context because all identifiers have the same format. With the hack in the above example, when the
lexer finds the identifier A it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears.
The problem also exists in C++ and parsers can
use the same hack.[1]
Alternative solutions[edit]
This problem does not arise (and hence needs no "hack" in order to solve) when using lexerless
parsing techniques, as these are intrinsically contextual. These are generally seen as less elegant designs, however, because they lack the modularity of having a concurrent lexer
and parser in a pipeline.
Some parser generators, such as the yacc-derived BtYacc ("Backtracking
Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules.[3]
The relatively new LLVM-based parser Clang handles
the situation a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's semantic
analysis library to determine the nature of the identifier. This allows a much cleaner separation
of concerns and encapsulation of the lexer and parser, and is therefore
considered a much more elegant solution than The Lexer Hack by most modern software design metrics.[4] This
is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.
See also[edit]
References[edit]
- ^ Jump
up to:a b Roskind,
James A. (1991-07-11). "A YACC-able C++
2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES". - Jump
up^ Bendersky, Eli (2007-11-24). "The
context sensitivity of C's grammar". - Jump
up^ "BtYacc
3.0". Based
on yacc with modifications by Chris Dodd and Vadim Maslov. - Jump
up^ Bendersky, Eli. "How
Clang handles the type / variable name ambiguity of C/C++".
Citations[edit]
- http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html
- http://cs.nyu.edu/rgrimm/papers/pldi06.pdf
- http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html
- http://www.springerlink.com/index/YN4GQ2YMNQUY693L.pdf
- http://news.gmane.org/find-root.php?group=gmane.comp.lang.groovy.jsr&article=843&type=blog
- http://groups.google.com/group/comp.compilers/browse_frm/thread/db7f68e9d8b49002/fa20bf5de9c73472?lnk=st&q=%2B%22the+lexer+hack%22&rnum=1&hl=en#fa20bf5de9c73472