=head1 NAME
bt_language - the BibTeX data language, as recognized by B<btparse>
=head1 SYNOPSIS
AT \@
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
JUNK ~[\@\n\ \r\t]+
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
NUMBER [0-9]+
NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
LBRACE \{
RBRACE \}
LPAREN \(
RPAREN \)
EQUALS =
HASH \
COMMA ,
QUOTE \"
bibfile : ( entry )*
entry : AT NAME body
body : STRING
| ENTRY_OPEN contents ENTRY_CLOSE
contents : ( NAME | NUMBER ) COMMA fields
| fields
| value
fields : field { COMMA fields }
|
field : NAME EQUALS value
value : simple_value ( HASH simple_value )*
simple_value : STRING
| NUMBER
| NAME
=head1 DESCRIPTION
One of the problems
with
BibTeX is that there is
no
formal specification
of the language. This means that users exploring the arcane corners of
the language are largely on their own, and programmers implementing
their own parsers are completely on their own---except
for
observing the
behaviour of the original implementation.
Other parser implementors (Nelson Beebe of C<bibclean> fame, in
particular) have taken the trouble to explain the language accepted by
their parser, and in that spirit the following is presented.
If you are unfamiliar
with
the arcana of regular and context-free
languages, you will not have any easy
time
understanding this. This is
I<not> an introduction to the BibTeX language; any LaTeX book would be
more suitable
for
learning the data language itself.
=head1 LEXICAL GRAMMAR
The lexical scanner
has
three distinct modes: top-level, in-entry, and
string. Roughly speaking, top-level is the initial mode; we enter
in-entry mode on seeing an C<@> at top-level; and on seeing the C<}> or
C<)> that ends the entry, we
return
to top-level. We enter string mode
on seeing a C<"> or non-entry-delimiting C<{> from in-entry mode. Note
that the lexical language is both non-regular (because braces must
balance) and context-sensitive (because C<{> can mean different things
depending on its syntactic context). That said, we will
use
regular
expressions to describe the lexical elements, because they are the
starting point used by the lexical scanner itself. The rest of the
lexical grammar will be informally explained in the text.
From top-level, the following tokens are recognized according to the
regular expressions on the right:
AT \@
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
JUNK ~[\@\n\ \r\t]+
(Note that this is PCCTS regular expression syntax, which should be
fairly familiar to users of other regex engines. One oddity is that a
character class is negated as C<~[...]> rather than C<[^...]>.)
On seeing C<at> at top-level, we enter in-entry mode. Whitespace, junk,
newlines, and comments are all skipped,
with
the latter two incrementing
a line counter. (Junk is explicitly recognized to allow
for
C<bibtex>'s
"implicit comment"
scheme.)
From in-entry mode, we recognize newline, comment, and whitespace
identically to top-level mode. In addition, the following tokens are
recognized:
NUMBER [0-9]+
NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
LBRACE \{
RBRACE \}
LPAREN \(
RPAREN \)
EQUALS =
HASH \
COMMA ,
QUOTE \"
At this point, the lexical scanner starts to sound suspiciously like a
context-free grammar, rather than a collection of independent regular
expressions. However, it is necessary to keep this complexity in the
scanner because certain characters (C<{> and C<(> in particular) have
very different lexical meanings depending on the tokens that have
preceded them in the input stream.
In particular, C<{> and C<(> are treated as
"entry openers"
if
they
follow one C<at> and one C<name> token,
unless
the value of the C<name>
token is C<
"comment"
>. (Note the switch from top-level to in-entry
between the two tokens.) In the C<
@comment
> case, the delimiter is
considered as starting a string, and we enter string mode. Otherwise,
the delimiter is saved, and
when
we see a corresponding C<}> or C<)> it
is considered an
"entry closer"
. (Braces are balanced
for
free here
because the string lexer takes care of counting brace-depth.)
Anywhere
else
, C<{> is considered as starting a string, and we enter
string mode. C<"> always starts a string, regardless of context. The
other tokens (C<name>, C<number>, C<equals>, C<hash>, and C<comma>) are
recognized unconditionally.
Note that C<name> is a
catch
-all token used
for
entry types, citation
keys
, field names, and macro names; because BibTeX
has
slightly
different (largely undocumented) rules
for
these various elements, a bit
of trickery is needed to make things work. As a starting point,
consider BibTeX
's definition of what'
s allowed
for
an entry key:
a sequence of any characters I<except>
"
plus space. There are a couple of problems
with
this scheme. First,
without specifying the character set from which those
"magic 10"
characters are drawn, it's a bit hard to know just what is allowed.
Second, allowing C<@> characters could lead to confusing BibTeX syntax
(it doesn't confuse BibTeX, but it might confuse a human reader).
Finally, allowing certain characters that are special to TeX means that
BibTeX can generate bogus TeX code:
try
putting a backslash (C<\>) or
tilde (C<~>) in a citation key. (This
last
exception is rather specific
to the
"generating (La)TeX code from a BibTeX database"
application, but
since that's the major application
for
BibTeX databases, then it will
presumably be the major application
for
B<btparse>, at least initially.
Thus, it makes sense to pay attention to this problem.)
In B<btparse>, then, a name is
defined
as any sequence of letters,
digits, underscores, and the following characters:
! $ & * + - . / : ; < > ? [ ] ^ _ ` |
This list was derived by removing BibTeX's
"magic 10"
from the set of
printable 7-bit ASCII characters (32-126), and then further removing
C<@>, C<\>, and C<~>. This means that B<btparse> disallows some of the
weirder entry
keys
that BibTeX would
accept
, such as C<\foo
@bar
>, but
still allows a string
with
initial digits. In fact, from the above
definition it appears that B<btparse> would
accept
a string of all
digits as a
"name;"
this is not the case, though, as the lexical scanner
recognizes such a digit string as a number first. There are two
problems here: BibTeX entry
keys
may in fact be entirely numeric, and
field names may not begin
with
a digit. (Those are two of the
not-so-obvious differences in BibTeX's handling of
keys
and field
names.) The tricks used to deal
with
these problems are implemented in
the parser rather than the lexical scanner, so are described in
L<
"SYNTACTIC GRAMMAR"
> below.
The string lexer recognizes C<lbrace>, C<rbrace>, C<lparen>, and
C<rparen> tokens in order to count brace- or parenthesis-depth. This is
necessary so it knows
when
to
accept
a string delimited by braces or
parentheses. (Note that a parenthesis-delimited string is only allowed
after
C<
@comment
>---this is not a normal BibTeX construct.) In
addition, it converts
each
non-space whitespace character (newline,
carriage-
return
, and tab) to a single space. (Sequences of whitespace
are not collapsed; that's the domain of string post-processing, which is
well removed from the scanner or parser.) Finally, it accepts C<"> to
delimit quote-delimited strings. Apart from those restrictions, the
string lexer accepts anything up to the end-of-string delimiter.
=head1 SYNTACTIC GRAMMAR
(The language used to describe the grammar here is the extended
Backus-Naur Form (EBNF) used by PCCTS. Terminals are represented by
uppercase strings, non-terminals by lowercase strings; terminal names
are the same as those
given
in the lexical grammar above. C<( foo )*>
means zero or more repetitions of the C<foo> production, and C<{ foo }>
means an optional C<foo>.)
A file is just a sequence of zero or more entries:
bibfile : ( entry )*
An entry is an at-sign, a name (the
"entry type"
), and the entry body:
entry : AT NAME body
A body is either a string (this alternative is only tried
if
the entry
type is C<
"comment"
>) or the entry contents:
body : STRING
| ENTRY_OPEN contents ENTRY_CLOSE
(C<ENTRY_OPEN> and C<ENTRY_CLOSE> are either C<{> and C<}> or C<(> and
C<)>, depending what is seen in the input
for
a particular entry.)
There are three possible productions
for
the
"contents"
non-terminal.
Only one applies to any
given
entry, depending on the entry metatype
(which in turn depends on the entry type). Currently, B<btparse>
supports four entry metatypes: comment, preamble, macro definition, and
regular. The first two correspond to C<
@comment
> and C<
@preamble
>
entries;
"macro definition"
is
for
C<
@string
> entries; and
"regular"
is
for
all other entry types. (The library will be extended to handle
C<
@modify
> and C<
@alias
> entry types, and corresponding
"modify"
and
"alias"
metatypes,
when
BibTeX 1.0 is released and the exact syntax is
known.) The
"metatype"
concept is necessary so that all entry types
that aren't specifically recognized fall into the
"regular"
metatype.
It's also convenient not to have to C<strcmp> the entry type all the
time
.
contents : ( NAME | NUMBER ) COMMA fields
| fields
| value
Note that the entry key is not just a C<NAME>, but C<( NAME | NUMBER)>.
This is necessary because BibTeX allows all-numeric entry
keys
, but
B<btparse>'s lexical scanner recognizes such digit strings as C<NUMBER>
tokens.
C<fields> is a comma-separated list of fields,
with
an optional single
trailing comma:
fields : field { COMMA fields }
|
A C<field> is a single
"field = value"
assignment:
field : NAME EQUALS value
Note that C<NAME> here is a restricted version of the
"name"
token
described in L<
"LEXICAL GRAMMAR"
> above. Any
"name"
token will be
accepted by the parser, but it is immediately checked to ensure that it
doesn't begin
with
a digit;
if
so, an artificial syntax error is
triggered. (This is
for
compatibility
with
BibTeX, which doesn't allow
field names to start
with
a digit.)
A C<value> is a series of simple
values
joined by C<
'#'
> characters:
value : simple_value ( HASH simple_value )*
A simple value is a string, number, or name (
for
macro invocations):
simple_value : STRING
| NUMBER
| NAME
=head1 SEE ALSO
L<btparse>
=head1 AUTHOR
Greg Ward <gward
@python
.net>