The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

bt_language - the BibTeX data language, as recognized by btparse

INTRODUCTION

One of the problems with BibTeX is that there is no formal specification of the language. This means that users exploring the arcane corners of the language are largely on their own, and programmers implementing their own parsers are completely on their own---except for observing the behaviour of the original implementation.

Other parser implementors (Nelson Beebe of bibclean fame, in particular) have taken the trouble to explain the language accepted by their parser, and in that spirit the following is presented.

If you are unfamiliar with the arcana of regular and context-free languages, you will not have any easy time understanding this. This is not an introduction to the BibTeX language; any LaTeX book would be more suitable for learning the data language itself.

LEXICAL GRAMMAR

The lexical scanner has three distinct modes: top-level, in-entry, and string. Roughly speaking, top-level is the initial mode; we enter in-entry mode on seeing an @ at top-level; and on seeing the } or ) that ends the entry, we return to top-level. We enter string mode on seeing a " or non-entry-delimiting { from in-entry mode. Note that the lexical language is both non-regular (because braces must balance) and context-sensitive (because { can mean different things depending on its syntactic context). That said, we will use regular expressions to describe the lexical elements, because they are the starting point used by the lexical scanner itself. The rest of the lexical grammar will be informally explained in the text.

From top-level, the following tokens are recognized according to the regular expressions on the right:

   at                    \@
   newline               \n
   comment               \%~[\n]*\n
   whitespace            [\ \r\t]+
   junk                  ~[\@\n\ \r\t]+

(Note that this is PCCTS regular expression syntax, which should be fairly familar to users of other regex engines. One oddity is that a character class is negated as ~[...] rather than [^...].)

On seeing at at top-level, we enter in-entry mode. Whitespace, junk, newlines, and comments are all skipped, with the latter two incrementing a line counter. (Junk is explicitly recognized to allow for bibtex's "implicit comment" scheme.)

From in-entry mode, we recognize newline, comment, and whitespace identically to top-level mode. In addition, the following tokens are recognized:

   name                  [a-z][a-z0-9:\+/'\.\-_]*
   number                [0-9]+
   lbrace                \{
   rbrace                \}
   lparen                \(
   rparen                \)
   equals                =
   hash                  \#
   comma                 ,
   quote                 \"

At this point, the lexical scanner starts to sound suspiciously like a context-free grammar, rather than a collection of independent regular expressions. However, it is necessary to keep this complexity in the scanner because certain characters ({ and ( in particular) have very different lexical meanings depending on the tokens that have preceded them in the input stream.

In particular, { and ( are treated as "entry openers" if they follow one at and one name token, unless the value of the name token is "comment". (Note the switch from top-level to in-entry between the two tokens.) In the @comment case, the delimiter is considered as starting a string, and we enter string mode. Otherwise, the delimiter is saved, and when we see a corresponding } or ) it is considered an "entry closer". (Braces are balanced for free here because the string lexer takes care of counting brace-depth.)

Anywhere else, { is considered as starting a string, and we enter string mode. " always starts a string, regardless of context. The other tokens (name, number, equals, hash, and comma) are recognized unconditionally. Note that name here (used for entry types, citation keys, field names, and macro names) is quite different from the original BibTeX definition. According to the BibTeX documentation, a name is anything except a certain sequence of characters; this means a field name of @!__37} is perfectly legal with BibTeX; in the name of decency, btparse rejects such a monstrosity.

The string lexer recognizes lbrace, rbrace, lparen, and rparen tokens in order to count brace- or parenthesis-depth. This is necessary so it knows when to accept a string delimited by braces or parentheses. (Note that a parenthesis-delimited string is only allowed after @comment---this is not a normal BibTeX construct.) In addition, it converts each non-space whitespace character (newline, carriage-return, and tab) to a single space. (Sequences of whitespace are not collapsed; that's the domain of string post-processing, which is well removed from the scanner or parser.) Finally, it accepts " to delimit quote-delimited strings. Apart from those restrictions, the string lexer accepts anything up to the end-of-string delimiter.

SYNTACTIC GRAMMAR

(The language used to describe the grammar here is the extended Backus-Naur Form (EBNF) used by PCCTS. Terminals are represented by uppercase strings, non-terminals by lowercase strings; terminal names are mostly the same as those given in the lexical grammar above, apart from the conversion to uppercase. ( foo )* means zero or more repetitions of the foo production, and { foo } means an optional foo.)

A file is just a sequence of zero or more entries:

   bibfile : ( entry )*

An entry is an at-sign, a name (the "entry type"), and the entry body:

   entry : AT NAME body

A body is either a string (this alternative is only tried if the entry type is "comment") or the entry contents:

   body : STRING
        | ENTRY_OPEN contents ENTRY_CLOSE

(ENTRY_OPEN and ENTRY_CLOSE are either { and } or ( and ), depending what is seen in the input for a particular entry.)

There are three possible productions for the "contents" non-terminal. Only one applies to any given entry, depending on the entry metatype (which in turn depends on the entry type). Currently, btparse supports four entry metatypes: comment, preamble, macro definition, and regular. The first two correspond to @comment and @preamble entries; "macro definition" is for @string entries; and "regular" is for all other entry types. (The library will be extended to handle @modify and @alias entry types, and corresponding "modify" and "alias" metatypes, when BibTeX 1.0 is released and the exact syntax is known.) The "metatype" concept is necessary so that all entry types that aren't specifically recognized fall into the "regular" metatype. It's also convenient not to have to strcmp the entry type all the time.

   contents : NAME COMMA fields     # for regular entries
            | fields                # for macro definition entries
            | value                 # for preamble entries

fields is a comma-separated list of fields, with an optional single trailing comma:

   fields : field { COMMA fields }
          | 

A field is a single "field = value" assignment:

   field : NAME EQUALS value

A value is a series of simple values joined by '#' characters:

   value : simple_value ( HASH simple_value )*

A simple value is a string, number, or name (for macro invocations):

   simple_value : STRING
                | NUMBER
                | NAME

1 POD Error

The following errors were encountered while parsing the POD:

Around line 85:

Unterminated C<...> sequence