The Boost C++ Libraries has been updated. The second edition was published in September 2014 and introduces 72 Boost libraries with more than 430 examples. It is available at Amazon, Barnes and Noble, for Kindle, as an Epub and as a PDF file. The second edition is based on C++11 and the Boost libraries 1.55.0 and 1.56.0 with the latter version having been released in August 2014.

Find the second edition online at http://theboostcpplibraries.com/

The Boost C++ Libraries


Chapter 12: Parser


Table of Contents

This book is licensed under a Creative Commons License.


12.1 General

Parsers are used to read formats that allow a flexible but potential complicated structure of data. A good example of such a format is C++ code. The parser of the compiler needs to understand the various language constructs in all possible combinations of C++ in order to translate them to a binary form.

The major issue while developing parsers is usually the sheer amount of rules by which the corresponding data is structured. For example, C++ supports so many language constructs that the development of a corresponding parser would require countless if expressions to recognize any imaginable C++ code as being valid.

The library Boost.Spirit introduced in this chapter turns the tables on developing parsers. Instead translating explicit rules into code and validating this code using countless if expressions, Boost.Spirit allows to express the rules using the so-called Extended Backus-Naur Form. Using these rules, Boost.Spirit then can parse a C++ source code file accordingly.

The basic idea of Boost.Spirit is similar to regular expressions. Instead of searching a text for a specific pattern using if expressions, the pattern is rather specified as a regular expression. The search is then performed by a library such as Boost.Regex so that the developer does not need to care about the details.

This chapter shows how to use Boost.Spirit to read complicated formats for which regular expressions are no longer feasible. Since Boost.Spirit is a comprehensive library introducing different concepts, a simple parser for the JSON format is developed throughout the chapter. JSON is an existing format which is used by e.g. Ajax applications to exchange data similar to XML between applications that potentially may even run on different platforms.

Even though Boost.Spirit simplifies the development of parsers, no one succeeded to write a C++ parser based on this library yet. The development of such a parser is still a long-term goal of Boost.Spirit, however, due to the complexity of the C++ language, has not been achieved so far. Boost.Spirit is currently not well suited for these kind of complex or binary formats.


12.2 Extended Backus-Naur Form

The Backus-Naur Form, abbreviated BNF, is a language to precisely describe rules and is used in many technical specifications. For example, many specifications of the numerous Internet protocols, the so-called request for comments, contain the rules in BNF in addition to annotations in text.

Boost.Spirit supports the Extended Backus-Naur Form (EBNF) which allows to specify rules in a shorter form than BNF does. The main advantage of EBNF is a shortened and thus simplified notation.

Please note that there are different variations of EBNF which may differ in their syntax. This chapter as well as Boost.Spirit itself uses the EBNF which syntax resembles the one of regular expressions.

In order to use Boost.Spirit, one needs to understand EBNF accordingly. Most of the time, developers already know EBNF and therefore select Boost.Spirit to reuse rules previously specified in EBNF. Below is a short introduction to EBNF; for a quick reference of the syntax used in this chapter and by Boost.Spirit, please refer to the W3C XML specification that contains a short summary.

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Strictly speaking, EBNF denotes rules as production rules. Any number of production rules can be combined to describe a particular format. The above format consists of only one production rule. It defines a digit that is made up of a number between 0 and 9.

Definitions such as digit are called nonterminal symbols. The numeric values 0 to 9 in the above definition are called terminal symbols instead. These symbols do not have any special meaning and can be easily identified since they are enclosed by double quotes.

All numeric values are connected by the pipe operator which has the same meaning as the || operator in C++: It denotes an alternative.

Summarized, the production rule specifies that any number between 0 and 9 is a digit.

integer = ("+" | "-")? digit+

The new nonterminal symbol integer consists of at least a digit that can be preceded by a plus or minus sign.

The definition of integer uses a couple of new operators. Parenthesis are used to create partial expressions just like in mathematics. Other operators can then be applied to these expressions. The question mark denotes that the partial expression can only be declared either once or not at all.

The plus sign following digit indicates that the corresponding expression must be declared at least once.

This new production rule defines an arbitrary positive or negative integer. While a digit consists of exactly one digit, an integer can consist of any combination of digits which can also be marked unsigned or signed. Whereas 5 is both a digit and an integer, +5 is solely an integer. The same applies to 169 or -8 which are also integer only.

More and more complex production rules can be created by simply defining and combining nonterminal symbols.

real = integer "." digit*

While the definition of integer represents integers, the definition of real represents floating-point numbers. The rule is based on the already defined nonterminal symbols integer and digit separated by a dot. The asterisk after digit specifies that digits after the dot are optional: There can either be an arbitrary number or none.

Floating-point numbers such as 1.2, -16.99 or even 3. satisfy the definition of real. However, the current definition does not allow floating-point numbers without a leading zero such as .9.

As mentioned in the beginning of this chapter, a parser for the JSON format is going to be developed using Boost.Spirit. For this purpose, the rules of the JSON format need to be expressed in EBNF.

object = "{" member ("," member)* "}"
member = string ":" value
string = '"' character* '"'
value = string | number | object | array | "true" | "false" | "null"
number = integer | real
array = "[" value ("," value)* "]"
character = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

The JSON format is based on objects that contain pairs of keys and values enclosed by curly braces. While the keys are simple strings, values can range from strings, numeric values, arrays, other objects or the literal true, false or null. Strings are a consecutive character stream enclosed by double quotes. Numeric values can either be integers or floating-point numbers. Arrays contain values separated by comma that are enclosed by squared brackets.

Please note that the above definition is not complete. On the one hand, the definition of character lacks capital letters as well as additional characters; on the other hand, JSON supports specifics such as Unicode or control characters. This can be currently ignored since Boost.Spirit defines frequently used nonterminal symbols such as alphanumeric characters to save one from typing endless character streams. In addition, a string will be later defined in code as a consecutive stream of arbitrary characters excluding double quotes. Since double quotes finalize the string, all other characters can be therefore used within a string. The above EBNF does not express this accordingly since EBNF requires the definition of a nonterminal symbol for all characters except the ones that should be excluded in order to define an exception.

The following is an example for a JSON format for which above definition applies.

{
  "Boris Schäling" :
  {
    "Male": true,
    "Programming Languages": [ "C++", "Java", "C#" ],
    "Age": 31
  }
}

The global object is characterized by the outer curly braces and contains a key-value pair. The key is named "Boris Schäling", the value is a new object that contains multiple key-value pairs. While all the keys are strings, the values are the literal true, an array containing several strings and a numeric value.

The EBNF rules defined above can now be used to develop a parser using Boost.Spirit that is able to read the above JSON format.


12.3 Grammar

After defining the rules for the JSON format in EBNF in the previous section, they now need to be used together with Boost.Spirit somehow. Boost.Spirit actually allows to define the EBNF rules as C++ code by overloading the different operators used by EBNF.

Please note that the EBNF rules need to be modified slightly in order to create valid C++ code. Symbols that are combined by a space in EBNF need to be combined using some operator in C++. Additionally, operators such as the asterisk, the question mark and the plus sign, placed directly behind the corresponding symbol in EBNF, must be placed in front of the symbol in order to use them as unary operators in C++.

The following shows the EBNF rules for the JSON format expressed in C++ code for Boost.Spirit.

#include <boost/spirit.hpp> 

struct json_grammar 
  : public boost::spirit::grammar<json_grammar> 
{ 
  template <typename Scanner> 
  struct definition 
  { 
    boost::spirit::rule<Scanner> object, member, string, value, number, array; 

    definition(const json_grammar &self) 
    { 
      using namespace boost::spirit; 
      object = "{" >> member >> *("," >> member) >> "}"; 
      member = string >> ":" >> value; 
      string = "\"" >> *~ch_p("\"") >> "\""; 
      value = string | number | object | array | "true" | "false" | "null"; 
      number = real_p; 
      array = "[" >> value >> *("," >> value) >> "]"; 
    } 

    const boost::spirit::rule<Scanner> &start() 
    { 
      return object; 
    } 
  }; 
}; 

int main() 
{ 
} 

To use the different classes of Boost.Spirit, the header boost/spirit.hpp needs to be included. The classes are provided within the namespace boost::spirit.

In order to create a parser with Boost.Spirit, a so-called grammar must be created which among other things defines the rules after which the data is structured. In the above example the json_grammar class has been developed which is derived from the template class boost::spirit::grammar and is instantiated with the name of the class. json_grammar defines the complete grammar necessary to understand the JSON format.

One important component of the grammar are the rules to read structured data correctly. These rules are defined within an inner class named definition - this name is mandatory. This class is a template with one argument which is instantiated with a so-called scanner by Boost.Spirit. A scanner is a concept internally used by Boost.Spirit. Even though it is mandatory for definition to be a template taking a scanner type as its argument, it is irrelevant for the daily use of Boost.Spirit what these scanners actually are and why they are defined.

definition must define a method named start() which is called by Boost.Spirit to obtain the complete rules and standards of the grammar. The return value of this method is a constant reference to boost::spirit::rule that is also a template class instantiated with the scanner type.

The class boost::spirit::rule is used to define rules. Nonterminal symbols are defined by means of this class. The previously defined nonterminal symbols object, member, string, value, number and array are all of type boost::spirit::rule.

All of these objects are defined as properties of the definition class which is not mandatory but alleviates the definition especially if rules reference each other recursively. As seen with the EBNF examples in the previous section, recursive references are not an issue.

At first glance, the definitions of the rules within the constructor of definition look similar to the production rules of the EBNF seen in the previous section. This does not come as a surprise though since this is exactly the goal of Boost.Spirit: To reuse production rules defined in the EBNF.

While the C++ code resembles the rules established in the EBNF, there are certainly small differences in order to write valid C++. For example, all of the symbols are concatenated via the >> operator. EBNF operators such as the asterisk are placed in front of the symbols instead of following them. Despite these syntax changes, Boost.Spirit strives to convert EBNF rules to C++ code without many changes.

The constructor of definition uses two classes provided by Boost.Spirit: boost::spirit::ch_p and boost::spirit::real_p. Frequently used rules are provided in the form of parsers that can easily be reused. One example is boost::spirit::real_p which allows storing positive and negative integers and floating-point numbers without the need to define nonterminal symbols such as digit or real.

boost::spirit::ch_p can be used to create a parser for a single character which is equal to enclosing the character with double quotes. In the above example, the usage of boost::spirit::ch_p is mandatory since the tilde and asterisk are applied to the double quote sign. Without the class, the code would read *~"\"" which would be rejected by the compiler as invalid code.

The tilde actually allows the trick mentioned in the previous section: By prefixing the double quote with the tilde, all characters except the double quote are accepted.

After the rules for identifying the JSON format are defined, the following example shows how to use them.

#include <boost/spirit.hpp> 
#include <fstream> 
#include <sstream> 
#include <iostream> 

struct json_grammar 
  : public boost::spirit::grammar<json_grammar> 
{ 
  template <typename Scanner> 
  struct definition 
  { 
    boost::spirit::rule<Scanner> object, member, string, value, number, array; 

    definition(const json_grammar &self) 
    { 
      using namespace boost::spirit; 
      object = "{" >> member >> *("," >> member) >> "}"; 
      member = string >> ":" >> value; 
      string = "\"" >> *~ch_p("\"") >> "\""; 
      value = string | number | object | array | "true" | "false" | "null"; 
      number = real_p; 
      array = "[" >> value >> *("," >> value) >> "]"; 
    } 

    const boost::spirit::rule<Scanner> &start() 
    { 
      return object; 
    } 
  }; 
}; 

int main(int argc, char *argv[]) 
{ 
  std::ifstream fs(argv[1]); 
  std::ostringstream ss; 
  ss << fs.rdbuf(); 
  std::string data = ss.str(); 

  json_grammar g; 
  boost::spirit::parse_info<> pi = boost::spirit::parse(data.c_str(), g, boost::spirit::space_p); 
  if (pi.hit) 
  { 
    if (pi.full) 
      std::cout << "parsing all data successfully" << std::endl; 
    else 
      std::cout << "parsing data partially" << std::endl; 
    std::cout << pi.length << " characters parsed" << std::endl; 
  } 
  else 
    std::cout << "parsing failed; stopped at '" << pi.stop << "'" << std::endl; 
} 

Boost.Spirit offers a free-standing function named boost::spirit::parse(). By creating an instance of a grammar, a parser is created accordingly which is passed to boost::spirit::parse() as the second argument. The first argument denotes the text to be parsed while the third argument is a parser indicating which characters should be skipped in the given text. To skip spaces, an object of type boost::spirit::space_p is passed as the third argument. This simply means that an arbitrary number of spaces is allowed between the data to be captured - in other words, everywhere the >> operator was applied in the rules. Tabulators and line breaks are included which allows for a more flexible notation of data formats.

boost::spirit::parse() returns an object of type boost::spirit::parse_info that offers four properties indicating whether or not the text was successfully parsed. The hit property is set to true if the text was successfully parsed. If all characters in the text were parsed without e.g. remaining spaces at the end, full is also set to true. Only if hit equals true, length is valid and contains the number of characters parsed successfully.

The length property must not be accessed if the text could not be parsed successfully. In this case, the stop property allows to access the text location where parsing stopped. While stop can also be accessed if the text was parsed successfully, it does not make much sense since it will point behind the parsed text in this case.


12.4 Actions

So far, one knows how to define a grammar in order to obtain a new parser that can be used to identify whether a specific text is structured according to the rules of the grammar. However, the data format still is not interpreted at this point since the individual data read from a structured format such as JSON are not processed further.

In order to process data satisfying a specific rule as identified by the parser, actions are used. Actions are functions that are associated with rules. If the parser identifies data satisfying a specific rule, the associated action is executed, passing the data to be processed accordingly as shown in the following example.

#include <boost/spirit.hpp> 
#include <string> 
#include <fstream> 
#include <sstream> 
#include <iostream> 

struct json_grammar 
  : public boost::spirit::grammar<json_grammar> 
{ 
  struct print 
  { 
    void operator()(const char *begin, const char *end) const 
    { 
      std::cout << std::string(begin, end) << std::endl; 
    } 
  }; 

  template <typename Scanner> 
  struct definition 
  { 
    boost::spirit::rule<Scanner> object, member, string, value, number, array; 

    definition(const json_grammar &self) 
    { 
      using namespace boost::spirit; 
      object = "{" >> member >> *("," >> member) >> "}"; 
      member = string[print()] >> ":" >> value; 
      string = "\"" >> *~ch_p("\"") >> "\""; 
      value = string | number | object | array | "true" | "false" | "null"; 
      number = real_p; 
      array = "[" >> value >> *("," >> value) >> "]"; 
    } 

    const boost::spirit::rule<Scanner> &start() 
    { 
      return object; 
    } 
  }; 
}; 

int main(int argc, char *argv[]) 
{ 
  std::ifstream fs(argv[1]); 
  std::ostringstream ss; 
  ss << fs.rdbuf(); 
  std::string data = ss.str(); 

  json_grammar g; 
  boost::spirit::parse_info<> pi = boost::spirit::parse(data.c_str(), g, boost::spirit::space_p); 
  if (pi.hit) 
  { 
    if (pi.full) 
      std::cout << "parsing all data successfully" << std::endl; 
    else 
      std::cout << "parsing data partially" << std::endl; 
    std::cout << pi.length << " characters parsed" << std::endl; 
  } 
  else 
    std::cout << "parsing failed; stopped at '" << pi.stop << "'" << std::endl; 
} 

Actions are implemented as functions or function objects. The latter are beneficial if the action should be initialized or maintain state information between repeated executions. The above example implements the action as a function object.

The class print is a function object which writes data to the standard output stream. When called, the overloaded operator()() operator receives a pointer to the beginning and end of the data as identified by the rules executing the particular action.

The example associates the action with the nonterminal symbol string which appears as the first symbol after member. An instance of type print is passed to the nonterminal symbol string inside squared brackets. As string represents the key within key-value pairs of JSON objects, the overloaded operator()() operator of class print is called for every discovered key which simply writes the key to the standard output stream.

It is now possible to define an arbitrary number of actions or to associate them with an arbitrary number of symbols. In order to associate an action with a literal, a parser must be specified explicitly. This is no different from the definition of the nonterminal symbol string specifying the boost::spirit::ch_p class. The following example uses the boost::spirit::str_p class to associate an object of type print with the literal true.

#include <boost/spirit.hpp> 
#include <string> 
#include <fstream> 
#include <sstream> 
#include <iostream> 

struct json_grammar 
  : public boost::spirit::grammar<json_grammar> 
{ 
  struct print 
  { 
    void operator()(const char *begin, const char *end) const 
    { 
      std::cout << std::string(begin, end) << std::endl; 
    } 

    void operator()(const double d) const 
    { 
      std::cout << d << std::endl; 
    } 
  }; 

  template <typename Scanner> 
  struct definition 
  { 
    boost::spirit::rule<Scanner> object, member, string, value, number, array; 

    definition(const json_grammar &self) 
    { 
      using namespace boost::spirit; 
      object = "{" >> member >> *("," >> member) >> "}"; 
      member = string[print()] >> ":" >> value; 
      string = "\"" >> *~ch_p("\"") >> "\""; 
      value = string | number | object | array | str_p("true")[print()] | "false" | "null"; 
      number = real_p[print()]; 
      array = "[" >> value >> *("," >> value) >> "]"; 
    } 

    const boost::spirit::rule<Scanner> &start() 
    { 
      return object; 
    } 
  }; 
}; 

int main(int argc, char *argv[]) 
{ 
  std::ifstream fs(argv[1]); 
  std::ostringstream ss; 
  ss << fs.rdbuf(); 
  std::string data = ss.str(); 

  json_grammar g; 
  boost::spirit::parse_info<> pi = boost::spirit::parse(data.c_str(), g, boost::spirit::space_p); 
  if (pi.hit) 
  { 
    if (pi.full) 
      std::cout << "parsing all data successfully" << std::endl; 
    else 
      std::cout << "parsing data partially" << std::endl; 
    std::cout << pi.length << " characters parsed" << std::endl; 
  } 
  else 
    std::cout << "parsing failed; stopped at '" << pi.stop << "'" << std::endl; 
} 

In addition, the example associates an action with boost::spirit::real_p. While most parsers pass a pointer to the beginning and end of the identified data, boost::spirit::real_p passes the discovered number as double instead. This certainly makes processing of numbers much easier since they do not need to be converted explicitly. In order to pass a value of type double to the action, a corresponding overloaded operator()() operator has been added to print.

Besides the parsers introduced in this chapter such as boost::spirit::str_p or boost::spirit::real_p, Boost.Spirit offers many more. For example, boost::spirit::regex_p is helpful if regular expressions should be used. In addition, parsers to verify conditions or execute loops exist as well. This helps creating dynamic parsers which process data differently depending on conditions. In order to get an overview over the utilities provided by Boost.Spirit, one should take a look at its documentation.


12.5 Exercises

You can buy solutions to all exercises in this book as a ZIP file.

  1. Develop a calculator that can add and subtract arbitrary integers and floating-point numbers. The calculator should accept input such as =-4+8 + 1.5 and display 5.5 as the result accordingly.