The Boost C++ Libraries


Chapter 5: String Handling


Table of Contents

This book is licensed under a Creative Commons License.

A new edition of this book is available! It has been published as a print book and can be bought from Barnes and Noble, Amazon and other bookstores. The new edition is up-to-date and based on the Boost C++ Libraries 1.47.0 (released in July 2011). Several chapters have been updated (for example to Boost.Spirit 2.x, Boost.Signals 2 and Boost.Filesystem 3) and many new libraries are covered (for example Boost.CircularBuffer, Boost.Intrusive and Boost.MultiArray). For more information please see the publisher's website XML Press.


5.1 General

Strings in the C++ standard are handled by the std::string class which offers many functions for manipulating them. Among these are functions searching a string for a specific character or functions returning a substring. Even though std::string provides more than 100 functions, which makes it one of the more bloated classes of the C++ standard, many developers still miss additional functionality throughout their daily routine. For example, while Java and .NET provide functions to convert a string to uppercase, there is no equivalent available in std::string. The Boost C++ Libraries presented in this chapter try to close this gap.


5.2 Locales

Before the Boost C++ Libraries are introduced though, one should at least take a brief look at locales. Many functions outlined in this chapter will expect a locale as an additional parameter.

Locales are used in the C++ standard to encapsulate cultural conventions such as the currency symbol, date and time formats, the symbol used to separate the integer portion of a number from the fractional one (radix character) as well as the symbol used for grouping numbers with more than three digits (thousands separator).

In terms of string handling, the locale is relevant for describing the order and the individual letters used in the particular culture. For instance, whether an alphabet contains mutated vowels and what place they take in the alphabet depends on the culture.

If a function is called that converts a given string to uppercase, the individual steps taken depend on the particular locale. In the German language, it is obvious that the letter 'ä' is converted to 'Ä'; however, this does not necessarily hold true for other cultures as well.

When working with std::string, the usage of locales can be neglected since none of the functions is dependent on a particular culture. In order to work with the Boost C++ Libraries in this chapter though, this knowledge is mandatory.

The C++ standard defines a class named std::locale in locale. Every C++ program automatically has one instance of this class - the global locale which cannot be directly accessed. Instead, a separate object of std::locale must be created via the default constructor that will be initialized with the same properties as the global locale.

#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale loc; 
  std::cout << loc.name() << std::endl; 
} 

The above program will output C on the standard output stream which is the name of the classic locale. This locale contains descriptions used by default in programs developed with the C language.

This also happens to be the default global locale for every C++ application. It contains descriptions used by the American culture. For example, the dollar sign is used as the currency symbol, the radix character is a period, and displaying a date causes the month to be written in English.

The global locale can be changed using the static function global() of the std::locale class.

#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::locale loc; 
  std::cout << loc.name() << std::endl; 
} 

The static global() function expects a new object of type std::locale as its sole parameter. Using a different constructor of the class, expecting a character string of type const char*, a locale object for a particular culture can be created. However, names of locales are not standardized except for the C locale which is named "C" correspondingly. It therefore depends on the individual C++ standard library which names are actually accepted. In case of Visual Studio 2008, the definitions for the German culture can be selected using the language string "German" as outlined in the documentation of language strings.

The program will output German_Germany.1252 if executed. Specifying "German" as the language string selects the definitions for the German primary language and sublanguage as well as the character map 1252.

In case the sublanguage should be set to a different location of the German culture such as the Swiss, a different language string can be used.

#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German_Switzerland")); 
  std::locale loc; 
  std::cout << loc.name() << std::endl; 
} 

Now, the program will output German_Switzerland.1252 instead.

After getting an understanding about locales in general and how the global one can be changed, the following example shows how locales affect string handling.

#include <locale> 
#include <iostream> 
#include <cstring> 

int main() 
{ 
  std::cout << std::strcoll("ä", "z") << std::endl; 
  std::locale::global(std::locale("German")); 
  std::cout << std::strcoll("ä", "z") << std::endl; 
} 

The example uses the std::strcoll() function defined in cstring to compare whether the first string is lexicographically less than the second one. In other words, which of the two strings would be found first in a dictionary.

If executed, the result is both 1 and -1. Even though the function is called with the same input parameters, the results are different. The reason is quite simple - while calling std::strcoll() the first time, the global C locale is used. However, when called the second time, the global locale has been changed to incorporate definitions for the German culture instead. The order of the two characters 'ä' and 'z' is different for these locales as indicated by the output.

Numerous C functions as well as C++ streams access locales. Albeit functions of the std::string class work independently from locales, many of the functions outlined in the following paragraphs do not. Hence, locales are met again several times throughout this chapter.


5.3 Boost.StringAlgorithms

The Boost C++ library Boost.StringAlgorithms provides many stand-alone functions for string manipulation. Strings can be of type std::string, std::wstring or any different instance of the template class std::basic_string.

The functions are categorized within different header files. For example, functions converting from uppercase to lowercase are defined in boost/algorithm/string/case_conv.hpp. Since Boost.StringAlgorithms consists of more than 20 different categories and as many header files, boost/algorithm/string.hpp acts as the common header including all other header files for convenience. All of the following examples will use this combined header.

As mentioned in the previous paragraph, many functions of the Boost.StringAlgorithms library expect an object of type std::locale as an additional parameter. However, this parameter is optional - if not provided, the default global locale is used.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 
#include <clocale> 

int main() 
{ 
  std::setlocale(LC_ALL, "German"); 
  std::string s = "Boris Schäling"; 
  std::cout << boost::algorithm::to_upper_copy(s) << std::endl; 
  std::cout << boost::algorithm::to_upper_copy(s, std::locale("German")) << std::endl; 
} 

The boost::algorithm::to_upper_copy() function is used to convert a string to uppercase. Naturally, there also exists a function doing the opposite: boost::algorithm::to_lower_copy() converts a string to lowercase. Both functions return the converted string as result. If the passed string itself should be converted, the functions boost::algorithm::to_upper() or boost::algorithm::to_lower() can be used instead.

The above example converts the string "Boris Schäling" to uppercase using boost::algorithm::to_upper_copy(). The first call uses the default global locale while the second call explicitly states the locale for the German culture.

Using the latter certainly will result in a correctly converted string since the corresponding uppercase character 'Ä' exists for the lowercase 'ä'. For the C locale instead, 'ä' is an unknown character and thus is not converted. To yield correct results, either pass the correct locale explicitly or modify the global locale before calling boost::algorithm::to_upper_copy().

Note, that the program uses std::setlocale() - defined in clocale - to set the locale for any C function. Internally, std::cout uses C functions to display information on the screen. By setting the correct locale, mutated vowels such as 'ä' and 'Ä' are displayed correctly.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  std::cout << boost::algorithm::to_upper_copy(s) << std::endl; 
  std::cout << boost::algorithm::to_upper_copy(s, std::locale("German")) << std::endl; 
} 

The above program sets the German culture for the global locale which causes the first call to boost::algorithm::to_upper_copy() to use the corresponding definitions for converting 'ä' to 'Ä'.

Please note that the std::setlocale() is not called in this example. By setting the global locale using the std::locale::global() function, the C locale is automatically set as well. In practice, C++ programs almost always set the global locale using std::locale::global() rather than using std::setlocale() as seen in the previous example.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  std::cout << boost::algorithm::erase_first_copy(s, "i") << std::endl; 
  std::cout << boost::algorithm::erase_nth_copy(s, "i", 0) << std::endl; 
  std::cout << boost::algorithm::erase_last_copy(s, "i") << std::endl; 
  std::cout << boost::algorithm::erase_all_copy(s, "i") << std::endl; 
  std::cout << boost::algorithm::erase_head_copy(s, 5) << std::endl; 
  std::cout << boost::algorithm::erase_tail_copy(s, 8) << std::endl; 
} 

Boost.StringAlgorithms provides several functions to delete individual characters from a string. How and where the deletion should occur can be explicitly specified. For example, a particular character can be removed from the complete string by using boost::algorithm::erase_all_copy(). If only the first occurrence of the character should be removed, boost::algorithm::erase_first_copy() ought to be used instead. To shorten the string by a specific number of characters on either end, the functions boost::algorithm::erase_head_copy() and boost::algorithm::erase_tail_copy() can be used accordingly.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::iterator_range<std::string::iterator> r = boost::algorithm::find_first(s, "Boris"); 
  std::cout << r << std::endl; 
  r = boost::algorithm::find_first(s, "xyz"); 
  std::cout << r << std::endl; 
} 

Different functions such as boost::algorithm::find_first(), boost::algorithm::find_last(), boost::algorithm::find_nth(), boost::algorithm::find_head() and boost::algorithm::find_tail() are available to find strings within strings.

All of these functions have in common that they return a pair of iterators of type boost::iterator_range. This class originates from the Boost C++ Library Boost.Range which defines a range concept based on the iterator concept. Since the << operator is overloaded for boost::iterator_range, the result of the individual search algorithm can be directly written to the standard output stream. The above program prints Boris for the first result and an empty string for the second one.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 
#include <vector> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::vector<std::string> v; 
  v.push_back("Boris"); 
  v.push_back("Schäling"); 
  std::cout << boost::algorithm::join(v, " ") << std::endl; 
} 

A container of strings is passed as the first parameter to the boost::algorithm::join() function which concatenates them separated by the second parameter. The example will output Boris Schäling accordingly.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  std::cout << boost::algorithm::replace_first_copy(s, "B", "D") << std::endl; 
  std::cout << boost::algorithm::replace_nth_copy(s, "B", 0, "D") << std::endl; 
  std::cout << boost::algorithm::replace_last_copy(s, "B", "D") << std::endl; 
  std::cout << boost::algorithm::replace_all_copy(s, "B", "D") << std::endl; 
  std::cout << boost::algorithm::replace_head_copy(s, 5, "Doris") << std::endl; 
  std::cout << boost::algorithm::replace_tail_copy(s, 8, "Becker") << std::endl; 
} 

Just like functions for searching strings or removing characters from a string, Boost.StringAlgorithms also provides functions for replacing a substring within a string. Among these functions are boost::algorithm::replace_first_copy(), boost::algorithm::replace_nth_copy(), boost::algorithm::replace_last_copy(), boost::algorithm::replace_all_copy(), boost::algorithm::replace_head_copy() and boost::algorithm::replace_tail_copy(). They can be applied the same way as the functions used for searching and removing except that they expect an additional parameter - the replacement string.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "\t Boris Schäling \t"; 
  std::cout << "." << boost::algorithm::trim_left_copy(s) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_right_copy(s) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_copy(s) << "." << std::endl; 
} 

In order to automatically remove spaces on either end of a string, boost::algorithm::trim_left_copy(), boost::algorithm::trim_right_copy() and boost::algorithm::trim_copy() can be used. Which character counts as a space is dependent on the given global locale.

Boost.StringAlgorithms allows to provide a predicate as an additional parameter for different functions that determines to which characters of the string the function is applied to. The predicated versions for trimming a string are named boost::algorithm::trim_left_copy_if(), boost::algorithm::trim_right_copy_if() and boost::algorithm::trim_copy_if() accordingly.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "--Boris Schäling--"; 
  std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; 
} 

The program in the above example accesses another function named boost::algorithm::is_any_of() which is a helper function for creating a predicate verifying whether or not the character - passed as the parameter - exists in a given string. Using boost::algorithm::is_any_of(), the character for trimming a string can be specified as has been done in the example which uses the hyphen.

Boost.StringAlgorithms already provides numerous helper functions returning commonly used predicates.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "123456789Boris Schäling123456789"; 
  std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; 
  std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; 
} 

The predicate returned by boost::algorithm::is_digit() indicates a numeric character by returning the boolean value true. Helper functions are also provided to check whether or not a character is uppercase or lowercase: boost::algorithm::is_upper() and boost::algorithm::is_lower() respectively. All of these functions use the global locale by default unless otherwise specified by passing a different locale as a parameter.

Besides the predicates that verify individual characters of a string, Boost.StringAlgorithms also offers functions that work with strings instead.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  std::cout << boost::algorithm::starts_with(s, "Boris") << std::endl; 
  std::cout << boost::algorithm::ends_with(s, "Schäling") << std::endl; 
  std::cout << boost::algorithm::contains(s, "is") << std::endl; 
  std::cout << boost::algorithm::lexicographical_compare(s, "Boris") << std::endl; 
} 

The functions boost::algorithm::starts_with(), boost::algorithm::ends_with(), boost::algorithm::contains() and boost::algorithm::lexicographical_compare() all compare two individual strings.

The following shows a function that allows to split a string into smaller parts.

#include <boost/algorithm/string.hpp> 
#include <locale> 
#include <iostream> 
#include <vector> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  std::vector<std::string> v; 
  boost::algorithm::split(v, s, boost::algorithm::is_space()); 
  std::cout << v.size() << std::endl; 
} 

Using boost::algorithm::split(), a given string can be split into a container based on a certain delimiter. The function requires a predicate as its third parameter indicating for each character whether the string should be split at the given position. The example uses the helper function boost::algorithm::is_space() to create a predicate that will split the string at every space character.

Many of the functions introduced in this paragraph also exist in a version that ignores the case of the string. They typically have the same name except for a leading 'i'. For example, the equivalent to boost::algorithm::erase_all_copy() is boost::algorithm::ierase_all_copy().

Finally, it should be noted that many functions of Boost.StringAlgorithms also support regular expressions. The following program uses the boost::algorithm::find_regex() function to search for a regular expression.

#include <boost/algorithm/string.hpp> 
#include <boost/algorithm/string/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::iterator_range<std::string::iterator> r = boost::algorithm::find_regex(s, boost::regex("\\w\\s\\w")); 
  std::cout << r << std::endl; 
} 

In order to use the regular expression, the program accesses a class named boost::regex which is defined inside the Boost C++ Library Boost.Regex and is presented in the following paragraph.


5.4 Boost.Regex

The Boost C++ Library Boost.Regex allows the usage of regular expressions in C++. Regular expressions is a powerful feature of many languages that alleviates searching for a particular string pattern. While nowadays C++ still needs to resort to a Boost C++ Library, support for regular expressions will become part of the C++ standard library in the future: Boost.Regex is expected to be included in the next revision of the C++ standard.

The two most important classes in Boost.Regex are boost::regex and boost::smatch, both defined in boost/regex.hpp. While the former is used to define a regular expression, the latter will save the search results.

Boost.Regex provides three different functions to search for regular expressions which are introduced below.

#include <boost/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::regex expr("\\w+\\s\\w+"); 
  std::cout << boost::regex_match(s, expr) << std::endl; 
} 

boost::regex_match() is used to compare a string with a regular expression. It will return true only if the expression matches the complete string.

To search a string for a regular expression, boost::regex_search() is available.

#include <boost/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::regex expr("(\\w+)\\s(\\w+)"); 
  boost::smatch what; 
  if (boost::regex_search(s, what, expr)) 
  { 
    std::cout << what[0] << std::endl; 
    std::cout << what[1] << " " << what[2] << std::endl; 
  } 
} 

boost::regex_search() expects a reference to an object of type boost::smatch as an additional parameter that is used to store the results. boost::regex_search() only searches for groupings thus, the example actually returns two results based on the two groupings found in the regular expression.

The result storage class boost::smatch is actually a container holding elements of type boost::sub_match which can be accessed using an interface similar to the one of std::vector. For example, elements can be accessed via the operator[]() operator.

The class boost::sub_match on the other hand saves iterators to the specific positions inside a string corresponding to the grouping of a regular expression. Since it is derived from std::pair, the individual iterators referencing a particular substring can be accessed using first and second. In order to write a substring to the standard output stream, these iterators do not necessarily need to be accessed though as seen in the above example. Using the overloaded << operator, the substring can be directly written instead.

Please note that since results are stored using iterators, boost::sub_match does not copy them. This certainly implies that they are accessible only as long as the corresponding string - referenced by the iterators - exists.

Furthermore, please note that the first element of the container boost::smatch stores iterators referencing the string that matches the complete regular expression. The first substring that matches the first grouping is accessible at index 1.

The third function offered by Boost.Regex is boost::regex_replace().

#include <boost/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = " Boris Schäling "; 
  boost::regex expr("\\s"); 
  std::string fmt("_"); 
  std::cout << boost::regex_replace(s, expr, fmt) << std::endl; 
} 

Besides the string to search as well as the regular expression, boost::regex_replace() requires a format that defines how substrings, matching individual groupings of the regular expression, are replaced. In case the regular expression does not contain any grouping, corresponding substrings are replaced one-to-one using the given format. Thus, the above program will output _Boris_Schäling_ as the result.

boost::regex_replace() always searches through the complete string for the regular expression. Hence, the program actually replaced all three spaces with underscores.

#include <boost/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::regex expr("(\\w+)\\s(\\w+)"); 
  std::string fmt("\\2 \\1"); 
  std::cout << boost::regex_replace(s, expr, fmt) << std::endl; 
} 

The format can access substrings returned by groupings of the regular expression. The example uses this technique to swap the first with the last name, displaying Schäling Boris as the result.

Please note that there exist different standards for regular expressions and formats. Each of the three functions takes an additional parameter that allows to select a specific standard. Whether or not special characters should be interpreted in a specific format or whether the format should rather replace the complete string matching the regular expression can be specified as well.

#include <boost/regex.hpp> 
#include <locale> 
#include <iostream> 

int main() 
{ 
  std::locale::global(std::locale("German")); 
  std::string s = "Boris Schäling"; 
  boost::regex expr("(\\w+)\\s(\\w+)"); 
  std::string fmt("\\2 \\1"); 
  std::cout << boost::regex_replace(s, expr, fmt, boost::regex_constants::format_literal) << std::endl; 
} 

The program passes the boost::regex_constants::format_literal flag as the fourth parameter to boost::regex_replace() to suppress handling of special characters in the format. Since the complete string that matches the regular expression is replaced with the format, the output of the example is \2 \1.

As indicated at the end of the previous paragraph, regular expressions can also be used with Boost.StringAlgorithms. The library accesses Boost.Regex to provide functions such as boost::algorithm::find_regex(), boost::algorithm::replace_regex(), boost::algorithm::erase_regex() and boost::algorithm::split_regex(). Since Boost.Regex is expected to be part of the upcoming revision of the C++ standard, it is advisable to be proficient in applying regular expressions without the usage of Boost.StringAlgorithms though.


5.5 Boost.Tokenizer

The library Boost.Tokenizer allows to iterate over partial expressions in a string by interpreting certain characters as separators.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  tokenizer tok(s); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

Boost.Tokenizer defines a template class named boost::tokenizer in boost/tokenizer.hpp. It expects a class that identifies coherent expressions for its template parameter. The above example uses the class boost::char_separator which interprets spaces and punctuation marks as separators.

A tokenizer must be initialized with a string of type std::string. Using the begin() and end() methods, the tokenizer can be accessed just like a container. Partial expressions of the string used to initialize the tokenizer are available via iterators. How partial expressions are evaluated depends on the kind of class passed as the template parameter.

Since boost::char_separator interprets spaces and punctuation marks as separators by default, the example displays Boost, C, +, + and libraries. In order to identify these characters, boost::char_separator utilizes both std::isspace() and std::ispunct(). Boost.Tokenizer distinguishes between separators that should be displayed and separators that should be suppressed: By default, spaces are suppressed while punctuation marks are displayed. Hence the two plus signs are displayed accordingly.

If punctuation marks should not be interpreted as separators, the boost::char_separator object can be initialized accordingly before being passed to the tokenizer. The following example does exactly that.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" "); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

The constructor of boost::char_separator expects a total of three parameters of which only the first one must be supplied. It describes the individual separators that are suppressed. For the given example, spaces are treated as separators just like with the previous example.

The second parameter specifies the separators that are displayed. In case this parameter is omitted, it is empty and thus no separators are displayed at all. If the program is now executed, it displays Boost, C++ and libraries.

If a plus sign is passed for the second parameter, the example program behaves just like the first one.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" ", "+"); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

The third parameter determines whether or not empty partial expressions are displayed. If two separators are found back-to-back, the corresponding partial expression is empty. By default, these empty expressions are not displayed. Using the third parameter, the default behavior can be manipulated.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer; 
  std::string s = "Boost C++ libraries"; 
  boost::char_separator<char> sep(" ", "+", boost::keep_empty_tokens); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

If executed, the above program displays two additional empty partial expressions. The first one is found between the two plus signs while the second one is found between the second plus sign and the following space.

A tokenizer can also be used with different string types.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::char_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tokenizer; 
  std::wstring s = L"Boost C++ libraries"; 
  boost::char_separator<wchar_t> sep(L" "); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::wcout << *it << std::endl; 
} 

This example iterates over a string of type std::wstring instead. In order to allow this type of string, the tokenizer must be initialized using additional template parameters. The same applies to the boost::char_separator class; it also must be initialized using wchar_t for its template parameter.

Besides boost::char_separator, Boost.Tokenizer provides two additional classes to identify partial expressions.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::escaped_list_separator<char> > tokenizer; 
  std::string s = "Boost,\"C++ libraries\""; 
  tokenizer tok(s); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

boost::escaped_list_separator is used to read multiple values separated by a comma. This format is commonly known as CSV (comma separated values). It also considers double quotes as well as so-called escape sequences accordingly. The output of the example is therefore Boost and C++ libraries.

The second class provided is boost::offset_separator which must be instantiated. The corresponding object must be passed to the constructor of boost::tokenizer as the second parameter.

#include <boost/tokenizer.hpp> 
#include <string> 
#include <iostream> 

int main() 
{ 
  typedef boost::tokenizer<boost::offset_separator> tokenizer; 
  std::string s = "Boost C++ libraries"; 
  int offsets[] = { 5, 5, 9 }; 
  boost::offset_separator sep(offsets, offsets + 3); 
  tokenizer tok(s, sep); 
  for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) 
    std::cout << *it << std::endl; 
} 

boost::offset_separator specifies the locations within the string at which individual partial expressions end. The above program specifies that the first partial expression ends after 5 characters, the second ends after additional 5 characters and the third and last ends after the following 9 characters. The output will be Boost,  C++  and libraries.


5.6 Boost.Format

Boost.Format offers a replacement for the std::printf() function defined in cstdio. std::printf() originates from the C standard and allows formatted data output. However, it is neither type-safe nor expandable. In C++ applications, Boost.Format is usually the preferred choice when data should be output in a formatted way.

The library Boost.Format provides a class named boost::format which is defined in boost/format.hpp. Similar to std::printf(), a string containing special characters to control formatting is passed to the constructor of boost::format. The actual data replacing these special characters in the output is linked via the % operator as shown in the following example.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%1%.%2%.%3%") % 16 % 9 % 2008 << std::endl; 
} 

Boost.Format uses numerics placed between two percent signs as placeholders that are later linked to the actual data using the % operator. The above program uses the numbers 16, 9, and 2009 to form a date string in the format of 16.9.2008. In case the month should appear in front of the day, which is common in the United States, the placeholders can simply be swapped to accommodate.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%2%/%1%/%3%") % 16 % 9 % 2008 << std::endl; 
} 

The program now displays 9/16/2008 instead.

To format data using the C++ manipulators, Boost.Format offers a function named boost::io::group().

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%1% %2% %1%") % boost::io::group(std::showpos, 99) % 100 << std::endl; 
} 

The example will display +99 100 +99 as the result. Since the manipulator std::showpos() has been linked to the number 99 via boost::io::group(), the plus sign is automatically added whenever 99 is displayed.

If the plus sign should only be shown for the first output of 99, the format placeholder needs to be customized.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%|1$+| %2% %1%") % 99 % 100 << std::endl; 
} 

The placeholder %1% has been replaced with %|1$+|. Customization of a format not only adds two additional pipe signs though. The reference to the data also is placed between the pipe signs and rather uses 1$ instead of 1%. This is required in order to modify the output to +99 100 99.

Please note that, even though references to data are optional in general, they must be specified either for all placeholders or none. The following example only provides references for the second and third placeholder but omits them for the first one which generates an error during execution.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  try 
  { 
    std::cout << boost::format("%|+| %2% %1%") % 99 % 100 << std::endl; 
  } 
  catch (boost::io::format_error &ex) 
  { 
    std::cout << ex.what() << std::endl; 
  } 
} 

This program will throw an exception of type boost::io::format_error. Strictly speaking, Boost.Format throws boost::io::bad_format_string. Since the different exception classes are all derived from boost::io::format_error, it is usually easier catching exceptions of this type though.

The following examples shows how to write the program without having references to data.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%|+| %|| %||") % 99 % 100 % 99 << std::endl; 
} 

The pipe signs for the second and third placeholder can safely be omitted since they do not specify the format in this case. The resulting syntax then closely resembles the one of std::printf().

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%+d %d %d") % 99 % 100 % 99 << std::endl; 
} 

While the format may look like the one of std::printf(), Boost.Format still provides the advantage of type safety. The usage of the letter 'd' within the format string does not indicate the output of a numeric but rather incorporates the std::dec() manipulator on the internal stream object used by boost::format. This allows to specify format strings which would not make sense for std::printf() and thus may result in a crash of the application during execution.

#include <boost/format.hpp> 
#include <iostream> 

int main() 
{ 
  std::cout << boost::format("%+s %s %s") % 99 % 100 % 99 << std::endl; 
} 

While std::printf() uses the letter 's' only for strings of type const char*, the above program works perfectly. Boost.Format does not expect a string necessarily but rather incorporates the appropriate manipulators to configure the operation mode of the internal stream. Even in this case though, it is still possible to add the numbers to the internal stream as shown above.


5.7 Exercises

You can buy solutions to all exercises in this book as a ZIP file.

  1. Create a program that extracts and displays data such as first and last name, birthday and account balance from the following XML stream: <person><name>Karl-Heinz Huber</name><dob>1970-9-30</dob><account>2,900.64 USD</account></person>.

    The first name should be displayed separated from the last name. The birthday should be shown using the typical format of 'day.month.year' while the account balance should omit any decimal place. Test your application with different XML streams that contain additional spaces, a second first name, a negative number for the account balance and so forth.

  2. Create a program that formats and displays data records such as the following: Munich Hamburg 92.12 8:25 9:45. This record describes a flight from Munich to Hamburg that costs 92.12 Euro, departs at 8:25 AM and arrives at 9:45 AM. It should be displayed as: Munich    -> Hamburg      92.12 EUR (08:25-09:45).

    More detailed, the city should be 10-digit and left-aligned while the price should be 7-digit and right-aligned. After the price, the currency should be displayed. The departure and arrival times should be shown in parenthesis, without spaces and separated by a hyphen. For times prior to 10 AM/PM, a leading 0 should be added. Test your application with different data records by e.g. adding a city that contains more than 10 digits.