strip_headers.cpp File Reference
Detailed Description
This program preprocesses text files from the Project Gutenberg.
It strips the headers and footers from a Project Gutenberg ebook text file (http://www.gutenberg.org/). This is necessary because unfortunately there is no standard delimiter to separate the actual text from the header and footer, so we have to apply some heuristics here.
This program has been tested on nearly all the Project Gutenberg texts. Only for some files it might leave in some lines of the headers or remove too many lines. For most of the thousands of files it determines the boundaries correctly.
Usage:
strip_headers INFILE OUTFILE
- Parameters:
-
INFILE The name of the input file downloaded from Project Gutenberg. OUTFILE The name of the output file (has to be different from input file because they will be read and written at the same time).
- Returns:
- 0 on success, something else on error
Download:
- The newest version of this tool can be downloaded from http://wwwmayr.in.tum.de/spp1307/downloads.html