file_statistics.cpp File Reference
Detailed Description
Efficient computation of some text statistics for a given file, including the text length, alphabet size, number of distinct q-grams and empirical entropy.
The input is read from stdin
(with a single pass over the text) and the output is written to stdout
. The ouput is formatted as comma seperated values to be easily written into a csv file. It is also possible to handle UTF-8 encoded strings.
Depending on the command line parameter STATISTIC
, different values can be computed.
simple
: The output will be the text length (number of actual characters, works also with multi-byte encoding), the alphabet characters (concatenated set of characters occurring in the text, non-printable characters replaced with a "_"), and the size of the alphabet (equal to the length of the former string).qgrams_X
: The output will be the number of different substrings of lengthX
.entropy_X
: The output will be the empirical entropy of orderX
. The calculation of the empirical entropy is based on the following paper: Giovanni Manzini (2001): "An analysis of the Burrows-Wheeler transform" http://dx.doi.org/10.1145/382780.382782
Usage:
file_statistics STATISTIC [ENCODING=single-byte [FILETYPE=plain]] < FILE
- Parameters:
-
STATISTIC The name of the statistic to compute. Has to be one of the following: (simple | qgrams_X | entropy_X)
whereX
is an integer.ENCODING The encoding of the input file. Has to be one of the following: (single-byte | UTF-8)
.FILETYPE Whether the input file should be treated as a regular text file or as a FASTA file. The only difference is that for a FASTA file all lines starting with a >
will be ignored. Has to be one of the following:(plain | fasta)
.
Examples:
- Determine length, alphabet characters, and alphabet size of the single line "mississippi":
$ ./file_statistics simple < mississippi.txt "11";"psim";"4";
- Determine the number of different 3-grams in a chinese text:
$ ./file_statistics qgrams_3 UTF-8 plain < chinese.txt "135344";
- Calculate the empirical entropy of a fasta-file:
$ ./file_statistics entropy_2 single-byte fasta < test.fasta "1.94";
- Returns:
- 0 on success, something else on error
- Remarks:
- With this program and on a regular desktop computer (2 Gigabyte RAM) it is possible to compute statistics for texts with sizes of several Gigabytes, for example for the DNA sequence of the human genome. The following table gives an example of computed values. The following table shows some examples of computed values. (Values of empty cells could not efficiently be computed because of large text and/or alphabet sizes.)
File | length | qgrams_1 | qgrams_2 | qgrams_3 | ent_0 | ent_1 | ent_2 | ent_3 | ent_4 | ent_5 --------------------+---------------+----------+-----------+------------+-------+-------+-------+-------+-------+------- DNA of human genome | 3,095,677,412 | 7 | 31 | 133 | 2.21 | 1.79 | 1.78 | 1.77 | 1.77 | 1.76 Texts (English) | 8,790,836,971 | 184 | 10,762 | 324,678 | 4.50 | 3.03 | | | | Texts (German) | 210,528,730 | 189 | 10,121 | 137,459 | 4.52 | 3.59 | 2.96 | 2.49 | 2.15 | 1.90 Texts (Chinese) | 51,649,808 | 16,564 | 2,706,626 | 16,329,056 | 9.51 | 7.38 | 4.82 | | |
Download:
- The newest version of this tool can be downloaded from http://wwwmayr.in.tum.de/spp1307/downloads.html