Statistics on Numbers and Punctuation
Numbers and punctuation marks are surprisingly rare. The Brown Corpus — the corpus mostly used in linguistic analysis (search on Google) — shows commas at 0.98% and periods at 0.83%. The Brown Corpus is actually a bit old now and punctuation usage has significantly decreased in the past twenty years, to the point where periods dominate commas in modern text.
Below are statistics from four different large text collections:
- Bill Machrone is a set of columns he published in PC Magazine.
- MRoth Email File is a set of email messages collected over a year. (They are unedited and contain a lot of computer-generated headers, hence with more numbers than in text written by mortals.)
- Model Business Act is legal and administrative language.
- Dictal is a large set of medical transcription reports
Here are the results:
|Letter E: ||8.9721%|
|MRoth Email File|
|Letter E: ||8.2004%|
|Model Business Act|
|Letter E: ||9.5633%|
|Medical Transcription Dictal|
|Letter E: ||9.1752%|
They do confirm Gordon Walker's analysis, namely that punctuation and digits are so rare that they are they are within the margin of error of what you can expect in a text of 182 characters.