The zebra jumps quickly over a fence, vexed by a lazy ox. Eden tries to alter soft stone near it. Tall giants often need to rest, and open roads invite no pause. Some long lines appear there. In bright cold night, stars drift, and people watch them. A few near doors step out. Much light finds land slowly, while men feel deep quiet. Words run in ways, forward yet true. Look ahead, and things form still, yet dreams stay hidden. Down the path, close skies come, forming hard arcs. High above, quiet kites drift, fast on pure wind, yanking joints.
What's so special about the nonsense paragraph above? It's attempting to match the average distribution of letters in texts written in the English language.
This article by Peter Norvig discusses a 2012 study of letter frequency using Google books data set. And the distribution look like so:
For font-fallback matching purposes (more on this later) I want a shorter paragraph, representing roughly similar distribution. One can, of course, just create a paragraph like "Zzzzzzzzz" (9 Zs), followed by 12 Qs and so on, all the way to 1249 Es. But where's the fun in that? Plus texts have spaces and punctuation too.
So after some tweaking and coaching AI, this is a paragraph that came out that looks more realistic and matches the letter frequency pretty well.
Here's a CSV that shows:
- each letter,
- the Norvig's frequencies (based on 3,563,505,777,820 letters in the dataset) and
- my frequencies too (based on mere 424 letters, once you take out spaces and punctuation)
Letter,Norvig,Tall giants E,12.49%,12.26% T,9.28%,8.73% A,8.04%,7.55% O,7.64%,7.08% I,7.57%,6.60% N,7.23%,7.55% S,6.51%,6.84% R,6.28%,6.13% H,5.05%,4.01% L,4.07%,4.48% D,3.82%,5.42% C,3.34%,1.89% U,2.73%,2.36% M,2.51%,2.12% F,2.40%,2.83% P,2.14%,2.59% G,1.87%,2.12% W,1.68%,2.12% Y,1.66%,2.12% B,1.48%,0.94% V,1.05%,0.94% K,0.54%,1.18% X,0.23%,0.47% J,0.16%,0.47% Q,0.12%,0.71% Z,0.09%,0.47%
Here's the same data represented graphically:
Well, what's the point of this?
Similar to the nonsense etaoin shrdlu used by typesetters, this paragraph can be used to find out the average character width of a font.
Just render the paragraph in a non-wrapping inline-block DOM element, measure the width of the element and divide by the length of the text.
How is this useful? Welp, to set the size-adjust
CSS property of a fallback font to match a custom web font. Further write up is coming, stay tuned!
Close enough
As you can see in the graph, the two lines do not match exactly. I think this is OK. It's extremely unlikely that any text on your page will have the exact average distribution of letters in it. So we're talking about an approximation to begin with. May also be site-dependent. E.g. in an adult site maybe the X character will occur more often than the average book.
Also Norvig's analysis doesn't mention spaces and punctuation. In my paragraph, these exist, maybe making it possible to match the average text on a web page just a little bit closer.
Aside: why not just Lorem Ipsum
Well, it doesn't attempt to match the character distribution in English. (Duh, it's not even English!)
Here's what it looks like in the same digram:
Note: no K, J, Z, W or Y. Barely any H.
Here are the stats in CSV and .numbers for your perusal.
May "The zebra jumps quickly over a fence, vexed by a lazy ox" be always in your favor!