i before e, except after c?
Rumour has it that children are no longer taught “i before e, excpet after c (unless it rhymes with bee!)” at school anymore, because it is more often wrong than right. I wondered whether this was only obscure words, or was true for all words - or even true at all?
To look at this I considered the 20,000 most common english words (helpfully provided by first20hours) from Google’s Trillion Word Corpus. According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas (dictionary form, i.e. eat, ate, eats are all the same lexeme) account for approximately 90% of usage. Therefore, I should capture almost all schoolboy usage. However, note this corpus contains words not lemma.
Simply using search in my browser, we can test whether i really is before e.
Characters | # Appearances |
---|---|
ie | 592 |
ei | 147 |
It does seem that i is before e! What happens if we look only after c?
Characters | # Appearances |
---|---|
cie | 43 |
cei | 14 |
Even after c, it looks like i should be before e still. Perhaps the adage is wrong? There is a little refinement to do though. Often -cie occurs at the end of a word as a pluralisation -cies, which we know isn’t part of the rule. This can be overcome by looking at lemmas or by going through by hand. I found 15 words ending in -cies. By the same token, words ending in -iest should be removed: of which there were 2 (earliest and easiest, if we keep priest!). Thus the table should read:
Characters | # Appearances |
---|---|
cie | 28 |
cei | 12 |
Even after c, it looks like i should be before e!
An additional thought is whether cei is a helpful rule for frequent word, and cie words are obscure. So I reran the final analysis above with only the 10,000 most common words:
Characters | # Appearances |
---|---|
cie | 13 |
cei | 9 |
This brings the two strings much closer together, but cie is still more frequent! Moreover, the most common cei word (received) comes in at position 891, whereas the most common cie word (science) comes in at 440. It turns out then, the rule of thumb should be to always place your i before your e, and ignore the c.
Does anyone want to try rhyming with bee?