Winning arguments with R
You don’t really know, do you?
There’s an okay episode of Star Trek: The Next Generation in which Captain Picard and Dr. Crusher become telepathically linked after getting stuck on an unknown planet. There’s a scene in this episode that I think about sometimes.
Picard and Crusher have to pick a direction to travel in, without knowing much about their surroundings. Picard confidently starts in one direction, but because Crusher can read his mind their exchange goes like this:
PICARD: This way.
CRUSHER: You don’t really know, do you?
PICARD: What?
CRUSHER: I mean, you’re acting like you know exactly which way to go, but you’re only guessing. Do you do this all the time?
PICARD: No, but there are times when it is necessary for a captain to give the appearance of confidence.
I try to remember this scene in real life and that a confident answer is not necessarily built on true knowledge.
Die, der oder das?
How is the previous scene related to German? Let me tell you! My husband (Mike) and I live in Switzerland, and both speak varying levels of German. One thing I find challenging about German is memorizing articles. Every noun can have one of three options: die, der, or das. You need to know these articles because they change depending on which case you should be using.
Recently, I told Mike I had noticed that you can sometimes guess the article based on the identity of the last letter of the noun. Specifically, I noticed nouns that end with ‘e’ are often die. He told me, no it’s not true. Because he speaks much better German than me, and answered so confidently, normally I would just accept this answer. But thanks to Dr. Crusher and her telepathic revelations, instead I investigated for myself. Here’s what I found and how I did it.
Aquiring and checking out the data
First, I downloaded a list of German nouns with their articles compiled from WiktionaryDE from this source. After cleaning up the list, I have 87404 nouns.
I extracted the last letter of each word, and first checked the proportion of nouns in each category. I’ll look at it with a treemap, where the size of each rectangle corresponds to the proportion of nouns in each category.

So we can see that there are many nouns that end with e, n, t, r. Very few nouns end with v, w, c, q. In fact, only one noun ends with ‘q’— Nasdaq. I’m going to exclude that. Interestingly, no noun on this list ends with ‘j’.
Now I’ll split these last letter categories by their article.

Already we can see that most nouns that end with ‘e’ have die as their article. Specifically, 87.2%.
But is that statistically significant? We can use a chi-squared test to find out if each category of final letter has an equal distribution across die / der / das.
Turns out, only three letters have an equal distribution: c, w, y. We can visually see these final letters are more or less equally distributed across the three categories of articles in the previous figure.
To guess or not to guess
So is it generally a good idea to guess that die is the article for a German noun that ends with ‘e’?
We can check by randomly picking 100 nouns that end with e, and see how often they have die for their article. In other words, how many times would we be right if we just guessed?
## [1] 0.88Victory occurs in 88% of this subsample! Not too shabby.
But I could use all the help I can get for guessing articles. Are there any other letters which are heavily biased towards die, der or das?
I defined a “heavy” bias as nouns whose final letter is associated with a particular article at a frequency of 85% or greater.
article | last letter | proportion |
|---|---|---|
die | e | 0.87 |
der | f | 0.86 |
So for two different situations we would have a good chance of success if we guessed die (word ends with e) or der (word ends with f).
Greedy for guesses
Are there any other “rules” we can find? I’ll try to identify additional patterns by looking at not only the final letter, but also the last two, three, and four letters of each noun.
I have limited memory, so I’ll only look for really good rules that I consider worth remembering. For me, that’s where guessing will work out 85% of the time and the rule would cover at least 900 nouns.
article | last letter(s) | proportion | # words |
|---|---|---|---|
das | hen | 0.89 | 988 |
die | rin | 0.99 | 3340 |
ung | 0.99 | 5575 | |
ion | 0.97 | 1702 | |
e | 0.87 | 16842 | |
it | 0.86 | 2180 | |
der | ler | 1.00 | 927 |
f | 0.86 | 1334 |
So now we have eight different rules to remember, which altogether cover 37.6% of our nouns. Some of these endings encompass multiple child suffixes that have the same or greater proportion of the article. We can use a sankey plot to check this out:

Of course, remembering die for nouns that end with ‘e’ is easier than remembering each of the seven child suffixes.
In conclusion
In the end, our rules can help us a bit going forward, but they won’t substitute for actual memorization of each article. The main take-aways are that:
- I was right
- Mike was wrong
- German is hard
- The projection of confidence is important for starship captains
