How do nouns get assigned a gender?

How do languages assign grammatical gender to “foreign” nouns? Why is phone masculine in Urdu? Why is Auto neutral in German? Most native speakers struggle to explain why a foreign noun is assigned a specific gender, and yet most of them would agree that it just sounds right. They simply know.

Given the entire corpus of nouns in a specific language, could we infer rules for gender assignments to nouns? Relatedly, could we teach a machine to then assign genders to nouns that it has never seen before?

I collected and trained a few models on the entire corpus of German nouns (from this repo) to answer this question. I must admit that this was partially driven by my own frustration as a student of German language: genders are essential not just in articles, but also in pronouns and declensions in German. That often left me stuttering after every two words in a sentence, or referring to the wrong object – a problem both programmers and language students can appreciate.

Code base can be found here. Web app here.

Methodology

Data cleaning and feature generation

I needed to decompose the noun in some form to create features. While I reasoned that some structure within the noun may be predictive of grammatical gender, I did not know what that structure may be. The dimensions that I may care about in a sub-word could include: the position of the sub-word within the word, the length of the word, the letters preceding and proceeding the sub-word, etc. Unable to reduce this problem meaningfully, I used the toddler’s approach to problem solving: brute force.

That’s not completely true: each word was appended by a start and end symbol since I reasoned that there may be information in prefixes and suffixes. Then, each word was decomposed into its constituent n-grams for values of n from 1 to 5. The number 5 reflected my German fluency: the longest suffixes I could think of were four characters long (e.g. -chen).

I tallied the frequency of each n-gram for each gender, and compared it to the overall distribution of the three genders in German nouns. There’s an assumption baked in here: that the distribution of n-grams across genders should mirror the distribution of words across genders. There’s no reason that should be the case: words are not a random collection of letters after all. Nevertheless, I figured it may be a good way to map out outliers, where the distribution is clearly skewed towards one gender. I calculated the p-value using a Chi-square test for a difference in distributions to create a ranking, since the actual number is not interpretable in any meaningful manner.

Model selection and refinement

Interpretability matters to me. So I chose a decision tree to build the first pass for the classification model. I also chose a subset of 10,000 nouns to create the first model (with a 60:20:20 training:test:validation split), because I knew that using n-grams from the full list of nouns to build a large number of features could tank my CPU. In addition, I was wary of overfitting the decision tree, given the number of features I was going to use.

The accuracy of the first model on test data was around 70%, compared to the naive benchmark accuracy of 43% (i.e., just assigning the highest frequency category to each noun). The initial tree was unsurprisingly quite deep (>100 layers), and I chose to experiment with both pre- and post-pruning techniques to refine its hyperparameters.

I used various combinations of tree depth, minimum splits and samples in a leaf for pre-pruning, and cost complexity pruning – a technique that penalizes each additional leaf – for post-pruning. Ultimately, cost-complexity pruning, with a very small “alpha” (or the penalty for each additional leaf) outperformed (accuracy: 80.4%) any combination of pre-pruning parameters (best accuracy: 74.0%).

Random forests did not perform significantly differently from the post-pruned decision tree. Even after significant hyperparameter tuning, the accuracy on test data hovered around 80.2%.

An important addendum here is that I resorted to drastic feature reduction by choosing only n-grams at the end, i.e., the one that included the special “end” character. I stumbled upon this both out of necessity and luck. It was necessary because using p-values to reduce features wasn’t sufficient to allow a model within the confines of my CPU, and it was luck because my wife had a gut instinct that only the final syllable or two matter.

Finally, there’s an inherent difference between the samples that this model is trained upon (German words) and what it’s trying to predict (English words). My hypothesis is that there’s something within the structure of German words that informs gender assignment to all other nouns. I could very well be wrong here though.

Discussion

Why did deeper trees perform better? And why did the best random forest not outperform the best decision tree?

Simply because decision trees are better at uncovering “rules”. Consider a hypothetical language, where every word’s gender is assigned strictly by a system of rules based on the last three letters in the word. In theory, a decision tree could gain 100 percent accuracy on such data. The tree would also be exhaustive, reaching its maximum possible depth. A random forest would, at best, perform as well as the decision tree.

My hypothesis, therefore, is that roughly 80 percent of noun gender assignment is rules-based.

What’s next?

There are two avenues I want to explore to increase accuracy:

(i) Firstly, I will use an LLM to break each word into syllables and use syllables as features instead of n-grams.

(ii) Secondly, I will build a neural network to see if I can replicate the “intuition” that many native speakers of a language have when assigning a gender to a new noun.

Oh and in the meanwhile, if you’re interested in feeding new nouns to my algorithm and see what gender it spits out, here’s the link again.