Linguistic analysis of my PhD thesis

Thu, 10 Apr 2025 10 min read

Image credit: Olaf Lipinski

Table of Contents

Analysing my PhD thesis with Python

I recently saw someone post about the statistics of their PhD thesis, and I thought I could join in on the fun, but with an emergent communication twist! I will analyse my own thesis with some linguistic-inspired metrics and see what interesting patterns emerge.

In this post I will:

Start with some simple overall stats
- I use readability score calculations from textstat and textblob
  - For more explanation on these metrics, visit readability formulas and textblob documentation
- I also use SpaCy for analysing other parts of the text
  - For explanation on SpaCy, visit their documentation
Analyse the thesis section by section
- This is very interesting in my opinion!
- You can see all the metrics change significantly across different sections
Compare my work to Shakespeare’s writing
- Just for fun, to see how we stack up!

I use several readability metrics including “Flesch Reading Ease”, “SMOG Index”, “Flesch-Kincaid Grade Level”, and “Automated Readability Index”, which can be summarized as:

Flesch Reading Ease: Higher scores indicate easier readability (100 is very easy, 0 is very difficult)
SMOG Index: Estimates the years of education needed to understand a text
Flesch-Kincaid Grade Level: Corresponds to US grade level required to comprehend the text
Automated Readability Index: Similar to Flesch-Kincaid, indicates US grade level needed

Click here for readability score breakdowns/ranges!

Readability Score Interpretations

Flesch Reading Ease:
- 90-100: Age 11 (UK Year 6/US 5th grade) - very easy
- 80-90: Age 12 (UK Year 7/US 6th grade) - easy
- 70-80: Age 13 (UK Year 8/US 7th grade) - fairly easy
- 60-70: Ages 14-15 (UK Years 9-10/US 8th-9th grade) - plain English
- 50-60: Ages 16-18 (UK Years 11-13/US 10th-12th grade) - fairly difficult
- 30-50: University level (difficult)
- 0-30: Postgraduate level (very difficult)
SMOG Index:
- 6-10: Up to GCSE level (UK)/High school sophomore (US) - accessible to general public
- 11-12: A-level (UK)/High school completion (US)
- 13-16: Undergraduate degree level
- 17+: Postgraduate/professional level
Flesch-Kincaid Grade Level:
- 1-6: Primary school (UK)/Elementary school (US)
- 7-9: Lower secondary (UK Years 7-9)/Junior high (US)
- 10-12: Upper secondary (UK Years 10-13)/High school (US)
- 13-16: Undergraduate level
- 17+: Postgraduate level
Automated Readability Index:
- 1-6: Primary school (UK)/Elementary school (US)
- 7-9: Lower secondary (UK Years 7-9)/Junior high (US)
- 10-12: Upper secondary (UK Years 10-13)/High school (US)
- 13-16: Undergraduate level
- 17+: Postgraduate level

Join me on this journey through my thesis statistics! You can also generate your own analysis, with small modifications required, by using the code linked at the top of this post (feel free to leave a comment/suggest changes on GitHub!).

All the plots here are interactive! Hover over bars, lines, and points to find out more! You can even zoom in, rotate, and explore other amazing features with these plots. I’m experimenting with Plotly, which I think is a great alternative to static images. I’m looking forward to finding more productive uses for their interactivity beyond just thesis statistics.

This analysis is mostly for fun, so while I tried to make the plots accurate, they are almost definitely not perfect. For one, I only performed cursory text sanitization, so there are likely still LaTeX commands in the text affecting the statistics. You may notice this in some plots where we get slightly negative values, or common words like “ssec” which I used for cross-referencing subsections. For any frequency plots, shorter text sections can inflate frequencies due to the smaller denominator when calculating relative usage. Additionally, the code was written with assistance from Claude Sonnet 3.7, so there may be unforeseen issues.

Overall statistics

Let’s see what we’re working with. These are the stats extracted with the pipeline I used for the plots. Interestingly, we get about 6,000 more words with my pipeline than Overleaf’s word count.

Metric	Value
General Statistics
Total Words	33,833
Unique Words	4,753
Lexical Diversity	0.140
Average Sentence Length	26.99
Number of Sections	30
Readability Scores
Flesch Reading Ease	31.62
SMOG Index	16.4
Flesch-Kincaid Grade Level	14.5
Automated Readability Index	17.2
Top 15 Words	Frequency
temporal	326
agents	284
language	180
message	168
communication	162
messages	151
emergent	145
agent	131
references	120
trg	112
integer	111
compositional	101
sec	100
target	95
ngram	89

Distributions

I also plotted the word distributions in a nice histogram, which some might find easier to read than a table.

Using SpaCy, I could also identify different parts of speech used in my thesis! As you can see from my distribution chart, my thesis is heavy on nouns, which is typical for academic writing - lots of concepts and things to discuss!

For a quick overview of parts of speech terminology, click here!

Parts of Speech Explained

When I analyzed my thesis with SpaCy, it tagged every word with these parts of speech:

NOUN: Entities, concepts, or objects (e.g., “language,” “agents,” “communication”)
VERB: Words expressing actions, processes, or states (e.g., “analyze,” “develop,” “communicate”)
PROPN: Proper nouns referring to specific entities or names (e.g., “Shakespeare,” “Python,” “NeurIPS”)
ADJ: Adjectives modifying nouns by describing attributes or qualities (e.g., “temporal,” “compositional,” “emergent”)
NUM: Numerals representing quantities or ordinal positions (e.g., “one,” “third,” “16.4”)
ADV: Adverbs modifying verbs, adjectives, or other adverbs (e.g., “significantly,” “mostly”)
PUNCT: Punctuation marks
X: Miscellaneous category for elements not classified elsewhere
ADP: Adpositions expressing spatial, temporal, or logical relationships (e.g., “in,” “of,” “through”)
SYM: Symbols representing concepts rather than linguistic content (e.g., mathematical notation)
INTJ: Interjections expressing emotional reactions or sentiments
CCONJ: Coordinating conjunctions connecting equivalent syntactic elements (e.g., “and,” “or,” “but”)
AUX: Auxiliary verbs supporting the main verb’s tense, mood, or voice (e.g., “is,” “have,” “can”)
PART: Particles fulfilling grammatical functions without clear lexical meaning (e.g., “to” in infinitives, “not”)
PRON: Pronouns substituting for nouns or noun phrases (e.g., “I,” “they,” “this”)
SCONJ: Subordinating conjunctions introducing dependent clauses (e.g., “if,” “because,” “while”)
DET: Determiners specifying referential properties of nouns (e.g., “the,” “a,” “these”)

The proportions of these different word types help characterize my writing style and differentiate technical sections from more accessible ones.

Section by section

Let’s start with a simple word count per section. As expected, my literature review is the largest section of all. The other significant spikes correspond to experimental sections and, my favorite, the appendices! My supervisors often commented that I have a tendency to pack the appendices with additional data.

You can also see the word frequency per section, per 1,000 words in heatmap format.

Next, let’s examine how the most frequently used words vary across sections. As expected, “temporal” peaks in the experimental chapters of my thesis where I discuss temporality extensively!

Now, let’s look at the sentiment analysis per section. Interestingly, my chapter on the game of Werewolf appears to be more polarizing and subjective. This could be due to the frequent use of words like “villagers” and “voting,” which sentiment algorithms might associate with political discourse.

We can also examine lexical diversity across the thesis. These metrics may be slightly skewed due to incomplete text sanitization. It’s interesting to see that lexical diversity peaks in the introduction of the chapter based on my NeurIPS paper.

Click for an explanation of lexical diversity!

Lexical Diversity

Lexical diversity measures how varied your vocabulary is - basically, are you using the same words over and over, or changing it up? The simplest way to calculate this is by dividing unique words by total words. My thesis scored 0.140, which means about 14% of all words are unique. This might seem low, but academic writing typically scores lower than fiction because we keep using the same technical terms (like how “temporal” appears 326 times in my thesis!). Different sections have different diversity scores depending on whether they are introducing broad concepts or diving into technical details.

In this post, I use three measures of lexical diversity:

TTR (Type-Token Ratio): This is that simple ratio I mentioned (unique words divided by total words). While straightforward, TTR has a major drawback - it decreases as text length increases, since you naturally run out of new words to use as you write more.
Moving Window TTR: This addresses the text length problem by calculating TTR within fixed-size windows (say, 100 words each) and then averaging the results. This gives a more consistent measure across texts of different lengths, which is important when comparing sections that vary in size.
MTLD (Measure of Textual Lexical Diversity): This is a more advanced metric that calculates how many times the TTR drops below a threshold (usually 0.72) when moving through the text. MTLD gives us a factor score indicating how many times we can divide the text while maintaining that TTR threshold. Higher MTLD values indicate greater lexical diversity. This metric is much less sensitive to text length than raw TTR, making it great for comparing my thesis sections, which range from brief introductions to lengthy literature reviews.

In my analysis, I’ve used a combination of these metrics to get a comprehensive view of how my vocabulary varies throughout different thesis sections. The spikes in lexical diversity often correlate with sections where I’m introducing new concepts or discussing broader implications, whereas the more technical sections tend to reuse the same terminology consistently (which is actually good practice in academic writing - consistency is key!).

We can quantify the readability of each section using established metrics. Again, the values may be slightly inflated due to text sanitization issues, but I’m pleased to see that the conclusions and introduction score well on readability, as these should be the most accessible sections. The technical sections score lower on readability, as expected, likely due to the large number of technical jargon words.

Academic-ness of my writing

Let’s examine some common academic writing patterns: passive voice usage, sentence length, lexical diversity, and first-person pronouns. As expected, we can see that technical sections tend to use more passive voice. Interestingly, my architecture descriptions appear to have notably shorter sentences!

I also decided (with help from Claude) to analyze whether I overuse specific “academic” terms, including:

“analysis”
“research”
“data”
“method”
“results”
“theory”
“approach”
“study”
“model”
“framework”

It’s interesting to see that I apparently use “method” frequently in the conclusions of my NeurIPS paper. However, this observation comes again with the caveat that shorter text sections can produce less reliable frequencies.

Comparing my writing with Shakespeare

An academic looking person shaking hands with William Shakespeare — Academics v Shakespeare. (Generated with DALL-E 3)

I thought it might be interesting to compare the statistics from my thesis to Shakespeare’s writing. Fortunately, Andrej Karpathy has compiled a tiny Shakespeare dataset, which I used for this comparison.

Let’s start with sentiment analysis. I’m pleased to see that my thesis is both less polarizing and less subjective than Shakespeare’s works—suggesting at least some degree of scientific rigor!

We can also see that my sentences are somewhat longer, and my passive voice usage is significantly higher than Shakespeare’s.

Somewhat surprisingly, my thesis is less readable than Shakespeare according to all metrics. (Note that for Flesch Reading Ease, higher scores indicate better readability, while for the other metrics, lower scores are better.) This is rather humbling, considering Shakespeare still uses archaic terms like “thou” which aren’t particularly accessible to modern readers. But then again, he was a master playwright after all…

At least I can claim slightly higher lexical diversity! Though I’m concerned this might also be influenced by LaTeX commands that weren’t fully removed during text sanitization.

Finally, let’s examine the top words used by Shakespeare and myself. Unsurprisingly, there’s virtually no overlap. Emergent communication wasn’t particularly popular in the late 16th and early 17th centuries when Shakespeare was writing.

Summary

While this post is primarily for fun, I think it nicely showcases a few important points. First, text sanitization is surprisingly challenging and time-consuming—I now have much greater appreciation for data preparation specialists. Second, there are numerous fascinating approaches to analyzing text and, by extension, language! Exploring these different metrics and seeing how they compare has been super fun. This kind of analysis is precisely what made me pursue emergent communication and linguistics in my PhD, and I hope I’ve managed to convey some of that enthusiasm to you, dear reader!