Back
Words, words, words. So many words, so little time!

Text Simplifier

A work-in-progress idea.

Published on

What is a text simplifier?

After writing one too many English essays that looked more like a thesaurus’ index page than something I was submitting for a grade, I decided my teachers might need something to help them read the unintelligible page of words left on their desk.

Lexiphanic • /ˌlɛksɪˈfænɪk/

adj. Using or characterized by pompous, ostentatious, or bombastic language; overly grandiloquent.

Synonyms: bombastic, grandiose, purple, florid, sesquipedalian Antonyms: simple, direct, plain, concise

How is this going to work?

There are plenty ways to simplify text, but I decided to do something more fun. Here’s our example sentence:

The ressentiment of the weak, being an internal fecundation of a venomous passion, finds its expression in the transvaluation of values. On the Genealogy of Morality by Friedrich Nietzsche

Now for each word we are going to run an algorithm. I thought, if we used a thesaurus to get here, why not use one to get back out?

This algorithm picks the highest ranked synonym of each word, and once there becomes a loop, it chooses whatever word was last in that loop. Seems to work pretty well huh? Nope!

As you can see, especially for simple words, this algorithm makes them more complex! The opposite of what we want. So how can we ensure we are always landing on less complex words? We’ll luckily someone has an answer: word frequency analysis.

It turns out (and this isn’t all too surprising), that words that come up more frequently in the English text corpus are also more well known. The word frequency list I’ve decided to use is called SUBTLEXus, and compiles subtitles from English films.

This data set has a value called the Zipf value (you can learn more on Wikipedia!). In short, the Zipf value is measured on a log scale between (sort of) 1-7 where 1 is lowest frequency and 7 is highest.

Here are some example words:

wordZipf
and7.1
there6.6
apple4.4
loquacious2.2
talkative3.0
garrulous2.1
chatty3.0

Pretty cool! Let’s make this more practical.

White represents best synonym; red represents second-best synonym

Now our algorithm worked just like before, but if the result has a lower zipf score than the initial word, we try again with the second best synonym for our first word. If that also fails, we can safely assume our initial word is okay and leave it how it is.

There is still just one problem with our algorithm:

What!? If we made that change we would completely change the semantics of our statement. Let’s fix this by implementing two things:

  1. A minimum delta (e.g. if the selected word is less than a 0.5 difference, we won’t change it for safety)
  2. A maximum difficulty (if the word is already well known there is no need to change it)

We can change these values later, but my guesstimates would be 0.5 for the minimum delta and 6 for the maximum difficulty.

Okay now all we have to do is run this algorithm for every word in our sentence — and that is an exercise to the reader! Be watching my projects page because I might be adding this one there soon!

Until our next floccinaucinihilipilificatious foray,
~Ilan Bernstein