Readability and Complexity

What do we mean when we talk about Legibility?

Readability is a broad concept that expresses the ease with which a reader can understand a text, and it may depend on the audience of the text as well as its content, quality, legibility, formatting or design structure. Therefore, the definition can fluctuate based on the type of audience to whom one is presenting a certain type of content to. In “natural language” (any language that occurs naturally in a human community by a process of use, repetition, and change without conscious planning or premeditation), the readability of texts depends on its content: the complexity of its vocabulary and syntax; and its presentation: such as typographic aspects that affect legibility, like font size, line height, character spacing, and line length. Research on readability can date back to the XIX century, and several ways of measuring it have been proposed since then. Here, we use the readability page entry in Wikipedia as a reference, for a detailed list of different approaches and some of its history.

Here, we describe our module for text readability and complexity, i.e., a set of statistical measures that give an idea of how readable or complex a text is. The module includes various metrics, each providing insights into the text’s readability and the level of education required to understand it, with a few consensus metrics. This system relies on statistical properties of the text: number of words, number of characters, number of difficult words, etc.

Our final metric, the results we show, uses all the readability tests integrating them into a single one, by using a majority vote.

Consensus Metrics:

Text Standard: A consensus measure that integrates all the other readability tests (except other consensus metrics) to provide a comprehensive assessment of the text’s complexity. It represents the most common (mode) US-grade level required to understand the text, ranging from 0 to around 20, with higher scores indicating more complexity.

Our metric: We normalize and invert text standard into a metric that ranges from 0 to 100, being 0 hardest text and 100 the easiest text.

Standardized metrics:

Gunning Fog Index: A readability metric that emphasizes sentence complexity and the use of complex words. It calculates the average number of words per sentence and the proportion of complex words in the text. For more info, check Wikipedia.

Automated Readability Index (ARI): This index estimates the grade level needed to comprehend the text. It considers the number of characters per word and the number of words per sentence. For more info, check Wikipedia.

Linsear Write Formula: This measure estimates the years of education required to understand the text. It requires a 100-word sample and calculates readability based on the ratio of syllables to sentences. For more info, check Wikipedia.

Dale-Chall Score: A readability score that utilizes a specific list of 3000 commonly understood English words. It assesses the ratio of difficult words to total words and the average sentence length. For more info, check Wikipedia.

Spache Formula: A readability score specifically designed for children’s literature, suitable for texts up to grade 4. For more info, check Wikipedia.

McAlpine EFLAW: Tailored for non-native English speakers, this score evaluates readability based on the number of ‘miniwords’ and sentence length. For more info, check the original post in the archive.

Reading Time: An estimate of the time required to read the text, calculated based on an average reading speed of approximately 30 milliseconds per character. In the original package they indicate 15 ms was used as default, but we found empirically this to be very fast.

Flesch Reading Ease: This score ranges typically from 0 to 100 (but can theoretically range from -∞ to 121.22), assessing text easiness based on words per sentence and syllables per word. Higher scores indicate easier readability. For more info, check Wikipedia.

Polysyllabcount: A count of the polysyllabic (more than one syllable) words in the text.

Monosyllabe Count: A count of the monosyllabic (single syllable) words in the text.

Difficult Words: A tally of words considered difficult, which can vary based on the specific criteria or word lists used.

Flesch-Kincaid Grade Level: Similar to the Flesch Reading Ease, this metric estimates the U.S. school grade level required for comprehension, based on sentence length and word syllability. For more info, check Wikipedia.

Coleman-Liau Index: This index predicts the grade level needed to understand a text using the average number of letters per 100 words and the average number of sentences per 100 words. For more info, check Wikipedia.