Automatic diagnosis of Alzheimer’s disease using lexical features extracted from language samples
Original Article

Automatic diagnosis of Alzheimer’s disease using lexical features extracted from language samples

M. Zakaria Kurdi

Department of Computer Science, University of Lynchburg, Lynchburg, VA, USA

Correspondence to: M. Zakaria Kurdi, PhD. Department of Computer Science, University of Lynchburg, 1501 Lakeside Drive, Lynchburg, VA, USA. Email:

Background: This study has a twofold goal. First, it aims to improve the understanding of the impact of dementia of type Alzheimer’s disease (AD) on different aspects of the lexicon. Second, it aims to demonstrate that such aspects of the lexicon, when used as features of a machine learning classifier, can help achieve state-of-the-art performance in automatically identifying language samples produced by patients with AD.

Methods: The main dataset used is derived from the ADDReSS challenge, which is a part of the DementiaBank data set. This dataset consists of transcripts of descriptions of the cookie theft picture, produced by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is 108 in the training set and 48 in the test set. First, the impact of AD on 99 selected lexical features is studied using both the training and testing parts of the main dataset. Then some machine learning experiments were conducted on the task of classifying transcribed speech samples with text samples that were produced by people with AD from those produced by normal subjects. Several experiments were conducted to compare the different areas of lexical complexity, identify the subset of features that help achieve optimal performance, and study the impact of the size of the input on the classification. To evaluate the generalization of the models built on narrative speech to new tasks two generalization tests were conducted using written data from two British authors, Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan.

Results: Using lexical features only, state-of-the-art classification, F1 and accuracies of over 91% were achieved in categorizing language samples produced by individuals with AD from the ones produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing. The task generalization tests show that the system scales well to a new task.

Conclusions: AD has a substantial impact on the lexicon. Hence, with lexical information only, one can achieve state-of-the-art classification of language samples produced by AD patients vs. normal control patients.

Keywords: Alzheimer’s disease automatic diagnosis (AD automatic diagnosis); lexical complexity; lexical diversity; lexical density (LD); lexical sophistication

Received: 24 August 2023; Accepted: 25 March 2024; Published online: 24 June 2024.

doi: 10.21037/jmai-23-104

Highlight box

Key findings

• Substantial impact of Alzheimer’s disease (AD) on the usage of emotional lexicon as measured by lexicon-based sentiment analysis algorithms.

• It is also worth noting that the features that made it to the top list are distributed over all the lexical complexity areas. This distribution shows that all aspects of lexicon are affected by AD.

• With lexical information, state-of-the-art classification of spoken language samples can be classified correctly.

What is known and what is new?

• It is well known that AD has a substantial impact on lexicon access.

• This paper confirmed that AD can be diagnosed with both transcribed spoken texts as well as edited literary texts, such as published novels.

• This paper showed what aspects of lexicon are most affected by AD.

• It also showed that with lexical information only, state-of-the-art classification of texts produced by patients with AD and normal controls results can be achieved.

What is the implication, and what should change now?

• This work showed that a generic, task independent, system to diagnose AD is possible to achieve.

• It also showed that using the right size of language sample can help optimize AD diagnosis.


Alzheimer’s disease (AD) is a type of dementia that is a progressive neurodegenerative disease that affects cognitive function, including language processing. One of the most significant language-related changes that occur in individuals with AD is a decline in their lexicon, which refers to the vocabulary and words they use to communicate (1). The loss of lexicon can be particularly challenging, as it can make it difficult for individuals with AD to express themselves clearly and to understand others. Early detection of AD is critical for ensuring timely and appropriate treatment, as well as for improving patient outcomes (2). In recent years, there has been a growing interest in using machine learning with different types of linguistic features as a means of detecting AD at an early stage. Such an approach has several advantages, as it is less intrusive, has virtually no side effects, and is cheaper than traditional approaches (3).

One promising approach is the use of lexical features, which are linguistic elements related to vocabulary and word usage. This study uses an automated analysis of transcribed speech as a screening tool for AD. Furthermore, two written datasets have been used to test the generalization of the built models to new tasks and to a different type of language (transcribed spoken language vs. written language).

Although there has been a plethora of works that used lexical features of different types to detect AD and dementia in general, the lexicon was not covered with enough depth and width. While current studies have shown promising results in using lexical features for detecting AD, there are still several gaps in our understanding of the impact of AD on lexicon processing. One major limitation of existing studies is that they often rely on a small number of lexical features, limiting the scientific generalizability of their findings. Additionally, there is a lack of consistency in the types of lexical features that are measured and analyzed, making it difficult to compare results across studies.

To overcome these limitations, this study covers 99 lexical features. Some of these features have been used in previous studies about AD, such as the Brunet Index (BI), and the type token ratio (TTR). Other features have been used in other areas of research but have not been applied to AD detection, such as sentiment analysis, text focus, and knowledge depth. Finally, some new features have been proposed in this paper, such as point wise mutual information (PMI) of word embeddings and diversity of Ngrams. To the best of our knowledge, no such extensive lexical level analyses have been carried out on bodies of linguistic production in AD. Using many lexical features within the same study gives the advantage of comparing the benefits of those features individually and identifying the optimal combinations of those features in AD classification. Hence, this paper aims to combine two goals. First, study the impact of AD on a wide number of lexical complexity measures. Second, show how these features, when grouped, can help detect AD based on transcribed spoken linguistic production of a subject or the text he/she writes. To achieve this second goal, an extensive evaluation is carried out to examine several key factors, such as the best set of features, the optimal input size, and the generalization of the trained models to a new task and language type.

AD in its different types is affecting a large proportion of seniors around the world. Recent studies have shown that early diagnosis of this disease may help to delay its symptoms (2). Clinical diagnosis of AD is an intrusive procedure that is both expensive and stressful to patients. On the contrary, recent studies have shown the potential of machine learning and classification of linguistic features as a tool for detecting and monitoring cognitive decline in individuals with AD.

Three types of work have been conducted about dementia and AD and their impact on the linguistic abilities of the patients. First, some theoretical works have been performed. For example, Hier et al. (1) used a standardized picture description task of 39 patients with dementia to show that lexical deficits tended to be more severe than syntactic ones. Other research conducted longitudinal works on novels of Iris Murdoch, who was diagnosed with AD after the publication of her last novel, concluded that lexical decline across her different novels, as the author becomes older, is more substantial than with syntax (4,5). In a follow-up of the works of (4,5), Le et al. conducted a longitudinal analysis of the works of three British authors: Iris Murdoch, who was diagnosed with AD after the publication of her last novel, Agatha Christie, an author who was suspected of having dementia during her last years, and the novelist P. D. James, who aged healthily (6). They used several measures of lexical complexity, such as TTR, phrase repetition, and word-type introduction rate (WTIR) as well as knowledge from other linguistic levels such as syntax and disfluencies. They concluded that signs of dementia can be found in diachronic analyses of patients’ writings including lexical aspects of texts.

Inspired by theoretical findings like the ones above, several works were conducted to build machine learning classifiers to diagnose patients with AD based on their language production. Among the studies that investigated lexical features for automatic diagnosis of AD, the work by Fraser et al. (7) is worth mentioning. These authors examined the use of lexical richness measures, such as moving-average TTR (MATTR), TTR, mean length of utterance, BI, and Honoré’s statistic to distinguish between positive AD cases and healthy controls. The results showed that lexical richness measures were effective in differentiating between the two groups, with individuals with AD exhibiting lower lexical richness scores. Orimaye et al. (8) developed several machine learning models on 198 language transcripts from the DementiaBank1. Those transcripts were evenly selected from 99 patients with probable AD (PrAD) and 99 healthy control subjects. They combined syntactic features, such as average dependencies per sentence and average number of predicates, with lexical features, such as word count, word repetitions, and unique words, with the counts of specific Ngrams of words, like the window, is open, and girl is. They reported an area under the curve (AUC) score of 0.93 on the 1,000 Ngram model and 0.82 on the syntactic and lexical model. The main issue with this study is that it did not rely on an extensive number of features that cover systematically the linguistic aspects that can be affected by AD. Another drawback, that was acknowledged by the authors, is that Ngrams of words are known to have scalability issues when applied to different tasks.

In 2020, several researchers from the University of Edinburg and Carnegie Mellon University launched a campaign about AD called the ADReSS Challenge (9). This challenge aimed to make available a benchmark dataset of spontaneous speech, which is balanced in terms of age and gender, to help compare different approaches of automatic diagnosis of AD. The ADReSS challenge consists of two tasks, the first of which is related to this work. This task is about AD classification task, where the participants must produce a model to predict whether a speech session is produced by someone with AD or a healthy subject. Several approaches have been proposed within the ADReSS Challenge campaign, some relied on speech data, transcribed speech, or both. Among those approaches, Chlasta and Wołk (10) used a two-stage hybrid architecture as a baseline. This architecture first uses VGGish is a pretrained convolutional neural network (CNN) from Google, for audio feature extraction, then they used support vector machine (SVM) and neural network (NN) to diagnose AD. They also proposed DemCNN raw waveform-based CNN model whose accuracy is 0.636, 7% more accurate than the baseline. According to the results sheet2, shared by the organizers of the ADDReSS Challenge, this system achieved an F1 of 0.875 on the campaign’s evaluation test set (10). Yuan et al. (11) proposed a model that fine-tuned BERT and ERNIE, pre-trained modern language models, to diagnose AD. They achieved 0.896 accuracy on the test set of the ADReSS challenge, which is the leading performance of this campaign. Edwards et al. (12) adopted a combination of audio, word, and phoneme features. They analyzed the text data at both the word level and phoneme level. Experiments with larger neural language models did not result in improvement, given the small amount of text data available. However, they showed that the phoneme representation helps improve the classification. This approach gave an F1 score of 0.854. Deep learning led to key unprecedented achievements within natural language processing (NLP). However, like other NNs, this paradigm does not provide clear clues to understand the underlying problem. It is also known for demanding large amounts of data to achieve optimal performance, which is hard to find within the medical field.

Yamada et al. (13) collected speech responses in Japanese from 121 older adults comprising AD, dementia with Lewy bodies (DLB), and cognitively normal (CN) groups and investigated their acoustic jitter and shimmer features, prosodic features, such as pitch variation, pause duration, phoneme rate per second, and linguistic features, such as the number of correct answers and TTR. They reported an F1 score of 0.864 for AD vs. CN. Despite the promising results of this work, it does not offer either a systematic or a theoretically motivated approach.

The above-mentioned studies suggest that lexical features can be effective indicators of AD, and that the use of NLP techniques can further improve the performance of the classifier to detect them. However, the previous studies lack systematic in-depth coverage of those features.

This paper is organized as follows: in the Methods section, the methodology and the used data set are presented. The Methods section also covers a presentation of the lexical features adopted in this study along with their analysis of variance (ANOVA) F-tests and their rankings according to the three other adopted feature selection techniques. Machine learning experiments and their results are presented in the Results section. The Discussion section is about the discussion of the results and findings of this paper. The conclusion and future work will be presented in the Conclusions section.



The main aim of this paper is to explore systematically the discriminatory impact of lexical features in distinguishing between patients with AD and healthy users. Such exploration will help lay the ground to build a machine learning classifier that can achieve this task automatically. Hence, the conducted work is organized into two main phases. First, individual examination of the impact of AD on carefully selected lexical features that cover different aspects of lexicon, such as precision, diversity, and density. Second, finding the best set of features and the optimal length of input so that the classification performance is optimal.


Several previous studies suggested that AD impacts lexicon more importantly than other key linguistic levels, such as syntax (1,4,5). This leads us to assume that the lexicon of the subjects who are normally aging is less affected than the one of patients with AD. Furthermore, given the types of lexical features covered in previous works, it is assumed that AD impacts all aspects of lexicon that we are going to cover in this paper. Hence, AD is expected to cause a sharp decrease in density, diversity, sophistication, and specificity. The psychological and miscellaneous features are also expected to show some changes toward less complexity with patients suffering from AD. Since no previous systematic study of lexicon has been conducted, to the limit of our knowledge, it is hard to predict which of the lexical features will be impacted more by AD. Therefore, feature ranking is going to be one of the goals of this paper.


The first dataset used in this study is the one provided by the ADDReSS challenge organizers (9). This dataset is a part of the DementiaBank data set, which is in its turn a part of the larger TalkBank project (14). The dataset consists of recordings and transcripts of cookie theft picture3 descriptions, by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is as follows: the training set contains 108 samples, and the test set contains 48 samples. Only the transcripts were used in this study, since the main goal of this work is to study the impact of AD on lexicon, not the speech. The ADDReSS dataset competition is adopted in this study because it is balanced in terms of age, gender, and conditions. For example, in terms of age, it ranges between 50 and 80 years, with the largest number being between 65 and 70. Within both the training and testing data sets, the ratio of males to females is about 80%, which shows a high level of balance [See (9) for more details about the distribution of genders, ages, and conditions of the participants]. Since the goal of the classification is to identify the subjects with AD, the transcriptions of utterances produced by the subjects in one dialogue were grouped together in one sample. Another reason for this grouping is that individual utterances can be too short to provide information about the subject’s mental abilities. On the other hand, given that the ADDReSS data set was used in several previous studies, it makes it possible for us to compare our results with those reported previously on this data set.

Three other data sets were collected from two different renowned authors and a public personality, to test the system’s generalization. Random passages of about 350 words were collected from those texts and speeches as described below.

Previous works, like (15,16), suggested that President Reagan had early signs of AD during his second presidency, years before his formal diagnosis in 1994. Forty passages were randomly selected from 10 of President Reagan’s speeches. Twenty passages were picked from speeches between 1964 and 1980, before he presumably started to show signs of AD. The other 20 passages were selected from speeches made between 1989 and 1990, when he presumably started to show signs of AD.

The second data set was collected from novels written by the British novelist Iris Murdoch, who died with AD. Previous works suggested that the quality of her language was deteriorating with time (4,6). A balanced data set made of forty-four passages selected randomly from different parts of three of her novels: Jackson’s Dilemma, published in 1995 was written when she was suspected of having AD, and two of her earliest novels: Under the Net, published in 1954, and The Sandcastle, published in 1957.

Although not formally diagnosed with AD during her lifetime, Le et al. (6) showed that the British novelist Agatha Christie exhibited symptoms of dementia in her late writings. Hence, a data set is made of thirty-eight passages extracted from three of her novels: two of her earliest novels: The Mysterious Affair at Styles, 1921 and The Secret Adversary, 1922 and her penultimate one: Elephants Can Remember, 1972, which is about a female novelist with memory issues. This later novel was written when she was suspected of having dementia. 19 passages extracted from her earlier works are considered as negative cases, while the 19 passages extracted from Elephants Can Remember are deemed as positive.

Data analysis approach

The aim of this paper is double. From a scientific perspective it aims at comparing the impact of AD on a large body of lexical features from different types. On the other hand, from an application perspective, it aims to build a classifier that can diagnose transcribed and written language samples produced by healthy subjects and subjects with AD. To do so, one of the most important steps consists of identifying the key features to be extracted from the language sample. After a comprehensive survey of the lexical features that were proposed in the literature and the conception of some new ones, 99 lexical features were extracted from each document. These features are distributed around six lexical complexity types: density, diversity, focus, sophistication, specificity, and psychological. Given the large number of lexical diversity features, they are broken down into three subtypes in the presentation. Besides, the feature set includes 16 miscellaneous features covering readability formulas, discourse connectors, Ngram profiles, and sentiment analysis. This means that every document Di in the data set is converted into a vector V𝑖 of lexical features that is made of sub vectors representing each a subtype of lexical complexity:

V𝑖=<ΦDensity, ΦDiversity, ΦLexicalFocus, ΦSophistication, ΦSpecificity, ΦPsychology, ΦMisc>. The problem is that on the one hand, V𝑖 has a lot of redundancies and hence does not necessarily lead to optimal machine learning performance. On the other hand, it would be useful, from a scientific point of view, to get an idea about the features that are more impacted by AD and those that are less compacted by AD. Hence, to improve the machine learning performance and to get a clear picture of the impact of AD on the lexicon, feature ranking and selection seem necessary. The features of the same type will be mainly ranked using the ANOVA’s F-test, which is a common method for feature selection and ranking. The larger the F-score the stronger the feature. Besides, statistics not being an exact science, three other popular feature selection techniques are used: χ2, information gain, and ReliefF.

Feature set

This section aims to describe the impact of AD on individual features. For every feature, a brief description and justification will be provided along with its one-way ANOVA’s F-test. Three other ranking methods will also be considered: reliefF, information gain, and χ2. Information about those three methods ranking will be provided when the feature is among the top features of these methods’ ranked lists.

Measures of lexical density (LD)

LD refers to the percentage of content or lexical words in a given text or language sample. According to the study of Laufer and Nation (17), LD is also a measure of the information density of a text. LD has been studied in relation to AD. For example, Shellikeri et al. (18) analyzed lexical-semantic and acoustic features of picture descriptions and found a decline of LD in texts produced by patients with AD. As seen in the density formula (Eq. [1]), a higher LD score rate indicates a greater concentration of lexical or content words and a lower proportion of functional words in a text or a speech sample.


Nlex is the number of lexical words N is the total number of words.

Lexical words are words that are open and whose number is theoretically non-limited. The closed class words typically include grammatical words, whose number is limited. Unfortunately, such a broad definition is not sufficient for algorithmic implementation, as there is a disagreement in the literature about the categories of words to include in each class. For instance, O’Loughlin considered all adverbs of time, manner, and place as lexical (19). On the other hand, Engber and Lu consider the following categories as lexical (20,21): nouns, adjectives, verbs (excluding modal verbs and auxiliary verbs), adverbs with an adjective base like fast, and those formed by adding the suffix -ly to their end, like especially. Hence, to account for the above divergence, two versions of LD were implemented. In the first version (LD1), the following word categories are considered as lexical: the verbs, including modals, the adverbs, including comparative and superlative adverbs, the adjectives, including comparative and superlative adjectives, gerund, present participle, nouns, and proper nouns. In the second implemented version (LD2), the following categories of words are considered as lexical: all the common and proper nouns, adjectives, including comparative and superlative adjectives, as well as comparative and superlative adverbs. Verbs are considered as lexical except the modals have and be. Adverbs are considered as lexical when they end with the suffix -ly or when they have the same form as an adjective like half, late, or low. As shown in Figure 1, the ANOVA F-tests of these two versions are not significant. A possible interpretation of this result is that the information density is the same between the populations with and without AD, given that the task is the same.

Figure 1 ANOVA F-test ranking of the features of lexical density, the features ANR, rtAdj, rtAdv, and rtVb are significant, with P0.03. ANR, adjective-to-noun ratio; FWR, functional word ratio; LD1/2, lexical density 1st/2nd version; NVR, noun-to-verb ratio; rtAdj, ratio of adjectives; rtAdv, ratio of adverbs; rtNn, ratio of the nouns; rtPrep, ratio of prepositions; rtVb, ratio of the verbs; WFI, word frequency index; ANOVA, analysis of variance.

Word frequency index (WFI) is a variant of LD where the sum of the functional words is divided by the total number of words (Eq. [2]).


wi is a functional word from the text, N is the total number of words.

As shown in Figure 1, WFI did not yield a significant ANOVA F-test either. The probable reason for this non-significance is the limited task of the conversation.

One can look at density from the opposite angle too. Therefore, some previous works used functional word ratio (FWR), which is the ratio of the number of functional words, such as articles, pronouns, and prepositions, to the total number of words in a text or speech sample. A higher FWR indicates lower lexical density. FWR has been used in the studies of language development and author profiling (22,23). FWR did not yield a significant ANOVA F-test (Figure 1). Despite that, FWR is among the top 25 features within the ranked lists of reliefF and information gain.

Another approach to measuring density is by taking every word’s morphosyntactic category or its part of speech (POS) tag and calculating its ratio to the number of all words within the text. For example, the work of Cho et al. (24), showed that patients with Amnestic AD produced fewer verbs and adjectives and more fillers compared to normal control subjects. Hence, in this study, five features of this type are considered: the ratio of adjectives (rtAdj), the ratio of the nouns (rtNn), the ratio of the verbs (rtVb), the ratio of prepositions (rtPrep), and the ratio of adverbs (rtAdv). Some more specific ways to consider the ratio of morphological categories have been explored in the literature, by taking the ratio of a given category to another, like the noun-to-verb ratio (NVR). A higher NVR indicates a higher lexical density. Conversely, a higher adjective-to-noun ratio (ANR) is an indicator of a lower lexical density. This aspect of density was considered by Ahmed et al. (25) in their study of 12 proportional frequencies of open class (nouns, verbs, and descriptive terms) and closed class or grammatical words. They found no significant overall differences, after Group comparisons between normal controls and patients with AD. On the other hand, using noun and verb naming, sentence completion, and narrative tasks, Kim and Thompson (26) examined the nature of verb deficits in 14 individuals with PrAD. Production was tested, controlling both semantic and syntactic features of verbs. This study also covered noun and verb comprehension and a grammaticality judgment task. Results showed that both PrAD subjects showed impaired verb naming and that NVR with PrAD is consistent with grammatical aphasia. As shown in Figure 1, only the features ANR, rtAdj, rtAdv, and rtVb have significant ANOVA F-tests. Furthermore, the features ANR, FWR, rtAdj, rtAdv, rtVb, and NVR, are among the top 25 features within the reliefF ranked list of features. The features rtAdj, rtAdv, rtVb, rtPrep, and NVR are among the top 25 features χ2 ranked list of features. The features rtAdj, rtNn, rtVb, and FWR are among the top 25 features within the information gain ranked list. Overall, these results show that many of the lexical density measures used in this paper are substantially impacted AD.

Lexical diversity

Lexical diversity or lexical variation is a way to measure the variety of words or vocabulary used to express ideas about a given subject. Lexical Diversity is an important aspect of language that can impact the effectiveness and clarity of written and spoken communication. Several measures of lexical diversity have been studied within the context of works about language acquisition and education, as well as about dementia (5,27,28).

The intuitive way to describe lexical diversity is the count of the different lexical forms in the text, or the size of the vocabulary. This is called the number of different words (NDW). NDW was used in areas like language acquisition (29) or English as a Second Language (ESL) (28). NDW was also used by Shin et al. (30) to identify the core vocabulary for adults with complex communication needs. The obvious limitation of NDW is its dependence on the length of the sample. The longer the sample, the bigger is the chance of observing different lexical forms. This limitation motivated the creation of several extensions to measure diversity independently of the text length. As seen in Figure 2, the ANOVA F-test of NDW is not significant.

Figure 2 The ANOVA F-test values. The following features are significant, with P0.05: Brunet, CTTR, GTTR, HDD, Herdan, HLM_v2, Maas, MaasLog, MTLD, Sichel, Summer Index, TTR, Uber Index, and Yule Index. TTR, type token ratio; MATTR, moving-average TTR; CTTR, Caroll TTR; GTTR, Guiraud’s corrected TTR; HDD, hypergeometric distribution D; HLM, Honoré’s lexical measure; MTLD, measure of textual lexical diversity; WFI, word frequency index; ANOVA, analysis of variance.

TTR is a linguistic measure used to determine the lexical diversity of a text or a speech sample (31). TTR is an extension of NDW, aiming to take into consideration the length of the text or speech sample. It is calculated by dividing NDW in a text by the total number of words in that text (Eq. [3]).


N is the number of words.

A high TTR indicates that a text has a greater variety of vocabulary and is more lexically diverse. TTR has been used in several studies about dementia with contradictory results. For example, TTR did not give significant results in the study conducted by Shinkawa and Yamada (32) that aimed to characterize the atypical repetition of words on different days, which is observed in patients with dementia. While Le et al. (6) reported significant results in their longitudinal study about lexical and syntactic changes of three British novelists. These problematic results of TTR are probably due to its bias toward the size of the text, as was shown by (33,34,35) [see (27) for a detailed discussion], since the ratio decreases as the number of words in the text increases. As seen in Figure 2, the ANOVA F-test of TTR is significant.

To compensate for the change in the size of the text and consequently turn TTR into a constant over the whole text, several mathematical transformations of TTR were attempted. Guiraud’s corrected TTR (GTTR) (36) is one of these transformations (Eq. [4]).


NDW is the also size of the vocabulary V.

Carroll (37) also proposed a similar transformation called Caroll type token ratio (CTTR) (Eq. [5]).


Both CTTR and GTTR yielded a significant ANOVA F-test and are among the top 25 features within the χ2 ranked list.

A more recent transformation called MATTR has been proposed by Covington and McFall (38). MATTR addresses lexical diversity using a moving window that calculates the TTRs for each successive window of a given length. The final MATTR score is calculated as the average of the TTRs of the windows (Eq. [6]).


W is the number of windows.

Two versions of MATTR have been implemented. In the first version (MATTR_v1), a fixed window of ten words with a gradient move of five words was used. In the second version (MATTR_v2), a window of 1/10th the text size and moves with half window size. Neither version yielded a significant ANOVA F-test, however MATTR_v1 ranks among the top 50 features of the information gain list.

The measure of textual lexical diversity (MTLD) is another alternative to TTR. MTLD is calculated as the mean length of sequential word strings in a text with a given TTR value (39). As shown in Figure 2, the one-way ANOVA F-test of this feature is significant.

The D measure is another approach to calculating the lexical diversity independently of the length of the text (40). This measure is based on the predicted decrease of the TTR, as the size of the text increases. This mathematical curve is comparedwith textual data collected by McCarthy and Jarvis (MJ corpus) as well as the Lancaster-Oslo-Bergen corpus [see (39,41) for more details]. Besides, McCarthy and Jarvis showed that Vocd-D is a complex approximation of the hypergeometric distribution, and to show this, they proposed an index that they called hypergeometric distribution D (HD-D or HDD). The hypergeometric distribution being the probability of drawing a certain number of tokens of a specific type from a text sample of a certain size. As shown in Figure 2, this HDD feature has significant ANOVA F-test.

Dugast’s or Uber Index (Uber) is a transformation of TTR designed to measure the lexicon diversity based on the frequency of different word lengths in a text (42) (Eq. [7]).


Uber was used by Lissón and Ballier to study vocabulary progression in foreign language learners (43) and by Nasseri to model the lexical complexity of academic writing (44). As shown in Figure 2, Uber yielded a significant ANOVA F-test.

Within the context of his work on author identification, Yule proposed a measure K or Yule of repetition to account for lexical diversity. He assumed that k would differ between authors (Eq. [8]) (45). Yule has a significant ANOVA F-test (Figure 2).


vr is the number of word types, which occur r times in a text of length N.

The BI, also known as W, is a measure of the lexical diversity of a text or speech sample. BI is calculated based on the number of unique words within a text sample (Eq. [9]).


A higher BI indicates a higher lexical diversity. BI was used as a feature in several dementia classification models like (7,15,46-48). As shown in Figure 2, BI produced a significant ANOVA F-test.

Honoré’s lexical measure (HLM) is another alternative to TTR to measure lexical diversity of a text or a speech sample (49). The main assumption behind HLM is that the frequency of a word is inversely proportional to its rank in the frequency list. HLM was used as a feature to detect AD with machine learning in several previous studies like (7,15,45). Based on the literature, three versions of HLM were implemented (HLM-V1, HLM-V2, and HLM-V3 in Eq. [10], Eq. [11], and Eq. [12] respectively).




NSingOc = number of words with a single occurrence in the text, V = total number of different words in the text (i.e., the vocabulary).

As shown in Figure 2, only the second version of HLM has a significant ANOVA F-test.

Summer Index (SI or Summer) is a measure of lexical diversity (Eq. [13]), that was used by Lissón and Ballier (43) to study vocabulary progression in second language learning. As shown in Figure 2, SI gave a significant ANOVA F-test.


Herdan’s Index (HI) is another lexical diversity measure that is also known as LogTTR (50) (Eq. [14]). HI has been used in studies about foreign language progression (43). As shown in Figure 2, HI yielded a significant ANOVA F-test.


V (or Vocabulary) is the number of different words.

Maas Index (MI) is a different measure of diversity proposed by Maas to improve TTR (51). MI is based on the association between the number of unique words and the total number of words (Eq. [15]).


A variant of this measure called Maas log (MaasLog) has also been used in literature (Eq. [16]) (43). Both measures yielded a significant ANOVA F-test (Figure 2).


Sichel measure was used to study language progression by foreign language learners (52) and it was also used as a feature in the model proposed by Wang et al. (15) to detect AD (Eq. [17]).

As seen in Figure 2, Sichel yielded a significant ANOVA F-test.


Entropy is a thermodynamic concept that was proposed to measure the degree of disorder of randomness (Eq. [18]). Entropy has been used to measure the uniformity of the vocabulary distribution (53). Hernández-Domínguez et al. (47) showed that Entropy has a significant, but weak, correlation with cognitive impairment.


p(x) is the probability of a word w occurring in the text T.

However, as shown in Figure 2, entropy did not yield a significant F-test, but it figures among the top 50 features of reliefF’s ranked list.

Diversity through Lexical Focus and PMI

Another way to measure the diversity of a text is by measuring the focus of the text. Text focus was proposed by Kurdi (54) as the mean distance between keywords extracted from the text. Keywords are extracted using the rapid automatic keyword extraction (RAKE) algorithm (55). Three distances between the keywords are then calculated using the wordnet dictionary: Wu-Palmer (distWUP) similarity4, Path similarity5 (distPath), the Leacock-Chodorow (distLCH) similarity6. The features distWUP and distPath do not have a significant ANOVA F-test but distPath is among the top 25 ReliefF’s ranked features. Even though the ANOVA F-test of distLCH is not significant, this feature is among the top 25 features in the ranked lists of reliefF, χ2, and information gain.

A fourth approach to measure the text focus is also considered. This approach consists of using the average of the cosine distances between the glove and word2vec vectors of the keywords respectively (Eq. [19]).


T is the text sample, Cos is the cosine distance, Vwi is the Vector of word wi, K is the number of tokens in T.

Given the possible impact of the size of those vectors, the experiments were conducted with the following sizes: 50, 100, 200, and 300. We got four features with glove: AVG_GlvSIZE and two with word2vec AVG_W2VSIZE (to avoid redundancy). Only the features AVG_Glv300 and AVG_W2V300 have significant ANOVA F-test (Figure 3). However, AVG_Glv50 is among the top 25 of ReliefF and AVG_Glv100 is among the top 25 information gain features.

Figure 3 F-values of the ANOVA statistic of the text focus and pointwise mutual information features, features average glove size 300 (AVG_Glv300) and average word to vector size 300 (AVG_W2V300) are significant, with P=0.01. ANOVA, analysis of variance.

In statistics, PMI is used to measure the association between two events. When it comes to the lexicon, it indicates a high probability of co-occurrence, which is among others, a measure of semantic relatedness. Hence, PMI can be used as a measure of diversity. Two approaches to calculate PMI are adopted in this paper. First, PMI is calculated according to the traditional approach that consists of calculating the probability of co-occurrence of two terms within the same text (Eq. [20]).


The average PMI of the pairs of content words within the text is calculated. This approach did not give a significant ANOVA F-test. However, it ranked among the top 25 features within the χ2, reliefF, and information gain. The second approach to PMI is an extension proposed in this paper to consider the semantic relatedness of the pairs of words (Eq. [21]).


Two types of word embedding have been used: word2vec and glove, with varying vector sizes: 50, 100, and 300 (pmi_glvSize) and one size for word2vec, to avoid redundancy: pmi_W2V300. All the PMI features of word embedding gave non-significant ANOVA F-tests (Figure 3). However, pmi_Glv50 is among the top 25 features of reliefF and pmi_Glv100, pmi_Glv300, and pmi_W2V300 are among the top 50 features of the same list.

Diversity of ngrams of words

Using the DementiaBank language dataset, Orimaye et al. (8) compared the 20 most frequent Ngrams between patients with preliminary AD and Healthy Elderly People and saw significant differences between those two groups. Such an approach relies more on pattern matching, that is dependent on the literal aspect of the text. Therefore, this approach does not scale well to a change of the task. Hence, in this paper, three lexical diversity measures were extended to Ngrams: TTR, CTTR, and GTTR. Every one of these measures was implemented with bi-grams, trigrams, and fourgrams. This gave us the nine measures, whose F-tests are presented in Figure 4.

Figure 4 F-values of the ANOVA statistic of Ngram diversity measures. Only bigram, trigram, and fourgrams of TTR, respectively bigTTR, trigTTR, and frgTTR, are significant, with P0.004. TTR, type token ratio; ANOVA, analysis of variance.

Among those measures, only trigTTR, trigTTR, and frgTTR gave significant ANOVA F-test. TrigTTR is also among the top 25 features in the information gain’s ranked list. The features trigCTTR and trigGTTR are among the top 50 features of the information gain and χ2 ranked lists. The feature frgGTTR are among the top 50 features within the χ2 ranked list.

Lexical sophistication

Lexical sophistication (Soph) is also called lexical rareness or Basic Lexical Sophistication (21,56). It is calculated as the ratio of the number of sophisticated words to the number of lexical words (Eq. [22]).


Nsoph is the number of sophisticated words. N is the total number of tokens (that remain after filtering the stop words).

Hyltenstam considers as sophisticated the words whose frequency is larger than 7,000 in the list of most frequent Swedish words (57). Astell and Harley induced tip-of-the-tongue (TOT) states in their experiments with elderly participants, some of whom have PrAD (58). The results of these experiments showed that TOT states occurred more often with patients with PrAD when the targets were words of low frequency and imageability. Given the above consideration, three versions of the Lexical sophistication are implemented. Sophistication is a basic version of lexical sophistication, where a word is considered sophisticated if its frequency rank is more than 3,000. Word frequencies are obtained from the word frequency data (WFD)7, a freely available list of the 5,000 most frequent words in English that is calculated based on the Contemporary American English Corpus (59). The words that are not in the WFD are given the same frequency value, which is lower than the lowest score in the list. Words are stemmed using the Snowball Stemmer. This helps match the plural of the regular forms (e.g., book, books). The ANOVA F-test of Soph is significant (Figure 5).

Figure 5 ANOVA F-tests of the sophistication features. HLR, gapVb, rtIrregVb and VSM are significant, with P0.04. CLS, continuous lexical sophistication; gapVb, general all-purpose verbs; HLR, hapax legomena ratio; VSM, verb sophistication measure; ANOVA, analysis of variance.

A related measure was proposed by Harley and King: the verb sophistication measure (VSM) (60). It is calculated as the ratio of the number of sophisticated verbs to the total number of verbs (Eq. [23]).


Vsoph is the number of sophisticated verbs. V is the total number of verbs.

Sophisticated verbs are defined as the verbs that are outside of the list of the most frequent verbs. Harley and King used two lists with 20 and 200 verbs respectively in two different studies. In both studies, they reported a significant difference between native and non-native writers. In this paper, the McMillan English Dictionary was used. This dictionary provides the list of the 330 most frequent verbs8 in English. To find the uninflected form of a verb, the verb conjugation module, provided within the Pattern.en toolbox9, is used. As seen in Figure 5, this feature yielded a significant ANOVA F-test. It is also among the top 25 reliefF features. A modified version of VSM, that uses the square to reduce the sample size effect, was proposed by Wolfe-Quintero et al. (61) (Eq. [24]). VSMsq did not give a significant ANOVA F-test (Figure 5), but it is among the top 50 χ2 ranked features.


General all-purpose (GAP) verbs, sometimes called light verbs, are a limited set of high frequency, often mono syllabic, verbs with general semantic meanings, such as the verbs: make, do, and go (62). As suggested by Maouene et al., heavy verbs have a strong association with specific objects (63). There have been conflicting findings concerning high usage of GAP verbs in children with specific language impairment (SLI), but it is now thought that making extensive usage of GAP verbs is a normal phase in both children developing normally and those with SLI (64). Given the relevance of GAP verbs to the process of language acquisition and verb specificity, they seem a relevant indicator of dementia. Hence, in our data the ratio of GAP verbs to the total number of verbs gave relevant ANOVA F-test, it is also among the top 25 features in the three other feature selection lists: reliefF, χ2, and information gain. This result suggests that patients with dementia tend to use more generic GAP verbs than control subjects.

Another measure related to verbs is the ratio of regular vs. irregular verbs (rtIrregVB). Previous works did not provide clear evidence of the difference in terms of lexical access to regular vs. irregular verbs. For example, Feldman et al. concluded that normal native speakers did not show substantial difference in access to irregular verbs compared to the regular ones (65), while Justus et al. observed a dissociation between the two types of verbs (66). Nonetheless, the divergence between the two types of verbs requires extra cognitive processing that can be altered by AD. This assumption is confirmed by the significance of the ANOVA F-test (Figure 5) and by the fact that this feature is among the top 25 features of the three other considered feature selection techniques: reliefF, χ2, and information gain.

The continuous lexical sophistication (CLS) measure is proposed by Kurdi (28). It is about calculating the average frequency of the content words within a text, after removing stop words (Eq. [25]).


WFD(wi) is the frequency of the word wi within WFD.

Like with Soph, words are also stemmed here. This feature did not give a significant ANOVA F-test, but it is among the top 50 ranked features by the information gain and reliefF methods.

Hapax legomena ratio (HLR) takes a local approach to sophistication, since it defines complexity as the ratio of the number of words that occur only once in a text or a speech sample to the total number of words in the sample (Eq. [26]).


NsinglOc is the number of word with a single occurrence in the text.

Hernández-Domínguez et al. (47) showed that HLR has a negative but significant correlation with cognitive impairment. HLR yielded a significant ANOVA F-test (Figure 5), and it is among the top 50 χ2 ranked features.

Lexical specificity

Lexical specificity is about how precise the lexicon is used in a text or a speech sample. Lexicon is said to be precise when it designates a precise object or action. Vagueness and ambiguity are the result of using a lexicon that lacks specificity. Lexical specificity has been linked to AD in the literature. For example, Eyigoz et al. showed that using the semantically generic terms, such as boy, girl, and woman instead of the more specific words like: son, brother, sister, daughter, and mother to refer to the subjects in the picture is associated with higher risk of AD (67). Le et al., in their longitudinal study of the writings of three British novelists, showed that there is a significant decrease in specificity associated with dementia (6).

Several features have been proposed in this paper to capture lexical specificity. First is the depth of knowledge required by a set of words (depthKnow). A word is considered to require more knowledge, or to be specialized, if it belongs to the list of University Word Lexicon compiled by Xue and Nation (68). It is made up of 836 words. This feature gave significant ANOVA F-test (Figure 6).

Figure 6 ANOVA F-tests of specificity features, the significant features are: depthKnow, rtHypo, rtOneMeaning, rtMeaningWords, with P0.05. depthKnow, depth of knowledge; rtAbb, ratio of abbreviations; rtHypo, ratio of hyponyms; rtOneMeaning, ratio of monosemic words; rtMeaningWords, the ratio of the total number of meanings; ANOVA, analysis of variance.

To capture the polysemy of the content words within the texts, those words that remain after removing the stop words, two approaches have been adopted. The first consists of calculating the ratio of monosemic words to the number of content words (rtOneMeaning) and the second consists of calculating the ratio of the total number of meanings to the number of words within the text (rtMeaningWords). WordNet is used to calculate both ratios. As shown in Figure 6, both rtOneMeaning and rtMeaningWords yielded a significant ANOVA F-test. The feature rtOneMeaning is among the top 25 χ2 ranked list and among the top 50 of the ranked list of information gain. The feature rtMeaningWords is among the top 50 features within the reliefF ranked list.

Abbreviations, such as DIY, BTW, and LOL, can be seen as a special case of monosemic words. Therefore, their ratio is calculated. This feature did not yield significant results, because of the rareness of usage of such type given the nature of the task.

Hyponyms are a way to capture the specificity of a word. The more hyponyms a word has the less specific it is. Using WordNet, the ratio of hyponyms to the words within the text (rtHypo) is calculated. This feature yielded a significant ANOVA F-test (Figure 6). It is also among the top 25 of the three other lists of feature selection: reliefF, χ2, and information gain.

Psycholinguistic aspects of the lexicon

Several psycholinguistic aspects of the lexicon have been studied in the literature. These studies led to standard measures in psycholinguistics. Some of those key standards have been coded within the MRC psycholinguistic database10, which covers 150,837 words and provides 26 linguistic and psycholinguistic properties. In this paper, the following psycholinguistic features are considered as relevant to this study: Kucera-Francis Written Frequency (kf_freq), Kucera-Francis number of categories (kf_ncat), Kucera-Francis number of samples (kf_nsamp) (69), Thorndike-Lorge written frequency (TL_freq), Brown verbal frequency, Familiarity rating (familiarity), Concreteness rating, Imageability rating, Meaningfulness with Colorado Norms (Colorado) (70), Meaningfulness with Paivio Norms (Paivio), as well as the Age of Acquisition rating (ageAquis). Some of these features have been used in studies about dementia. For example, in their longitudinal study of three subjects with semantic dementia, Le et al. (6) confirmed the finding of Astell and Harley (58) that AD has a significant impact on the imageability of spoken production. On the other hand, Bird et al. designed an experiment where they divided regular and exception words, where pronunciation did not follow directly from the orthography (71). They tested whether the patient could tell those words from false font strings, sequences of non-orthographic symbols. The results of their study were not significant, with words having high and low Kucera Francis frequencies respectively.

The psycholinguistic features are calculated as follows. After extracting the list of the nouns from the text and their total number of occurrences k is counted. The scores of every psychological feature are extracted from the MRC dictionary and their average within the text is calculated. Finally, the ratio of the number of nouns that are in the MRC to the number of nouns in the text (rtNouns) is also used as a feature.

As seen in Figure 7, all the psycholinguistic features yielded a significant ANOVA statistic except for the three Kucera Francis features and Thorndike-Lorge written frequency. Furthermore, several psycholinguistic features ranked high with the other feature selection techniques. As shown in Table 1, Kf_ncats is the only feature that did not yield a significant ANOVA statistic in addition to not being in any top list. This is probably because this feature is designed for written language. On the other hand, the high rankings of many psycholinguistic features confirm their importance for AD and dementia detection.

Figure 7 ANOVA F-tests of the psycholinguistic features, the significant features are: ageAquis, Brown, Colorado, concreteness, familiarity, imageability, Paivio, and rtNouns, with P0.05. ageAquis, age of acquisition; rtNouns, ratio of the number of nouns; ANOVA, analysis of variance.

Table 1

Rankings of the psycholinguistic features among the top 25 features of information gain, reliefF, and χ2

Feature Info. Gain ReliefF χ2
ageAquis x
Brown x x
Colorado x x x
Concreteness x x x
imageability x x x
familiarity x x x
Kf_freq x
Kf_nsamp x
Paivio x x
rtNouns x x x
TL_freq x

Miscellaneous lexical features

Discourse connectors, or simply connectors, are a simple and reliable way to measure the cohesion of a text or speech sample. Connectors, such as or, and, so, and, also indicate a complex content. Edwards et al. used the number of coordinating conjunctions, and the number of subordinated conjunctions as features in their AD diagnosis system (12). The reports in the literature do not provide a clear account of the impact of dementia on the use of discourse markers. For example, Davis and MacLagan (72) reported that the patient they followed in their longitudinal study did not show changes in her use of Uh, the discourse marker that they focused on. On the other hand, in their study of poly-synthetic agglutinating language, Cowell et al. (73) reported a decline in subordination with clausal connectors. In our study, a distinction is made between general connectors and argumentative discourse connectors, such as but, therefore, and hence, that are a subset of discourse connectors, which indicate a higher level of reasoning and argumentation. Two features of this type are calculated: the ratio of discourse connectors to the number of words and to the number of sentences, respectively (discWrd) and (discSent). Two similar features regarding the argumentative discourse connectors are calculated too: (discArgWrd) and (discArgSent). As seen in Figure 8, among the four discourse features, only the feature discWrd yielded a significant ANOVA F-test. This is due to the limited usage of argumentative discourse connectors within the cookie theft description task and to the difficulty of delimiting sentences within spoken language.

Figure 8 ANOVA F-tests of the miscellaneous features. The features: discWrd, distBigCt, distBigDem, distFourgDem, distTrigCt, distTrigDem, and sentiBlob are significant, with P0.04. discWrd, word discourse connectors; distBigCt, distance bigram control; distBigDem, distance bigram with dementia of type AD; distFourgDem, distance fourgram with dementia of type AD; distTrigCt, distance trigram control; distTrigDem, distance trigram dementia of type AD; AD, Alzheimer’s disease; ANOVA, analysis of variance.

People with dementia tend to have issues with emotion regulation processes (74). Although the cookie theft task itself is descriptive, several subjects expressed some emotions about what they see by using emotion carrying words like good, sad, bad, or mess. Therefore, it seems relevant to consider emotional lexicon as a feature to detect AD. There have been several approaches to calculating the sentiment analysis of a sentence (75), among them three approaches based on the lexicon are considered: TextBlob, sentiVader, and sentiWordNet. The approach based on sentiWordNet consists of calculating the average sentiment of all the content words, that remain after removing stop words, within a text or a speech sample. As shown in Figure 8, only sentiBlob yielded a significant ANOVA F-test. Despite that, sentiment analysis features ranked high among the three other adopted feature selection techniques. For example, sentiBlob ranked among the top 25 information gain features, while both sentiWordNet and sentiVader ranked among the top 25 features of the three other feature selection techniques: reliefF, χ2, and information gain. These results confirm that AD has a substantial impact on emotions and sentiment expression. This difference between the subjects in expression the emotion shows that some people perceive the same situation from an emotional point of view, while others see the same situation more objectively. Hence, one can expect to see more substantial differences in terms of emotion expression with tasks involving judging products or people.

As shown in (28), there are several readability formulas that rely on different knowledge sources. Only the readability formulas that rely mainly on lexical characteristics are considered in this paper. Hence the following readability formulas have been selected: Gunning’s Fog Score (76), Spache Score, and the Dale-Chall formula (Eq. [27], Eq. [28], and Eq. [29], respectively).


N is the number of words, S the number of sentences, Ncomplex the number of complex words, that have three or more syllables.


Nunfam the number of complex words, that have three or more syllables.


Ndif Number of difficult words, which are not in Dale-Chall list of common words11.

As seen in Figure 8, the three adopted readability formulas yielded non-significant ANOVA results. Gunning Fog Index is among the top 25 features within the χ2 list. A probable reason for this outcome is that those formulas are not designed for spoken language.

Another way of using lexicon to detect AD consists of building Ngram profiles of the two types of texts we have in the data set: AD and control. Three profiles are built for each of those two types: bigram, trigram, and fourgram (respectively big, trig, and fourg). Then, during the feature extraction phase, three profiles are built for the current text and their distances are calculated with the corresponding training models (e.g., we calculated the distance between the bigram training model and the bigram model of the current text). Those distances are used as features and they are calculated as follows. For every Ngram in the text’s model, the absolute value of the subtraction of the rank of this Ngram in the text from the rank of this Ngram in the level profile is added to a delta variable. The delta variable is the actual distance between the text and the level profile (77). This approach gives us six distances: distFourgCt, distFourgDem, distTrigCt, distTrigDem, distBigCt, and distBigDem. As seen in Figure 8, the six features related to the Ngram profiles yielded significant results. Despite the significance of the results of these features, it is worth noting that they are known to have issues with generalization when used outside of the specific task they are trained in.


The goal of this section is to provide a quantitative evaluation of the predictive power of the features presented in the Feature set section. The following Machine-Learning algorithms have been used in the experiments: logistic regression, SVM, adaptive boosting (AB) with 100 estimators, bagging (78), and random forest (RF) with 2 as maximum of depth (79) and eXtreme Gradient Boosting (XG) (80). The following parameters have been used with XG: learning_rate =0.001, n_estimators =5,600, max_depth =5, min_child_weight =1, gamma =0, subsample =0.8, colsample_bytree =0.9, objective = ‘binary:logistic’, seed =25. In addition, two versions of multilayer Perceptron (MLP) were used. The first (MLP) has the following parameters: max_iter =200, hidden_layer_sizes =50, activation function: tanh’, solver=adam, alpha =1e−8, while the second version (NN) has the following parameters: max_iter =100, hidden_layer_sizes =40, activation function: relu, solver = adam, alpha =0.0001. All the machine learning parameters have been selected empirically, after having tried multiple combinations, the ones that gave the optimal results were adopted. These machine learning algorithms were selected for their better performance after having done some experiments with other algorithms, such as Decision Trees, and Naïve Bayes. All experiments were implemented in Python using Scikit-learn (81) and other relevant libraries such as XGBoost. To measure the performance of the different algorithms, accuracy, recall, precision, F1-score (Eqs. [30-33]) as well as AUC are reported when relevant, otherwise, only F1 is reported.





TN stands for true positive, FN stands for false negative, FP stands for false positive, N for the number of cases.

Comparison of lexical types

In the Feature set section, we have seen the individual impacts of the lexical features, as they are presented by their types. Here, we try to answer the question of significance of these features grouped by type in the classification of transcribed speech samples produced by patients with AD vs. the ones produced by normal control subjects.

In this evaluation, the data is split into two parts: one for training and one for testing. The splitting is the same as the one proposed by the ADDReSS challenge (see Datasets section for more details). Such splitting of the data allows for testing the generalization within the same task of the model by exposing it to unseen data during the training. Another advantage of using the same splitting is that it allows us to use the papers involved in the ADDReSS challenge as a baseline that can be compared to our models.

From a methodological point of view, although this study does not pretend to cover all the lexical features, it nonetheless collected a large number of lexical features per type. Hence, this comparison should give a general tendency. Before presenting the actual classification results per type, it is necessary to present the percentages of the features that made it to the top 50, according to the four adopted feature selection techniques: ANOVA, reliefF, χ2, and information gain. As seen in Figure 9, there are some disagreements between the different feature selection techniques about the selected percentages of the lexicon areas. In all the feature selection techniques, diversity has about half of the features, which is expected given the large number of features of this type. Lexical sophistication has between 6–10% of the features, while lexical density holds less than ten percent in all the feature selection techniques. The number of specificity features varies between 2% and 6%. Psycholinguistics has also varied shares, between 6% and 16%. Finally, miscellaneous occupies about 20% in the four feature selection techniques.

Figure 9 Rankings of the top 50 features according to four different feature selection techniques. ANOVA, analysis of variance.

The question now is whether these numbers of features have an impact on evaluation scores of the classification with the different lexical types. Because of the special nature of the Ngrams of words, two models were built out of miscellaneous features: misc1 and misc2. The model misc1 includes the Ngrams profiles, while misc2 includes all the miscellaneous features, except for the Ngrams profiles. The F1 and accuracy results of the classification are reported in Figure 10, where the eight used machine learning algorithms are trained on the ADDReSS training data and tested on the testing data of the same data set.

Figure 10 F1 scores and accuracies of the lexical areas, on the ADDReSS challenge data set. XG, eXtreme Gradient Boosting; RF, random forest; NN, neural network; AB, adaptive boosting; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; SVM, support vector machine.

Here are some observations from Figure 10. The psycholinguistics features provide the top classification results with the XG machine learning algorithm, with over 0.8 for both accuracy and F1. These results confirm the high impact of the psycholinguistic features, as demonstrated in Table 1. The model misc1 gives higher results than misc2 because it includes distances between Ngrams profiles. The overall ranking of the features does not seem to systematically impact the results. For example, although diversity accounts for about half of the top 50 features and density and specificity account respectively for less than 10%, yet both density and specificity provide higher F1 and accuracy scores than diversity. Besides, all the lexical areas achieved lower than the best performance reported in the ADReSS Challenge (11), which is 0.89 for both F1 and Accuracy. This is probably due to the fact that large groups of features, such as the diversity group, contain noisy and partially redundant features.

Nonetheless, all the lexicon complexity types provide decent classification results, with misc2 providing the lowest performance. No machine learning dominates with the different models. For example, XG provides the best performance with psycholinguistics, and misc1, RF provides the best performances with density and sophistication, MLP provides the best performance with specificity, SVM provides the best performance with misc1, and the best performance with misc2 is provided with Logistic Regression. Finally, no substantial differences are observed between accuracy and F1 scores.

To better understand the results of the classifiers, an examination of the confusion matrices of the four top classifiers is provided in Table 2. Those classifiers are selected because they provide the best performance per lexical type.

Table 2

Confusion matrices of the four top classifiers with the lexical areas: psycholinguistics, density, miscellaneous 1, and specificity

Model Data Control AD
Psycholinguistics with XG Control 92% 8%
AD 25% 75%
Density with RF Control 92% 8%
AD 41% 59%
Misc1 with MLP Control 16% 84%
AD 0% 100%
Specificity with MLP Control 75% 25%
AD 25% 75%

AD, Alzheimer’s disease; XG, eXtreme Gradient Boosting; RF, random forest; MLP, multilayer perceptron.

As shown in Table 2, both psycholinguistics with XG and density with RF do a better job at detecting control cases, while specificity of MLP’s performance in detecting control and AD cases is identical. On the other hand, misc1 with MLP does a perfect job at detecting AD and a poor job at detecting control. This suggests that the outcome pattern depends on the combination of the set of features and the machine learning algorithm.

In the previous evaluation, the models were tested with test data from the same data set that is unseen during the training. However, some key similarities still exist between the training and the testing data as they both are about the same cookie theft task. Therefore, the language samples in both testing and training datasets are influenced by the same constraints implied by the collection conditions such as the dialog context and the semantic field. It would be interesting to test how the different lexical complexity areas covered in this paper scale up when applied to texts of different types and thus being of different task and even of different language type (transcribed spoken language vs. edited written language). Such a task scalability evaluation is useful from two different perspectives. From a theoretical perspective, it allows us to compare the task dependence of the tested model. On the other hand, from a practical perspective, the task scalability test gives a good insight into the possibility of building a task independent system to test AD. In such a task independent system, spoken or written language samples about an undefined task are collected and used to perform the test. Hence, a task scalability evaluation is conducted by training the classifiers on the entire ADReSS data set and testing them on the three data sets extracted from the novels of Iris Murdoch and Agatha Christie and from President Reagan’s speeches (see Datasets section). Given that no substantial differences were observed between F1 and accuracy scores, only F1 results will be reported.

The results of the task scalability test on the data set extracted from Iris Murdoch novels are presented in Figure 11.

Figure 11 F1 scores of the task scalability evaluation conducted on Iris Murdoch data Set. XG, eXtreme Gradient Boosting; RF, random forest; NN, neural network; AB, adaptive boosting; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; SVM, support vector machine.

As seen in this figure, density with RF achieves the highest F1 score, while diversity’s best score is much lower. As expected, misc2 did a better job at scaling up than misc1, which relies on Ngrams profiles.

The task scalability results on the data of Agatha Christie are presented in Figure 12. The first observation about the results in this figure is that, unlike the F1 scores on Iris Murdoch’s data, diversity scales better than density and that sophistication and psycholinguistics scale better here as well. The observation about the difference between mis1 and misc2 is confirmed here as well. This difference between the data of the two authors can be attributed to either the difference in their writing style, to the specific type of dementia they had, or a combination of both.

Figure 12 F1 scores of the task scalability evaluation on Agatha Christie’s data. XG, eXtreme Gradient Boosting; RF, random forest; NN, neural network; AB, adaptive boosting; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; SVM, support vector machine.

The patterns of the results on President Reagan’s data (Figure 13) are closer to the one of Iris Murdoch, where density scales better than diversity and the rest of the lexical types. Sophistication and psycholinguistics do not do as well here as with Agatha Christie’s data.

Figure 13 F1 scores of the task scalability evaluation on President Donald Reagan’s data. XG, eXtreme Gradient Boosting; RF, random forest; NN, neural network; AB, adaptive boosting; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; SVM, support vector machine.

Let’s examine some confusion matrices of the top classifiers to understand the generalization patterns. Those classifiers are selected because they provide the highest F1 Scores for their respective data sets. As seen in Table 3, the error patterns are different depending on the machine learning algorithm, the test data, and the set of features. For example, with Iris Murdoch data, the RF algorithm, and the density features, the classifier can correctly classify AD texts 95% of the time, while it has issues classifying control cases. With Ronald Regan’s data, the XG algorithm, and the density features, the classifier also does a better job at classifying AD cases than control cases, but the overall score is lower than with Iris Murdoch’s data. Finally, with Agatha Christie’s data, the RF algorithm, and the diversity features, the classifier does a perfect job classifying control data, but it makes a lot of mistakes with AD cases.

Table 3

Confusion matrices of the top classifiers with the three task generalization datasets

Model Data Control AD
Density with XG, Ronald Reagan Control 45% 55%
AD 25% 75%
Diversity with RF, Agatha Christie Control 100% 0%
AD 42% 58%
Density with RF, Iris Murdoch Control 50% 50%
AD 5% 95%

AD, Alzheimer’s disease; XG, eXtreme Gradient Boosting; RF, random forest.

Continuous feature selection

Having explored a large number of lexical features, the question that needs to be addressed is the following. Which combinations of those features lead to the optimal classification result? To answer this question, four key feature selection techniques have been adopted in the experiments: ANOVA, χ2, information gain, and reliefF. Every one of these methods provides a different ranking of the features. Hence, starting from the top feature, per feature selection technique, until the entire set of features, all possible subsets of features will be tested with seven of the machine learning algorithms that have been used in the Comparison of lexical types section, NN has been discarded because it gave low results systematically. The features selected did not include the Ngram profiles, because of their issues with generalization. In this evaluation, the ADDReSS challenge scenario was replicated. In other words, all the machine learning models were trained on the training data provided in ADDReSS challenge website and tested with the test data from the same website. As mentioned before, besides being balanced, this data set makes it possible to compare the results reported by the other works that used the same evaluation scenario. The recall, precision, F1, and AUC results of this evaluation are presented in Figure 14. As shown in this figure, the patterns of recall, precision, and AUC are similar to the ones of F1, therefore, only the F1 results will be discussed. Besides, this figure shows that quality matters more than quantity when it comes to machine learning models. AB is among the most demanding machine learning algorithms in terms of number of features, as it reached its top performances with χ2 and reliefF with more than 50 features. MLP is the least coherent, as its performance goes substantially up and down as the number of features increases. MLP is also the fastest to peak with χ2, as it requires only 3 features. The top performance obtained with information gain is 0.875. It is achieved with 39 features, the length of AtopInfoGain. The top overall result of 0.895 was achieved by ANOVA. The length of the top ANOVA list of features AtopANOVA has 15 features. This F1 score is equal to the best performance reported in the ADReSS Challenge (11). Consequently, this result shows that lexical information alone allows us to achieve such high performance.

Figure 14 Recall, precision, area under the curve, and F1 scores with seven machine learning algorithms and the four adopted feature selection techniques: ANOVA, χ2, information gain, and reliefF. ANOVA, analysis of variance; XG, eXtreme Gradient Boosting; RF, random forest; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; AB, adaptive boosting; SVM, support vector machine.

To showcase the contribution of this paper, the union of the four top lists that gave the best performances, shown in Figure 14, was performed to obtain the list of the top-most features in this study. This list is calculated as follows: AtopOverall = AtopANOVA U AtopChi U AtopInfoGain U AtopReliefF. The length of top overall list AtopOverall is 27 features. The distribution of those features is presented in Figure 15. As shown in this figure, most of the features which made it to the top are either proposed in this paper or used for the first time for AD diagnosis. About 37% of the features have been used in previous works.

Figure 15 Distribution of the most impactful features. PMI, point wise mutual information; VSM, verb sophistication measure; ANR, adjective-to-noun ratio; TTR, type token ratio; CTTR, Caroll TTR; NVR, noun-to-verb ratio; GTTR, Guiraud’s corrected TTR; rtAdj, ratio of adjectives; rtVb, ratio of the verbs; rtAdv, ratio of adverbs; discWrd, word discourse connectors; rtNouns, ratio of the number of nouns.

Input length experiments

Another key question, that has not been addressed in previous works, to the limit of our knowledge, is the length of the input used in training and testing machine learning models. Knowing the optimal input size can help maximize the yield of machine learning algorithms. Besides, optimizing the input size helps design the psychological tests such that the right amount of data is gathered from the subjects, thus reducing the effort and cost of the process. Before we present the results, some basic statistics can help better understand the distribution and the disparities between the lengths of the speech samples within the ADDReSS data set (Table 4).

Table 4

Statistics of the numbers of tokens of speech samples within the training and testing parts of the ADDReSS data set

Data Min Max Mean SD
Training 44 559 135.129 73.164
Testing 48 523 138.833 84.096

SD, standard deviation.

A test is conducted with gradual inputs within the range between 5 to 225 tokens and an increment of 1. The upper bound of 225 is chosen because it exceeds the mean length, by about one standard deviation, in both test and training data. This test is repeated with the four optimal numbers of features identified with the feature selection techniques. Given that the data in the chosen upper bound is higher than the minimum numbers of tokens gathered in some entries, the tests will be conducted with whole data gathered from a subject when it is lower than the chosen upper limit, and with a sub part of the data gathered when its length exceeds or is equal to the upper limit. The F1 results are presented in Figure 16. Accuracy results were similar to the F1 scores; therefore, they are not presented.

Figure 16 Experiments conducted with gradual lengths between 5 and 225 tokens, with an increment of 1. ANOVA, analysis of variance; XG, eXtreme Gradient Boosting; RF, random forest; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; AB, adaptive boosting; SVM, support vector machine.

As we can see in Figure 16, performances with the four feature selection techniques have been slightly improved by reducing the input length. For example, XG peaked at 0.895, with around 150 tokens, with the optimal information Gain features. While the same algorithm peaked at 0.854 without input limitation. Similar improvements were observed with other machine learning algorithms, such as RF, and MLP, meaning that the benefit of limiting the input length is not proper to a specific machine learning algorithm. The best overall performance of 0.917 was obtained with the optimal set of χ2 features and XG, with an input limited to 80 tokens. This score slightly exceeds the best reported F1 score in the ADReSS Challenge (11).

To study the impact of input length on generalization to other tasks, similar tests were conducted on Iris Murdoch data. So, the best set of features according to the four adopted feature selection techniques were used along with input lengths ranging from 5–225, with increments of 1 (Figure 17).

Figure 17 Generalization test on Iris Murdoch data with gradual input sizes between 5 and 225 tokens. ANOVA, analysis of variance; XG, eXtreme Gradient Boosting; RF, random forest; LR, logistic regression; LD, lexical density; MLP, multilayer perceptron; AB, adaptive boosting; SVM, support vector machine.

As seen in Figure 17, the best overall F1 performance of about 0.78 is achieved with AB, information gain, and input length of 12 tokens. This shows that, despite the decline of performance when training and testing are done on two different data sets, lexical models can still achieve a reasonable generalization to other tasks when limiting the input length. This result is very promising given the known differences between spoken language (used for training) and written language samples used for testing (82).


The results reported in the Results section showed that lexical features helped achieve state-of-the-art classification performance between speech samples produced by normal subjects and those produced by patients with AD, which confirms the importance of the role of lexicon in detecting AD. Some points deserve further discussion.

As seen in confusion tables (Tables 2,3), there are substantial differences in terms of behavior of the leading classifiers. Some do a better job at classifying the control or AD cases, while others have a more balanced behavior. This difference can be explained by the partial redundancy between the considered features by the different models and the aspects considered by the Machine learning algorithm of these models. However, those confusion tables provide us with a good tool to pick the most balanced models.

The tests conducted on the features grouped by lexical areas (Figure 10) did not lead to optimal performance. This is expected given the higher inner redundancy between the features of the same type. Nonetheless, besides allowing to compare the performances of the different lexical types, those experiments showed decent performance with the larger groups of features such as psycholinguistics, specificity, density, and diversity.

The tests conducted on feature selection by the four adopted feature selection techniques show that this approach led to better performance than grouping the features by lexical types. As shown in Figure 14, different combinations of subsets of features and machine learning provided a state-of-the-art performance compared to the best results reported in the ADReSS challenge (11).

Besides from leading to a performance improvement compared to the best reported result in the ADReSS challenge, the tests conducted on gradual input lengths show that there is a window of input length within which the classifiers can be optimal, suggesting that too many words in the test sample can introduce some noise and biases that can cause performance deterioration.

As expected, task generalization led to a decrease in the classifiers’ performance. However, this decrease was moderate. On the one hand, this shows that lexicon-based models can scale up well across the tasks. On the other hand, this suggests that it is reasonable to envision a generic classifier that can detect AD independently of the subject. Two of the task generalization tests, being written samples produced by professionally published writers, lead us to think that this generalization goes beyond the semantic aspect of the task to the type of language. We know that, unlike spoken language, written language gives the author the possibility of planning and editing by the author or by others. Another key difference between the datasets, is that we are comparing speech samples produced by common people in the ADDReSS dataset with texts written by two of the most successful British novelists, whose mastery of language is above the one of average people; and an American president who is known for his superior communication skills. All those strong differences could not eclipse the impact of AD on lexicon, leading to a reasonable classification rate of about 0.8.

Previous works have shown that AD and dementia patients in general can witness cognitive fluctuations where they alternate periods of lucidity, or normal cognitive performance, with periods of cognitive impairment (83,84). When it comes to our data this phenomenon can potentially impact the part gathered from people affected with AD only, as normal subjects are not known to go through similar substantial alternations. This means that the only potential bias introduced by such fluctuation makes the tests conducted in this paper, with all the used data sets, more challenging, as it is not clear whether the text or speech samples were collected during the lucid or impaired period. In other words, the results provided in this paper provide pessimistic, or worst case, figures of the performance of the classifiers. In a real-world scenario the human experts who collect the language samples can help avoid data generated during lucid periods, which should bring the classifiers’ performance up.


Unlike previous works, this study is not limited to a single data set. However, despite the variation in the used data set, the presented results are likely to improve with the increase in the number of data samples. Given the substantial progress with Automatic Speech Recognition (ASR) and the fact that this paper confirmed the results reported by previous works suggesting that AD can be detected using edited and unedited written language, it was assumed in this paper that ASR errors will not affect the actual classification results in a substantial way. Nonetheless, a future work involving the output of an ASR module could clarify the actual impact of the speech recognition errors on AD classification.

This study is limited to the lexical features, which have been shown to play a key role in AD detection in several previous works. However, lexical features are not the only ones that can help detect AD, others like acoustic features, morphological features, and syntactic features need to be explored. The Ngram profiles used in this study are, as we showed in the generalization tests, not good at scaling up to other tasks.


This paper is about exploring the impact of AD on the lexicon and how lexical features can help build an automatic classifier that can predict whether a transcribed speech sample or a written text is produced by someone with AD.

The set of 99 features that was covered in this study includes features that have been used before with AD studies, such as ANR, TTR and some of its derivatives like Brunet Index, and GTTR. The set of covered features also includes features that have been explored in other application areas, such as sentiment analysis, gap verbs, irregular verbs, and sentiment analysis. Finally, some new features have been proposed by the author, such as content focus, Wu-Palmer Similarity, Path similarity, and the ratio of hyponyms to the words. The individual examination of this wide body of features allowed to shed more light on the impact of AD on the lexicon in general. Among the 27 features that helped reach the peak machine learning classification scores, about 63% were either proposed here or used for the first time to detect AD. Among the findings that are worth mentioning, is the substantial impact of AD on the usage of emotional lexicon as measured by lexicon-based sentiment analysis algorithms. It is also worth noting that the features that made it to the top list are distributed over all the lexical complexity areas.

The second part of this paper is about using the identified features in the first part to build machine learning models to detect AD based on written or spoken language samples produced by patients with AD and normal control subjects. The experiments on features grouped by type of lexical complexity, showed that the psycholinguistic features were the most important when the test was performed on data of the same task. Depending on the data used, density and diversity scaled better than the other types of lexical complexity when tested on a different task. All the groups of features by lexical complexity types performed lower than the state-of-the-art. Feature selection experiments, with the four adopted feature selection techniques, helped optimize the performance of the seven adopted machine learning algorithms. With the 15 best features selected by ANOVA and the XG machine learning algorithm, a state-of-the-art performance was achieved. The experiments with gradual input length showed that the input length is a key factor that influences the results, whether when it is too short or too long. Hence, an improvement in the performance of all the best models was achieved when the test was conducted with an optimal input length. The reported performance of 0.917 was obtained with an input length of 81 tokens, and the optimal chi square set of features with the XG algorithm. This performance is a slight improvement compared to the best result reported in the ADReSS challenge. Furthermore, the task scalability evaluation showed that optimizing the input length can lead to performance improvement as well.

Overall, this paper showed the importance of the lexicon as a means to detect AD. Nonetheless, it would be interesting in a future work to combine lexical features with other linguistic and acoustic models to further optimize the classification performance. Conducting experiments with other feature selection approaches can help identify other optimal sets of features and perhaps bring further improvement to the classification performance.


Funding: None.


Peer Review File: Available at

Conflicts of Interest: The author has completed the ICMJE uniform disclosure form (available at The author has no conflicts of interest to declare.

Ethical Statement: The author is accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). No additional ethics approval or informed consent was required due to our study was based on public databases.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See:


2 last accessed in February 12, 2024


4based on the depth of the two senses in the hierarchy and that of their least common subsumer.

5returns a score based on the shortest path that connects the senses in the hypernym hierarchy.

6based on the shortest path that connects the senses and the maximum depth of the hierarchy in which the senses occur.





11The Dale-Chall word list of common words:


  1. Hier DB, Hagenlocker K, Shindler AG. Language disintegration in dementia: effects of etiology and severity. Brain Lang 1985;25:117-33. [Crossref] [PubMed]
  2. Rasmussen J, Langerman H. Alzheimer’s Disease - Why We Need Early Diagnosis. Degener Neurol Neuromuscul Dis 2019;9:123-30. [Crossref] [PubMed]
  3. Fernández Montenegro JM, Villarini B, Angelopoulou A, et al. A Survey of Alzheimer’s Disease Early Diagnosis Methods for Cognitive Assessment. Sensors (Basel) 2020;20:7292. [Crossref] [PubMed]
  4. Garrard P, Maloney LM, Hodges JR, et al. The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author. Brain 2005;128:250-60. [Crossref] [PubMed]
  5. Van Velzen M, Garrard P. From hindsight to insight — retrospective analysis of language written by a renowned Alzheimer’s patient. Interdiscip Sci Rev 2008;33:278-86. [Crossref]
  6. Le X, Lancashire I, Hirst G, et al. Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists. Literary and Linguistic Computing 2011;26:435-61. [Crossref]
  7. Fraser KC, Meltzer JA, Rudzicz F. Linguistic Features Identify Alzheimer’s Disease in Narrative Speech. J Alzheimers Dis 2016;49:407-22. [Crossref] [PubMed]
  8. Orimaye SO, Wong JS, Golden KJ, et al. Predicting probable Alzheimer’s disease using linguistic deficits and biomarkers. BMC Bioinformatics 2017;18:34. [Crossref] [PubMed]
  9. Luz S, Haider F, De la Fuente S, et al. Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge. In: Proceedings of the Interspeech Conference, September 14-18, Shanghai, China. 2020.
  10. Chlasta K, Wołk K. Towards Computer-Based Automated Screening of Dementia Through Spontaneous Speech. Front Psychol 2020;11:623237. [Crossref] [PubMed]
  11. Yuan J, Bian Y, Cai X, et al. Disfluencies and Fine-Tuning Pre-trained Language Models for Detection of Alzheimer’s Disease, In: Proceedings of the Interspeech Conference, September 14-18, 2020, Shanghai, China. 2020:2162-6.
  12. Edwards E, Dognin C, Bollepalli B, et al. Multiscale System for Alzheimer’s Dementia Recognition through Spontaneous Speech. In: Proceedings of the Interspeech Conference, September 14-18, Shanghai, China. 2020:2197-201.
  13. Yamada Y, Shinkawa K, Nemoto M, et al. Speech and language characteristics differentiate Alzheimer’s disease and dementia with Lewy bodies. Alzheimers Dement (Amst) 2022;14:e12364. [Crossref] [PubMed]
  14. Macwhinney B, Fromm D, Forbes M, et al. AphasiaBank: Methods for Studying Discourse. Aphasiology 2011;25:1286-307. [Crossref] [PubMed]
  15. Wang N, Luo F, Peddagangireddy V, et al. Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches, In: Proceedings of the BioNLP 2020 workshop, Seattle. 2020:133-9.
  16. Berisha V, Wang S, LaCross A, et al. Tracking discourse complexity preceding Alzheimer’s disease diagnosis: a case study comparing the press conferences of Presidents Ronald Reagan and George Herbert Walker Bush. J Alzheimers Dis 2015;45:959-63. [Crossref] [PubMed]
  17. Laufer B, Nation P. Vocabulary size and use: lexical richness in L2 written production. Applied Linguistics 1995;16:307-22. [Crossref]
  18. Shellikeri S, Cho S, Cousins KAQ, et al. Natural speech markers of Alzheimer’s disease co-pathology in Lewy body dementias. Parkinsonism Relat Disord 2022;102:94-100. [Crossref] [PubMed]
  19. O’Loughlin K. Lexical Density in Candidate Output on Direct and Semi-direct Versions of an Oral Proficiency Test. Language Testing 1995;12:217-37. [Crossref]
  20. Engber CA. The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing 1995;4:139-55. [Crossref]
  21. Lu X. The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral Narratives. The Modern Language Journal 2012;96:190-208. [Crossref]
  22. Anastassiou F, Andreou G. Speech Production of Trilingual Children: A Study on Their Transfers in Terms of Content and Function Words and the Effect of Their L1. International Journal of English Linguistics 2017;2017: [Crossref]
  23. Litvinova A, Seredin P, Litvinova O, et al. Differences in type-token ratio and part-of-speech frequencies in male and female Russian written texts. In: Proceedings of the Workshop on Stylistic VariationCopenhagen, Denmark. 2017:69-73.
  24. Cho S, Cousins KAQ, Shellikeri S, et al. Lexical and Acoustic Speech Features Relating to Alzheimer Disease Pathology. Neurology 2022;99:e313-22. [Crossref] [PubMed]
  25. Ahmed S, Haigh AM, de Jager CA, et al. Connected speech as a marker of disease progression in autopsy-proven Alzheimer’s disease. Brain 2013;136:3727-37. [Crossref] [PubMed]
  26. Kim M, Thompson CK. Verb deficits in Alzheimer’s disease and agrammatism: implications for lexical organization. Brain Lang 2004;88:1-20. [Crossref] [PubMed]
  27. Malvern D, Richards B, Chipere N, et al. Lexical Diversity and Language Development: Quantification and Assessment. New York: Palgrave Macmillan, 2004.
  28. Kurdi MZ. Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL, Journal of Data Mining and Digital Humanities 2020. doi: 10.46298/jdmdh.6012.10.46298/jdmdh.6012
  29. Klee T. Developmental and diagnostic characteristics of quantitative measures of children’s language production. Top Lang Disord 1992;12:28-41. [Crossref]
  30. Shin S, Park H, Hill K. Identifying the Core Vocabulary for Adults With Complex Communication Needs From the British National Corpus by Analyzing Grouped Frequency Distributions. J Speech Lang Hear Res 2021;64:4329-43. [Crossref] [PubMed]
  31. Templin MC. Certain language skills in children. Minneapolis: University Of Minnesota Press; 1957.
  32. Shinkawa K, Yamada Y. Word Repetition in Separate Conversations for Detecting Dementia:A Preliminary Evaluation on Data of Regular Monitoring Service. AMIA Jt Summits Transl Sci Proc 2018;2017:206-15. [PubMed]
  33. Hess CW, Sefton KM, Landry RG. Sample size and type-token ratios for oral language of preschool children. J Speech Hear Res 1986;29:129-34. [Crossref] [PubMed]
  34. Richards B. Type/Token Ratios: what do they really tell us? J Child Lang 1987;14:201-9. [Crossref] [PubMed]
  35. Arnaud PJL. Objective Lexical and Grammatical Characteristics of L2 Written Compositions and the Validity of Separate-Component Tests. In: Arnaud PJL, Béjoint H, editors. Vocabulary and Applied Linguistics. London: Palgrave Macmillan; 1992. doi: 10.1007/978-1-349-12396-4_13.10.1007/978-1-349-12396-4_13
  36. Guiraud P. Problèmes et Méthodes de la Statistique Linguistique. 1959.
  37. Carroll JB. Language and Thought. Englewood Cliffs, NJ: Prentice-Hall; 1954.
  38. Covington MA, McFall JD. Cutting the Gordian knot: The moving-average type–token ratio (MATTR). Journal of Quantitative Linguistics 2010;17:94-100. [Crossref]
  39. McCarthy PM, Jarvis S. MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav Res Methods 2010;42:381-92. [Crossref] [PubMed]
  40. Richards BJ. Quantifying lexical diversity in the study of language development. Reading: Faculty of Education and Community Studies; 1997.
  41. Johansson V. Lexical diversity and lexical density in speech and writing: a developmental perspective. Working papers/Lund University, Department of Linguistics and Phonetics; 2008;53:61-79.
  42. Daniel D. Vocabulaire et Stylistique: Théâtre et Dialogue. Paris: Slatkine; 1979.
  43. Lissón P, Ballier N. Investigating Lexical Progression through Lexical Diversity Metrics in a Corpus of French L3. Discours 2018. doi: 10.4000/discours.9950.10.4000/discours.9950
  44. Nasseri M. Statistical modelling of lexical and syntactic complexity of postgraduate academic writing: a genre and corpus-based study of EFL,ESL, and English L1 M.A. dissertations. University of Birmingham; 2018.
  45. Yule CU. The statistical study of literary vocabulary. Cambridge: Cambridge University Press; 2014.
  46. Bucks RS, Singh S, Cuerden JM, et al. Analysis of spontaneous, conversational speech in dementia of Alzheimer type: Evaluation of an objective technique for analysing lexical performance. Aphasiology ;202:71-91. [Crossref]
  47. Hernández-Domínguez L, Ratté S, Sierra-Martínez G, et al. Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients during a picture description task. Alzheimers Dement (Amst) 2018;10:260-8. [Crossref] [PubMed]
  48. Guinn CI, Habash A. Language Analysis of Speakers with Dementia of the Alzheimer’s Type. In: Proceedings of AAAI Fall Symposium. 2012.
  49. Honoré A. Some simple measures of richness of vocabulary. Association for literary and linguistic computing bulletin 1979;7:172-7.
  50. Herdan G. Type-token mathematics. The Hague: Mouton; 1960.
  51. Maas HD. Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes Zeitschrift für Literaturwissenschaft und Linguistik 1972;8:73-9. [On the relation between vocabulary size and length of a text].
  52. Zhang Y, Wu W. How effective are lexical richness measures for differentiations of vocabulary proficiency? A comprehensive examination with clustering analysis. Language Testing in Asia 2021;11:15. [Crossref]
  53. Thoiron P. Diversity Index and Entropy as Measures of Lexical Richness. Computers and the Humanities 1986;20:197-202. [Crossref]
  54. Kurdi MZ. Measuring content complexity of technical texts: machine learning experiments, In: Proceedings of The 20th International Conference on Artificial Intelligence in Education (AIED), Chicago. 2019;148-52.
  55. Rose S, Dave E, Cramer N, et al. Automatic Keyword Extraction from Individual Documents. In: Berry M, Kogan J, editors. Text Mining: Applications and Theory. John Wiley & Sons, Ltd; 2010. doi: 10.1002/9780470689646.ch1.10.1002/9780470689646.ch1
  56. Read J. Assessing vocabulary. Cambridge: Cambridge University Press; 2000.
  57. Hyltenstam K. Lexical characteristics of near-native second-language learners of Swedish. Journal of Multilingual & Multicultural Development 1988;9:67-84. [Crossref]
  58. Astell AJ, Harley TA. Tip-of-the-tongue states and lexical access in dementia. Brain Lang 1996;54:196-215. [Crossref] [PubMed]
  59. Davies M. The 385+ million word Corpus of Contemporary American English (1990–2008): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics 2009;14:159-90. [Crossref]
  60. Harley B, King ML. Verb Lexis in the Written Compositions of Young L2 Learners. Studies in Second Language Acquisition 1989;11:415-39. [Crossref]
  61. Wolfe-Quintero K, Inagaki S, Kim HY. Second Language Development in Writing: Measures of Fluency, Accuracy, and Complexity. Honolulu, HI: University of Hawaii Press; 1998.
  62. Pinker S. Learnability and Cognition: the Acquisition of Argument Structure. Boston: MIT Press; 1989.
  63. Maouene J, Laakso A, Smith LB. Object associations of early-learned light and heavy English verbs. First Lang 2011; [Crossref] [PubMed]
  64. Thordardottir ET, Ellis Weismer S. High-frequency verbs and verb diversity in the spontaneous speech of school-age children with specific language impairment. Int J Lang Commun Disord 2001;36:221-44. [Crossref] [PubMed]
  65. Feldman LB, Kostić A, Basnight-Brown DM, et al. Morphological facilitation for regular and irregular verb formations in native and non-native speakers: Little evidence for two distinct mechanisms. Biling (Camb Engl) 2010;13:119-35. [Crossref] [PubMed]
  66. Justus T, Larsen J, de Mornay Davies P, et al. Interpreting dissociations between regular and irregular past-tense morphology: evidence from event-related potentials. Cogn Affect Behav Neurosci 2008;8:178-94. [Crossref] [PubMed]
  67. Eyigoz E, Mathur S, Santamaria M, et al. Linguistic markers predict onset of Alzheimer’s disease. EClinicalMedicine 2020;28:100583. [Crossref] [PubMed]
  68. Xue G, Nation ISP. A University Word List. Language Learning and Communication 1984;3:215-29.
  69. Francis WN, Kucera H. Frequency analysis of English usage: lexicon and grammar. Boston: Houghton Mifflin; 1982.
  70. Nickerson CA, Cartwright DS. The University Of Colorado Meaning Norms. Behavior Research Methods, Instruments, & Computers 1984;16:355-82. [Crossref]
  71. Bird H, Lambon Ralph MA, Patterson K, et al. The rise and fall of frequency and imageability: noun and verb production in semantic dementia. Brain Lang 2000;73:17-49. [Crossref] [PubMed]
  72. Davis BH, MacLagan MA. UH as a pragmatic marker in dementia discourse. Journal of Pragmatics 2020;156:83-99. [Crossref]
  73. Cowell A, Ramsberger G, Menn L. Dementia and Grammar in a Polysynthetic Language: An Arapaho Case Study. Language 2017;93:97-120. [Crossref]
  74. Perach R, Rusted J, Harris PR, et al. Emotion regulation and decision-making in persons with dementia: A scoping review. Dementia (London) 2021;20:1832-54. [Crossref] [PubMed]
  75. Birjali M, Kasri M, Beni-Hssane A. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems 2021;226:107134. [Crossref]
  76. Gunning R. The Technique of Clear Writing. McGraw-Hill; 1952.
  77. Cavnar WB, Trenkle JM. N-Gram-Based Text Categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94), Las Vegas. 1994.
  78. Breiman L. Bagging Predictors. Department of Statistics. Technical Report No. 421. Berkeley, CA: University of California; 1994. Available online:
  79. Ho TK. Random Decision Forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC. 1995.
  80. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August, 2016:785-794.
  81. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825-30.
  82. Kurdi MZ. Natural Language Processing and Computational Linguistics 1: Speech, Morphology and Syntax. London: ISTE-Wiley; 2016.
  83. Escandon A, Al-Hammadi N, Galvin JE. Effect of cognitive fluctuation on neuropsychological performance in aging and dementia. Neurology 2010;74:210-7. [Crossref] [PubMed]
  84. Van Dyk K, Towns S, Tatarina O, et al. Assessing Fluctuating Cognition in Dementia Diagnosis: Interrater Reliability of the Clinician Assessment of Fluctuation. Am J Alzheimers Dis Other Demen 2016;31:137-43. [Crossref] [PubMed]
doi: 10.21037/jmai-23-104
Cite this article as: Kurdi MZ. Automatic diagnosis of Alzheimer’s disease using lexical features extracted from language samples. J Med Artif Intell 2024;7:13.

Download Citation