Automatic language analysis identifies and predicts schizophrenia in first-episode of psychosis

Nature

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Automated language analysis of speech has been shown to distinguish healthy control (HC) vs chronic schizophrenia (SZ) groups, yet the predictive power on first-episode psychosis

patients (FEP) and the generalization to non-English speakers remain unclear. We performed a cross-sectional and longitudinal (18 months) automated language analysis in 133 Spanish-speaking

subjects from three groups: healthy control or HC (_n_ = 49), FEP (_n_ = 40), and chronic SZ (_n_ = 44). Interviews were manually transcribed, and the analysis included 30 language features

(4 verbal fluency; 20 verbal productivity; 6 semantic coherence). Our cross-sectional analysis showed that using the top ten ranked and decorrelated language features, an automated HC vs SZ

classification achieved 85.9% accuracy. In our longitudinal analysis, 28 FEP patients were diagnosed with SZ at the end of the study. Here, combining demographics, PANSS, and language

information, the prediction accuracy reached 77.5% mainly driven by semantic coherence information. Overall, we showed that language features from Spanish-speaking clinical interviews can

distinguish HC vs chronic SZ, and predict SZ diagnosis in FEP patients. SIMILAR CONTENT BEING VIEWED BY OTHERS DECONSTRUCTING HETEROGENEITY IN SCHIZOPHRENIA THROUGH LANGUAGE: A

SEMI-AUTOMATED LINGUISTIC ANALYSIS AND DATA-DRIVEN CLUSTERING APPROACH Article Open access 29 November 2022 SPEECH- AND TEXT-BASED CLASSIFICATION OF NEUROPSYCHIATRIC CONDITIONS IN A

MULTIDIAGNOSTIC SETTING Article 09 November 2023 DETECTING NEUROPSYCHIATRIC FLUCTUATIONS IN PARKINSON’S DISEASE USING PATIENTS’ OWN WORDS: THE POTENTIAL OF LARGE LANGUAGE MODELS Article Open

access 18 April 2025 INTRODUCTION Schizophrenia (SZ) is a severe neurodevelopmental psychotic disorder with a lifetime prevalence of 0.7% that causes emotional, behavioral, sensory,

psychomotor, and cognitive alterations with a chronic and deteriorating course1. It is common, at least in Chile2, to require clinical follow-up and the treating team’s combined effort to

confirm or rule out the diagnosis. Moreover, in the case of teenagers, it is a process that spans several months or even a year of transition cycling in and out of mental health services.

Among the research lines, an extensive search of potential biomarkers for improving clinical categorization diagnosis has been performed. In this sense, language biomarkers offer a window to

understand the thinking in SZ research3,4. In general, individuals with SZ have impaired communicative competencies in fluency, verbal productivity, and speech coherence5,6. However, these

studies have been performed mainly in English-speaking subjects, and they have used different methodologies to assess language competencies, targeting a wide range of language aspects. In

this context, recent authors have begun to explore automated English language assessment in communication tasks, which allows the classifying of healthy controls (HC) vs individuals with

SZ7. However, the use of such a tool remains in the pilot stage8,9. The main reasons provided are the need to better understand language assessment methodologies as well as when and why

automated language analysis fails. Therefore, three actions could point towards breaking through the pilot stage of computational tools for schizophrenia language analysis: a better

understanding of cross-language variations, dissecting multiple levels of discriminative and predictive language feature capabilities, and focusing on clinically relevant tasks. Given the

reported potential of language biomarkers obtained from clinical interviews of people with SZ and considering our pool of unstructured psychiatric interviews in psychotic subjects, we chose

three aspects of language according to this setup to differentiate between HC, first-episode psychosis subjects (FEP), and chronic SZ: fluency, verbal productivity, and coherence. VERBAL

FLUENCY Verbal fluency (VF) is a complex dimension of communication. Crystal and Davy10 point out that FV is synonymous with discursive continuity and includes several elements that are part

of this continuous discourse, in particular, pauses and hesitations. Noncommunicative pauses are usually recognized as part of formal thought disorders (FTD) in the mental status

examination. Crockford and Lesser11 have suggested a relationship between neurocognitive impairment and the appearance of pauses (≥2 s) in aphasia. Interestingly, phonological studies of

pauses in English-speaking SZ subjects have shown similar results12. Figueroa and Martínez13 have also described nonfunctional pauses in Spanish-speaking people with SZ, specifically

reporting a longer duration of pauses in FEP subjects. So, the speech of individuals with SZ is interrupted due to frequent and more prolonged pauses with the wrong timing and correlated

with negative symptoms14. In this context, it is not surprising that automatic pause assessment has also been shown to classify English speakers in HC vs SZ groups, but it is still

constrained by the English language15. More recently, Stanislawski et al.16 studied aberrant pauses in clinical high risk (CHR). Another element of VF is word production and utterances per

time as proposed by Clemmer17, who studied their patterns in SZ. VERBAL PRODUCTIVITY Verbal productivity (VP) is the ability to utter a number of words and sentences, such as the number of

total words and different words per sentence, average word length, and determiner or pronoun count. In SZ, a low VP, so-called _poverty of speech_, is considered one of the inherent language

characteristics in the linguistic profile of SZ patients18. In fact, differentiation between HC vs SZ patients19 and those affected by antipsychotics20 has been demonstrated. On the other

hand, some VP measurements such as the number of words and different words, either in interview transcripts of an interview or written narratives21,22,23,24, differentiate subjects at CHR.

Finally, automated VP analysis techniques are also being used as predictors in subjects at CHR showing that pronouns and deictics work as predictive markers of SZ, at least for English

speakers22, and also to explain cognitive deficit variance25. SEMANTIC COHERENCE Semantic coherence (SC) consists of the logical organization of meaning in discourse through interrelated

linguistic structures. For example, in interviews with people with SZ schizophrenia, conversation topics can abruptly change. Furthermore, in SZ patients, erroneous and lax use of words or

expressions affects concordance, referentiality, and therefore, speech comprehension21,22,26,27. Moreover, lax speech requires the listener to make an extra effort to understand what the

affected person said. Manual linguistic approaches have been proposed to identify SC, for instance, identifying each sentence’s role in the speech18,28,29 and computing indexes such as the

Communication Disturbance Index30. The pioneering work of Elvevåg et al.31,32 proposed automated incoherence measurement. Corcoran et al.22 proposed the use of latent semantic analysis (LSA)

combined with VP measurements to predict psychosis in CHR populations. Other related work23 deals with referential cohesion and its relation to semantic coherence. Since it accounts for the

semantic relations that maintain the continuity of discourse, referential coherence is a deeper level of spoken or written semantic coherence, as proposed in systemic functional

linguistics33. LANGUAGE ANALYSIS IN NON-ENGLISH-SPEAKING GROUPS In a multilingual context, there are several studies related to schizophrenia in other languages besides English. In Spanish,

our group has reported a longer pause duration in the FEP group13 and a positive correlation with negative symptoms14, the identification of 24 hierarchical candidate language features to

automatize34, and the loss of integrity and coherence in FEP and SZ subjects27. In Italian, Frau et al.35 proposed a semiautomated clustering analysis of speech and its correlation with the

speech of SZ patients. The novelty of this work is that it sheds light on the variations of language within schizophrenia groups such as SZ, eventually as a way to measure treatment

effectiveness. In Dutch, Wouts et al.36 proposed the use of a deep-learning transformer model to capture long-distance language relations. The effectiveness of the method is shown for a

3-class classification problem: control, depressed, and psychotic subjects. In Portuguese, Mota et al.37 proposed a computational assessment using graph analysis of syntactic coherence for

specific tasks (e.g., memory reports of a dream and negative image) and reported that it provides accurate quantification of speech characteristics and a correlation with clinical symptoms.

The work by Mota et al.38 is applied to distinguish HC, FEP, and SZ and to do a longitudinal analysis of FEP’s diagnosis. There are multiple reports of language biomarkers with the clinical

potential for analyzing SZ communication skills. However, there are not many studies of SZ onset prediction based on the analysis of other languages besides English speakers. In this study,

we propose that language biomarker analysis of VF, VP, and SC can be automatized even in unstructured ecological Spanish-speaking interviews. More specifically, the first goal of this study

is to use language to automatically distinguish between healthy controls, first-episode psychosis patients and schizophrenic subjects, and our second goal is to predict which FEP patients

convert or do not convert to SZ. In order to achieve these aims, we will evaluate 30 automated linguistic features in a sample of Spanish-speaking HC, FEP, and SZ individuals, and then we

will measure their stability, diagnostic, and prognosis capacity in SZ. In addition, we assess the relative contribution of clinical, sociodemographic, and linguistic information for

classification purposes. RESULTS One hundred and thirty-three interviews (HC = 49; FEP = 40; and chronic SZ = 44) were recorded and manually transcribed for further automated analysis. The

overall data collection process is shown in Fig. 1. HCs were exclusively Spanish-speaking subjects from Chile, without self-reported psychiatric disorders or substance abuse. SZ diagnosis

was confirmed by a team of three adult psychiatrists, who used the DSM-IV structured clinical interview39, PANSS positive and negative symptom subscales were used for measuring symptom

severity of FEP and SZ40. FEP was defined as up to two years after presenting their first psychotic episode. At the end of follow-up, 28 FEP subjects confirmed SZ diagnosis (converted to SZ,

C-SZ, see Table 2), and 12 transitioned to other nonschizophrenic psychoses (50% transitioned to mood disorders). The full set of 30 language features presented in this study was applied to

HC, FEP, and SZ interviews (see Fig. 2 and details in Supplementary Tables S2–S4). When taking a closer look at the information contributed by each feature, it can be seen that from the 30

evaluated features, 9 clusters of at least two correlated variables (Pearson coefficient) were detected, which provide similar information, as shown in Supplementary Fig. S2. Moreover, sets

of correlated variables could be observed; some of them are expected, such as TTR500 grouped with TTR1000 (cluster G in Supplementary Fig. S1) as they represent similar information

(type-token ratio) at different text spans. Interestingly, clusters B and C indicate a correlation between word-level features (word length) and sentence features (count of

questions–answers). We also looked for associations between language features and symptoms. In the FEP group, two correlations were statistically significant (Pearson, _P_ < 0.05):

possessive pronouns (_r_ = 0.38; _P_ = 0.0153), and min cos similarity six levels (_r_ = 0.33; _P_ = 0.0427). In the SZ group, five measurements were statistically significant: demonstrative

and relative pronouns (_r_ = −0.49; _P_ = 0.007 and _r_ = −0.30; _P_ = 0.0455, respectively), question–answer pairs per time (_r_ = −0.40; _P_ = 0.0065), different word per time (_r_ =

0.30; _P_ = 0.0464), and TTR500 (_r_ = 0.32; _P_ = 0.0343). Furthermore, in the SZ group pauses were near significant (_r_ = −0.29; _P_ = 0.0503). Multiple testing Bonferroni correction was

applied to above-mentioned correlations (_k_ = 30), even though many features are correlated, and only negative PANSS and demonstrative pronoun correlations hold. CROSS-SECTIONAL ANALYSIS

The first goal of this study was to automatically distinguish between subject groups (HC, FEP, SZ) and rank more informative linguistic variables. A variable importance list was compiled

using an initial random forest classifier to differentiate between HC, FEP, and SZ subjects, selecting the top 10 most relevant, as shown in Fig. 3B. Using the top ten ranked variables, the

accuracies obtained in differentiating between HC and patient groups were 80.97% (HC vs SZ), 85.93% (HC vs FEP + SZ), and 91.11% (HC vs FEP) using a random forest classifier (Fig. 3A).

LONGITUDINAL ANALYSIS The second goal of this study was to predict which FEP patients convert (C-SZ) or do not convert to SZ (NC-SZ). Our first analysis was similar (correlation analysis is

reported in Supplementary Fig. 2) to that of the cross-sectional study: only language variables were used; later, we added clinical (PANSS, duration of disorder) and demographic variables

(gender, age, education, first-degree relative with psychotic disorder). Then a new list of top ten features was computed (Fig. 4B). In this ranking, PANSS total score ranked fourth, and all

the remaining features were language-related. Compared with the cross-sectional analysis, we observed similar informative features in both scenarios, such as cosine similarity minimum, mean

TTR500 and TTR 750, and interrogative/possessive determiners (compare Figs. 3B and 4B). To evaluate FEP conversion to SZ, we measured accuracy. Using only patient demographic information,

results were poor (43.33%), but they improved by using PANSS information (65.83%). PANSS information allowed a 67.5% prediction accuracy. Interestingly, language-only provided 75.83%

accuracy. When all information was combined and the top ten features were selected, 77.5% accuracy was achieved to predict if an FEP patient would have a confirmed SZ diagnosis, as shown in

Fig. 4A. A visual report of all FEP 40 patients is shown in Fig. 5A, where the response of all classifiers for the reported feature set selected is displayed. As shown, the demographic

information-based classifier overestimated SZ conversion (second row, mainly red). When more language information was included, the classification improved (match of green and red colors

with reference). PANNS and language-based classifiers failed to predict six NC-SZ patients’ conversion (subjects 5, 7, 8, 9, 10, 12), which were mainly (5 out 6) affective disorders. In

addition, we compared how much each feature category contributes to FEP diagnosis prediction (Supplementary Fig. S3). FEP diagnosis accuracy ranged from 56% (fluency), 64% (verbal

productivity), 77% (semantic coherence). DISCUSSION This study expands language biomarkers in SZ and their automated computation, considering non-English speakers and the biomarkers’ overall

relation with SZ groups. LANGUAGE MARKERS In terms of VF, we found that four out of four markers were statistically different between groups (_P_ < 0.001). In terms of pauses, it has

already been shown that these markers can identify English-speaking HC vs SZ patients15, and here we confirmed that the same occurs in Spanish-speaking subjects, even in the case of the FEP

group. The total/unique words or total sentences per time also showed differences, which to the best of our knowledge, has not been reported in the literature to date. Moreover, as shown in

Supplementary Fig. S1, these features are correlated with productivity markers such as word total mean per answer, giving opportunities for alternative measuring approaches. Regarding

productivity markers, we confirmed that raw volume (total unique words or per answer) or normalized volume (type-token ratio or TTR) could distinguish groups in Spanish, just like in

English20,21. We also suggest a new productivity marker: mean word length, which can also identify groups. This measurement illustrates the speaker’s greater or lesser linguistic complexity,

considering that the frequency of appearance of words in Spanish is concentrated in words composed of one and two syllables (RAE-Corpus CREA) and is calculated by the number of syllables

per word. In the case of syntactic markers, such as the determiners and the pronoun counts, we found that specific pronouns and determiners were different between study groups (see

Supplementary Table 3). Previous studies in English22 have used syntactic markers such as possessive and interrogative pronouns, reporting a decrease in possessive pronouns in SZ patients.

Interestingly, we observed that indefinite pronouns were significantly different (_P_ < 0.001), while personal and interrogative pronouns were close to significantly different between

groups (_P_ < 0.01), as well as indefinite and demonstrative determiners (_P_ < 0.01), which may all be related to reduction41,42. Referential coherence accounts for the speech

functional architecture of speech, and it is known to be altered in individuals with SZ schizophrenia; thus, syntactic markers are a direct and straightforward way to measure this coherence.

Verbal coherence markers has been proposed before in English22. We encoded sentences with a different method (word2vec) in our Spanish-speaker database; nonetheless, computing coherence

with a span of five or six words can still significantly identify subject groups. We evaluated minimum coherence and mean coherence, and mean values showed more discriminating power, as

shown by the _P_ value ranking. Concerning the associations of negative symptoms and language features, in SZ we found a statistically significant VP (TTR500) and VF (question–answer pairs

per time, different words per time, and weakly with pauses) as reported by Frau et al.35 and Stanivslavsky et al.16. Interestingly, in the FEP group, pronouns and semantic coherence (min cos

similarity 6 levels) were associated with negative symptoms. Taking into account that PANSS’s negative score was higher in the SZ group (Table 1), we could interpret that previously

reported correlations for the poverty of speech and pauses are found with more severe negative symptoms, but lower negative symptom correlations are found only at semantic coherence and

specific verbal productivity measurements (possessive pronouns). In the literature, it is reported that semantic alterations are associated with a decrease in the functional connectivity of

gamma frequencies, and this alteration is correlated with psychotic symptoms in gchizophrenia43. Thus, patterns of semantic alterations and their association with both positive and negative

symptoms could shed light on some general mechanisms of functional connectivity alteration. As shown in Table 1, age among study groups is significantly different, and there are reports of

differences between adolescents and older adults (+60) in VF and VP features44,45. However, in our study, subjects of age 60 years or older were a very small percentage: 8.1% (4/49) in HC,

0% (0/40) in FEP, and 0% (0/44) in SZ. To further investigate, we compared, in the case of total words per time (VF) and TTR250-500-750-1000 (VP), two linear models with and without the age,

and there was no significant difference between models (ANOVA, _P_ < 0.05). CROSS-SECTIONAL ANALYSIS Automatic classification of healthy controls vs study participants with schizophrenia

shown in this work has up to 80% accuracy using only language-related features, and HC vs FEP has 91.11% accuracy. Thus, we quantitatively demonstrate that distinguishing between HC and SZ

is more complex than distinguishing between HC vs FEP, which can be expected since SZ patients are stabilized under regular medication. Literature reports accuracies from 72% (in similar

conditions) to 100% for CHR populations22. Here, we showed that language analysis has the potential to be used as a psychiatric diagnostic screening tool. In this work, we highlight that

many kinds of language biomarkers can solve this problem. Consequently, clinical applications should privilege language independence and ease implementation. In that regard, transcription

should be avoided, as language processing is community dependent. For instance, in a Spanish text (from Chilean subjects), we had to create new stop words to perform analyses that are not of

everyday use in other Spanish-speaking countries such as Spain or Mexico. LONGITUDINAL ANALYSIS To our knowledge, there are no Spanish-speaking studies that predict schizophrenia from the

first episode of psychosis. Interestingly, when demographic, PANSS, and language features are combined, higher accuracy is achieved (77%), which may be an indication that these are measuring

different aspects of SZ. Furthermore, language biomarkers provide more information than demographic information (75% vs 43% accuracy), and language biomarkers were better than a highly

specialized PANSS score (75% vs 67% accuracy). Taking a closer look at the most relevant features to predict SZ onset in FEP subjects (word length, pauses, coherence, pronouns use),

according to the ZIPF’s law46, in all languages, there is a close relationship between the length of a word and the frequency of occurrence, so longer words are less frequent. According to

the RAE (Royal Spanish Academy), in Spanish, there is a high frequency of two-syllable words. We observed a higher occurrence of longer words in participants with SZ, which in general are

infrequent words, supporting the findings of several studies47,48,49. On the other hand, the use of short words in interaction with the occurrence of aberrant pauses generates a fragmented

speech that is not observed in controls. Likewise, we observed differences in the use of personal and possessive pronouns; it is possible that these findings are clues to referential

anomalies in the discourse. We can interpret that TTR, word length, pauses, and determiners are related dysfunctional characteristics of SZ that reduce communication effectiveness, in

contrast with HC, and they can contribute to identifying FTD. Overall, our proposed prediction system showed that affective disorders were the most difficult differential diagnosis of SZ, as

more prediction errors are accounted for by these subjects. It has been shown that pathologies such as affective disorders show similar formal thought disorders as SZ at an early stage50;

hence, we can interpret our results as detecting thought disorders that strongly relate to psychosis. Interestingly, our work shows that VF, VP, and SC can predict diagnosis in the case of

FEP, as well as a different language aspect such as syntactic coreferences, as proposed by Mota et al.38 in a task-specific protocol. A promising perspective is to explore if taken together

we can identify more or/and better SZ and other psychosis-related pathologies at the same time. Neuroimaging biomarkers have also been proposed using structural MRI, EEG, and PET. Kambeitz

et al.51 performed a meta-analysis evaluating studies that combined neuroimaging techniques and found an overall sensitivity of 80% (CI 77–84%) after evaluating 38 studies. Similarly, Shim

et al.52 proposed the use of automated EEG analysis to classify between SZ and control subjects, reaching a maximum accuracy of 88.24%. More recently, Zeng et al.53 have proposed a

deep-learning approach based on MRI, achieving 85% accuracy. However, MRI, PET, and EEG are difficult to apply in clinical settings due to their access, cost, and technical difficulties in

low-income countries. In our opinion, language analysis represents an interesting approach that, despite having a lower prediction accuracy, is simpler to apply in medical settings. We

summarize our contributions as (1) a better understanding of cross-language variations. English and Spanish have multiple differences (e.g., longer words are more frequent in Spanish than in

English, Zipf law). Thus, it is not evident a priori that the same discriminative or predictive features and methods in English will work in Spanish. One of our results is that most

discriminative and predictive language features hold in Spanish for group discrimination, contributing to the understanding of cross-language variations. Furthermore, we can predict

diagnosis in FEP, for a small subjects group. (2) Dissecting multiple levels of discriminative and predictive language feature capabilities. To this aim, we compared how much each feature

category contributes to the classification of three groups (HC, FEP, SZ) and FEP diagnosis prediction (Supplementary Fig. S3). Interestingly, group classification and FEP diagnosis accuracy

are higher for semantic coherence. We argue that more operational tasks such as VF and VP can be impaired differently among subjects. Still, their speech effectiveness is finally affected,

and this is more related to semantic coherence. This hypothesis is consistent with our results that rank the semantic coherence dimension as more informative than FV or VP. In this sense,

our findings support the proposals of Hinzen and Roselló41, who hypothesize that alterations in linguistic cognition may cause alterations in thinking in schizophrenia. An example of these

alterations in linguistic cognition is the loss of meta-reflexive abilities derived from higher thought processes, implying a significant impairment of semantic coherence that integrates the

selective mechanisms guided by linguistic cognition. (3) Focusing on clinically relevant tasks. Proposed works24,38, use psychiatric interviews, where participants are asked to perform a

communicative task such as narrating a dream or anecdote. This interviewer-modeled discourse elicitation provides a different communicative framework than the clinical phenomenological

interview we used for this study. In the phenomenological interview, discourse elicitation is not determined by a task but follows a natural course of interaction. LIMITATIONS This study

also has some limitations. First, HCs were exclusively Chilean Spanish speakers, and comorbidities like drug abuse were self-reported. Second, healthy and psychotic recruited subjects had

different demographic variables, which could be a potential bias. Third, there was no register of refusals at recruitment. Fourth, the chosen predictive method (random forest) has a

relatively simple and broad interpretation. Finally, we used limited samples, which may lead to overfitting, and the longitudinal analysis classes were unbalanced. CONCLUSION In this work,

we determined which information is language-independent and concluded that linguistic phenomena are broadly invariant, with a few exceptions that must be carefully considered, such as

syntactic features (determiners, pronouns). In addition, we performed automated language analysis and combined it with clinical information using machine learning techniques; these

procedures have achieved classification results comparable to neuroimaging or EEG methodologies, but they have the significant advantage of being easy to apply in a clinical context. To our

knowledge, this is the first time that automated language analysis, using unstructured clinical interviews with open-ended questions, has been used in non-English-speaking countries to

classify and predict SZ. METHODS PARTICIPANTS The HC interviews were selected from the ESECH’s study54, which consists of the construction of a corpus of more than 300 interviews with

neurotypical native speakers of Chilean Spanish. The duration of HC interviews ranged from 32 to 83 min (53.5 ± 10.2 min) with open-ended questions. The data were organized according to the

sociodemographic characteristics of the speakers, selecting subjects with ages and education levels similar to those in the chronic SZ group (Table 1). FEP and SZ subjects were recruited

from Barros Luco Trudeau Clinical Hospital (CABL). Psychiatric interviews ranged from 5 to 102 min (mean 28.6 ± 16.5 min), depending on the patient’s. All the interviews were conducted with

clinically compensated patients. Among the FEP group, three subjects (7.5%), and among the SCZ group, six subjects (13.6%) were hospitalized at the time of the study. Thus, 89.2% (9/84) were

receiving outpatient treatment in a mental health service. Substance use was self-reported, and within the FEP group, 20% of subjects (8/40) reported cannabis or alcohol use (3 females, 5

males). In the FEP group, 7.5% of subjects initiated FEP due to substance use (3/40). Clinical information used for further analysis were age (years), education (years), disease duration

(years), and clinical history of psychiatric disorder in first-degree relatives (yes or no) as shown in Tables 1 and 2. Each patient read and signed an informed written consent form, and the

protocol was authorized by the “Comité ética científico del Complejo Asistencial Barros Luco” local committee (ID 155). See Supplementary Methods for more information. SPEECH PROCESSING The

pauses were determined when the temporal separation between two consecutive speech segments was longer than 2 s. Since audio signals had different recording qualities, a noise reduction

algorithm was used before pause detection (see details in Supplementary Methods). For text processing, all punctuation marks, phonetic transcription, expression sounds, onomatopoeias, and

stop words were eliminated, while words were lemmatized. Stop words were extended with 73 typical Chilean expressions that fit the definition of the stop word (see details in Supplementary

Methods). To improve the performance of classification methods55,56, words were codified in high dimension and then into the classifier, consistent with the notion that the meaning of a word

depends on the context of neighboring words. To this aim, we used the word2vec algorithm available as an Open Source software package for Python57, building a word model specifically for

Chilean Spanish (see Supplementary Methods). LINGUISTIC FEATURES AND SPEECH ANALYSIS An individual’s verbal fluency was assessed using the number of pauses longer than two seconds at any

time during the interview, as shown in Fig. 2A. As an additional measurement of verbal fluency, we propose the measurement of the number of paired questions–answers divided by the time or

duration of the interview, the number of total words, and different words by the hour. Supplementary Table S2 shows the list of verbal fluency features. Twenty measurements of verbal

productivity were analyzed through four approaches: lexical volume (number of total words and different words per answer), type-token ratio (TTR), the average length of words, and count of

determiners or pronouns in two variants: total number of words and non-repeated words, both normalized by the number of responses during the interview, and the average per response (see

Supplementary Methods). A total of six semantic measurements were performed. The semantic lexical coherence between sentences (or cosine similarity) was defined from the sum of each of the

semantic vectors of the words that compose them between question and answer, and every 5 or 6 words (see Supplementary Methods). STATISTICAL METHODS, VARIABLE SELECTION, AND CLASSIFICATION

The Shapiro–Wilk test was used to check if data were normally distributed. In addition, for each attribute, statistical tests were performed to assess the group’s statistical differences. A

Mann–Whitney _U_ test was used to compare pairs of groups (HC vs FEP, HC vs SZ, and FEP vs SZ), and a Kruskal–Wallis test was used to compare the three pairs. We used a correlation and

random forest analysis for variable ranking58 (see details in Supplementary Methods). DATA AVAILABILITY The datasets used in this study are not publicly available due to participant privacy

and security concerns. Researchers may contact the corresponding author for access. CODE AVAILABILITY All the analysis in the work was done in Python, and the code is publicly available at

https://github.com/busmangit/nlpezq. The code is organized in Jupyter notebooks and commented on. Please cite this article in case you use the code totally or partially. REFERENCES * Tandon,

R., Nasrallah, H. A. & Keshavan, M. S. Schizophrenia, ‘Just the facts’ 5. Treatment and prevention. Past, present, and future. _Schizophr. Res._ 122, 1–23 (2010). Article PubMed

Google Scholar * Gaspar, P. A. et al. Early psychosis detection program in Chile: a first step for the South American challenge in psychosis research. _Early Interv. Psychiatry_ 13, 328–334

(2019). Article PubMed Google Scholar * Mckenna, P. & Oh, T. M. _Schizophrenic Speech: Making Sense of Bathroots and Ponds that Fall in Doorways_ (Cambridge University Press, 2005).

* Kuperberg, G. R. Language in schizophrenia Part 1: an introduction. _Lang. Linguist. Compass_ 4, 576–589 (2010). Article PubMed PubMed Central Google Scholar * Pawełczyk, A.,

Kotlicka-Antczak, M., Łojek, E., Ruszpel, A. & Pawełczyk, T. Schizophrenia patients have higher-order language and extralinguistic impairments. _Schizophr. Res._ 192, 274–280 (2018).

Article PubMed Google Scholar * Covington, M. A. et al. Schizophrenia and the structure of language: the linguist’s view. _Schizophrenia Research_ 77, 85–98 (2005). Article PubMed

Google Scholar * Cecchi, G. & Corcoran, C. O2.3. Automated analysis of recent-onset and prodromal schizophrenia. _Schizophr. Bull._ 44, S76–S76 (2018). Article PubMed Central Google

Scholar * Hitczenko, K., Mittal, V. A. & Goldrick, M. Understanding language abnormalities and associated clinical markers in psychosis: the promise of computational methods.

_Schizophr. Bull_ 47, 344–362 (2021). Article PubMed Google Scholar * Foltz, P. W., Rosenstein, M. & Elvevåg, B. Detecting clinically significant events through automated language

analysis: Quo imus? _npj Schizophr._ 2, 15054 (2016). Article PubMed PubMed Central Google Scholar * Crystal, D. & Davy, D. _Advanced Conversational English_ (Longman Publishing

Group, 1975). * Crockford, C. & Lesser, R. Assessing functional communication in aphasia: clinical utility and time demands of three methods. _Eur. J. Disord. Commun._ 29, 165–182

(1994). Article CAS PubMed Google Scholar * Alpert, M., Kotsaftis, A. & Pouget, E. R. At issue: speech fluency and schizophrenic negative signs. _Schizophr. Bull._ 23, 171–177

(1997). Article CAS PubMed Google Scholar * Barra, A. I. F. & Herrera, C. J. M. Las pausas en personas con diagnóstico de esquizofrenia de primer episodio. _Pragmalinguistica_ 26,

88–108 (2018). * León, M. _Relación entre nivel plasmático de BDNF y las pausas en el discurso en Esquizofrenia_ (Universidad de Chile, 2020). * Cohen, A. S., Mitchell, K. R., Docherty, N.

M. & Horan, W. P. Vocal expression in schizophrenia: less than meets the ear. _J. Abnorm. Psychol._ 125, 299–309 (2016). Article PubMed PubMed Central Google Scholar * Stanislawski,

E. R. et al. Negative symptoms and speech pauses in youths at clinical high risk for psychosis. _npj Schizophr._ 7, 3 (2021). Article PubMed PubMed Central Google Scholar * Clemmer, E.

J. Psycholinguistic aspects of pauses and temporal patterns in schizophrenic speech. _J. Psycholinguist. Res._ 9, 161–185 (1980). Article CAS PubMed Google Scholar * Andreasen, N. C.

Scale for the assessment of thought, language, and communication (TLC). _Schizophr. Bull._ 12, 473–482 (1986). Article CAS PubMed Google Scholar * Sabbe, B., Beheydt, L., De Picker, L.,

Goetschalckx, J. & Daelemans, W. Computational language analysis for assessment of schizophrenia. In _2017 Annual International Conference on Cognitive & Behavioral Psychology_.

https://doi.org/10.5176/2251-1865_CBP17.37 (GSTF, 2017). * de Boer, J. N., Voppel, A. E., Brederoo, S. G., Wijnen, F. N. K. & Sommer, I. E. C. Language disturbances in schizophrenia: the

relation with antipsychotic medication. _npj Schizophr._ 6, 24 (2020). Article PubMed PubMed Central Google Scholar * Rezaii, N., Walker, E. & Wolff, P. A machine learning approach

to predicting psychosis using semantic density and latent content analysis. _npj Schizophr._ 5, 9 (2019). Article PubMed PubMed Central Google Scholar * Corcoran, C. M. et al. Prediction

of psychosis across protocols and risk cohorts using automated language analysis. _World Psychiatry_ 17, 67–75 (2018). Article PubMed PubMed Central Google Scholar * Gupta, T., Hespos,

S. J., Horton, W. S. & Mittal, V. A. Automated analysis of written narratives reveals abnormalities in referential cohesion in youth at ultra high risk for psychosis. _Schizophr. Res._

192, 82–88 (2018). Article PubMed Google Scholar * Bedi, G. et al. Automated analysis of free speech predicts psychosis onset in high-risk youths. _npj Schizophr._ 1, 15030 (2015).

Article PubMed PubMed Central Google Scholar * Minor, K. S., Willits, J. A., Marggraf, M. P., Jones, M. N. & Lysaker, P. H. Measuring disorganized speech in schizophrenia: automated

analysis explains variance in cognitive deficits beyond clinician-rated scales. _Psychol. Med._ 49, 440–448 (2019). Article CAS PubMed Google Scholar * Minor, K. S. et al. Conceptual

disorganization weakens links in cognitive pathways: disentangling neurocognition, social cognition, and metacognition in schizophrenia. _Schizophr. Res._ 169, 153–158 (2015). Article

PubMed Google Scholar * Figueroa, A., Durán, E. & Oyarzún, S. La gestión temática como marcador de déficit lingüístico primario en personas con diagnóstico de primer episodio de

Esquizofrenia: un estudio en una muestra chilena. _RLA. Revista de lingüística teórica y aplicada_ 55, 117–147 (2017). Article Google Scholar * Docherty, N. M., Gordinier, S. W., Hall, M.

J. & Cutting, L. P. Communication disturbances in relatives beyond the age of risk for schizophrenia and their associations with symptoms in patients. _Schizophr. Bull._ 25, 851–862

(1999). Article CAS PubMed Google Scholar * Gordinier, S. W. & Docherty, N. M. Factor analysis of the communication disturbances index. _Psychiatry Res._ 101, 55–62 (2001). Article

CAS PubMed Google Scholar * Docherty, N. M., DeRosa, M. & Andreasen, N. C. Communication disturbances index. _PsycTESTS Dataset_ https://doi.org/10.1037/t39394-000 (2015). * Elvevåg,

B., Foltz, P. W., Weinberger, D. R. & Goldberg, T. E. Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia. _Schizophr. Res._ 93, 304–316

(2007). Article PubMed PubMed Central Google Scholar * Elvevåg, B., Foltz, P. W., Rosenstein, M. & Delisi, L. E. An automated method to analyze language use in patients with

schizophrenia and their first-degree relatives. _J. Neurolinguistics_ 23, 270–284 (2010). Article PubMed PubMed Central Google Scholar * Halliday, M. A. K. & Hasan, R. _Language,

Context, and Text: Aspects of Language in a Social-semiotic Perspective_ (Deakin University Press, 1985). * Figueroa, A. _Análisis pragmalingüístico de los marcadores de coherencia en el

discurso de sujetos con esquizofrenia crónica y de primer episodio_ (Universidad de Valladolid, 2015). * Frau, F. et al. Can language detect different clinical profiles in schizophrenia? A

semi-automated analysis on Italian-speaking patients. In _Architectures and Mechanisms for Language Processing_. https://amlap2021.github.io/program/174.pdf (AMLaP, 2021). * Wouts, J. et al.

belabBERT: a Dutch RoBERTa-based language model applied to psychiatric classification. Preprint at https://arxiv.org/abs/2106.01091 (2021). * Mota, N. B. et al. Speech graphs provide a

quantitative measure of thought disorder in psychosis. _PLoS ONE_ 7, e34928 (2012). Article CAS PubMed PubMed Central Google Scholar * Mota, N. B., Copelli, M. & Ribeiro, S. Thought

disorder measured as random speech structure classifies negative symptoms and schizophrenia diagnosis 6 months in advance. _npj Schizophr._ 3, 18 (2017). Article PubMed PubMed Central

Google Scholar * Kay, S. R. et al. SCID-PANSS: two-tier diagnostic system for psychotic disorders. _Compr. Psychiatry_ 32, 355–361 (1991). Article CAS PubMed Google Scholar * Kay, S. R.

_Positive and negative syndromes in schizophrenia: assessment and research (No. 5)_ (Brunner/Mazel, 1991). * Hinzen, W. & Rosselló, J. The linguistics of schizophrenia: thought

disturbance as language pathology across positive symptoms. _Front. Psychol._ 6, 971 (2015). PubMed PubMed Central Google Scholar * Docherty, N., Schnur, M. & Harvey, P. D. Reference

performance and positive and negative thought disorder: a follow-up study of manics and schizophrenics. _J. Abnorm. Psychol._ 97, 437–442 (1988). Article CAS PubMed Google Scholar *

Spironelli, C. & Angrilli, A. Language-related gamma EEG frontal reduction is associated with positive symptoms in schizophrenia patients. _Schizophr. Res._ 165, 22–29 (2015). Article

PubMed Google Scholar * Kemper, S., Marquis, J. & Thompson, M. Longitudinal change in language production: effects of aging and dementia on grammatical complexity and propositional

content. _Psychol. Aging_ 16, 600–614 (2001). Article CAS PubMed Google Scholar * Burke, D. M. & Shafto, M. A. Aging and language production. _Curr. Dir. Psychol. Sci._ 13, 21–24

(2004). Article PubMed PubMed Central Google Scholar * Chao, Y. R. & Zipf, G. K. _Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology,_ Vol. 26

(Addison-Wesley, 1949). * Chaika, E. & Lambe, R. The locus of dysfunction in schizophrenic speech. _Schizophr. Bull._ 11, 8–15 (1985). Article Google Scholar * Chaika, E. _Linguistics,

Pragmatics and Psychotherapy: A Guide for Therapists_ (John Wiley & Sons, 2008). * Piro, S. _El lenguaje esquizofrénico_ (Fondo de Cultura Economica USA, 1987). * Minor, K. S.,

Marggraf, M. P., Davis, B. J., Mehdiyoun, N. F. & Breier, A. Affective systems induce formal thought disorder in early-stage psychosis. _J. Abnorm. Psychol._ 125, 537–542 (2016). Article

PubMed Google Scholar * Kambeitz, J. et al. Detecting neuroimaging biomarkers for schizophrenia: a meta-analysis of multivariate pattern recognition studies. _Neuropsychopharmacology_

40, 1742–1751 (2015). Article PubMed PubMed Central Google Scholar * Shim, M., Hwang, H.-J., Kim, D.-W., Lee, S.-H. & Im, C.-H. Machine-learning-based diagnosis of schizophrenia

using combined sensor-level and source-level EEG features. _Schizophr. Res._ 176, 314–319 (2016). Article PubMed Google Scholar * Zeng, L.-L. et al. Multi-site diagnostic classification

of schizophrenia using discriminant deep learning with functional connectivity MRI. _EBioMedicine_ 30, 74–85 (2018). Article PubMed PubMed Central Google Scholar * San Martín Núñez, A.

& Guerrero González, S. Estudio Sociolingüístico del Español de Chile (ESECH): recogida y estratificación del corpus de Santiago. _Bol. filol._ 50, 221–247 (2015). Article Google

Scholar * Zhang, D., Xu, H., Su, Z. & Xu, Y. Chinese comments sentiment classification based on word2vec and SVMperf. _Expert Syst. Appl._ 42, 1857–1863 (2015). Article Google Scholar

* Lilleberg, J., Zhu, Y. & Zhang, Y. Support vector machines and Word2vec for text classification with semantic features. In _2015 IEEE 14th International Conference on Cognitive

Informatics & Cognitive Computing (ICCI*CC)_ (IEEE, 2015). * Rehurek, R. & Sojka, P. Software framework for topic modelling with large corpora. In _Proceedings of the LREC 2010

Workshop on New Challenges for NLP Frameworks_. http://www.lrec-conf.org/proceedings/lrec2010/workshops/W10.pdf (ELRA, 2010). * Genuer, R., Poggi, J.-M. & Tuleau-Malot, C. Variable

selection using random forests. _Pattern Recognit. Lett._ 31, 2225–2236 (2010). Article Google Scholar Download references ACKNOWLEDGEMENTS Rolando Castillo for his critical review. Jim

Hesson copyedited the manuscript (https://www.academicenglishsolutions.com/editing-service). This work was supported by the Millennium Science Initiative Program (grant numbers P09- 015F,

NCS17_035, ACE210007); Agencia Nacional de Investigación y Desarrollo Fondecyt program (grant number 11191122) to A.F., (grant numbers 1211988, 1190806, 1221696) to M.C., Fondequip program

(grant EQM210020), Fondef program (grant ID20I10371) to M.C., PIA program (grant ACT192015) to M.C.; Guillermo Puelma Foundation award to P.G. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS *

Department of Psychiatry, Faculty of Medicine, Universidad de Chile, Santiago, Chile Alicia Figueroa-Barra & Pablo A. Gaspar * Biomedical Neuroscience Institute, Santiago, Chile Alicia

Figueroa-Barra, Mauricio Cerda & Pablo A. Gaspar * Millennium Nucleus to Improve the Mental Health of Adolescents and Youths (IMHAY), Santiago, Chile Alicia Figueroa-Barra & Pablo A.

Gaspar * Translational Psychiatry Laboratory Psiquislab, Faculty of Medicine, Universidad de Chile, Santiago, Chile Alicia Figueroa-Barra & Pablo A. Gaspar * Artificial Intelligence

Development Department, BiosIntelligence, GrupoBios, Santiago, Chile Daniel Del Aguila * Integrative Biology Program, Institute of Biomedical Sciences, Faculty of Medicine, Universidad de

Chile, Santiago, Chile Mauricio Cerda, Manuel Durán & Camila Valderrama * Center for Medical Informatics and Telemedicine, Faculty of Medicine, Universidad de Chile, Santiago, Chile

Mauricio Cerda, Manuel Durán & Camila Valderrama * Department of Neuroscience, Faculty of Medicine, Universidad de Chile, Santiago, Chile Pablo A. Gaspar * Laboratory for System Dynamics

& Signal Processing, Universidad Nacional de Rosario and CIFASIS, Santa Fe, Argentina Lucas D. Terissi Authors * Alicia Figueroa-Barra View author publications You can also search for

this author inPubMed Google Scholar * Daniel Del Aguila View author publications You can also search for this author inPubMed Google Scholar * Mauricio Cerda View author publications You can

also search for this author inPubMed Google Scholar * Pablo A. Gaspar View author publications You can also search for this author inPubMed Google Scholar * Lucas D. Terissi View author

publications You can also search for this author inPubMed Google Scholar * Manuel Durán View author publications You can also search for this author inPubMed Google Scholar * Camila

Valderrama View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS D.D., M.C., L.T., M.D., and C.V. designed and performed the experiments, derived

the models, and analyzed the data. A.F. was a key contributor to data collection. A.F, D.D., M.C., and P.G. wrote the manuscript. CORRESPONDING AUTHOR Correspondence to Alicia

Figueroa-Barra. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY MATERIAL RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a

Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit

to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are

included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this

license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Figueroa-Barra, A., Del Aguila, D., Cerda, M. _et al._ Automatic

language analysis identifies and predicts schizophrenia in first-episode of psychosis. _Schizophr_ 8, 53 (2022). https://doi.org/10.1038/s41537-022-00259-3 Download citation * Received: 18

July 2021 * Accepted: 18 April 2022 * Published: 01 June 2022 * DOI: https://doi.org/10.1038/s41537-022-00259-3 SHARE THIS ARTICLE Anyone you share the following link with will be able to

read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing

initiative