Healthtech General AI News
Oct 17, 2018 ● Cami Rosso
MIT Creates AI That Predicts Depression From Speech

Innovative neural network detects depression from conversation

Depression is one of the most common disorders globally that impacts the lives of over 300 million people, and nearly 800,000 suicides annually, per the World Health Organization March 2018 figures.  Diagnosing depression may be a challenging, complex endeavor. According to the Mayo Clinic, symptoms of depression vary, and doctors may use a physical exam, lab tests, psychiatric evaluation questionnaire, and criteria from the American Psychiatric Association’s DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) in order to determine a diagnosis of depression [1]. For a mental health professional, asking the right questions and interpreting the answers is a key factor in the diagnosis. But what if a diagnosis could be achieved through natural conversation, versus requiring context from question and answer?

 An innovative Massachusetts Institute of Technology (MIT) research teamconsisting of Tuka Alhanai and James Glass at CSAIL (Computer Science and Artificial Intelligence Laboratory), and Mohammad Ghassemi at IMES (Institute for Medical Engineering and Science) discovered a way for AI to detect depression in individuals through identifying patterns in natural conversation [2].

The MIT researchers developed a neural-network AI model based that could predict depression based on identifying speech patterns from audio and text transcriptions from interviews. Using a data set from 142 recorded patient interviews, the team aimed to model sequences for depression detection. The researchers included experiments in context-free modeling, weighted modeling, and sequence modeling [3].

First the team sought to evaluate the prediction accuracy of audio and text features “when considered independently of the type of question asked, and time it was asked during the interview session” – in other words, “context-free” modeling. The team fed 279 audio and 100 text features to a logistic regression model with L1 regularization [4]. For the text features, the team harnessed Doc2Vec of the Python Gensim library for “a total of 8,050 training examples, 272,418 words, and a vocabulary size of 7,411 [5].” For audio features, the team “extracted an initial set of 553 features representing each subject response. [6].”

In the second experiment, the team aimed to understand predictive performance “when conditioning on the type of question asked, and independent of the time it was asked during the interview session.” To achieve this, they created a weighted model similar to the context-free model, with a key differentiator — it had assigned weights to the model based on the “predictive power of the question found in the training set.”

For the third experiment, the team focused on “modeling temporal changes of the interview” and used a bi-directional Long Short-Term Memory (LSTM) neural network because it had “the additional advantage of modeling sequential data.”

Interestingly, the researchers discovered that the model needed over four times more data when using audio than text when predicting depression. The model required on average 30 sequences for audio, in comparison to only seven sequences of text question and answer. The team observed that sequence modeling is more accurate for predicting depression, and the multi-modal model of both text and audio was the best performing. Ironically, the nature of AI neural network models obfuscate exactly what patterns it discovers from the input data. The opacity of AI is due to the inherent complexity of neural nets with intricate connections between nodes and the vast amount of parameters. Regardless, this MIT study represents an innovative step towards creating a new potential tool to assist doctors and mental health professionals in tackling the complexities of diagnosing depression in the future.


This article originally appeared in Psychology Today

Article by:

Cami Rosso