Language Models in Mental Health Conversations – How Empathetic Are They Really?

December 3, 2024

Language Models in Mental Health Conversations – How Empathetic Are They Really?

Why We Should Care

Artificial Intelligence and large language models (LLMs) are becoming increasingly intertwined in the interactions and lives of humans today. What began as relatively simple models for paraphrasing or text summarization have evolved into conversational tools capable of explaining foreign concepts or offering advice in specific situations.

Instead of turning to a friend, family, or even a Google search for answers, it's increasingly more convenient to go straight to a product like ChatGPT to seek answers in a conversational format. With this shift in behavior and the growing role of AI as a trusted advisor, it becomes crucial we not only evaluate the helpfulness of these models but also the emotional quality of their responses.

In this blog post, we will evaluate the empathetic standards that LLMs achieve. First, we will break down the methods for analyzing the responses of generative models and propose a strategy for evaluating the empathetic quality of language models. Then we will evaluate how empathetic LLaMA – Meta’s open source LLM – is.

Evaluating Responses for Text and Language Models

Before we discuss evaluating empathy, let’s start with a simpler idea: how can we even evaluate the outputs and responses of generative models? What makes a “good” or “preferred” response? For classification tasks, the answer is straightforward. A model can output either the true class or the wrong class: either a pedestrian is in frame or they are not. There isn't much room for variability in evaluation.

The Challenge of Evaluating Generative Responses

With generative text models, evaluating responses presents a much more nuanced challenge. Whether you are a teacher grading the free responses section on an exam or a researcher analyzing the outputs of a model you fine-tuned, there is no one correct response or necessarily strict structure these responses must follow. “To make a sandwich, take out the bread, apply your spread" is equally viable as “Grab the bread out of the packaging, scoop out the peanut butter with a spoon, and apply it onto your bread”.

When evaluating, you might look for specific synonymous words, phrases, or terms. You consider if these responses are factually correct. Is there coherence and continuity between sentences? How verbose is the content? How similar are these responses to a set of reference responses that you have deemed correct or sufficient?

Metrics for Text Evaluation

We want a metric that will give us a quantifiable measure of similarity between generated and reference responses.

Word-Based Metrics: BLEU and ROGUE

The most straightforward (and still widely used) approach involves comparing the raw content: individual words, phrases, and sequences (n-grams) in the generated response against a reference.

The BLEU (Bilingual Evaluation Understudy) Score, one of the earliest automated methods for evaluating machine translation, compares n-grams (sequences of words) in a generated response to those in a set of reference responses. It penalizes outputs with fewer shared n-grams and rewards higher overlap. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) works similarly but focuses on recall, often used for summarization tasks.

Both BLEU and ROUGE are simple, intuitive, and have remained popular for their ease of implementation. However, they fall short of capturing the nuances of language and meaning. Consider the following example:

Reference:

The weather today is perfect for a hike, with clear skies and a gentle breeze.


Generated Response 1 (BLEU/ROUGE Aligned):

The weather today is clear and perfect for a hike.

BLEU Score: 0.29, ROUGE_1 Score: 0.8, ROGUE_2 Score: 0.52


Generated Response 2 (Semantically Similar):

It’s a beautiful day for outdoor activities, with sunny skies and a mild wind.

BLEU Score: 0.0, ROUGE_1 Score: 0.4, ROGUE_2 Score: 0.14 


Constructed Negative Example:

The weather today sucks. Clear and perfect weather is not perfect for a hike.

BLEU Score: 0.27, ROUGE_1 Score: 0.69, ROGUE_2 Score: 0.37 

Using BLEU or ROUGE, Generated Response 1 scores higher because it directly matches more n-grams with the reference. However, Generated Response 2, despite being semantically identical to the reference, scores poorly due to fewer shared words and phrases. Even though the Negative Example semantically conveys something much different from our reference text, it still scores high with BLEU/ROGUE. This discrepancy highlights the limitations of traditional metrics: they focus on surface-level similarity, but struggle to assess deeper meaning or context.

Going Beyond Words: Comparing Sentence Embeddings

Sentence Transformers are deep learning architectures built on top models like BERT and MPNet, designed to capture the semantics of language. They generate embeddings, numerical representations of text, that position semantically similar inputs closer together in a high-dimensional space, effectively minimizing the distance between related text. To get a measure for similarity, we generate an embedding for both the response text and the reference text and take the Cosine similarity between the two vectors.

Reference:

The weather today is perfect for a hike, with clear skies and a gentle breeze.


Generated Response 1 (BLEU/ROUGE Aligned):

The weather today is clear and perfect for a hike.

Cosine Similarity: 0.93


Generated Response 2 (Semantically Similar):

It’s a beautiful day for outdoor activities, with sunny skies and a mild wind.

Cosine Similarity: 0.72


Constructed Negative Example:

The weather today sucks. Clear and perfect weather is not perfect for a hike.

Cosine Similarity: 0.68

Notice how Sentence Transformers yield significantly higher similarity scores for meaningful comparisons, while correctly penalizing the Negative Example with the lowest score. This highlights the strength of embedding-based metrics: they capture and compare the underlying meaning of the text, offering a more holistic and human-like evaluation of similarity.

Strategies For Evaluating Empathy

To evaluate the empathy of Language Models, we will first build a comprehensive dataset containing empathetic responses to questions, then evaluate LLaMA’s responses through both word and semantic-based metrics.

  • Obtain a diverse set of references which specifically demonstrate a variety of empathetic responses to an input.

  • For every input, generate a response from an LLM.

  • Run BLEU/ROGUE/Sentence Transformer Evaluation Metrics.

The challenge here is not building a model or running your evaluation, but instead obtaining a dataset specific and diverse enough to meet our needs.

EmotionQueen

In EmotionQueen (Chen et al., 2024), the authors describe one such method for artificially generating such data. First, given a set of input responses, they use an emotion detection model to identify the type of emotion present in the input text. Then based on that emotion, they engineer a prompt with the input text, the emotion they’ve classified, and specifically ask GPT-3 to generate a set empathetic response to the input.

Context: A person just received a rejection letter from a job they really wanted.


Emotion: Disappointment

User Input: "I can't believe I didn't get the job. I really thought this was it."

Prompt: “Generate an empathetic response to the user. They are feeling disappointment”. 


Empathetic Response: "I'm really sorry to hear that. It's tough when something you hope for doesn't work out, but I know you'll find something even better."

By employing this strategy, they were able to build a dataset holding a set of reference responses specifically designed to be empathetic. Here they heavily relied on the assumption that GPT when explicitly told to respond emphatically and given information on the specific emotions to respond will perform strongly.

Counsel Chat’s

However, when performing my analysis, I wanted data vetted by professionals. I used publicly released conversational data between mental health professionals and their clients from Counsel Chat’s online Mental Health Platform. These interactions are inherently empathetic and focused on addressing a wide range of emotions in a supportive and compassionate manner. You can find the full dataset published on Hugging Face, a platform to share datasets and pre-trained models.

Evaluating LLaMA in Mental Health Dialogue

Setup

To generate my responses, I told LLaMA what the patient had said and then instructed it: “You are a licensed therapist. Your role is to respond to the user's question with empathy, understanding, and support”. Here is a subset of responses that LLaMA generated.

Patient

Therapist

LLaMA

Often times I find myself thinking scary thoughts and sometimes I even scare myself into thinking that something bad is going to happen to me. Once it starts, the thought continues going through my head and I can't get it out. How can I stop these thoughts?

There are some great thoughts offered by others here. I would just add that typically the most natural response to fearful thoughts is to want to stop, avoid, or get rid of them - which doesn't work if you're really caught up in a cycle of OCD or other form of...

You're describing a common phenomenon that many people experience, known as intrusive thoughts. These thoughts can be distressing and feel overwhelming. First, please know that you're not alone in experiencing these thoughts...

Sometimes I find myself waking up 3 times a night. 

Dreams are our chance to make sense of our waking lives. If you've got a lot going on in your life then your dreams are your assistants in understanding these happenings and dynamics.

You're telling me that you're waking up multiple times during the night. That can be really frustrating and disrupt your sleep patterns. Can you tell me more about what's happening when you wake up? Is there anything specific that's causing you to wake up, or is it more of a general feeling of being awake?

Results

Baseline Evaluation - Therapist Responses

LLaMA

BLEU

0.2439

0.0184

ROGUE_1

0.4019

0.2845

ROGUE_2

0.1834

0.0410

Cosine Similarity

0.6585

0.6492

The BLEU and ROUGE scores for LLaMA are notably lower compared to the baseline therapist responses. These metrics are typically sensitive to exact word overlap and n-grams, and as expected, LLaMA's performance in this regard is weaker. This is a common observation for models that are trained to generate diverse and contextually appropriate responses rather than merely mimicking exact phrasing from a reference dataset. LLaMA does not necessarily use the same wording or structure as a human therapist, which likely accounts for the low BLEU and ROUGE scores.

However, when we turn to Cosine Similarity, which measures the semantic similarity between the embeddings of the generated and reference responses, LLaMA's performance is much closer to human-like responses. The cosine similarity score of 0.6492 indicates that while LLaMA may not match the exact phrasing of the therapist's responses, the underlying meaning and emotional tone are comparable. This suggests that LLaMA is able to capture the essence of empathy and provide contextually appropriate responses, even though the surface-level expression may differ.

This indicates that LLaMA is likely focusing on the semantic content of responses, delivering emotionally appropriate feedback even if it doesn't always align with the exact language used by humans. While LLaMA's performance on traditional metrics suggests room for improvement in domain-specific phrasing and surface-level accuracy (how therapists are taught to respond), its cosine similarity score demonstrates that it is still capable of generating responses that reflect the empathic intent found in the baseline therapist responses.

Limitations and Future Work

This experiment focused specifically on analyzing the empathetic quality of LLaMA in the context of mental health conversations. While LLaMA performed similarly to human-like therapists, it is worth noting that in this domain empathetic responses are relatively natural in such conversations, given the emotional context. This raises the question of how LLaMA would perform in environments where empathy must be inferred more subtly, such as in educational or workplace settings.

For example, imagine a scenario in a class where a student becomes increasingly frustrated with understanding a concept.  Can a language model recognize this emotional shift and adapt its response accordingly—not only by explaining the concept but also by offering reassurance and affirmation? As emotions fluctuate throughout the course of an interaction, can an agent continuously evaluate the emotional state of the user and adjust its tone to deliver a response that fits the context of the conversation?

Apart from analyzing performance in alternate contexts, how does the emphatic quality vary over the multitude of conversational agents that exist: ChatGPT, Gemini, Claude? How has the empathetic quality of responses changed over time? Are we seeing improvements with each successive iteration of the model?

Given that Language Models are becoming integral to our daily lives, not just as assistive tools but now as trusted agents of information, it is paramount that we continuously analyze, emphasize, and improve the empathetic quality of these models for these interactions that will shape the evolution of society. I believe neglecting this aspect could have lasting and harmful effects on the mental health and the very humanity of our communities - perhaps even society as a whole.

However, for now, it seems we are off to a pretty decent start.

References

  1. https://arxiv.org/html/2409.13359v1

  2. https://huggingface.co/datasets/Amod/mental_health_counseling_conversations

  3. https://clementbm.github.io/theory/2021/12/23/rouge-bleu-scores.html