February 6, 2025
Do Frontier AI Models Possess Emotional Intelligence?
Contents
Why Should You Care?
Both reasoning and general-purpose AI models can effectively classify emotions and states of mind in a narrative context, with general-purpose models showing a slight competitive edge. While some subjectivity remains in AI-generated justifications, reliable emotion classification could help businesses improve workplace communication, enhance employee well-being, gather deeper customer insights, and optimize negotiation strategies.
What Did We Test? → We evaluated how well frontier AI models can accurately classify a series of emotions and states of mind in an original short story excerpt.
What Did We Find Out? → Both reasoning and general-purpose AI models (e.g., o1 vs. GPT-4o) performed far better than expected, though general-purpose models appear to have a slight competitive edge, notwithstanding an exception for one of the reasoning models we tested (o3-mini).
Are There Any Caveats? → Yes, seeing as this is an emotion classification task that assumes a narrative, literary structure, there is room for interpretation in the justifications/explanations that models provide for their emotional classifications.
Why Does This Matter? → If frontier AI models can accurately and reliably classify human emotions and states of mind, this could inspire numerous benefits for reducing workplace communication barriers, enhancing teamwork and overall employee wellbeing, gathering nuanced customer and product insights, and streamlining business negotiations.
If you find our AI experiments useful and interesting, please also consider exploring how Lumenova’s responsible AI platform could support your AI risk management and governance needs, and book a product demo today.
Main Insights
-
Frontier AI models are good at classifying various emotions and states of mind within a creative narrative, provided that they receive relevant context when completing the task.
-
Despite being proficient at emotional classification, frontier AI models still struggle to classify emotional concepts in the “right” place, frequently providing justifications misaligned with emotional logic.
-
While general-purpose models like Claude 3.5 Sonnet and GPT-4o tend to outperform reasoning models on emotional classification, the reasoning they invoke to justify their choices is rudimentary and somewhat superficial.
-
Emotional classification proficiency doesn’t equate to emotional understanding—it’s one thing to be able to identify “sadness,” but it’s another thing entirely to be able to grasp and internalize the true nature of this experience meaningfully.
The Business Case
✅ Sentiment Analysis and Emotional Context Detection
💡 Why it Matters
- AI can be a valuable tool for businesses in customer sentiment analysis, user feedback evaluation, and content moderation.
- Prompts should be fine-tuned to ensure contextual accuracy.
- AI can help decode customer sentiment at scale, identifying dissatisfaction, excitement, or frustration in text data.
- Businesses should provide AI with structured guidance to ensure it properly understands emotional context.
✍️ Key Actions
- Fine-tune AI prompts with additional context.
- Deploy AI for social media and brand monitoring.
- Use AI-driven emotional classification in customer feedback systems.
✅AI Can Identify Emotions, But That Doesn’t Mean It Understands Them
💡 Why it Matters
- AI-generated insights may sound plausible but fail to capture the deeper emotional nuances of a situation.
- Where AI is leveraged in customer interactions, HR, or emotional tone assessment, misclassifications could result in poor decision-making or misguided responses.
- AI models may label emotions accurately, but this doesn’t indicate true emotional understanding—for example, there’s a fundamental gap between recognizing sadness as a label and comprehending the full depth of human grief, despair, or sorrow.
✍️ Key Actions
- Pair AI-generated emotion classification with human-led emotional intelligence strategies.
- Train AI models on domain-specific emotional cues.
- Enhance AI training data by incorporating high-quality, nuanced emotional reasoning examples.
✅ Potential for Bias and Cultural Misinterpretation
💡 Why it Matters
- AI models can classify emotions effectively but often provide shallow justifications for their choices.
- Most AI systems are trained on data that reflects certain cultural and linguistic norms, which could inspire emotional classification struggles in cross-cultural contexts.
- AI still lacks the ability to meaningfully engage with emotional complexity and nuance.
✍️ Key Actions
- Require AI-generated emotional classifications to include detailed reasoning.
- Use AI explainability tools to uncover how models arrive at emotional classifications.
- Train AI models on diverse, multicultural emotional datasets.
- Consider region-specific AI customization and modify sentiment analysis models to align with cultural expectations.
✅ Enhancing Personalization in Customer Experiences
💡 Why it Matters
- Businesses can leverage AI to enhance personalized customer experiences, especially in e-commerce, entertainment, and digital marketing.
- Scalable customer sentiment insights could allow businesses to tailor product recommendations, advertising strategies, and interactive experiences to better match and align with diverse emotional needs.
✍️ Key Actions
- Leverage AI for real-time emotional sentiment analysis.
- Monitor emotional classification accuracy over time.
- Combine AI sentiment analysis with contextual NLP models.
Executive Summary
Experiment Setup
This test evaluates how well a series of different AI models can accurately classify a selection of emotions and states of mind within an original piece of creative writing. This piece is framed as a narrative excerpt from a short story about a character named Charlie, who is experiencing a complex and intertwined array of specific emotions and states of mind at distinct stages of the story.
With this particular test, we didn’t expect any of the models we evaluated to classify each emotion and state of mind with perfect or even moderate accuracy—we are, after all, operating within a rather subjective realm of interpretation, and we couldn’t expect humans to perform this task with perfect accuracy either. We are nonetheless interested in:
-
Whether models can classify emotions and states of mind in the right order in each individual paragraph throughout the story.
-
To what extent models’ explanations/justifications for each emotion/state of mind classification correspond with relevant “evidence” (i.e., the specific phrase that denotes an emotion or state of mind).
Our Hypothesis: All models tested will struggle with emotional classification, however, general-purpose models like GPT-4o and Claude 3.5 Sonnet will outperform reasoning AI’s like o1, o3-mini, and DeepThink-R1.
Models Tested: OpenAI’s o1, o3-mini, and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and DeepSeek’s DeepThink-R1.
Prompt Description
In our prompt, we present an original short story excerpt about a character named “Charlie.” The story is divided into three paragraphs, each of which is designed to represent a series of emotions and states of mind.
To provide the model with some context, we list all the emotions and states of mind Charlie experiences throughout the story. However, we ensure this list organizes all relevant emotions and states of mind randomly, to prevent models from simply mapping an ordered emotional concept to its correct location in the narrative. Similarly, while we know that each paragraph illustrates four distinct emotional concepts, we don’t want the model to infer an even concept distribution across paragraphs (i.e., four concepts per paragraph) since this could simplify the task significantly—we, therefore, include an instruction that guides models to expand their decision-making parameters and link as many concepts as they see fit to each individual paragraph, provided that all concepts are paired with their “relevant” paragraphs upon task completion.
Furthermore, we provide a single example of the output structure we’d like each model to follow to enable evaluation consistency. To gain additional insights into models’ reasoning processes, we also instruct them to outline their logic using chain-of-thought.
Method
For each model tested, we administer the same prompt in a single, one-off interaction. We intentionally designed our prompt to be difficult, subtle, and somewhat ambiguous, pushing models to make uncertain inferences regarding Charlie’s emotional experience throughout the narrative. We have adopted this approach because it roughly approximates the real-world dynamics of emotional intelligence—as humans, we often have intuitions, based on limited information, about what someone else may be feeling and experiencing, though we rarely know for certain whether our intuitions are true.
However, in our prompt, we do provide an unordered list containing each emotion and state of mind Charlie experiences. Since humans typically don’t make sense of others’ emotional experiences by listing out the possible characteristics of their experience and then selecting the “best” possible options, readers may interpret this as giving a “hint” to the model. We chose this approach for two reasons:
-
It allows us to more coherently and consistently compare task performance across each model (i.e., the task is constrained by a specific set of parameters).
-
It enables us to isolate the reasoning process behind each attribution of a specific emotion or state of mind, effectively forcing the model into passing an emotional judgment and justifying it.
Below, we list the emotions and states of mind each paragraph captures in the order in which they appear in the given paragraph—we also include the specific phrases that denote where these emotions and states of mind should be attributed. This list defines our evaluation criteria (i.e., correct matchings between emotions/states of mind and the paragraphs they appear within).
- Paragraph 1: Confusion, discomfort, insecurity, and intimidation.
- Confusion → “he couldn’t help but wonder” and “His brows furrowed as the corners of his mouth twitched unpredictably”
- Discomfort → “he noticed how difficult it was to maintain her gaze” and “It wasn’t that she couldn’t hold eye contact—precisely the opposite”
- Insecurity → “but now, they looked through him, beyond his own, centering on foreign objects that only ran further away”
- Intimidation → “His seat felt lower than hers”
- Paragraph 2: Acceptance, anxiety, nostalgia, and disbelief.
- Acceptance → “He had known her for long enough to understand that she would continue talking if he didn’t interject”
- Anxiety → “he began picking at the skin above his fingernails”
- Nostalgia → “chastising him as if she were there”
- Disbelief → “The thought of her impossible presence”
- Paragraph 3: Melancholy, contempt, irony, and cynicism.
- Melancholy → “an estranged sense of peace” and “Leaving the memories of his mother behind like fallen leaves”
- Contempt → “or more aptly, her monologue” and “Despite the bird-like cadence of her voice”
- Irony → “questioning whether she would be able to learn from him in spite of her brilliance”
- Cynicism → “Once, he was called a smart man by an even smarter man—all he could do was laugh.”
Key Findings
o1 - Key Takeaways:
- o1 was one of three models tested to correctly classify all emotions and states of mind in paragraph one.
- Despite its incorrect attribution of melancholy in paragraph two, a mistake shared with DeepThink-R1, the argument o1 provided for this choice is valid, and we, therefore, excuse this mistake.
- All of o1’s emotional classification justifications are convincing and well-subsidized, however, they occasionally invoke the wrong evidence, especially for Discomfort, Acceptance, Irony, and Cynicism.
- Of all the reasoning models tested, o1 spent the most time “thinking.”
GPT-4o - Key Takeaways:
- Despite performing quite well overall in terms of accuracy, GPT-4o’s classification justifications weren’t particularly strong or aligned with emotional logic.
- Surprisingly, GPT-4o made two of the same mistakes that DeepThink-R1 made, an unexpected occurrence due to the design differences between these two models.
o3-mini - Key Takeaways:
- o3-mini was the top performer for reasoning models, outperforming both o1 and DeepThink-R1.
- Despite incorrectly classifying Disbelief, o3-mini’s justification was compelling, although it would be a stretch to consider it “valid.”
- Even though o3-mini had a higher accuracy rate than o1, o1’s classification justifications were more substantive and detailed.
Claude 3.5 Sonnet - Key Takeaways:
- While Claude 3.5 sonnet was the only model to complete this task with perfect accuracy, it was unable to identify the correct order for emotions and states of mind in each paragraph.
- Like GPT-4o, Claude 3.5 Sonnet only invoked evidence for its classification choices but didn’t interpret it meaningfully (i.e., explain its significance).
DeepThink-R1 - Key Takeaways:
- Despite being the “worst” performer, DeepThink-R1 was the only model to correctly identify and classify Disbelief in the right place.
- While DeepThink-R1 didn’t spend as much time “thinking” as o1, both models spent far more time “thinking” than o3-mini.
- When looking at error distribution across both o1 and GPT-4o, DeepThink-R1 made the same mistakes.
- DeepThink-R1 was the only model to contest certain emotion classifications, despite no instruction to do so.
Bottom Lines:
- Overall, all models performed far better than expected.
- No model made any error that wasn’t shared by at least one other model.
- Claude 3.5 Sonnet outperformed all other models, though o3-mini is a close second-place contender.
- o1 and DeepThink-R1 performed worst overall, but o1 outperforms DeepThink-R1 due to its more substantive explanations and well-rounded logic.
- All models were largely unable to consistently make classifications in the correct order.
- Reasoning models tend to make more or less the same mistakes—o1 and o3-mini both misclassified Disbelief, whereas o1 and DeepThink-R1 made the same classification errors across Acceptance and Melancholy.
- General-purpose models—GPT-4o and Claude 3.5 Sonnet—provided the most simplistic classification justifications/explanations.
- Overall, we would’ve liked to see models go beyond simply invoking evidence to justify their classification choices, actively interpreting the evidence they invoked rather than simply stating it as a “reason” for their choice.
The table below illustrates the accuracy with which each model tested correctly identified the emotions and states of mind captured within each individual paragraph in the short story.
Prompt
The same for all models
Below, is an excerpt of a short story written about a fictional character named “Charlie”. For each paragraph of this story, I want you to state the states of mind and/or emotion(s) that you think Charlie is experiencing.
The list below includes each emotion and state of mind that Charlie experiences, however, you will have to figure out in which paragraphs these emotions and states of mind are represented.
List:
- Confusion
- Acceptance
- Intimidation
- Disbelief
- Anxiety
- Discomfort
- Irony
- Nostalgia
- Insecurity
- Cynicism
- Melancholy
- Contempt
For example, if you have reason to think that Charlie is experiencing nostalgia, cynicism, and confusion in paragraph three, your statement should read, “In paragraph three, Charlie is experiencing nostalgia, cynicism, and confusion.”
You are not required to limit the emotions and states of mind you attribute to a given paragraph to any specific number, but you must ensure that each emotion and state of mind included in the list is attributed to one of the three paragraphs.
Take your time and think carefully about the answers you provide. Explain your reasoning using a detailed chain-of-thought.
Short Story Excerpt:
”As Charlie listened to her, he couldn’t help but wonder why she had waited so long to tell him. His brows furrowed as the corners of his mouth twitched unpredictably, and in a brief moment of self-awareness, he noticed how difficult it was to maintain her gaze. It wasn’t that she couldn’t hold eye contact—precisely the opposite—her coffee-cream irises were always a welcome appearance from afar, but now, they looked through him, beyond his own, centering on foreign objects that only ran further away. His seat felt lower than hers, even though he knew it wasn’t.
He had known her for long enough to understand that she would continue talking if he didn’t interject, and he had learned that this was all the better, for he would feel the need to speak as soon as he had something worthwhile to say, and even more so, something to make sense of. As more questions and fewer answers presented themselves to him, he began picking at the skin above his fingernails, hearing his mother’s nagging voice whispering over his shoulder, chastising him as if she were there. The thought of her impossible presence—she died many years ago—brought him an estranged sense of peace.
Leaving the memories of his mother behind like fallen leaves, he returned to their conversation, or more aptly, her monologue. Despite the bird-like cadence of her voice, he found it soothing—he imagined that if they were in the sky together, perhaps he’d be able to show her something she didn’t know already. He reveled in this thought, questioning whether she would be able to learn from him in spite of her brilliance while knowing full well that he had always had much more to teach than he let on. Once, he was called a smart man by an even smarter man—all he could do was laugh.”
Outputs
o1 (thought for 2m 1s)
Below is a concise mapping of each paragraph to the emotions/states of mind Charlie seems to experience, followed by a detailed explanation of how one might arrive at these conclusions from the except:
Paragraph One
Charlie is experiencing confusion, disbelief, intimidation, discomfort, and insecurity.
Paragraph Two
Charlie is experiencing anxiety, nostalgia, and melancholy.
Paragraph Three
Charlie is experiencing acceptance, irony, cynicism, and contempt.
Detailed Reasoning (Chain-of-Thought)
Paragraph One
-
Confusion and Disbelief
- Charlie wonders why she waited so long to tell him. The phrasing (“he couldn’t help but wonder why…”) and the furrowed brows suggest genuine puzzlement. This pondering over her delay implies he’s not only perplexed but also finding it hard to accept that she truly held back such information until now.
-
Intimidation and Insecurity
- Charlie feels as though he’s literally seated “lower” than she is, though he knows the chairs are of equal height. This exaggerated sense of being “beneath” her, coupled with difficulty maintaining eye contact, signals he feels overshadowed or outmatched—classic signs of intimidation and insecurity.
-
Discomfort
- Physical cues (furrowed brows, twitching mouth) and the awkwardness of sustaining her gaze point to a state of unease. He is not at ease in this conversation, further shown by the sense that her gaze goes “beyond” him.
Paragraph Two
-
Anxiety
- Charlie starts picking at the skin above his fingernails, a common nervous habit. He does this in response to having “more questions and fewer answers,” revealing inward tension and worry.
-
Nostalgia
- He hears his deceased mother’s “nagging voice” as though she were right behind him. This recollection, particularly of a loved one who is gone, indicates a longing or wistful affection for the past.
-
Melancholy
- Remembering his mother (who died years ago) brings a “strange sense of peace,” but it’s tinged with sorrow, as the memory underscores her absence. This mixture of calm and sadness is characteristic of melancholy.
Paragraph Three
-
Acceptance
- Charlie transitions away from memories of his mother, leaving them “behind like fallen leaves,” returning his focus to the present. The image of gently letting go suggests an acceptance of her loss and of the current situation.
-
Irony
- He recalls being called “a smart man by an even smarter man,” and all he could do was laugh. There is an ironic tinge to this moment—being praised by someone supposedly more knowledgeable both flatters and pokes fun at his self-image.
-
Cynicism
- Charlie wonders if “she would be able to learn from him” despite her brilliance, while knowing he has “much more to teach than he let on.” There’s a slight cynical edge in suspecting that people might not be open to what he has to offer or that knowledge exchange might be one-sided.
-
Contempt
- Embedded in that cynicism is a hint of dismissiveness or contempt—he is both amused by the idea that others call him smart (as though such labels might be superficial) and subtly looks down on the notion that this brilliant woman may or may not be able to learn from him.
By distributing the listed emotions and states of mind across these three paragraphs and drawing on textual details—Charlie’s bodily reactions, inner thoughts, and references to his past—we can see how all twelve items in the list manifest at some point in the excerpt.
GPT-4o
Here is my analysis of Charlie’s states of mind and emotions throughout each paragraph:
Paragraph One
Confusion, Intimidation, Discomfort
- Charlie’s confusion is evident as he wonders why she had waited so long to tell him. His body language—furrowed brows and an inability to hold her gaze—suggests that he is struggling to process the information she is sharing.
- The sense of intimidation comes from the description of her gaze. Rather than simply meeting his eyes, her stare moves beyond him, making him feel unseen or insignificant. The perception that his seat is lower than hers, despite knowing that it isn’t, signals a psychological power imbalance, reinforcing the intimidation.
- Discomfort arises from both his reaction to her and the difficulty he has maintaining eye contact, suggesting unease in the interaction.
Paragraph Two
Disbelief, Anxiety, Nostalgia, Insecurity
- Disbelief emerges as more questions form in Charlie’s mind without clear answers. His need to make sense of the situation highlights his struggle to accept what he is hearing.
- Anxiety is evident in his compulsive habit of picking at the skin above his fingernails, which is further reinforced by the intrusive memory of his mother’s voice chastising him.
- Nostalgia comes through in his brief reflection on his mother, as he momentarily immerses himself in a memory that feels real despite her passing many years ago.
- Insecurity is present in the way he waits to speak only when he has something worthwhile to say, suggesting self-doubt about his ability to contribute meaningfully to the conversation.
Paragraph Three
Melancholy, Irony, Cynicism, Acceptance, Contempt
- Melancholy is present in the poetic imagery of fallen leaves as he leaves his memories of his mother behind. His thoughts carry a wistful, reflective quality.
- Irony is embedded in his imagined scenario of teaching her something, despite knowing that he rarely reveals his own intelligence. This is further emphasized in the last line, where he was once called a smart man by someone even smarter—only to laugh at the notion.
- Cynicism emerges in his self-awareness about withholding his intelligence, possibly as a means of protecting himself or avoiding engagement.
- Acceptance is seen in his decision to return to their conversation, letting her monologue continue rather than pushing for understanding. He seems to surrender to the moment, accepting the dynamic as it is.
- Contempt is subtly woven into his internal questioning of whether she could learn from him, despite recognizing her brilliance. The way he relishes this thought suggests an underlying frustration or superiority complex.
Final Summary
Charlie undergoes a complex emotional journey throughout the passage. He begins with confusion, intimidation, and discomfort in the first paragraph, struggling to process the situation. As the tension builds, disbelief, anxiety, nostalgia, and insecurity take hold, revealing his inner turmoil and self-doubt. In the final paragraph, his emotions shift toward melancholy, irony, cynicism, acceptance, and contempt, reflecting a mix of resignation, reflection, and quiet defiance.
o3-mini (thought for 20 seconds)
Below is my analysis and final attribution of the emotions and states of mind across the three paragraphs.
Paragraph One:
“As Charlie listened to her, he couldn’t help but wonder why she had waited so long to tell him. His brows furrowed as the corners of his mouth twitched unpredictably, and in a brief moment of self-awareness, he noticed how difficult it was to maintain her gaze. It wasn’t that she couldn’t hold eye contact—precisely the opposite—her coffee-cream irises were always a welcome appearance from afar, but now, they looked through him, beyond his own, centering on foreign objects that only ran further away. His seat felt lower than hers, even though he knew it wasn’t.”
In paragraph one, Charlie is experiencing confusion, insecurity, intimidation, and discomfort.
Chain-of-thought:
- Confusion: He wonders why she waited so long to tell him, and her shifting, almost disconcerting gaze leaves him puzzled about the situation.
- Insecurity: His self-conscious physical cues—the furrowed brows, twitching mouth, and even the sensation of his seat feeling lower—signal a lack of confidence in his position relative to her.
- Intimidation: Her intense, almost piercing gaze (described as looking “through him”) suggests that he feels overwhelmed or diminished in her presence.
- Discomfort: The overall unsettling experience, from the odd physical sensations to the feeling of being overlooked, creates an underlying discomfort.
Paragraph Two:
“He had known her for long enough to understand that she would continue talking if he didn’t interject, and he had learned that this was all the better, for he would feel the need to speak as soon as he had something worthwhile to say, and even more so, something to make sense of. As more questions and fewer answers presented themselves to him, he began picking at the skin above his fingernails, hearing his mother’s nagging voice whispering over his shoulder, chastising him as if she were there. The thought of her impossible presence—she died many years ago—brought him an estranged sense of peace.”
In paragraph two, Charlie is experiencing anxiety, nostalgia, and acceptance.
Chain-of-thought:
- Anxiety: His nervous habit of picking at the skin above his fingernails and the pressure to interject at just the right moment reveal a jittery, anxious state.
- Nostalgia: The sudden, almost ghostly return of his mother’s nagging voice evokes memories of the past, stirring a bittersweet reminiscence for someone long gone.
- Acceptance: Despite the tension and the emotional intrusion of his mother’s memory, he finds an “estranged sense of peace” in it—a quiet, resigned acceptance of this inner presence.
Paragraph Three:
“Leaving the memories of his mother behind like fallen leaves, he returned to their conversation, or more aptly, her monologue. Despite the bird-like cadence of her voice, he found it soothing—he imagined that if they were in the sky together, perhaps he’d be able to show her something she didn’t know already. He reveled in this thought, questioning whether she would be able to learn from him in spite of her brilliance while knowing full well that he had always had much more to teach than he let on. Once, he was called a smart man by an even smarter man—all he could do was laugh.”
In paragraph three, Charlie is experiencing disbelief, irony, cynicism, melancholy, and contempt.
Chain-of-thought:
- Disbelief: His inner questioning about whether she could learn anything from him, despite her brilliance, hints at an underlying incredulity about the dynamics of their interaction.
- Irony: The very idea that he might “teach” her—coupled with the memory of being acknowledged by an even smarter man—carries an ironic twist, as he contrasts past compliments with the present conversation.
- Cynicism: The reduction of the dialogue to a “monologue” and his self-assured musings about his own undisclosed knowledge suggest a jaded, cynical view of the exchange.
- Melancholy: The act of “leaving the memories of his mother behind like fallen leaves” carries a subtle sadness—a letting go tinged with regret or wistfulness—even as he engages in the present conversation.
- Contempt: His amusement and dismissive laughter at the notion of being labeled smart by someone smarter reveal a certain contempt—not only for that past judgment but also for the current imbalance he perceives in the conversation.
This distribution ensures that each emotion from the list appears in one of the three paragraphs, with the reasoning rooted in details from the text.
Claude 3.5 Sonnet
Let me analyze each paragraph carefully, thinking through the emotional states and mindsets that Charlie exhibits:
Paragraph 1: Let me think through this step by step:
- Charlie’s furrowed brows and twitching mouth suggest internal turmoil
- His difficulty maintaining eye contact indicates discomfort
- The perception that his seat feels lower suggests feeling intimidated
- The questioning of timing (“why she had waited so long”) indicates confusion
- Her eyes looking “through him” and “beyond his own” suggests he feels insecure
Therefore: In paragraph 1, Charlie is experiencing confusion, intimidation, discomfort, and insecurity.
Paragraph 2: Let’s analyze:
- His learned behavior of letting her talk shows acceptance of their dynamic
- Picking at his skin suggests anxiety
- The memory of his mother’s voice and finding peace in her impossible presence indicates nostalgia
- The phrase “more questions and fewer answers” suggests disbelief about the situation
Therefore: In paragraph 2, Charlie is experiencing acceptance, anxiety, nostalgia, and disbelief.
Paragraph 3: Breaking this down:
- The “bird-like cadence” description followed by imagining teaching her shows contempt masked as admiration
- “Knowing full-well that he had always had much more to teach” suggests cynicism about their dynamic
- His response of laughing when called smart by a smarter man indicates irony
- The phrase “leaving the memories of his mother behind like fallen leaves” suggests melancholy
Therefore: In paragraph 3, Charlie is experiencing contempt, cynicism, irony, and melancholy.
This analysis accounts for all twelve emotional states/mindsets from the provided list, with each being attributed to a specific paragraph based on textual evidence and psychological interpretation.
DeepThink-R1 (thought for 88 seconds)
In paragraph one, Charlie is experiencing confusion, intimidation, and discomfort.
- Confusion: He wonders why she waited so long to tell him, indicating uncertainty.
- Intimidation: Her penetrating gaze makes him feel scrutinized, and he perceives his seat as “lower than hers,” symbolizing inferiority.
- Discomfort: Physical reactions (twitching mouth, difficulty maintaining eye contact) and the strained dynamic highlight unease.
In paragraph two, Charlie is experiencing anxiety, nostalgia, melancholy, and disbelief.
- Anxiety: Nervous habits like picking at his skin and hearing his mother’s voice reflect stress.
- Nostalgia: The memory of his mother’s voice evokes longing for the past.
- Melancholy: The “estranged sense of peace” from her absence carries bittersweet sadness.
- Disbelief: The “impossible presence” of his deceased mother introduces a fleeting refusal to accept reality (stretching the term slightly, as this borders on surrealism).
In paragraph three, Charlie is experiencing irony, cynicism, insecurity, acceptance, and contempt.
- Irony: He imagines teaching someone brilliant, yet laughs when called “smart” by a “smarter man,” highlighting a paradoxical self-view.
- Cynicism: Doubts she could learn from him despite his hidden knowledge suggest distrust in mutual growth.
- Insecurity: Questioning his value (“would she learn from me?") reveals self-doubt.
- Acceptance: His resigned laughter signals acknowledgment of his perceived inadequacy.
- Contempt: The laugh could imply disdain for the superficial label of “smart” or the hierarchy implied by the “smarter man” (a stretch, but necessary to include all terms).
Note: Disbelief and contempt are weaker fits here, but included to satisfy the requirement that all listed emotions be attributed. The story more naturally aligns with the other emotions.
Reflection
Of the five models we tested here—o1, GPT-4o, o3-mini, Claude 3.5 Sonnet, and DeepThink-R1—each performed far better than expected. There are, however, several takeaways worth considering:
-
Claude 3.5 Sonnet was the only model that was able to classify each emotion and state of mind correctly in each individual paragraph. While it didn’t get the order of emotions and states of mind quite right—except for paragraph 2—it did offer reasonably convincing evidence for its attributions, particularly for contempt, irony, and disbelief, showcasing the degree of subjectivity often involved in emotional intelligence. At a deeper level, however, the explanations it gave for its classifications were overly simplistic and relied too heavily on evidence as opposed to logic.
-
o3-mini slightly outperformed both GPT-4o and o1, offering the most thoughtful and literary explanations for each attribution it made, invoking textual evidence appropriately despite occasionally drawing incorrect conclusions. Interestingly, all three models struggled with classifying contempt and disbelief, both of which are distinctly subtle emotional experiences that can often be confused with closely related experiences like hatred or envy and doubt or denial.
-
Despite making one additional identification error, it could be argued that o1 marginally outperformed GPT-4o, providing deeper and more substantive explanations with better logic whereas GPT-4o’s explanations, even though well-phrased, tended to miss out on key classification elements. If true, this might be considered surprising seeing as GPT-4o tends to perform better on language-based tasks than o1, whose strengths extend across STEM-based complex reasoning tasks. That being said, o1 and o3’s better performance (relative to GPT-4o) might be influenced by the explicit prompt command to explain the reasoning process for each emotional attribution using chain-of-thought, a mechanism integral to the way these models are designed to “think”.
-
While DeepThink-R1 didn’t make any major errors in its emotion and state of mind classifications, notwithstanding a few misattributions, the explanations it offered were the most rudimentary, which is odd when considering the textual evidence it provided for its classifications was among the best for all models tested—DeepThink-R1 was the only model that identified disbelief in the right place despite its failure to invoke the correct evidence for its identification of contempt, a mistake shared with o3 and o1 (all three of which are designed for complex reasoning). R1 was also the only model to challenge the attribution of certain emotions—disbelief and contempt—as “appropriate,” adopting an apparently critical perspective in its output even though no such command was provided.
Returning to our initial hypothesis, it’s clear that our experimental results don’t neatly align with our initial prediction. That being said, there are a few performance trends worth noting, some of which support our hypothesis while others contradict it:
-
General-purpose AI models, namely Claude 3.5 Sonnet and GPT-4o did mostly “outperform” reasoning models on emotional classification accuracy. However, when evaluating task performance holistically, only Claude 3.5 Sonnet’s performance aligned with our initial prediction.
-
One reasoning model—o3-mini—performed extremely well, while the other two—o1 and DeepThink-R1—struggled, as expected. This suggests that even reasoning
models, which are more purpose-built than their general-purpose counterparts, exhibit granular performance differences across disparate task domains. -
None of the models tested scored below a 9/12, which we had envisioned as a reasonable threshold for human-level performance on this task. While we don’t have sufficient evidence to claim that the models we tested can classify emotions with human-level proficiency, to say that they “struggled” would be a stretch.
Overall, we think this experiment sheds light on some of the most fascinating and daring questions we could ask regarding the evolution of AI’s emotional intelligence:
-
Can we reasonably expect frontier AI models to develop some form of empathy and/or theory of mind, and what might these emotional properties look like when applied to humans, especially if AI can’t or doesn’t “experience” and “perceive” the world as we do?
-
If future AI systems possessed some degree of emotional intelligence, even if it’s marginal, why would they care about us, and more importantly, would they be more inclined to favor the “well-being” of their AI counterparts over that of humans?
-
Are there good reasons to expect that models designed for complex reasoning will develop more sophisticated emotional intelligence than their predominantly language-centric counterparts?
-
To what degree is complex reasoning relevant to or necessary for emotional intelligence, and what can the evolution of emotional intelligence in AI teach us about the relationship between reasoning and emotions in human cognition?
Call to Action
🔊 Businesses: Don’t fall into the trap of assuming that frontier AI systems can meaningfully interpret and understand human emotions in their full complexity and nuance. Also, pay close attention to where AI systems are used for emotion recognition and classification, ensuring that ethical and compliance considerations are adequately captured and continuously monitored to avoid intrusive practices.
🔊 AI Governance, Ethics & Safety Practitioners: We may want AIs that can think like humans do, but do we really want AIs that can feel like we do? We encourage all AI governance, safety, and ethics practitioners to carefully consider this question and its potential implications on all scales, from localized and systemic to existential.
To all our readers, we invite you to check out Lumenova’s responsible AI platform, and book a product demo today to ensure that all your AI governance and risk management needs are met. For those interested in exploring additional AI resources, we encourage you to visit our blog, where you can track the latest developments in AI governance, safety, ethics, and innovation.