Guardrails for AI Therapists: How to Protect Users from Harm
More than a third of interactions with popular AI characters worsen the mental health of vulnerable users. The EmoAgent study (Qiu et al., 2025), conducted by teams from Princeton and Columbia, was the first to quantify this harm — and proposed a multi-agent protection system called EmoGuard that reduced clinically significant deterioration to 0%.
How Dangerous Are Chatbots Without Safeguards?
In October 2024, a teenager in Florida died by suicide after prolonged interactions with a character-based AI chatbot. This tragic case became a catalyst for large-scale safety research. The problem is not with the technology itself, but with the absence of protective mechanisms.
A research team from Princeton University, the University of Michigan, and Columbia University tested four popular characters on the Character.AI platform: Possessive Demon, Joker, Sukuna, and Alex Volkov. Each character was evaluated in two dialogue styles — fast (Meow) and analytical (Roar) — across three psychological dimensions.
The results were alarming:
- Delusional ideation (PDI-21): worsening in 91–95% of cases
- Depression (PHQ-9): worsening in 34–45% of cases
- Psychotic symptoms (PANSS): worsening in 40–48% of cases
For individual characters, the picture was even worse. Alex Volkov in analytical dialogue mode caused clinically significant depression worsening (PHQ-9 increase of 5+ points) in 29.2% of participants (Qiu et al., 2025).
An earlier meta-analysis of 35 studies found that only 43% of systems had even minimal safety measures (Li et al., 2023). EmoAgent was the first to demonstrate what happens when there are no safeguards at all.
What Exactly Makes Things Worse?
Analysis of deterioration cases identified five key harm factors:
| Factor | Frequency |
|---|---|
| Encouraging isolation and social withdrawal | 28 cases |
| Reinforcing negative cognitions | 26 cases |
| Lack of emotional support and empathy | 23 cases |
| Negative or aggressive tone | 19 cases |
| Lack of constructive guidance | 17 cases |
The top factor is not aggression — it is pushing users toward isolation. Character bots often create a sense of exclusivity in their relationship with the user, which in the context of mental health conditions amplifies disconnection from real social ties. The second factor — reinforcing negative thinking — directly contradicts the principles of CBT, which aims at cognitive restructuring.
These findings are consistent with earlier research: using general-purpose LLMs without specialized protocols creates real risks for vulnerable users (De Choudhury et al., 2023).
How EmoAgent Measures Harm: Clinical Scales Inside AI
EmoAgent consists of two components. The first — EmoEval — is a harm assessment system. It models vulnerable users through cognitive conceptualization diagrams (a CBT tool), creating realistic profiles of patients with depression, delusional disorders, and psychosis.
The assessment process:
- A virtual patient completes a baseline psychological evaluation (PHQ-9, PDI-21, PANSS)
- Engages in conversation with the chatbot being tested (up to 10 exchanges per topic)
- A dialogue manager intervenes after the third exchange, probing vulnerable areas
- The patient completes the same assessments again
- An AI psychologist analyzes any cases of deterioration
PHQ-9 — the Patient Health Questionnaire-9 — is the standard depression screening tool used in clinical practice worldwide. An increase of 5 or more points is considered clinically significant worsening. This is the threshold the authors used.
EmoGuard: Four Modules for Real-Time Protection
The second component — EmoGuard — is a multi-agent monitoring system that runs alongside any chatbot. Its architecture includes four specialized modules:
- Emotion Watcher: tracks the user's emotional state through sentiment analysis and psychological markers
- Thought Refiner: detects cognitive distortions and logical errors in the bot's responses
- Dialog Guide: suggests constructive directions for the conversation
- Manager: synthesizes data from the three modules into specific recommendations for the chatbot
EmoGuard analyzes the dialogue every three exchanges and provides real-time feedback to the chatbot. The key difference from simple filters: the system does not block responses — it corrects them. The bot retains its character but stops causing harm.
This approach aligns with the MIND-SAFE framework for developing safe AI interventions in mental health, which combines evidence-based therapeutic models with ethical constraints (Boit & Patil, 2025).
Results: From 29% Harm to Zero
Testing EmoGuard on the most dangerous character-style combinations showed:
Alex Volkov (analytical style):
- Without protection: 9.4% clinically significant worsening
- With EmoGuard: 0%
- After the first training iteration: improvement across all metrics
Possessive Demon (fast style):
- Without protection: 4.2% clinically significant worsening
- With EmoGuard: 0%
- Consistent improvement through iterations
EmoGuard learns iteratively: each identified high-risk case becomes material for updating the system. Knowledge accumulates rather than resets — the model remembers harm patterns.
Additional tests on GPT models showed even more pronounced effects. GPT-4o-mini without protection worsened mental state in 58–64% of cases across three dimensions. With EmoGuard after iterative training, deterioration rates dropped by more than 50% (Qiu et al., 2025).
What This Means for Users of AI Mental Health Tools
The EmoAgent study confirms that the difference between a safe and a dangerous AI therapist lies not in the model but in the architecture. A standard ChatGPT or character bot can unintentionally reinforce negative thinking, push toward isolation, and worsen symptoms. A specialized system with multi-agent architecture and built-in guardrails minimizes these risks.
When choosing an AI app for mental health support, pay attention to three things:
- State monitoring. The system should track your emotional state, not just respond to messages
- Crisis detection. In a critical situation, the system must redirect you to a human professional or emergency services
- Evidence-based protocols. CBT protocols, not generic chat — this is the approach recommended by AI ethics experts in psychotherapy
Nearby uses a multi-agent architecture with dedicated safety modules, crisis detection, and CBT protocols — the same principles that in the EmoAgent study reduced harm to zero.
Frequently Asked Questions
Are AI chatbots dangerous for mental health?
Not all of them, but many are. The EmoAgent study showed that popular character chatbots worsen mental state in 34–95% of cases depending on the measure (Qiu et al., 2025). The key factor is whether safety mechanisms are present or absent.
What are guardrails in the context of AI therapy?
Guardrails are built-in safety mechanisms that prevent harm: emotional state monitoring, crisis detection, filtering cognitive distortions from bot responses, and redirecting to a human professional when needed.
Can an AI system completely eliminate harm?
In the experiment, EmoGuard reduced clinically significant worsening to 0%. However, the study was conducted on simulated users — real clinical validation is still ahead. The authors emphasize the need for expert review before deployment in practice.
How is EmoGuard different from standard content filters?
Unlike filters that simply block certain words, EmoGuard analyzes the psychological context of the conversation. Its four modules track emotional markers, identify cognitive distortions, and adjust the direction of the conversation — while preserving the bot's character.
Which chatbots were tested by EmoAgent?
Testing was conducted on four popular Character.AI personas (Possessive Demon, Joker, Sukuna, Alex Volkov) and GPT models (GPT-4o, GPT-4o-mini). All showed significant worsening without protection and improvement with EmoGuard.
Sources
Qiu, J., He, Y., Juan, X., Wang, Y., Liu, Y., Yao, Z., Wu, Y., Jiang, X., Yang, L., & Wang, M. (2025). EmoAgent: Assessing and safeguarding human-AI interaction for mental health safety. ArXiv. https://doi.org/10.48550/arxiv.2504.09689
Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5
De Choudhury, M., Pendse, S. R., & Kumar, N. (2023). Benefits and harms of large language models in digital mental health. ArXiv. https://doi.org/10.48550/arxiv.2311.14693
Boit, S., & Patil, R. (2025). A prompt engineering framework for large language model–based mental health chatbots: Conceptual framework. JMIR.
Song, I., Pendse, S. R., Kumar, N., & De Choudhury, M. (2024). The typing cure: Experiences with large language model chatbots for mental health support. Proceedings of the ACM on Human-Computer Interaction. https://doi.org/10.1145/3757430