Cognitive behavioral therapy (CBT) is the most evidence-based psychotherapy model and the only one whose protocols are structured enough for direct implementation in an AI chatbot. In 2024–2025, clinical evaluations have been published of at least five specialized CBT systems: structured dialogue on WHO protocols (SuDoSys, Chen et al., 2024), cognitive restructuring (Wang et al., 2025), Socratic reappraisal (Socrates 2.0, Held et al., 2025), behavioral activation (Kuhlmeier et al., 2025), and problem-solving therapy (Mo et al., 2025). All of them show strong protocol fidelity, but different risks at the level of therapeutic contact.

Why CBT in particular fits a chatbot well

CBT is a family of protocols with a clear structure: problem assessment, psychoeducation, a set of techniques (cognitive restructuring, behavioral activation, exposure, behavioral experiments, Socratic dialogue), change monitoring, and relapse prevention. Each technique is operationalized: a hierarchy of avoided situations, a format for recording automatic thoughts, mood-rating scales.

This structure is what "general ChatGPT" lacks and what is critical for safe automation. A systematic review by Karki et al. (2025) shows that chatbots and LLMs offer empathy comparable to humans and round-the-clock availability, but require integration into a stepped-care approach. The Du et al. (2025) meta-analysis confirmed that LLM chatbots significantly reduce depression and anxiety, while scripted systems produce only a modest effect on depression.

The new wave of CBT chatbots in 2024–2025 is therefore not "yet another generative companion" but a hybrid system: a structured protocol plus an LLM to generate natural responses within that protocol.

Structured dialogue on WHO protocols: SuDoSys

Chen et al. (2024) introduced SuDoSys, an LLM chatbot that runs the conversation on the WHO Problem Management Plus (PM+) protocol. PM+ is a brief psychological intervention (5 sessions) developed by the WHO for use in settings with a shortage of specialists: humanitarian crises, low income, limited access to psychotherapy.

SuDoSys's key innovation is its staged architecture. The chatbot holds the current stage of the work (contracting → problem assessment → psychoeducation → regulation techniques → change planning → consolidation) and prevents the conversation from "slipping" into general chat. This addresses the main problem of general-purpose LLM chatbots, which in qualitative work by Song et al. (2024) in Proceedings of the ACM on Human-Computer Interaction (Q1) systematically lost their therapeutic direction in emotionally charged moments.

Crucially, SuDoSys rests on an internationally validated WHO protocol. That removes a substantial part of the question about technique validity: PM+ has published RCT evidence of effectiveness for depression and anxiety in several countries. The chatbot is not "inventing therapy"; it is delivering an already evidence-based protocol with the help of an LLM.

Cognitive restructuring through an AI chatbot

Wang et al. (2025) evaluated a specialized LLM chatbot for cognitive restructuring — the central CBT technique in which the client learns to recognize and test automatic dysfunctional thoughts. Expert psychologists rated the clinical quality of the system's work.

The main positive finding: the chatbot can hold the protocol and offer empathic validation of experiences. The authors' main warning: in restructuring work, power imbalances and advice-giving risks emerge — when the chatbot moves from exploratory questions ("what arguments are there for and against this thought?") to directive advice ("think about it like this instead"). Directiveness breaks therapeutic contact and violates one of CBT's core principles — the client's own discovery of alternative interpretations.

This means the quality of a CBT chatbot is determined not by the volume of the model's knowledge but by how skillfully the protocol limits its directiveness in the right places. The problem is directly addressed in the prompt-engineering framework by Boit & Patil (see our breakdown of prompt engineering for mental-health chatbots) and in the MIND-SAFE architecture.

The Socratic method in a chatbot: Socrates 2.0

Held et al. (2025) in JMIR Mental Health published a mixed-methods feasibility study of Socrates 2.0 — an AI system for cognitive reappraisal through Socratic dialogue. The Socratic method is a CBT technique in which the therapist, through a sequence of open questions, helps the client arrive at a more balanced interpretation of an event on their own, rather than receive the "right answer" from outside.

This is arguably the hardest CBT technique to automate, and that is exactly why the Socrates 2.0 study is instructive. The authors demonstrated that contemporary LLMs can sustain a Socratic dialogue in a format close to a therapeutic one: asking clarifying questions, probing interpretations, holding focus on the session's goal. At the same time, the authors documented limits: in complex cases of cognitive distortion, the model drifted toward advice and lost its exploratory stance — the same problem found in Wang et al. (2025).

Combining the conclusions of Socrates 2.0 with Wang et al.'s evaluation, a general pattern emerges: a CBT chatbot can realistically deliver cognitive techniques in cases of moderate complexity but requires guard rails to maintain the exploratory stance in difficult cases.

Behavioral activation: an AI chatbot for depression in young adults

Behavioral activation (BA) is one of the most evidence-based CBT techniques for depression: rather than working with thoughts, the client gradually increases the number of activities tied to values and pleasure, breaking the depressive vicious cycle. Kuhlmeier et al. (2025) developed a specialized LLM chatbot for BA in young adults with depression and evaluated it with artificial users (client simulators) and clinical experts.

The main finding: LLM chatbots can carry out therapeutic protocols with high fidelity — that is, follow the structure of the session, give correct homework, and monitor progress. The open challenge remains robust clinical reasoning: response to atypical client answers, recognition of hidden risks, and dynamic adaptation of intensity.

This aligns with another validated design — CaiTI (Nie et al., 2024) in ACM Transactions on Computing for Healthcare (Q1, 35 citations): an LLM "therapist" delivered through everyday smart devices runs daily-functioning screening and selects the right CBT intervention at the right moment.

Problem-solving therapy: a PST chatbot on GPT-4

Mo et al. (2025) in Frontiers in Digital Health introduced a PST chatbot built on GPT-4 for self-help in young adults. Problem Solving Therapy (PST) is a brief CBT-derived approach focused on the structured solving of specific life problems: defining the problem → generating alternatives → evaluating and choosing → planning implementation → reviewing the result.

PST is especially well-suited to a chatbot format for two reasons. First, its protocol is strictly stepwise and easy to hold within a dialogue. Second, it works on current life tasks rather than on deep belief restructuring — which lowers the demands on the system's "therapeutic intuition." The chatbot helps structure the user's thinking without claiming the role of a depth therapist.

What the cumulative meta-analysis showed

The meta-analysis of 35 AI-chatbot studies in NPJ Digital Medicine (Li et al., 2023) is the most cited source on the evidence base. The findings most relevant to CBT chatbots:

Depression: a significant reduction, Hedges' g = 0.64 (95% CI: 0.17–1.12).
Distress: a significant reduction, g = 0.70.
Generative models (GPT, BERT) — g = 1.24; scripted systems — g = 0.52: a 2.4-fold difference in favor of generative systems.
The greatest benefit goes to users with clinical/subclinical symptoms (g = 1.07), versus healthy ones (g = 0.11).
Mobile apps (g = 0.96) outperform web versions (g = −0.08).

The fresh meta-analysis by Du et al. (2025) directly compared scripted and LLM chatbots: LLM systems show a significant effect on both depression and anxiety, while scripted systems do so only on depression and at a more modest size.

In individual studies, specialized CBT systems achieve a more pronounced effect: Therabot reduced major depression symptoms by 51% in 4 weeks (Sharma et al., 2023, NEJM AI).

Limits of current CBT chatbots

Effectiveness does not imply safety by default. Four risk zones are well documented:

Directiveness instead of exploration. The chatbot drifts to advice in places where CBT calls for collaborative inquiry (Wang et al., 2025; Held et al., 2025).
Empathy is uneven across subgroups. LLM empathy varies across patient groups (Gabriel et al., 2024). Without balanced corpora and guard rails, some users receive lower-quality responses than others.
Weak crisis handling. Only 15 of the 35 systems in the Li et al. (2023) meta-analysis reported having safety measures. Using LLMs without dedicated safety mechanisms creates real risks of harm (De Choudhury et al., 2023).
High data heterogeneity. I² = 95.3% in the Li et al. meta-analysis — studies differ widely in design, populations, and instruments.

What this means in practice

The CBT chatbots of 2024–2025 show that automation of cognitive behavioral therapy is feasible and produces a measurable clinical effect — under four conditions:

Staged architecture (as in SuDoSys) that holds the structure of the protocol.
Guard rails against directiveness (as discussed in Wang and Held), preserving the exploratory stance.
Bounded scope of application — mild and moderate symptoms, not acute crisis or complex comorbidity.
Transparent escalation to a human in a crisis.

This is exactly the approach behind Nearby: CBT protocols with a multi-agent architecture (separate agents for separate roles — assessment, technique, safety), crisis recognition with case handoff, and psychological profiling tailored to the user. Not a "generative companion," but a specialized CBT system designed around the known limits of AI.

Frequently asked questions

What is a CBT chatbot, and how does it differ from a regular AI companion?

A CBT chatbot delivers a structured protocol of cognitive behavioral therapy: assessment, psychoeducation, specific techniques (cognitive restructuring, behavioral activation, Socratic dialogue), monitoring, and consolidation. Unlike general-purpose ChatGPT, it holds the stages of therapy, has built-in guard rails, and does not "slip" into free conversation (Chen et al., 2024; Boit & Patil, 2025).

Does a CBT chatbot really help with depression?

Yes. A meta-analysis of 35 studies found a significant reduction in depression (Hedges' g = 0.64) among AI chatbot users (Li et al., 2023). In a separate RCT, Therabot reduced major depression symptoms by 51% in 4 weeks (Sharma et al., 2023, NEJM AI). The key conditions are a generative model and built-in CBT protocols, not a scripted system.

Which CBT techniques have already been automated?

Studies in 2024–2025 have evaluated at least five: structured dialogue on the WHO PM+ protocol (SuDoSys, Chen et al., 2024), cognitive restructuring (Wang et al., 2025), Socratic reappraisal (Socrates 2.0, Held et al., 2025), behavioral activation (Kuhlmeier et al., 2025), and problem-solving therapy (Mo et al., 2025). All show high protocol fidelity.

Can a CBT chatbot replace a psychotherapist?

No. A CBT chatbot is effective for mild and moderate symptoms, protocol work between sessions, and support at the first step of care. Complex diagnosis, acute crisis, long-term trauma work, and decisions about pharmacotherapy remain the live clinician's domain (Omar et al., 2024; Obradovich et al., 2024).

What are the risks of CBT chatbots?

Four main risk zones: drifting into directive advice instead of Socratic inquiry (Wang et al., 2025), uneven empathy across user subgroups (Gabriel et al., 2024), weak handling of crisis signals without dedicated guard rails (De Choudhury et al., 2023), and high heterogeneity of quality between different systems. Only specialized systems with validated protocols are safe.

References

Boit, S., & Patil, R. (2025). A prompt engineering framework for large language model–based mental health chatbots: Conceptual framework. JMIR Mental Health. https://doi.org/10.2196/75078

Chen, Y., Zhang, X., Wang, J., Xie, X., Yan, N., Chen, H., & Wang, L. (2024). Structured dialogue system for mental health: An LLM chatbot leveraging the PM+ guidelines. ArXiv. https://doi.org/10.48550/arxiv.2411.10681

De Choudhury, M., Pendse, S. R., & Kumar, N. (2023). Benefits and harms of large language models in digital mental health. ArXiv. https://doi.org/10.48550/arxiv.2311.14693

Du, Q., Ren, Y., Meng, Z., He, H., & Meng, S. (2025). The efficacy of rule-based versus large language model–based chatbots in alleviating symptoms of depression and anxiety: Systematic review and meta-analysis. Journal of Medical Internet Research.

Gabriel, S., Puri, I., Xu, X., Malgaroli, M., & Ghassemi, M. (2024). Can AI relate: Testing large language model response for mental health support. ArXiv. https://doi.org/10.48550/arxiv.2405.12021

Held, P. et al. (2025). AI-facilitated cognitive reappraisal via Socrates 2.0: Mixed methods feasibility study. JMIR Mental Health. https://doi.org/10.2196/80461

Karki, A., Kamble, C., Chavan, R., & Chapke, N. (2025). Mental health meets machine learning: The rise of chatbots and LLMs in therapy. International Journal for Research Trends and Innovation. https://doi.org/10.56975/ijrti.v10i5.203281

Kuhlmeier, F., Hanschmann, L., Rabe, M., Luettke, S., Brakemeier, E.-L., & Maedche, A. (2025). Designing an LLM-based behavioral activation chatbot for young people with depression: Insights from an evaluation with artificial users and clinical experts.

Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5

Mo, F. et al. (2025). Self-help psychological intervention for young individuals: PST chatbot using GPT-4. Frontiers in Digital Health. https://doi.org/10.3389/fdgth.2025.1627268

Nie, J., Shao, H., Fan, Y., Shao, Q., You, H., Preindl, M., & Jiang, X. (2024). LLM-based conversational AI therapist for daily functioning screening and psychotherapeutic intervention via everyday smart devices. ACM Transactions on Computing for Healthcare. https://doi.org/10.48550/arxiv.2403.10779

Obradovich, N. et al. (2024). Opportunities and risks of large language models in psychiatry. NPP Digital Psychiatry and Neuroscience. https://doi.org/10.1038/s44277-024-00010-z

Omar, M., Soffer, S., Charney, A. W., Landi, I., Nadkarni, G. N., & Klang, E. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry. https://doi.org/10.3389/fpsyt.2024.1422807

Sharma, A. et al. (2023). Human-centered evaluation of generative AI-based therapy chatbot. NEJM AI, 1(2). https://doi.org/10.1056/AIoa2300127

Song, I., Pendse, S. R., Kumar, N., & De Choudhury, M. (2024). The typing cure: Experiences with large language model chatbots for mental health support. Proceedings of the ACM on Human-Computer Interaction. https://doi.org/10.1145/3757430

Wang, Y. et al. (2025). Evaluating an LLM-powered chatbot for cognitive restructuring: Insights from mental health professionals. ArXiv. https://doi.org/10.48550/arxiv.2501.15599