A 2025 meta-analysis uncovered a paradox: rule-based chatbots with rigid scripts moderately reduce depression symptoms, while chatbots powered by large language models do not. The systematic review by Du et al. (2025) analyzed randomized controlled trials of both system types and reached a conclusion that challenges the narrative of generative AI's superiority in therapy.

What exactly did the meta-analysis find?

A team of researchers led by Qiuxue Du conducted a systematic review and meta-analysis of RCTs comparing two types of chatbots for people with depression and anxiety symptoms (Du et al., 2025). They divided the systems into two categories: rule-based (scripted, operating on predefined algorithms) and LLM-based (built on large language models).

The headline result: rule-based chatbots demonstrated a modest but statistically significant improvement in depression symptoms. LLM chatbots showed no significant effect.

This is a counterintuitive finding. Language models generate more natural responses, understand context better, and can display empathy close to a human level (Karki et al., 2025). How can a system that responds with pre-written phrases outperform them?

Why did rule-based chatbots "win"?

The answer isn't that scripts are better than AI. The answer lies in the evidence base.

A decade of clinical data. Rule-based systems like Woebot and Wysa have existed since 2017. Over that time, they've been through dozens of randomized trials with large samples and extended follow-up periods. As early as 2019, a review by Vaidyam et al. documented the growing evidence base for scripted chatbots in psychiatry — well before the ChatGPT era (Vaidyam et al., 2019).

Therapeutic protocols. Woebot strictly follows cognitive behavioral therapy. Every conversation is a structured session with a specific goal: identify an automatic thought, conduct cognitive restructuring, assign a behavioral experiment. A script cannot deviate from the protocol — and that's its advantage.

Very few RCTs for LLMs. Large language models only became available for therapeutic applications in 2023-2024. The number of completed RCTs for LLM chatbots can be counted on one hand. A meta-analysis combining three or four small trials simply cannot demonstrate statistical significance — it lacks the statistical power.

What went wrong with early LLM studies?

The problem isn't just the number of trials. Early LLM chatbots for mental health were often built without any therapeutic structure.

A typical 2023 scenario: researchers take GPT-3.5 or GPT-4, write a system prompt saying "you are an empathic psychologist," and release users into free-form conversation. Such a chatbot can comfort, listen, and find the right words. But it doesn't guide a person along a therapeutic pathway. It's reactive — responding to what the user says instead of steering the conversation toward specific therapeutic goals.

Ma et al. (2023) described this fundamental challenge: LLM agents possess impressive language capabilities, but without additional architecture they lack structured clinical reasoning (Ma et al., 2023). The review by Pavlopoulos et al. (2024) confirmed this: among AI tools for depression and anxiety, the greatest effect sizes belong to those embedded in evidence-based therapeutic frameworks (Pavlopoulos et al., 2024).

Kuhlmeier et al. (2025) ran an experiment with an LLM chatbot for behavioral activation and found a telling contradiction: the model can execute therapeutic protocols with high fidelity, but "reliable clinical reasoning remains an open challenge" (Kuhlmeier et al., 2025).

Context: other meta-analyses disagree

The Du et al. finding doesn't exist in a vacuum. The largest meta-analysis by Li et al. (2023) — 35 studies, over 17,000 participants — showed a significant reduction in depression for AI chatbots overall: Hedges' g = 0.64 (Li et al., 2023). However, that review did not separate rule-based and LLM systems into subgroups the way Du et al. did.

Moreover, Li et al. found that generative models outperformed scripted ones by 2.4 times in effect size (g = 1.24 vs g = 0.52). Granted, only five generative systems were in the sample — and some of them were fine-tuned on therapeutic data, not just vanilla LLMs.

Individual clinical trials also give reason for optimism. Therabot — an LLM chatbot built on GPT-4 with therapeutic structure — demonstrated a 51% reduction in depression in a pilot RCT (Sharma et al., 2023). A comparison of an AI therapist with a human clinician in behavioral activation showed comparable effectiveness (Napiwotzki et al., 2025).

The meta-analysis by Li et al. (2025) confirmed that chatbots — including LLM systems — significantly reduce psychological distress in young people (Li et al., 2025).

Not "scripts vs LLMs," but "structure vs chaos"

When you bring all the data together, the picture becomes clear. The dividing line doesn't run between "scripted vs language model." It runs between "structured therapy vs unstructured conversation."

Rule-based chatbots win not because scripts are better. They win because every rule-based chatbot is structured by definition. It has no choice — it follows the protocol. Early LLM chatbots, by contrast, often had no protocol at all.

The new generation of LLM systems is already fixing this. SuDoSys (Chen et al., 2024) exemplifies the structured approach: the system uses the WHO's Problem Management Plus (PM+) guidelines as a framework for LLM-driven dialogue. The model doesn't just chat — it guides the user through specific therapeutic techniques defined by the protocol (Chen et al., 2024).

Kuhlmeier et al. (2025) demonstrated a similar approach: an LLM chatbot for behavioral activation that follows the protocol step by step. Protocol adherence was high. This is a fundamentally different architecture from "talk to ChatGPT about your problems."

Limitations of the Du et al. meta-analysis

Several important caveats about the results:

Sample asymmetry. Rule-based chatbots are represented by dozens of RCTs with thousands of participants. LLM chatbots have only a handful of trials with small samples. Comparing unequal groups in a meta-analysis can systematically underestimate the effect of the less-studied group.

LLM system heterogeneity. The "LLM chatbot" category lumps together wildly different systems: from an untrained ChatGPT with a prompt to specialized therapeutic platforms. Model size matters too — compact models trained on therapeutic data can outperform general-purpose giants. Grouping them together is like comparing "medications" as a single category without distinguishing aspirin from antidepressants.

No long-term data. Most LLM studies lasted 2-4 weeks. For evaluating therapeutic effects, this is an insufficient timeframe — CBT typically requires 8-12 weeks.

Rapid obsolescence. A meta-analysis captures the state of the evidence at the time of the literature search. Given the pace of LLM therapy development, 2025 results may not reflect the capabilities of 2026 systems.

What does this mean in practice?

The Du et al. finding is not a death sentence for LLM therapy. It's an indication of a specific problem: a language model without therapeutic structure is a conversation, not therapy.

The effective AI therapist of the future isn't a choice between scripts and LLMs. It's an LLM embedded within a therapeutic protocol. The language model provides flexibility, empathy, and conversational naturalness. The protocol provides direction, consistency, and a therapeutic goal for every session.

This is exactly the principle behind the Nearby platform: an LLM core operates within structured CBT protocols, and a multi-agent architecture separates empathic dialogue from clinical reasoning. This approach combines the strengths of both system types — the flexibility of language models and the proven effectiveness of therapeutic protocols.

Frequently asked questions

Is it true that basic chatbots help with depression better than ChatGPT?

The Du et al. (2025) meta-analysis showed a modest effect for rule-based chatbots and no significant effect for LLM chatbots. But this reflects a difference in evidence base, not the superiority of scripts: rule-based systems have a decade of RCTs behind them, while LLMs have only a handful of trials.

Do AI chatbots help with anxiety?

The evidence is mixed. Li et al. (2023) found no significant effect of AI chatbots on anxiety (g = 0.65, confidence interval crossing zero). However, individual studies, including Napiwotzki et al. (2025), show reductions in anxiety symptoms with structured LLM interventions.

Why is therapeutic protocol structure so important for a chatbot?

Rule-based chatbots follow a protocol by definition — every step is pre-scripted. An LLM without structure engages in free-form conversation, which is closer to emotional support than to therapy. Studies by Kuhlmeier et al. (2025) and Chen et al. (2024) show that LLMs can execute therapeutic protocols with high fidelity when the structure is explicitly defined.

Should I use a chatbot instead of a therapist?

A chatbot is not a replacement for a professional. The meta-analysis by Li et al. (2023) showed an effect of g = 0.64 for depression — significant, but smaller than traditional CBT with a therapist. A chatbot is useful as a self-help tool between sessions, for people on a waitlist, or for those not yet ready to seek help in person (Karki et al., 2025).

Sources

Chen, Y., Zhang, X., Wang, J., Xie, X., Yan, N., Chen, H., & Wang, L. (2024). Structured dialogue system for mental health: An LLM chatbot leveraging the PM+ guidelines. ArXiv. https://doi.org/10.48550/arxiv.2411.10681

Du, Q., Ren, Y., Meng, Z., He, H., & Meng, S. (2025). The efficacy of rule-based versus large language model-based chatbots in alleviating symptoms of depression and anxiety: Systematic review and meta-analysis.

Karki, A., Kamble, C., Chavan, R., & Chapke, N. (2025). Mental health meets machine learning: The rise of chatbots and LLMs in therapy. International Journal for Research Trends and Innovation, 10(5). https://doi.org/10.56975/ijrti.v10i5.203281

Kuhlmeier, F., Hanschmann, L., Rabe, M., Luettke, S., Brakemeier, E.-L., & Maedche, A. (2025). Designing an LLM-based behavioral activation chatbot for young people with depression: Insights from an evaluation with artificial users and clinical experts.

Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5

Li, Y., et al. (2025). Chatbot interventions for young people: A meta-analysis. Worldviews on Evidence-Based Nursing.

Ma, Z., Mei, Y., & Su, Z. (2023). Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. AMIA Annual Symposium Proceedings. https://doi.org/10.48550/arxiv.2307.15810

Napiwotzki, L., et al. (2025). AI versus human therapist in depression: A behavioral activation comparison. Journal of Medical Internet Research.

Pavlopoulos, A., Rachiotis, T., & Maglogiannis, I. (2024). An overview of tools and technologies for anxiety and depression management using AI. Applied Sciences, 14(19), 9068. https://doi.org/10.3390/app14199068

Sharma, A., et al. (2023). Human-centered evaluation of generative AI-based therapy chatbot. NEJM AI, 1(2). https://doi.org/10.1056/AIoa2300127

Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S., & Torous, J. B. (2019). Chatbots and conversational agents in mental health: A review of the psychiatric landscape. The Canadian Journal of Psychiatry, 64(7), 456–464.