AI chatbots achieve symptom reductions comparable to live psychotherapy across several conditions (depression −51%, Sharma et al., 2023; pooled effect size g = 0.64, Li et al., 2023) and form a therapeutic alliance averaging 3.76 out of 5 — close to in-person outpatient psychotherapy (Schäfer et al., 2025). Three limitations remain critical: LLM empathy varies across patient subgroups (Gabriel et al., 2024), some models systematically overestimate the risk of a negative outcome (Elyoseph et al., 2024), and conversational breakdowns continue to disrupt the therapeutic contact for vulnerable users.

What does it actually mean to "replace a therapist"?

Before comparing, we need to fix one thing: a "therapist" is not one function but at least four. In the stepped-care model used by health systems from the UK to Australia, the clinician simultaneously plays the roles of:

Diagnostician — distinguishing depression from anxiety, PTSD, and the bipolar spectrum.
Technique deliverer — running CBT, ACT, and behavioral activation protocols.
Alliance partner — creating a safe space and validating experiences.
Clinical judge — assessing risk and deciding when to escalate.

AI systems in 2024–2025 cover these roles unevenly. A systematic review by Omar et al. (2024) in Frontiers in Psychiatry (Q1, 50 citations), drawing on 28 studies, concludes that LLMs show "promising results" in the first two roles and are noticeably weaker at clinical risk assessment, especially around suicidality. The right question is therefore not "will AI replace the therapist altogether" but "in which roles, and for which users, does AI already perform at a level comparable to a human?"

Does an AI chatbot reduce depression as much as a live therapist does?

In individual clinical trials — yes. The most cited piece of direct evidence is the Therabot randomized clinical trial published in NEJM AI in 2023: after 4 weeks of working with a generative AI chatbot, participants' major depression symptoms dropped by 51% (Sharma et al., 2023). That is comparable to the effect of structured short-term CBT.

At the level of pooled data the picture is more modest. A meta-analysis of 35 studies (17,123 participants) in NPJ Digital Medicine (Q1) found a statistically significant reduction in depression symptoms (Hedges' g = 0.64; 95% CI: 0.17–1.12) and psychological distress (g = 0.70) (Li et al., 2023). The effect lands in the "medium" range on Cohen's scale and matches several traditional psychotherapeutic interventions in order of magnitude. The key caveat: effect size depends on the type of AI. For generative models it was g = 1.24; for scripted systems, g = 0.52 — a 2.4-fold difference in favor of generative systems.

A 2025 confirmation came from the meta-analysis by Du et al. (35 RCTs, 4,224 participants), which compared scripted and LLM chatbots head-to-head: LLM systems produced significant reductions in depression and anxiety, while scripted ones delivered only a modest effect on depression.

Direct comparison: AI vs. therapist in behavioral activation

One of the most interesting designs of 2025 is the study by Napiwotzki and colleagues in JMIR Formative Research (Napiwotzki et al., 2025). The authors directly compared an AI chatbot and live therapists on behavioral activation (BA) — one of the most evidence-based CBT techniques for depression. BA is convenient for comparison because its protocol is tightly structured: a values list, an activity hierarchy, mood monitoring, and homework.

A similar design in JMIR Mental Health was implemented by Scholich et al. (2025), who compared therapeutic communication of LLM chatbots and live therapists with a mixed-methods approach. The shared finding across both studies: in protocol fidelity and basic empathic responses, AI chatbots reach scores comparable to humans. In the finer work of handling resistance, complex framing of a request, and adapting to the in-the-moment state, they fall noticeably short.

This is consistent with earlier qualitative work by Song et al. (2024) in Proceedings of the ACM on Human-Computer Interaction (Q1): users of LLM chatbots for mental health valued accessibility and the absence of judgment, but regularly ran into conversational breakdowns — irrelevant or formulaic responses in emotionally charged moments.

Therapeutic alliance with AI: 3.76 out of 5

The alliance — the working bond between client and therapist — predicts the outcome of psychotherapy better than the chosen method does, per Bordin (1979). So the critical question is: does an alliance form with an AI?

A cross-sectional study of 527 users of the AI chatbot Clare measured alliance on the Working Alliance Inventory — Short Revised (Schäfer et al., 2025). The mean was 3.76 out of 5 — comparable to in-person outpatient psychotherapy (3.9–4.2) and group CBT (3.5–3.8). The alliance with AI was strongest among lonely users (r = 0.25) and people with marked anxiety or depression symptoms (r = 0.37).

An important nuance: the alliance with AI is structurally asymmetric. The Bond component (emotional connection) is lower with AI than with a human therapist; the Goal and Task components (agreement on goals and methods) are comparable. In other words, AI holds the structure of therapy well but builds trust more slowly.

LLM empathy varies across patient subgroups

Gabriel et al. (2024), in their paper Can AI Relate (29 citations), asked a simple but uncomfortable question: is an LLM equally empathic to all groups of users? The answer: no. Models' empathy levels differed significantly across patient subgroups, and the appropriateness of responses against motivational interviewing principles needed improvement.

This is not an abstract technical flaw. It means that for some users — especially groups underrepresented in training data — an AI chatbot may produce less empathic responses than for others. A live therapist regulates empathy consciously; an LLM does so statistically, and where the statistics are thinner, empathy is lower too.

In practice, this is closed off in two ways: (a) fine-tuning the model on balanced psychotherapy corpora (Mental-LLM, Xu et al., 2023, NPJ), and (b) adding a layer of guard rails and toxicity checks (EmoAgent, Qiu et al., 2025). Without these layers, general-purpose ChatGPT is not suitable for mental health — De Choudhury et al. (2023, 63 citations) described 12 categories of potential harm from LLMs in digital mental health support.

Depression prognosis: AI errs toward pessimism

A less obvious but clinically important risk is systematic distortion in prognosis. Elyoseph and colleagues (2024) in Family Medicine and Community Health ran a comparative analysis of four LLMs (ChatGPT-3.5, ChatGPT-4, Claude, Bard) against general practitioners, psychiatrists, clinical psychologists, psychiatric nurses, and the general public. All four LLMs correctly identified depression in most cases and recommended a combination of psychotherapy and antidepressants.

But prognosis differed. ChatGPT-3.5 was significantly more pessimistic than all other LLMs, professionals, and the general public, predicting more negative long-term outcomes. ChatGPT-4, Claude, and Bard generally aligned with professional opinion. The authors warn directly: an LLM's pessimistic prognosis can reduce a patient's motivation to start or continue therapy.

This is an argument against using general-purpose ChatGPT as a "therapist." Specialized systems with vetted prompts and protocols (see our breakdown of prompt engineering for mental-health chatbots) neutralize this distortion — but only when it is recognized and addressed in the design.

Where AI loses to humans by design

Obradovich et al. (2024) in NPP Digital Psychiatry and Neuroscience (56 citations) consolidated the opportunities and risks of LLMs in psychiatry into four blocks. From their analysis and adjacent work, four zones stand out where a live therapist remains irreplaceable:

Complex diagnosis and comorbidity. Differentiating the bipolar spectrum, PTSD, and personality disorders requires sustained observation and context that a chatbot cannot reach in a single session.
Acute suicide risk and crisis escalation. Even specialized systems miss some crisis signals. An AI chatbot must therefore have a hard protocol for handing off to a hotline and a live clinician, rather than trying to "treat" through a crisis.
Long-term trauma work. Therapeutic work with childhood trauma and complex PTSD requires moment-to-moment regulation of the client's emotional state — non-verbal attunement, vocal pacing, pauses. AI systems cannot yet do this, even in multimodal formats.
Clinical supervisory context. Decisions about pharmacotherapy, hospitalization, and family involvement remain a human's responsibility.

What follows is a practical division of labor: an AI chatbot is a first step of care and a between-session support, not a replacement for a therapist with a long case history.

What this means in practice

The correct answer to the question "can AI replace a therapist" in 2025 is no — but it can cover a substantial share of mass demand for structured support and protocol-driven work with mild and moderate symptoms. This is consistent with the stepped-care approach: AI takes the first step, freeing live clinicians for cases where their competence is critical.

The conditions under which an AI chatbot actually works as support:

An evidence-based protocol. CBT, behavioral activation, problem-solving therapy — not "general conversation."
Guard rails and crisis recognition. Without these, harm exceeds benefit for vulnerable users.
Memory and personalization. Otherwise alliance does not accumulate between sessions.
Transparency about limits. The user must know when a live clinician is needed.

This is exactly the approach behind Nearby: CBT protocols, a multi-agent architecture with crisis recognition, and psychological profiling tailored to the user. Not "another ChatGPT," but a specialized system designed around the known boundaries of AI's applicability in mental health.

Frequently asked questions

Can an AI chatbot fully replace a psychotherapist?

No. The authors of the largest meta-analyses (Li et al., 2023; Du et al., 2025) and systematic reviews (Omar et al., 2024) converge on the same point: AI chatbots are a complementary tool, not a replacement. They are effective for mild to moderate symptoms of depression and anxiety, especially with CBT protocols, but cannot handle complex diagnosis, crisis escalation, or long-term trauma work.

How effective is AI therapy compared with live therapy?

In individual clinical trials, AI shows an effect comparable to live therapy: Therabot reduced major depression symptoms by 51% in 4 weeks (Sharma et al., 2023). At the level of pooled data, the effect size is g = 0.64 for depression — medium on Cohen's scale, matching several traditional interventions in order of magnitude (Li et al., 2023).

Can you trust an AI like a real therapist?

The therapeutic alliance with AI scores 3.76 of 5 on the WAI-SR (Schäfer et al., 2025), which is close to in-person outpatient psychotherapy. However, LLM empathy is uneven across patient subgroups (Gabriel et al., 2024), and some models overestimate the risk of a negative outcome (Elyoseph et al., 2024). Trust is therefore built on specialized systems with guard rails and validated protocols, not on general-purpose ChatGPT.

Who is an AI therapist best suited for?

The Clare cross-sectional study showed: alliance with AI forms more strongly among lonely users (r = 0.25) and people with marked anxiety or depression symptoms (r = 0.37) (Schäfer et al., 2025). The Li et al. (2023) meta-analysis adds: the benefit goes mostly to people with clinical and subclinical symptoms, not healthy participants (g = 1.07 vs. g = 0.11).

When is a live clinician strictly necessary instead of AI?

Four zones where an AI chatbot is unacceptable: complex comorbid diagnosis (bipolar disorder, PTSD, personality disorders), acute suicide risk and crisis, long-term trauma work, and decisions about pharmacotherapy or hospitalization (Obradovich et al., 2024; Omar et al., 2024). In these cases AI must hand the user off to a live clinician via a hard protocol.

References

De Choudhury, M., Pendse, S. R., & Kumar, N. (2023). Benefits and harms of large language models in digital mental health. ArXiv. https://doi.org/10.48550/arxiv.2311.14693

Du, Q., Ren, Y., Meng, Z., He, H., & Meng, S. (2025). The efficacy of rule-based versus large language model–based chatbots in alleviating symptoms of depression and anxiety: Systematic review and meta-analysis. Journal of Medical Internet Research.

Elyoseph, Z., Levkovich, I., & Shinan-Altman, S. (2024). Assessing prognosis in depression: Comparing perspectives of AI models, mental health professionals and the general public. Family Medicine and Community Health.

Gabriel, S., Puri, I., Xu, X., Malgaroli, M., & Ghassemi, M. (2024). Can AI relate: Testing large language model response for mental health support. ArXiv. https://doi.org/10.48550/arxiv.2405.12021

Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5

Napiwotzki, F. et al. (2025). Comparing human and AI therapists in behavioral activation for depression. JMIR Formative Research. https://doi.org/10.2196/78138

Obradovich, N., Khalsa, S., Khan, W. U., Suh, J., Perlis, R. H., Ajilore, O., & Paulus, M. P. (2024). Opportunities and risks of large language models in psychiatry. NPP Digital Psychiatry and Neuroscience. https://doi.org/10.1038/s44277-024-00010-z

Omar, M., Soffer, S., Charney, A. W., Landi, I., Nadkarni, G. N., & Klang, E. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry. https://doi.org/10.3389/fpsyt.2024.1422807

Schäfer, S. K. et al. (2025). User characteristics, motives, and therapeutic alliance in mental health conversational AI Clare. Frontiers in Digital Health. https://doi.org/10.3389/fdgth.2025.1576135

Scholich, T. et al. (2025). Comparison of human therapists and LLM chatbots for therapeutic communication: Mixed methods study. JMIR Mental Health. https://doi.org/10.2196/69709

Sharma, A. et al. (2023). Human-centered evaluation of generative AI-based therapy chatbot. NEJM AI, 1(2). https://doi.org/10.1056/AIoa2300127

Song, I., Pendse, S. R., Kumar, N., & De Choudhury, M. (2024). The typing cure: Experiences with large language model chatbots for mental health support. Proceedings of the ACM on Human-Computer Interaction. https://doi.org/10.1145/3757430