Skip to main content
Practices & tools

CBT-I in an AI Chatbot for Insomnia: A Meta-Analysis of 29 RCTs and an Eight-LLM Experiment

By Nearby Published on April 28, 2026 Updated on May 17, 2026 12 min read

A meta-analysis of 29 randomized clinical trials with 9,475 participants (Hwang et al., 2025) showed that fully automated digital cognitive behavioral therapy for insomnia (FA dCBT-I) reduces insomnia severity with a moderate-to-large effect size (SMD = −0.71; 95% CI: −0.88, −0.54; p < 0.001), and the effect is sustained for at least a year. Bao et al. (2025), in Journal of Translational Medicine, compared eight LLMs on a corpus of 2,387 CBT-I dialogues and showed that a compact Qwen2-7b model with a RAG architecture produces non-harmful answers in 91.2% of cases.

Why insomnia is a particularly good fit for digital therapy

Cognitive behavioral therapy for insomnia (CBT-I) is the first-line gold standard in clinical guidelines from the American Academy of Sleep Medicine and the European Sleep Research Society. The protocol consists of clearly separable components: sleep hygiene, sleep restriction, stimulus control, relaxation/mindfulness, and cognitive restructuring of dysfunctional beliefs about sleep.

The structure of the protocol makes CBT-I almost an ideal candidate for digital and chatbot delivery. Unlike psychotherapy for severe depression or PTSD, where trauma work requires fine clinical calibration in the moment, CBT-I is a sequence of algorithmic steps with a sleep diary, sleep-window calculations, and a checklist-based examination of beliefs. Bao and colleagues (2025) note this directly: "The structure of CBT-I aligns well with digital dialogue systems because it can be represented as modular sessions with measurable behavioral goals."

This explains why digital CBT-I products were the first to move beyond research prototypes and obtain regulatory clearance.

Meta-analysis of 29 RCTs: SMD = −0.71 sustained over time

Hwang et al. (2025), in NPJ Digital Medicine, conducted the largest systematic review of fully automated dCBT-I to date — without a therapist in the loop. The review included 29 RCTs and 9,475 participants (4,847 in intervention arms; 73.3% women; mean age 45.7 years).

Time pointSMDInterpretation
Immediately post-treatment−0.71moderate-to-large
Short-term follow-up−0.54moderate
Medium-term−0.54moderate
Long-term (≥12 mo)−0.76moderate-to-large

The key practical finding is durability. Unlike antidepressants or hypnotics, whose effects typically fade after discontinuation, the effect of digital CBT-I is sustained — and even slightly amplified — a year after the program ends. This is consistent with the underlying CBT-I model: the therapy changes behavior and beliefs around sleep, not the symptom directly, so changes are reinforced by daily life.

Key takeaway: Across 29 RCTs, fully automated digital CBT-I reduced insomnia severity (ISI) by SMD = −0.71 immediately post-treatment and held the effect at SMD = −0.76 at 12+ months (Hwang et al., 2025).

The authors also showed that adherence to the intervention — not its mere completion — is what drives results. Average completion was 59.3%, and meta-regression found no influence of completion percentage on effect size (p = 0.310). What matters is not how many modules a user opened, but how many they actually applied in their bedroom.

Bao et al. (2025): eight LLMs against the CBT-I protocol

Until 2024, most digital CBT-I products relied on rule-based "dialogue trees" — pre-scripted scenarios. The arrival of LLMs raised the question: can the same protocol fidelity be achieved with the flexibility of generative AI?

Bao, Zhu, Yang, and colleagues (2025) answered experimentally. Their paper, published in Journal of Translational Medicine, describes the eCBT-I architecture — a RAG system in which a CBT-I knowledge base is connected to an LLM as a source of vetted answers, while the model handles natural dialogue and adaptation to the client.

The fine-tuning corpus was assembled from 22,780 raw CBT-I dialogue records and, after rigorous filtering, reduced to 2,387 (1,909 for training, 239 for validation, 239 for test). The system implemented all key CBT-I components: sleep hygiene, sleep restriction, stimulus control, relaxation/mindfulness, and cognitive therapy.

Eight open-weight LLMs were compared — ChatGLM2-6b, ChatGLM3-6b, Baichuan-7b, Baichuan-13b, Qwen-7b, Qwen2-7b, Llama-2-7b-chat-hf, Llama-2-13b-chat-hf — across three adaptation strategies: LoRA, QLoRA, and Freeze (most parameters frozen, only top layers updated).

The best result came from compact Qwen2-7b with the Freeze strategy:

MetricValue
BLEU-40.2097
ROUGE-10.3267
ROUGE-L0.2914
C-eval (overall accuracy)0.8076

In substance, this means a 7-billion-parameter model fine-tuned on 1,909 dialogues with the right strategy retains CBT-I professional knowledge and answer quality at a level exceeding many 13-billion-parameter models on the same task. The result is consistent with independent work by Maurya et al. (2025), which showed the advantage of compact models in psychotherapeutic dialogues more broadly — we discussed this earlier.

Safety of responses: 91.2% non-harmful — what this means

Any published report on an AI chatbot for mental health must include a safety evaluation — otherwise high BLEU metrics say nothing. Bao et al. (2025) ran a separate clinical evaluation: 180 randomly sampled dialogue sessions from the best model were rated on a 5-point Likert scale for harmfulness.

The mean score was 4.89/5 toward "clearly non-harmful." Distribution: 91.2% of sessions classified as "strongly disagree (non-harmful)," 2.2% neutral, 0% "extremely harmful." In other words, across 180 sessions raters did not find a single response judged clinically dangerous.

This is a strong result, but its boundaries should be understood. First, the evaluation was performed by raters, not against crisis scenarios with suicidal ideation — the dialogue sample was representative of typical CBT-I conversations, not of rare acute situations. Second, the rating is subjective: "harmful" here means "deviation from CBT-I protocol in a direction that could worsen sleep or mental state," not clinical danger in a crisis sense.

For comparison, Li et al. (2023), in a meta-analysis of 35 AI agents for mental health, found that only 43% of systems had at least minimal crisis guardrails. The eCBT-I system from Bao et al., through its RAG anchoring to a vetted corpus, de facto solves part of this problem — but does not cover it fully. We unpacked the full picture of safety mechanisms in Guard rails for AI therapy.

Sleepio and Somryst: digital CBT-I already cleared by regulators

Digital CBT-I is the only area of AI psychology with regulator-cleared products.

Sleepio (Big Health) is a program built on Colin Espie's algorithms. In a large RCT, Espie et al. (2019), published in JAMA Psychiatry, use of Sleepio significantly improved functional health, psychological well-being, and sleep-related quality of life compared with sleep hygiene education. Since 2022, Sleepio has been recommended by the UK's NICE for patients with insomnia, replacing first-line sleeping pills in a substantial portion of cases.

Somryst (Pear Therapeutics, now part of Click Therapeutics) was the first digital therapeutic product for CBT-I to receive an FDA De Novo clearance, in 2020. It is prescribed for the treatment of chronic insomnia in adults. Clearance means not just an "app," but a registered medical product subject to its own quality and post-market surveillance requirements.

These products are the benchmark for evaluating current AI-chatbot systems. Sleepio and Somryst are built on rule-based algorithms (or hybrids with light AI), not LLMs. Bao et al. (2025) showed that a transition to a generative architecture is technically feasible while preserving accuracy, but clinical evidence specifically for LLM-CBT-I is still accumulating.

Where automated CBT-I falls short of a therapist

The most honest moment in Hwang et al. (2025) is a separate subsample where FA dCBT-I was compared with therapist-assisted CBT-I. Therapist-assisted CBT-I was significantly more effective: SMD = 0.61 (95% CI: 0.37, 0.85) in favor of human therapy.

This is not "AI is worse" in absolute terms — both modalities work and reduce insomnia. But if there is a choice and the person reaches a clinician, the specialist adds about 0.6 standard deviations of improvement on top of what the chatbot delivers alone.

Where exactly does the automated scheme break down? The authors propose three places. First, in individual calibration of the sleep window: the clinician sees the diary and decides in the moment whether to adjust the restriction protocol; the chatbot applies a generic algorithm. Second, in working with comorbid disorders — depression, anxiety, apnea — which require reassessing the protocol. Third, in emotional support during the restriction phase, when the patient complains of daytime sleepiness and wants to quit — here the alliance with a human holds better.

The meta-analysis authors' practical conclusion: a "hybrid model" — digital CBT-I plus targeted therapist support — yields the optimal result, especially in complex cases.

What a product needs for digital CBT-I to work

The combined evidence from Bao et al. (2025), Hwang et al. (2025), Espie et al. (2019), and the Sleepio/Somryst experience yields a product formula for a workable AI-CBT-I.

Anchoring to the protocol via RAG, not "general empathy." Bao et al. (2025) showed: the model must answer from a vetted CBT-I knowledge base, not generate "sleep advice" from general weights. Without this anchoring, a 7-billion-parameter model drifts into platitudes about "try chamomile tea."

Sleep diary with automatic calculations. Sleep restriction is the most effective component of CBT-I, and it requires precise calculation of the sleep window from actual time in bed and time asleep. Without a structured diary (rather than "tell me about your sleep"), a chatbot cannot perform the key step.

Adaptation without losing the protocol. Hadar-Shoval et al. (2023) showed that LLMs are plastic and adapt to the user. In CBT-I this is potentially a problem: "talking" the bot into letting you go to bed earlier because of fatigue means breaking sleep restriction. The architecture should allow tone and pacing to adapt, but protocol parameters must not.

Clinician in the loop for complex cases. The hybrid model in Hwang et al. (2025) yields an SMD advantage of 0.61 over a purely automated scheme. At the product level this means a built-in escalation route to a clinician at the first signs of apnea, severe depression, or breathing pauses — conditions a chatbot alone should not treat.

Transparency about limitations. The certified products Sleepio and Somryst openly declare their context of use (adults, chronic insomnia without untreated comorbid apnea). Any AI chatbot for insomnia should do the same.

Limitations of the studies

Both the meta-analysis and the Bao et al. experiment carry important caveats.

Hwang et al. (2025) included 29 RCTs, but many tested rule-based products from a previous generation, not LLM chatbots. Direct transfer of the SMD = −0.71 estimate to current generative systems requires caution — there are no large RCTs yet specifically testing LLM-CBT-I.

Bao et al. (2025) ran a strong benchmark of models and adaptation strategies, but they did not compare clinical effectiveness with a human and did not run an RCT. BLEU-4 = 0.21 speaks to similarity with reference answers, not to ISI reduction in patients. The authors state plainly: "the effectiveness of the system must be confirmed by multi-center clinical trials."

Additionally, the eCBT-I system was evaluated on a single-center local dataset, primarily of Chinese-language CBT-I dialogues. Cross-cultural applicability is a separate question: beliefs about sleep, work schedules, and stress factors differ across countries.

Finally, neither study covered multimodal signals — voice, tone, face — that a clinician uses when diagnosing insomnia in a complex clinical picture.

Frequently asked questions

Does an AI chatbot help with insomnia?

Yes. A meta-analysis of 29 RCTs with 9,475 participants showed that fully automated digital CBT-I reduces insomnia severity with a mean effect size of SMD = −0.71 immediately post-treatment, with the result sustained at SMD = −0.76 at 12+ months (Hwang et al., 2025).

How is CBT-I in a chatbot different from sleep hygiene education?

CBT-I is not "sleep tips" but a structured five-component protocol: sleep hygiene, sleep restriction, stimulus control, relaxation, and cognitive restructuring of beliefs about sleep (Bao et al., 2025). Sleep hygiene education is only one of the five components, and on its own it is clinically modest; the bulk of the effect comes from sleep restriction and stimulus control.

Which LLMs handle CBT-I best?

In the comparative experiment by Bao et al. (2025) across eight models, the best result came from compact Qwen2-7b with the Freeze adaptation strategy (BLEU-4 = 0.21; C-eval = 0.81). This aligns with the broader finding that small fine-tuned models outperform larger ones in psychotherapeutic dialogues (Maurya et al., 2025).

Does digital CBT-I replace a therapist?

Not entirely. In a subsample of Hwang et al. (2025), therapist-assisted CBT-I had a significant advantage over fully automated CBT-I (SMD = 0.61). The authors recommend a hybrid model: a digital program plus targeted specialist support — especially for comorbid depression, apnea, or anxiety.

Are AI chatbots safe for treating insomnia?

In the Bao et al. (2025) safety evaluation across 180 dialogue sessions, 91.2% of responses were classified as "clearly non-harmful," 0% as "extremely harmful," with a mean Likert score of 4.89/5. However, this result applies to typical CBT-I dialogues, not to acute crisis scenarios; for suicidal ideation or severe comorbidity, separate guardrails and an escalation route to a human are required.

Practical takeaway

Insomnia is the most "mature" scenario for digital AI therapy. The combined evidence — a meta-analysis of 29 RCTs with sustained effect, the Sleepio RCT in JAMA Psychiatry, FDA clearance of Somryst, and the LLM comparison of Bao et al. — supports the claim that a well-designed AI chatbot built on the CBT-I protocol genuinely reduces insomnia severity and holds the effect for years.

But "well-designed" here is not a marketing phrase but a set of concrete requirements: anchoring to the protocol via RAG, a structured sleep diary with sleep-window calculation, protection of sleep-restriction parameters from being "talked out" by the user, an escalation route to a clinician for comorbidities, and an explicit declaration of limitations.

At Nearby we use an approach compatible with this formula: CBT protocols at the system-prompt level, structured between-session diary work, memory of the user for continuity, and transparent boundaries — what the AI chatbot does, and what is left to a human specialist. For chronic insomnia with suspected apnea or severe depression, the chatbot does not replace a clinic visit — but as a first entry point to working on sleep behavior, it is a workable tool.

Related reading: Small AI models outperform giants in therapy, Prompt engineering for an AI therapist, Meta-analysis of 35 AI chatbot studies.


References

Bao, X., Zhu, X., Yang, D., Lou, H., Wang, R., Wu, Y., Li, W., Xia, Y., Zeng, L., Pan, Y., Wang, X., Zhang, X., Ling, C., Ling, Y., Zhang, Y., Zhao, Q., & Yang, M. (2025). eCBT-I dialogue system: A comparative evaluation of large language models and adaptation strategies for insomnia treatment. Journal of Translational Medicine, 23, 862. https://doi.org/10.1186/s12967-025-06871-y

Espie, C. A., Emsley, R., Kyle, S. D., Gordon, C., Drake, C. L., Siriwardena, A. N., Cape, J., Ong, J. C., Sheaves, B., Foster, R., Freeman, D., Costa-Font, J., Marsden, A., & Luik, A. I. (2019). Effect of digital cognitive behavioral therapy for insomnia on health, psychological well-being, and sleep-related quality of life: A randomized clinical trial. JAMA Psychiatry, 76(1), 21–30. https://doi.org/10.1001/jamapsychiatry.2018.2745

Hadar-Shoval, D., Elyoseph, Z., & Lvovsky, M. (2023). The plasticity of ChatGPT's mentalizing abilities: Personalization for personality structures. Frontiers in Psychiatry, 14, 1234397. https://doi.org/10.3389/fpsyt.2023.1234397

Hwang, J. W., Lee, G. E., Woo, J. H., Kim, S. M., & Kwon, J. Y. (2025). Systematic review and meta-analysis on fully automated digital cognitive behavioral therapy for insomnia. NPJ Digital Medicine, 8(1), 159. https://doi.org/10.1038/s41746-025-01514-4

Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5

Maurya, R. K., Pal, A., Chouhan, S. S., & Maurya, A. K. (2025). Exploring the potential of lightweight LLMs for AI-based mental health counselling: A novel comparative study. Scientific Reports, 15(1), 5012. https://doi.org/10.1038/s41598-025-05012-1

Nearby

AI companion for emotional support. Pro and Pro Max — billed in USD.

Navigation


Nearby is an independent product and is not affiliated with Anthropic or AWS. AI responses are generated by third-party large language models and are provided for informational and self-help purposes only. Nearby is not a medical device and does not provide medical services — its information and practices are not a substitute for consultation, diagnosis, or treatment by a licensed mental health professional.

© 2026 Nearby. All rights reserved.