DSM5AgentFlow is a multi-agent system of three AI agents that screens for mental health conditions through natural conversation and backs every conclusion with references to specific DSM-5 criteria. In testing on 8,000 dialogues, the best model achieved 70% accuracy and an F1 score of 77%, reaching up to 94% for anxiety disorders (Ozgun et al., 2025).

Why Diagnostic Transparency Is Critical

Most AI mental health systems operate as a "black box": they deliver a result without explaining how they arrived at it. For users, this looks like "the AI says you have depression" — with no way to understand why.

In clinical practice, transparency is a baseline requirement. A therapist explains their hypotheses, references diagnostic criteria, and ties observations to specific statements the client has made. This allows both the patient and the supervisor to verify the reasoning.

Systematic reviews document the growing use of LLMs in psychiatry (Guo et al., 2024; Omar et al., 2024), but systems with explainable diagnostics remain rare. DSM5AgentFlow, developed by a team from Vrije Universiteit Amsterdam and Eindhoven University of Technology, addresses exactly this problem.

Three Agents: Therapist, Client, Diagnostician

The system's architecture models a real diagnostic process through three specialized agents:

The therapist agent conducts the clinical interview. It takes 23 standard questions from the DSM-5 Level-1 Cross-Cutting Symptom Measure and rephrases them into natural, conversational questions. Instead of "Rate the frequency of your panic attacks from 0 to 4," it asks: "Can you tell me — are there moments when fear or panic suddenly overwhelms you?" It covers 13 symptom domains.

The client agent simulates a patient with a given psychological profile. It responds in the first person, describing symptoms without using diagnostic terminology. This allows the system to be tested at scale: 8,000 dialogues cover 10 major disorders — from anxiety and depression to schizophrenia and substance use.

The diagnostician agent analyzes the conversation transcript and produces a structured report in four parts:

A compassionate summary of the patient's condition
A diagnostic hypothesis
Justification with quotes from the dialogue and references to DSM-5 criteria
Treatment recommendations

The multi-agent approach — where each agent is responsible for its own role — has already proven more effective than monolithic solutions in both therapy and state assessment. DSM5AgentFlow confirms this trend on the diagnostic side.

How RAG Ensures Evidence-Based Reasoning

The key technical feature is RAG (Retrieval-Augmented Generation) integration with the full text of DSM-5. The diagnostician does not rely on knowledge baked into model weights. Instead, it:

Receives the dialogue transcript
Retrieves the 5 most relevant DSM-5 fragments (chunks of 512–1,024 tokens)
Formulates a diagnosis, explicitly linking patient statements to criteria

XML tags are used to mark connections: <sym> for symptoms, <quote> for direct quotes from the dialogue, <med> for medical criteria. This makes the reasoning chain fully traceable: a specific patient statement leads to a specific DSM-5 criterion leads to a diagnostic conclusion.

DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th edition) is the standard classification of the American Psychiatric Association, containing diagnostic criteria for all major mental health conditions. Using it as a RAG knowledge base ensures that every conclusion is grounded in an authoritative clinical source.

Accuracy: From 70% Overall to 94% for Anxiety Disorders

The system was tested on four language models: Llama-4-Scout-17B, Mistral-Saba-24B, Qwen-QWQ-32B, and GPT-4.1-Nano. The best results came from Qwen-QWQ — a model optimized for reasoning:

Overall accuracy: 70%, F1: 77%
Panic disorder: 93.65%
PTSD: 94.36%
Social anxiety: 93.89%

GPT-4.1-Nano achieved 83% accuracy but with a lower F1 (73%). Dialogue quality was evaluated separately: Llama-4 and Mistral scored 4.26–4.41 out of 5 on an LLM rubric scale, while GPT-4.1-Nano scored only 1.89–2.54 (Ozgun et al., 2025).

The weakest area was adjustment disorder: F1 ranging from 2.78% to 40.25%. The system systematically confused it with depression — which is unsurprising, since differentiating these diagnoses remains one of the most challenging tasks in clinical practice as well.

Explanation Quality: Not All Models Are Equally Transparent

Explainability — the model's ability to justify its conclusions — was evaluated separately. The differences were significant:

Qwen-QWQ (best): 11 symptom tags, 4 direct quotes from the dialogue, explicit references to DSM criteria, numbered reasoning steps. A fully transparent process from observation to conclusion.

GPT-4.1-Nano: many tags, but without structured reasoning. The answer is correct, but it is unclear why — the connection between observations and conclusions is lost.

Llama-4: minimal justification, no references to criteria. Essentially the same "black box" the system was designed to eliminate.

This result matters: diagnostic accuracy without explanation has limited value in a clinical context. A clinician must be able to verify each step of the reasoning — just as computational psychiatry strives to make mathematical models of mental processes transparent.

Limitations: Why This Is Not Yet a Replacement for a Psychiatrist

The authors are upfront about the study's boundaries:

Synthetic data only — all 8,000 dialogues were AI-generated. Ecological validity has not been confirmed
Single-pass generation — the system does not adapt questions during the interview based on previous answers
Limited model pool — testing was conducted only on Groq-hosted and OpenAI models
Overlapping symptoms — disorders with similar clinical presentations (adjustment vs. depression) are poorly differentiated
The authors' position: the system is a research tool, not a medical device

All data and code are open for reproduction by other researchers — an important step for scientific transparency in a field where trust is critical.

What This Means for the Future of AI Screening

DSM5AgentFlow shows what the next step might look like: not replacing clinicians, but providing a transparent preliminary screening tool. A system that explains every conclusion can:

Help users make sense of their symptoms before visiting a specialist
Give therapists a structured report to accelerate initial assessment
Standardize screening in regions with a shortage of psychiatrists

For Nearby, this confirms the validity of the multi-agent approach: splitting responsibility among agents — therapeutic, analytical, and supervisory — produces both more accurate and more transparent results.

What computational models change about diagnosis

There's a deeper tension worth naming. DSM-5 is a categorical system: you either meet the checklist for a disorder or you don't, and the line between "diagnosed" and "not" is a hard threshold. A system like DSM5AgentFlow inherits that logic — it maps a conversation onto DSM criteria. But much of computational psychiatry is pulling in the opposite direction, toward a dimensional view of mental illness.

Network theory, proposed by Borsboom (2017, World Psychiatry), is a clear example: it treats a disorder not as a single hidden disease entity but as a network of symptoms that trigger one another. Research initiatives like the NIMH's Research Domain Criteria (RDoC) push the same way, describing where a person sits along continuous parameters rather than sorting them into boxes. This matters for diagnosis because two people with the identical DSM label can differ enormously in what's actually driving their distress — and therefore in what will help them. Our overview of computational models of mental disorders unpacks this shift in detail.

It also comes with a warning. Hitchcock and colleagues (2023, Neuroscience & Biobehavioral Reviews) found that many computational measures suffer from low test-retest reliability — meaning the same person can score differently on different days. Any AI that leans on such measures inherits that instability, which is one more reason to treat an automated result as a starting point for a conversation with a clinician, not a verdict.

Frequently Asked Questions

Can AI diagnose a mental health condition?

Not yet — not in a clinical sense. DSM5AgentFlow achieves 70% accuracy and 77% F1 under controlled conditions, but it was tested only on synthetic data. The authors position the system as a research tool, not a replacement for psychiatric diagnosis (Ozgun et al., 2025).

What is DSM-5 and why does an AI system need it?

DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th edition) is the standard classification of the American Psychiatric Association. It includes diagnostic criteria for all major mental health conditions. DSM5AgentFlow uses it as a knowledge base via RAG, grounding every conclusion in a specific criterion.

Which disorders does the system diagnose most accurately?

Anxiety disorders: panic disorder (93.65%), PTSD (94.36%), social anxiety (93.89%). The weakest performance is on adjustment disorder (F1 from 2.78% to 40.25%), which the system frequently confuses with depression.

How is DSM5AgentFlow different from standard AI screening?

Three key differences: (1) a multi-agent architecture with separated roles, (2) RAG integration with the full text of DSM-5, (3) structured justification for every conclusion with symptom tags and dialogue quotes. Conventional AI screening tools deliver results without explanation.

Can DSM5AgentFlow results be used for self-diagnosis?

No. The authors explicitly state that the system is a research tool, not a medical device. Any screening — whether AI-based or a paper questionnaire — is a reason to consult a specialist, not a basis for drawing your own conclusions.

Why is DSM-5 criticized in computational psychiatry?

Because it is categorical — it sorts people into diagnostic boxes with hard thresholds, when much of the evidence suggests mental health is dimensional. Approaches like network theory and the RDoC framework argue that two people with the same label can be driven by very different underlying processes. DSM-5 remains the clinical standard and a useful common language, but computational researchers see its either/or structure as a poor fit for how disorders actually vary.

Will AI replace psychiatric diagnosis?

There is no credible sign of that. Even the most transparent research systems, tested only on synthetic data, are positioned by their own authors as screening and support tools — not medical devices. The realistic near-term role is to help people make sense of symptoms before a visit and to give clinicians a structured starting report, with a human always making the actual diagnosis.

Sources

Ozgun, M. C., Pei, J., Hindriks, K. V., Donatelli, L., Liu, Q., & Wang, J. (2025). Trustworthy AI psychotherapy: Multi-agent LLM workflow for counseling and explainable mental disorder diagnosis. Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025). https://doi.org/10.1145/3746252.3761164

Guo, J., et al. (2024). Large language models for mental health: A systematic review. ArXiv. https://doi.org/10.48550/arxiv.2403.15401

Omar, A., et al. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry, 15. https://doi.org/10.3389/fpsyt.2024.1422807

Chen, Y., et al. (2025). MIND: Towards immersive psychological healing with multi-agent inner dialogue. ArXiv. https://doi.org/10.48550/arxiv.2502.19860