AI Diagnosis with DSM-5: Transparency Instead of a Black Box
DSM5AgentFlow is a multi-agent system of three AI agents that screens for mental health conditions through natural conversation and backs every conclusion with references to specific DSM-5 criteria. In testing on 8,000 dialogues, the best model achieved 70% accuracy and an F1 score of 77%, reaching up to 94% for anxiety disorders (Ozgun et al., 2025).
Why Diagnostic Transparency Is Critical
Most AI mental health systems operate as a "black box": they deliver a result without explaining how they arrived at it. For users, this looks like "the AI says you have depression" — with no way to understand why.
In clinical practice, transparency is a baseline requirement. A therapist explains their hypotheses, references diagnostic criteria, and ties observations to specific statements the client has made. This allows both the patient and the supervisor to verify the reasoning.
Systematic reviews document the growing use of LLMs in psychiatry (Guo et al., 2024; Omar et al., 2024), but systems with explainable diagnostics remain rare. DSM5AgentFlow, developed by a team from Vrije Universiteit Amsterdam and Eindhoven University of Technology, addresses exactly this problem.
Three Agents: Therapist, Client, Diagnostician
The system's architecture models a real diagnostic process through three specialized agents:
The therapist agent conducts the clinical interview. It takes 23 standard questions from the DSM-5 Level-1 Cross-Cutting Symptom Measure and rephrases them into natural, conversational questions. Instead of "Rate the frequency of your panic attacks from 0 to 4," it asks: "Can you tell me — are there moments when fear or panic suddenly overwhelms you?" It covers 13 symptom domains.
The client agent simulates a patient with a given psychological profile. It responds in the first person, describing symptoms without using diagnostic terminology. This allows the system to be tested at scale: 8,000 dialogues cover 10 major disorders — from anxiety and depression to schizophrenia and substance use.
The diagnostician agent analyzes the conversation transcript and produces a structured report in four parts:
- A compassionate summary of the patient's condition
- A diagnostic hypothesis
- Justification with quotes from the dialogue and references to DSM-5 criteria
- Treatment recommendations
The multi-agent approach — where each agent is responsible for its own role — has already proven more effective than monolithic solutions in both therapy and state assessment. DSM5AgentFlow confirms this trend on the diagnostic side.
How RAG Ensures Evidence-Based Reasoning
The key technical feature is RAG (Retrieval-Augmented Generation) integration with the full text of DSM-5. The diagnostician does not rely on knowledge baked into model weights. Instead, it:
- Receives the dialogue transcript
- Retrieves the 5 most relevant DSM-5 fragments (chunks of 512–1,024 tokens)
- Formulates a diagnosis, explicitly linking patient statements to criteria
XML tags are used to mark connections: <sym> for symptoms, <quote> for direct quotes from the dialogue, <med> for medical criteria. This makes the reasoning chain fully traceable: a specific patient statement leads to a specific DSM-5 criterion leads to a diagnostic conclusion.
DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th edition) is the standard classification of the American Psychiatric Association, containing diagnostic criteria for all major mental health conditions. Using it as a RAG knowledge base ensures that every conclusion is grounded in an authoritative clinical source.
Accuracy: From 70% Overall to 94% for Anxiety Disorders
The system was tested on four language models: Llama-4-Scout-17B, Mistral-Saba-24B, Qwen-QWQ-32B, and GPT-4.1-Nano. The best results came from Qwen-QWQ — a model optimized for reasoning:
- Overall accuracy: 70%, F1: 77%
- Panic disorder: 93.65%
- PTSD: 94.36%
- Social anxiety: 93.89%
GPT-4.1-Nano achieved 83% accuracy but with a lower F1 (73%). Dialogue quality was evaluated separately: Llama-4 and Mistral scored 4.26–4.41 out of 5 on an LLM rubric scale, while GPT-4.1-Nano scored only 1.89–2.54 (Ozgun et al., 2025).
The weakest area was adjustment disorder: F1 ranging from 2.78% to 40.25%. The system systematically confused it with depression — which is unsurprising, since differentiating these diagnoses remains one of the most challenging tasks in clinical practice as well.
Explanation Quality: Not All Models Are Equally Transparent
Explainability — the model's ability to justify its conclusions — was evaluated separately. The differences were significant:
Qwen-QWQ (best): 11 symptom tags, 4 direct quotes from the dialogue, explicit references to DSM criteria, numbered reasoning steps. A fully transparent process from observation to conclusion.
GPT-4.1-Nano: many tags, but without structured reasoning. The answer is correct, but it is unclear why — the connection between observations and conclusions is lost.
Llama-4: minimal justification, no references to criteria. Essentially the same "black box" the system was designed to eliminate.
This result matters: diagnostic accuracy without explanation has limited value in a clinical context. A clinician must be able to verify each step of the reasoning — just as computational psychiatry strives to make mathematical models of mental processes transparent.
Limitations: Why This Is Not Yet a Replacement for a Psychiatrist
The authors are upfront about the study's boundaries:
- Synthetic data only — all 8,000 dialogues were AI-generated. Ecological validity has not been confirmed
- Single-pass generation — the system does not adapt questions during the interview based on previous answers
- Limited model pool — testing was conducted only on Groq-hosted and OpenAI models
- Overlapping symptoms — disorders with similar clinical presentations (adjustment vs. depression) are poorly differentiated
- The authors' position: the system is a research tool, not a medical device
All data and code are open for reproduction by other researchers — an important step for scientific transparency in a field where trust is critical.
What This Means for the Future of AI Screening
DSM5AgentFlow shows what the next step might look like: not replacing clinicians, but providing a transparent preliminary screening tool. A system that explains every conclusion can:
- Help users make sense of their symptoms before visiting a specialist
- Give therapists a structured report to accelerate initial assessment
- Standardize screening in regions with a shortage of psychiatrists
For Nearby, this confirms the validity of the multi-agent approach: splitting responsibility among agents — therapeutic, analytical, and supervisory — produces both more accurate and more transparent results.
Frequently Asked Questions
Can AI diagnose a mental health condition?
Not yet — not in a clinical sense. DSM5AgentFlow achieves 70% accuracy and 77% F1 under controlled conditions, but it was tested only on synthetic data. The authors position the system as a research tool, not a replacement for psychiatric diagnosis (Ozgun et al., 2025).
What is DSM-5 and why does an AI system need it?
DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th edition) is the standard classification of the American Psychiatric Association. It includes diagnostic criteria for all major mental health conditions. DSM5AgentFlow uses it as a knowledge base via RAG, grounding every conclusion in a specific criterion.
Which disorders does the system diagnose most accurately?
Anxiety disorders: panic disorder (93.65%), PTSD (94.36%), social anxiety (93.89%). The weakest performance is on adjustment disorder (F1 from 2.78% to 40.25%), which the system frequently confuses with depression.
How is DSM5AgentFlow different from standard AI screening?
Three key differences: (1) a multi-agent architecture with separated roles, (2) RAG integration with the full text of DSM-5, (3) structured justification for every conclusion with symptom tags and dialogue quotes. Conventional AI screening tools deliver results without explanation.
Can DSM5AgentFlow results be used for self-diagnosis?
No. The authors explicitly state that the system is a research tool, not a medical device. Any screening — whether AI-based or a paper questionnaire — is a reason to consult a specialist, not a basis for drawing your own conclusions.
Sources
Ozgun, M. C., Pei, J., Hindriks, K. V., Donatelli, L., Liu, Q., & Wang, J. (2025). Trustworthy AI psychotherapy: Multi-agent LLM workflow for counseling and explainable mental disorder diagnosis. Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025). https://doi.org/10.1145/3746252.3761164
Guo, J., et al. (2024). Large language models for mental health: A systematic review. ArXiv. https://doi.org/10.48550/arxiv.2403.15401
Omar, A., et al. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry, 15. https://doi.org/10.3389/fpsyt.2024.1422807
Chen, Y., et al. (2025). MIND: Towards immersive psychological healing with multi-agent inner dialogue. ArXiv. https://doi.org/10.48550/arxiv.2502.19860