A live therapist plays at least four distinct clinical roles, and AI in 2024–2025 replaces them at very different rates. On routine protocol delivery and basic empathic responding, AI now matches humans on validated scales. On in-the-moment regulation, suicide-risk assessment, and complex differential diagnosis, the gap stays wide. This article maps each role to the strongest evidence and to the boundary where a chatbot stops being safe.

The pooled effect-size question ("does AI reduce depression?") has already been answered in our breakdown of the Li et al. 2023 meta-analysis and the LLM-vs-scripted comparison by Du et al. 2025. Below we focus on the harder question: what happens in head-to-head designs where the chatbot and the clinician are doing the same task on the same population?

A therapist is four roles, not one

The right question is not "can AI replace a therapist," but "in which of the therapist's roles, and for which users, does AI already perform at a level comparable to a human?" Health systems running stepped care from the UK to Australia operationalize the clinician as four functions:

Diagnostician — distinguishing depression from anxiety, PTSD, and the bipolar spectrum.
Technique deliverer — running CBT, ACT, and behavioral activation protocols turn-by-turn.
Alliance partner — building a working bond, validating experience, tolerating silence and resistance.
Clinical judge — assessing risk, deciding when to escalate, owning the case across sessions.

A systematic review by Omar et al. (2024) in Frontiers in Psychiatry (Q1, 50 citations) synthesized 28 studies and reached a precise verdict: LLMs are "promising" on technique delivery and parts of alliance, noticeably weaker on clinical risk assessment, and not yet evaluated against humans on long-horizon judgment. We walk through each role with the strongest 2024–2025 evidence.

Role 1 — Technique deliverer: AI matches humans on protocol fidelity

The most informative 2025 design is Napiwotzki et al. (JMIR Formative Research), which put an AI chatbot and live therapists side by side on behavioral activation — one of the most evidence-based CBT techniques for depression. BA is the ideal comparison surface because its protocol is tightly operationalized: values clarification, activity hierarchy, mood monitoring, homework review. There is little ambiguity about what "doing it right" looks like.

A mixed-methods replication in JMIR Mental Health (Scholich et al., 2025) compared therapeutic communication of LLM chatbots and live therapists. The shared finding across both designs: on protocol fidelity and basic empathic responses, AI matches or comes within a small distance of humans. The gap opens up in the finer work — handling client resistance, decoding ambiguous framings of a request, adapting intensity to the in-the-moment state.

Song et al. (2024) in Proceedings of the ACM on Human-Computer Interaction (Q1) tracked the failure mode qualitatively. Users of LLM chatbots for mental health valued accessibility and the absence of judgment, but ran into conversational breakdowns — irrelevant or formulaic responses in emotionally charged moments. This is not a knowledge gap. It is the cost of statistical generation when the protocol script runs out.

Verdict on role 1: AI can deliver a tight CBT protocol turn-by-turn at near-human fidelity. It cannot improvise around a protocol when the client breaks the expected pattern.

Role 2 — Alliance partner: 3.76 of 5 on the WAI, but asymmetric

The alliance — the working bond between client and therapist — predicts the outcome of psychotherapy better than the chosen method does, per Bordin (1979). So the second question for AI is whether an alliance even forms.

A cross-sectional study of 527 users of the AI chatbot Clare measured alliance on the Working Alliance Inventory — Short Revised (Schäfer et al., 2025). The mean was 3.76 out of 5 — comparable to in-person outpatient psychotherapy (3.9–4.2) and group CBT (3.5–3.8). Two findings sharpen the picture:

Alliance with AI was strongest among lonely users (r = 0.25) and people with marked anxiety or depression symptoms (r = 0.37). The chatbot is most valued precisely where the human service is most rationed.
The alliance is structurally asymmetric: the Bond component (emotional connection) is lower than with a human therapist; the Goal and Task components (agreement on goals and methods) are comparable.

Translated: AI holds the structure of therapy well but builds trust more slowly. For a client whose primary need is structured weekly work — the kind a live therapist would call "good homework compliance" — AI competes credibly. For a client whose work is primarily relational (long-term grief, complex PTSD), the Bond gap is the wrong starting point.

Verdict on role 2: AI builds enough alliance to deliver protocol work; not enough to be the relational vehicle for depth therapy.

Role 3 — Clinical judge: prognosis drift and uneven empathy

Two head-to-head designs against clinicians expose this role's weakness.

Elyoseph et al. (2024, Family Medicine and Community Health) compared four LLMs (ChatGPT-3.5, ChatGPT-4, Claude, Bard) against general practitioners, psychiatrists, clinical psychologists, psychiatric nurses, and the general public on prognosis. All four LLMs correctly identified depression and recommended psychotherapy plus antidepressants. But ChatGPT-3.5 was significantly more pessimistic than all other LLMs, professionals, and the public, predicting more negative long-term outcomes. The authors warn directly: an LLM's pessimistic prognosis can reduce a patient's motivation to start or continue therapy. ChatGPT-4, Claude, and Bard generally aligned with professional opinion — but the variance across "the LLM tier" is now a clinical variable in itself.

Gabriel et al. (2024) in Can AI Relate (29 citations) asked whether an LLM is equally empathic to all groups of users. It is not. Empathy levels differed significantly across patient subgroups, and the appropriateness of responses against motivational interviewing principles needed improvement. For users from groups underrepresented in training data, the chatbot is statistically less empathic — a failure mode a live therapist regulates consciously and an LLM does not.

This is the cost of using general-purpose ChatGPT for mental-health work. De Choudhury et al. (2023, 63 citations) catalogued 12 categories of potential harm from LLMs in digital mental health support — most of them occurring at the boundary between "delivery of a technique" (role 1) and "clinical judgment" (role 3). Specialized systems close this gap with two layers: fine-tuning on balanced psychotherapy corpora (Mental-LLM, Xu et al., 2023, NPJ) and explicit guard rails (EmoAgent, Qiu et al., 2025; see our breakdown of guardrails for mental health).

Verdict on role 3: without specialized prompts, vetted protocols, and explicit safety layers, an LLM as clinical judge is negative-utility for vulnerable users. With them, it becomes triage-grade, not decision-grade.

Role 4 — Diagnostician and case owner: still mostly human

Obradovich et al. (2024) in NPP Digital Psychiatry and Neuroscience (56 citations) consolidated opportunities and risks of LLMs in psychiatry. The boundary they draw is the one most replicated across other reviews. AI cannot yet substitute the clinician on:

Complex differential diagnosis and comorbidity. Differentiating the bipolar spectrum, PTSD, and personality disorders requires sustained observation and case context that a chatbot cannot reach in a single session.
Acute suicide risk and crisis escalation. Even specialized systems miss some crisis signals. The correct design is therefore a hard handoff protocol — to a hotline and a live clinician — rather than an attempt to "treat" through a crisis.
Long-term trauma work. Childhood trauma and complex PTSD require moment-to-moment regulation of the client's emotional state — non-verbal attunement, vocal pacing, pauses. AI systems cannot yet do this even in multimodal formats.
Clinical supervisory context. Decisions about pharmacotherapy, hospitalization, and family involvement remain a human's legal and clinical responsibility.

Verdict on role 4: unchanged from a decade ago. The role boundary for AI is the case-level decision; everything below it is in play.

The role-by-role map

Role	What it requires	AI in 2024–2025	Where it breaks
Technique deliverer	Protocol fidelity, structured homework	Near-human on BA (Napiwotzki 2025) and CBT communication (Scholich 2025)	Resistance, atypical client framings (Song 2024)
Alliance partner	Working bond, validation	WAI = 3.76/5 on Clare (Schäfer 2025), Goal/Task components match humans	Lower Bond; relational depth therapy
Clinical judge	Risk assessment, motivational stability	Triage-grade with guard rails	Prognosis drift (Elyoseph 2024), uneven empathy (Gabriel 2024)
Diagnostician / case owner	Differential dx, escalation, longitudinal context	Not evaluated head-to-head against humans	Comorbidity, acute crisis, trauma, pharmacotherapy decisions

What this means in practice

"Can AI replace a therapist" is the wrong frame. Two of the four roles already have a credible AI substitute (technique delivery, parts of alliance). One is triage-only with guard rails (clinical judge). One remains the live clinician's domain (diagnostician and case owner).

A coherent stepped-care design therefore reads:

First step: AI handles routine CBT protocol delivery and between-session support, on a mature alliance that is sufficient for protocol work.
Second step: the live clinician owns differential diagnosis, crisis escalation, long-term trauma work, and pharmacotherapy.
Boundary: AI must surface clear escalation triggers without trying to "treat through" them.

Nearby is designed around exactly this role map: CBT protocols for role 1, structured profiling that builds Goal/Task alliance for role 2, multi-agent architecture with separate agents for technique and safety to keep role 3 honest, and explicit handoff for role 4.

Frequently asked questions

In which of the therapist's roles can AI replace a human?

AI in 2024–2025 reaches near-human performance on technique delivery (Napiwotzki 2025 for behavioral activation; Scholich 2025 for therapeutic communication) and on the Goal/Task components of the working alliance (Schäfer 2025, WAI-SR = 3.76/5 on Clare, 527 users). Two roles remain out of reach: clinical judgment (Elyoseph 2024 shows prognosis drift; Gabriel 2024 shows uneven empathy across subgroups) and case ownership including differential diagnosis and crisis escalation (Obradovich 2024; Omar 2024).

What does "head-to-head AI vs. therapist" actually mean methodologically?

Two 2025 designs compared chatbots and live therapists on identical tasks: Napiwotzki et al. (JMIR Formative Research) on behavioral activation, and Scholich et al. (JMIR Mental Health) on therapeutic communication using mixed methods. Both isolate protocol fidelity and empathic responding as the comparison axes. Both find AI competitive on those axes, with the gap opening up around resistance and ambiguous client framings.

Why is alliance with AI lower on the Bond component than on Goal and Task?

Bond captures emotional connection; Goal and Task capture agreement on what to work on and how. AI matches humans on Goal/Task because protocol agreement is verbal and structured. AI lags on Bond because emotional connection accumulates through non-verbal attunement, vocal pacing, and inferred subtext that an LLM does not produce reliably. The asymmetry is structural, not a question of model size.

Can general-purpose ChatGPT serve as a therapist?

No. Elyoseph et al. (2024) found ChatGPT-3.5 systematically more pessimistic about prognosis than clinicians and the general public — a distortion that can reduce a client's motivation to start or continue therapy. De Choudhury et al. (2023) catalogued 12 categories of potential harm from general-purpose LLMs in mental-health contexts. Triage-grade safety requires specialized prompts, vetted protocols, and explicit guard rails (prompt engineering for mental-health chatbots; guardrails for mental health).

When is a live clinician strictly necessary instead of AI?

Four zones where AI is unacceptable as the primary actor: complex differential diagnosis (bipolar spectrum, PTSD, personality disorders), acute suicide risk and crisis, long-term trauma work requiring moment-to-moment regulation, and decisions about pharmacotherapy or hospitalization (Obradovich et al., 2024; Omar et al., 2024). In these cases AI must hand the user off to a live clinician via a hard protocol — not attempt to "treat through" the case.

References

De Choudhury, M., Pendse, S. R., & Kumar, N. (2023). Benefits and harms of large language models in digital mental health. ArXiv. https://doi.org/10.48550/arxiv.2311.14693

Du, Q., Ren, Y., Meng, Z., He, H., & Meng, S. (2025). The efficacy of rule-based versus large language model–based chatbots in alleviating symptoms of depression and anxiety: Systematic review and meta-analysis. Journal of Medical Internet Research.

Elyoseph, Z., Levkovich, I., & Shinan-Altman, S. (2024). Assessing prognosis in depression: Comparing perspectives of AI models, mental health professionals and the general public. Family Medicine and Community Health.

Gabriel, S., Puri, I., Xu, X., Malgaroli, M., & Ghassemi, M. (2024). Can AI relate: Testing large language model response for mental health support. ArXiv. https://doi.org/10.48550/arxiv.2405.12021

Li, H., Zhang, R., Lee, Y.-C., Kraut, R. E., & Mohr, D. C. (2023). Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digital Medicine, 6(1), 236. https://doi.org/10.1038/s41746-023-00979-5

Napiwotzki, F. et al. (2025). Comparing human and AI therapists in behavioral activation for depression. JMIR Formative Research. https://doi.org/10.2196/78138

Obradovich, N., Khalsa, S., Khan, W. U., Suh, J., Perlis, R. H., Ajilore, O., & Paulus, M. P. (2024). Opportunities and risks of large language models in psychiatry. NPP Digital Psychiatry and Neuroscience. https://doi.org/10.1038/s44277-024-00010-z

Omar, M., Soffer, S., Charney, A. W., Landi, I., Nadkarni, G. N., & Klang, E. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry. https://doi.org/10.3389/fpsyt.2024.1422807

Schäfer, S. K. et al. (2025). User characteristics, motives, and therapeutic alliance in mental health conversational AI Clare. Frontiers in Digital Health. https://doi.org/10.3389/fdgth.2025.1576135

Scholich, T. et al. (2025). Comparison of human therapists and LLM chatbots for therapeutic communication: Mixed methods study. JMIR Mental Health. https://doi.org/10.2196/69709

Sharma, A. et al. (2023). Human-centered evaluation of generative AI-based therapy chatbot. NEJM AI, 1(2). https://doi.org/10.1056/AIoa2300127

Song, I., Pendse, S. R., Kumar, N., & De Choudhury, M. (2024). The typing cure: Experiences with large language model chatbots for mental health support. Proceedings of the ACM on Human-Computer Interaction. https://doi.org/10.1145/3757430