| Task | Model | ACC | F1w | AUC |
|---|---|---|---|---|
| Depression | TEXT (XGB) | 0.862 | 0.860 | 0.964 |
| Depression | Fusion ALL (XGB) | 0.852 | 0.850 | 0.961 |
| PTSD | TEXT (XGB) | 0.862 | 0.861 | 0.971 |
| PTSD | Fusion ALL (XGB) | 0.854 | 0.854 | 0.970 |
Depression and PTSD frequently co-occur, complicating automated assessment when framed as binary, disorder-specific tasks. We present a unified tri-modal severity framework that synchronizes sentence-level transformer text embeddings, log-Mel audio statistics with deltas, and OpenFace-based facial action units, gaze, and head-pose descriptors to output graded severities for depression (PHQ-8; five classes) and PTSD (three classes). A calibrated late-fusion classifier produces disorder-specific probabilities that support decision-curve analysis and robustness to missing/noisy modalities. Errors concentrate between adjacent severities; extremes are reliably identified. SHAP attributions align with clinical markers: linguistic cues dominate depression, audio–facial cues strengthen PTSD.

Vertical bands = voiced speech; troughs = pauses and low energy.
We summarize speech into time–frequency statistics (means/SDs over Mel bins and Δ) that capture tempo, spectral slope, and stability—signals linked to affect and arousal. This compact 256-D audio embedding complements text and facial cues in fusion.

Text separates severities; audio/face stabilize near boundaries.
Linguistic markers dominate PHQ-8 grading. Audio adds robustness on short, disfluent turns, while micro-expressive facial AUs help around Mild↔Moderate boundaries—where presentations are mixed and ambiguity is highest.

Text+Face strongest; audio–facial cues boost Severe detection.
For PTSD, gaze/head-pose regularity and AU clusters, together with lexical content, strengthen separation—especially for Severe—where text alone saturates.

Extremes are easy; adjacent severities confuse more.
Minimal and Severe are reliably identified. Confusions are local (Mild↔Moderate, Moderate↔Mod-Severe), consistent with clinical ambiguity and overlapping symptom profiles.

Moderate stays challenging; Severe is sharper with fusion.
Fusion preserves sensitivity in Moderate while sharpening Severe—evidence that non-lexical arousal cues (prosody, micro-stability) complement text in PTSD grading.

Higher net benefit across clinically relevant thresholds.
Calibrated late-fusion probabilities yield better net benefit than unimodal baselines, suggesting fewer unnecessary escalations for similar hit-rates in stepped-care workflows.

Errors cluster locally between adjacent PHQ-8 tiers.
Locality of errors is desirable for triage: near-misses often map to similar care pathways. Fusion reduces off-diagonal, non-local mistakes compared to unimodal heads.

Severe separates; Mild vs Moderate retains overlap.
Overlap in low–mid tiers reflects neutral lexical content and short utterances; Severe benefits from audio–facial arousal patterns and posture/gaze regularity.

Removing text collapses mid-tier precision.
Audio alone underfits PHQ-8 mid-tiers, motivating late-fusion calibration and disorder-specific heads that can gracefully degrade when streams are noisy/missing.

High AUC with balanced trade-offs across classes.
Complementary cues lift discrimination near borderline classes, aligning with PR/decision-curve gains and narrow CIs under stratified CV.

Robust separation; slightly flatter than PHQ-8.
Consistent AUC across folds; lingering overlap tracks the Mild↔Moderate confusion pattern observed in the matrix and embeddings.

Language cues dominate; audio/face refine mid-tier splits.
SHAP aligns with clinical theory: negative affect frames, self-referential style, and reduced lexical diversity drive PHQ-8; prosodic stability and AUs help Moderate vs Mod-Severe.

Prosodic bursts & head-pose micro-stability are salient.
Startle-like prosodic activity and stable head-pose/gaze increase importance for Severe PTSD, complementing lexical markers when text saturates.

Coarse gradients with mid-tier overlap.
Linear projections separate extremes; mid-tiers intermix—mirroring per-class F1 and indicating nuanced, mixed lexical–prosodic profiles.

Dense Severe clusters; diffuse Mild regions.
Non-linear neighborhoods expose compact Severe structure and spread-out Mild instances, matching the confusion pattern and modality analysis above.
| Task | Model | ACC | F1w | AUC |
|---|---|---|---|---|
| Depression | TEXT (XGB) | 0.862 | 0.860 | 0.964 |
| Depression | Fusion ALL (XGB) | 0.852 | 0.850 | 0.961 |
| PTSD | TEXT (XGB) | 0.862 | 0.861 | 0.971 |
| PTSD | Fusion ALL (XGB) | 0.854 | 0.854 | 0.970 |
Read the manuscript on arXiv and cite it using the BibTeX below.