Tri-Modal Severity Fused Diagnosis across
Depression & PTSD

Macquarie University
Architecture overview: text 768-D, audio 256-D, face 512-D fused to 1,536-D with calibrated late fusion.

Abstract

Depression and PTSD frequently co-occur, complicating automated assessment when framed as binary, disorder-specific tasks. We present a unified tri-modal severity framework that synchronizes sentence-level transformer text embeddings, log-Mel audio statistics with deltas, and OpenFace-based facial action units, gaze, and head-pose descriptors to output graded severities for depression (PHQ-8; five classes) and PTSD (three classes). A calibrated late-fusion classifier produces disorder-specific probabilities that support decision-curve analysis and robustness to missing/noisy modalities. Errors concentrate between adjacent severities; extremes are reliably identified. SHAP attributions align with clinical markers: linguistic cues dominate depression, audio–facial cues strengthen PTSD.

Figure Walkthrough

Representative log-Mel spectrogram

Signal view — Audio (Mel)

Vertical bands = voiced speech; troughs = pauses and low energy.

What this shows

We summarize speech into time–frequency statistics (means/SDs over Mel bins and Δ) that capture tempo, spectral slope, and stability—signals linked to affect and arousal. This compact 256-D audio embedding complements text and facial cues in fusion.

Modality contributions in Depression

Modality roles — Depression

Text separates severities; audio/face stabilize near boundaries.

What this shows

Linguistic markers dominate PHQ-8 grading. Audio adds robustness on short, disfluent turns, while micro-expressive facial AUs help around Mild↔Moderate boundaries—where presentations are mixed and ambiguity is highest.

Modality contributions in PTSD

Modality roles — PTSD

Text+Face strongest; audio–facial cues boost Severe detection.

What this shows

For PTSD, gaze/head-pose regularity and AU clusters, together with lexical content, strengthen separation—especially for Severe—where text alone saturates.

Per-class F1 (Depression)

Per-class F1 — Depression

Extremes are easy; adjacent severities confuse more.

What this shows

Minimal and Severe are reliably identified. Confusions are local (Mild↔Moderate, Moderate↔Mod-Severe), consistent with clinical ambiguity and overlapping symptom profiles.

Per-class F1 (PTSD)

Per-class F1 — PTSD

Moderate stays challenging; Severe is sharper with fusion.

What this shows

Fusion preserves sensitivity in Moderate while sharpening Severe—evidence that non-lexical arousal cues (prosody, micro-stability) complement text in PTSD grading.

Decision curve (Depression)

Decision-curve (Depression)

Higher net benefit across clinically relevant thresholds.

What this shows

Calibrated late-fusion probabilities yield better net benefit than unimodal baselines, suggesting fewer unnecessary escalations for similar hit-rates in stepped-care workflows.

Confusion matrix (Depression)

Confusion — Depression

Errors cluster locally between adjacent PHQ-8 tiers.

What this shows

Locality of errors is desirable for triage: near-misses often map to similar care pathways. Fusion reduces off-diagonal, non-local mistakes compared to unimodal heads.

Confusion matrix (PTSD)

Confusion — PTSD

Severe separates; Mild vs Moderate retains overlap.

What this shows

Overlap in low–mid tiers reflects neutral lexical content and short utterances; Severe benefits from audio–facial arousal patterns and posture/gaze regularity.

Unimodal failure (Audio-only)

Unimodal Failure — Audio

Removing text collapses mid-tier precision.

What this shows

Audio alone underfits PHQ-8 mid-tiers, motivating late-fusion calibration and disorder-specific heads that can gracefully degrade when streams are noisy/missing.

ROC (Depression)

ROC — Depression

High AUC with balanced trade-offs across classes.

What this shows

Complementary cues lift discrimination near borderline classes, aligning with PR/decision-curve gains and narrow CIs under stratified CV.

ROC (PTSD)

ROC — PTSD

Robust separation; slightly flatter than PHQ-8.

What this shows

Consistent AUC across folds; lingering overlap tracks the Mild↔Moderate confusion pattern observed in the matrix and embeddings.

SHAP summary (Depression)

Attribution — Depression

Language cues dominate; audio/face refine mid-tier splits.

What this shows

SHAP aligns with clinical theory: negative affect frames, self-referential style, and reduced lexical diversity drive PHQ-8; prosodic stability and AUs help Moderate vs Mod-Severe.

SHAP summary (PTSD)

Attribution — PTSD

Prosodic bursts & head-pose micro-stability are salient.

What this shows

Startle-like prosodic activity and stable head-pose/gaze increase importance for Severe PTSD, complementing lexical markers when text saturates.

PCA embedding (Depression)

Embedding — PCA (DEP)

Coarse gradients with mid-tier overlap.

What this shows

Linear projections separate extremes; mid-tiers intermix—mirroring per-class F1 and indicating nuanced, mixed lexical–prosodic profiles.

t-SNE embedding (PTSD)

Embedding — t-SNE (PTSD)

Dense Severe clusters; diffuse Mild regions.

What this shows

Non-linear neighborhoods expose compact Severe structure and spread-out Mild instances, matching the confusion pattern and modality analysis above.

Selected Results

Depression · TEXT (XGB)
0.862 ACC
F1w 0.860 · AUC 0.964
PTSD · TEXT (XGB)
0.862 ACC
F1w 0.861 · AUC 0.971
Depression · Fusion (ALL)
0.852 ACC
F1w 0.850 · AUC 0.961
PTSD · Fusion (ALL)
0.854 ACC
F1w 0.854 · AUC 0.970
Severity, not diagnosis, drives the gains. Graded severity (PHQ-8 and PCL-5 bands) concentrates errors between adjacent tiers and boosts utility for triage decisions. Decision-curve net benefit stays higher across clinical thresholds.
Text carries depression; audio + face lift PTSD severe. Linguistic markers dominate PHQ-8 separation, while prosodic bursts and head-pose or gaze regularity sharpen Severe PTSD where text saturates.
Fusion is robust to missing or noisy streams. Removing TEXT is the only ablation that materially drops ACC and F1. Calibrated late fusion preserves severe-case sensitivity when audio or face degrade.
Cross-validated Performance (full table)
TaskModelACCF1wAUC
DepressionTEXT (XGB)0.8620.8600.964
DepressionFusion ALL (XGB)0.8520.8500.961
PTSDTEXT (XGB)0.8620.8610.971
PTSDFusion ALL (XGB)0.8540.8540.970

Full Manuscript

Read the manuscript on arXiv and cite it using the BibTeX below.

BibTeX