Tri-Modal Severity Fused Diagnosis across Depression & PTSD

Abstract

Depression and PTSD frequently co-occur, complicating automated assessment when framed as binary, disorder-specific tasks. We present a unified tri-modal severity framework that synchronizes sentence-level transformer text embeddings, log-Mel audio statistics with deltas, and OpenFace-based facial action units, gaze, and head-pose descriptors to output graded severities for depression (PHQ-8; five classes) and PTSD (three classes). A calibrated late-fusion classifier produces disorder-specific probabilities that support decision-curve analysis and robustness to missing/noisy modalities. Errors concentrate between adjacent severities; extremes are reliably identified. SHAP attributions align with clinical markers: linguistic cues dominate depression, audio–facial cues strengthen PTSD.

Figure Walkthrough

Signal view — Audio (Mel)

Vertical bands = voiced speech; troughs = pauses and low energy.

What this shows

We summarize speech into time–frequency statistics (means/SDs over Mel bins and Δ) that capture tempo, spectral slope, and stability—signals linked to affect and arousal. This compact 256-D audio embedding complements text and facial cues in fusion.

Modality roles — Depression

Text separates severities; audio/face stabilize near boundaries.

What this shows

Linguistic markers dominate PHQ-8 grading. Audio adds robustness on short, disfluent turns, while micro-expressive facial AUs help around Mild↔Moderate boundaries—where presentations are mixed and ambiguity is highest.

Modality roles — PTSD

Text+Face strongest; audio–facial cues boost Severe detection.

What this shows

For PTSD, gaze/head-pose regularity and AU clusters, together with lexical content, strengthen separation—especially for Severe—where text alone saturates.

Per-class F1 — Depression

Extremes are easy; adjacent severities confuse more.

What this shows

Minimal and Severe are reliably identified. Confusions are local (Mild↔Moderate, Moderate↔Mod-Severe), consistent with clinical ambiguity and overlapping symptom profiles.

Per-class F1 — PTSD

Moderate stays challenging; Severe is sharper with fusion.

What this shows

Fusion preserves sensitivity in Moderate while sharpening Severe—evidence that non-lexical arousal cues (prosody, micro-stability) complement text in PTSD grading.

Decision-curve (Depression)

Higher net benefit across clinically relevant thresholds.

What this shows

Calibrated late-fusion probabilities yield better net benefit than unimodal baselines, suggesting fewer unnecessary escalations for similar hit-rates in stepped-care workflows.

Confusion — Depression

Errors cluster locally between adjacent PHQ-8 tiers.

What this shows

Locality of errors is desirable for triage: near-misses often map to similar care pathways. Fusion reduces off-diagonal, non-local mistakes compared to unimodal heads.

Confusion — PTSD

Severe separates; Mild vs Moderate retains overlap.

What this shows

Overlap in low–mid tiers reflects neutral lexical content and short utterances; Severe benefits from audio–facial arousal patterns and posture/gaze regularity.

Unimodal Failure — Audio

Removing text collapses mid-tier precision.

What this shows

Audio alone underfits PHQ-8 mid-tiers, motivating late-fusion calibration and disorder-specific heads that can gracefully degrade when streams are noisy/missing.

ROC — Depression

High AUC with balanced trade-offs across classes.

What this shows

Complementary cues lift discrimination near borderline classes, aligning with PR/decision-curve gains and narrow CIs under stratified CV.

ROC — PTSD

Robust separation; slightly flatter than PHQ-8.

What this shows

Consistent AUC across folds; lingering overlap tracks the Mild↔Moderate confusion pattern observed in the matrix and embeddings.

Attribution — Depression

Language cues dominate; audio/face refine mid-tier splits.

What this shows

SHAP aligns with clinical theory: negative affect frames, self-referential style, and reduced lexical diversity drive PHQ-8; prosodic stability and AUs help Moderate vs Mod-Severe.

Attribution — PTSD

Prosodic bursts & head-pose micro-stability are salient.

What this shows

Startle-like prosodic activity and stable head-pose/gaze increase importance for Severe PTSD, complementing lexical markers when text saturates.

Embedding — PCA (DEP)

Coarse gradients with mid-tier overlap.

What this shows

Linear projections separate extremes; mid-tiers intermix—mirroring per-class F1 and indicating nuanced, mixed lexical–prosodic profiles.

Embedding — t-SNE (PTSD)

Dense Severe clusters; diffuse Mild regions.

What this shows

Non-linear neighborhoods expose compact Severe structure and spread-out Mild instances, matching the confusion pattern and modality analysis above.

Selected Results

Depression · TEXT (XGB)

0.862 ACC

PTSD · TEXT (XGB)

0.862 ACC

Depression · Fusion (ALL)

0.852 ACC

PTSD · Fusion (ALL)

0.854 ACC

Severity, not diagnosis, drives the gains. Graded severity (PHQ-8 and PCL-5 bands) concentrates errors between adjacent tiers and boosts utility for triage decisions. Decision-curve net benefit stays higher across clinical thresholds.

Text carries depression; audio + face lift PTSD severe. Linguistic markers dominate PHQ-8 separation, while prosodic bursts and head-pose or gaze regularity sharpen Severe PTSD where text saturates.

Fusion is robust to missing or noisy streams. Removing TEXT is the only ablation that materially drops ACC and F1. Calibrated late fusion preserves severe-case sensitivity when audio or face degrade.

Cross-validated Performance (full table)

Task	Model	ACC	F1w	AUC
Depression	TEXT (XGB)	0.862	0.860	0.964
Depression	Fusion ALL (XGB)	0.852	0.850	0.961
PTSD	TEXT (XGB)	0.862	0.861	0.971
PTSD	Fusion ALL (XGB)	0.854	0.854	0.970

Tri-Modal Severity Fused Diagnosis across
Depression & PTSD

Abstract

Figure Walkthrough

Signal view — Audio (Mel)

What this shows

Modality roles — Depression

What this shows

Modality roles — PTSD

What this shows

Per-class F1 — Depression

What this shows

Per-class F1 — PTSD

What this shows

Decision-curve (Depression)

What this shows

Confusion — Depression

What this shows

Confusion — PTSD

What this shows

Unimodal Failure — Audio

What this shows

ROC — Depression

What this shows

ROC — PTSD

What this shows

Attribution — Depression

What this shows

Attribution — PTSD

What this shows

Embedding — PCA (DEP)

What this shows

Embedding — t-SNE (PTSD)

What this shows

Selected Results

Full Manuscript

BibTeX