J Affect Disord. 2025 Oct 10:120385. doi: 10.1016/j.jad.2025.120385. Online ahead of print.
ABSTRACT
BACKGROUND: Depression is a prevalent and debilitating mental health disorder, and early detection is critical for effective intervention and treatment. Traditional unimodal approaches often fail to capture the complex interplay between linguistic and acoustic cues indicative of depressive symptoms. Multimodal learning, integrating both textual and auditory features, has shown promise in improving detection accuracy. However, optimizing cross-modal interactions and effectively leveraging high-order feature representations remain key challenges.
METHODS: This study proposes a multimodal depression detection model (BCMA-MBF) that integrates Bidirectional Cross-Modal Attention (BCMA), Multi-Task Learning and Bilinear Fusion (MBF) to enhance feature interactions and classification performance. Textual features are extracted using BERT, while acoustic features combine Mel spectrogram (prosodic and spectral characteristics) and Chroma representations (harmonic and musical structures), with an audio filtering module refining speech signals. BCMA facilitates dynamic cross-modal attention between text and speech, while bilinear fusion explicitly models high-order feature interactions. The MBF framework jointly optimizes unimodal and multimodal classification tasks, improving feature learning and generalization.
RESULTS: Across two corpora and languages, our models consistently outperform recent baselines. On the EATD-Corpus (Chinese) dataset, the pro-posed BCMA-MBF fusion model attains an F1 = 0.94, exceeding prior multimodal systems; the unimodal text (BMM) and audio (MC, Mel+Chroma) models reach 0.93 and 0.88 F1, respectively, both surpassing the best published unimodal results. On the DAIC-WOZ (English) dataset, BCMA-MBF achieves F1 = 0.95, while BMM and MC each obtain 0.94 F1, improving over strong state-of-the-art comparators. Ablations confirm that each component (BCMA attention, MBF interaction, and Chroma features) contributes to the gains. Inference profiling shows sub-10 ms per-sample latency with 1.0-1.2 GB GPU memory on an RTX 3090, indicating practical deployability. Collectively, these results demonstrate effective multimodal and unimodal learning and cross-lingual adaptability for depression detection.
LIMITATIONS: Future research should explore larger datasets, cross-lingual generalization, and real-world clinical applications to further assess the model’s robustness and adaptability. Incorporating additional acoustic features and conversational context may enhance detection accuracy.
CONCLUSIONS: This study underscores the importance of cross-modal attention, multi-task learning, and bilinear fusion in multimodal depression detection. By effectively integrating text and speech cues, along with meaningful musical features, the proposed model enhances predictive performance. This results in a robust and scalable framework for intelligent speech and text-based mental health assessment, contributing to advancements in automated depression diagnosis.
PMID:41077153 | DOI:10.1016/j.jad.2025.120385
Recent Comments