Small Language Model Fusion Network for Multimodal Affective Computing
Keywords:
Affective Computing, Sentiment Analysis, Multimodal Fusion, Small Language ModelsAbstract
Conventional affective computing models aim to identify sentiments, emotions, hate speech, fake news from written text while Multimodal affective computing models identify emotions, sentiments, opinions expressed in form of multimodal data like images with captions, memes, videos, audios, emojis, texts, physiological signals, etc. In multimodal setup other modalities like speech, visuals accompany text modality. With the invent of language foundation models like ChatGPT and other small language models like Phi 3 mini, Llama, Gemini, etc., the potential of these models can be used to perform affective computing tasks. This study is steered on one specific multimodal affective computing tasks of sentiment analysis using the potential of small language models. The proposed approach employs two different subnetworks and the results from subnetworks then fused to get a more comprehensive understanding of the associated sentiment. One is a language subnetwork which uses small language model as a base model and the other is audio-visual subnetwork. To validate the proposed framework named small language model fusion network (SLMFN), extensive experiments are performed on two benchmark multimodal datasets, namely CMU-MOSI and CMU-MOSEI. This study offers insights into the practical applications of small language models by fine-tuning it for language specific tasks, which advances sentiment analysis and emotional recognition techniques. Additionally, with the use of quantized small language model making the proposed model more suitable for mobile and edge device-based application.











