Multimodal Sentiment Analysis on Product Review using Machine Learning Techniques
Keywords:
Twitter, Sentiment analysis (SA), Opinion mining, Machine learning, Naive Bayes (NB), Maximum Entropy, Support Vector Machine (SVM), multi-modal sentiment analysis, ALBert, feature extractionAbstract
Research on multi-modal sentiment analysis has seen significant advancements; however, emotions in real life are predominantly multi-modal, encompassing not only text but also images, audio, video, and other formats. These various modalities contribute to mutual enhancement. With the rapid expansion of e-commerce platforms, users generate vast amounts of multimodal data comprising text, images, and sometimes audio in product reviews. Traditional sentiment analysis approaches that rely solely on textual input often fail to capture the complete sentiment. Multimodal Sentiment Analysis (MSA) integrates multiple modalities to provide a more accurate and holistic understanding of user opinions. This paper presents a comprehensive analysis of product reviews using machine learning techniques across text, image, and audio data. We evaluate various models for each modality and explore fusion strategies to improve sentiment classification performance.
If the relationships among different modalities can be effectively explored, the precision of sentiment analysis can be further elevated. Accordingly, this paper presents a cross-attention-based multi-modal fusion model for images and text, referred to as MCAM. Initially, we employ the ALBert pre-training model to extract text features, followed by the use of BiLSTM to derive contextual features from the text. Subsequently, we utilize DenseNet121 to extract features from images, and then apply CBAM to identify specific areas in images that are associated with emotions. Ultimately, we implement multi-modal cross-attention to integrate the features extracted from both text and images, and classify the output to ascertain emotional polarity. In the comparative experimental analysis of the MVSA and TumEmo public datasets, the model proposed in this paper outperforms the baseline model, achieving accuracy and F1 scores of 86.5% and 75.3%, as well as 85.5% and 76.7%, respectively. Furthermore, we conducted ablation experiments, which validated that sentiment analysis utilizing multi-modal fusion surpasses that of single-modal sentiment analysis. Sentiment analysis has significantly expanded its scope in recent years. Initially, it primarily concentrated on analyzing textual data, with aspirations to broaden its horizons. As time has progressed, advancements have also been made in other data modalities, including audio and visual information. Researchers worldwide have demonstrated a strong interest in this area, developing various techniques to achieve this objective. This paper outlines several methods that are commonly employed in sentiment analysis. Additionally, it examines some of the approaches utilized by different authors in their research experiments. Furthermore, the proposed research work investigates various applications where sentiment analysis is presently implemented.











