Real-Time Fraud ML on Spark Structured Streaming: Micro-Batch vs. Continuous Processing
Keywords:
Real-Time Fraud Detection, Apache Spark Structured Streaming, Micro-Batch vs. Continuous Processing, Machine Learning Inference, Scalability and LatencyAbstract
Digital financial transactions have become highly vulnerable to fraudulent activities due to their growing volume and velocity, which necessitates real-time fraud detection systems. This article critically examines the analysis of machine learning-based fraud detection using Apache Spark Structured Streaming, which has two processing modes: Micro-Batch and Continuous Processing. A usable pipeline is trained and constructed on the IEEE-CIS Fraud Detection dataset, which incorporates feature engineering, supervised learning models, and stream processing to perform near-real-time fraud classification. Complex transformations and fault tolerance, supported by the Micro-Batch mode, have demonstrated reliability and analytical power with a marginal increase in latency. Continuous Processing mode offers significantly improved latency and throughput, is best suited for quickly issuing alerts in high-risk environments, and is limited in its transformations and recovery actions. Comprehensive experimentation compares the two modes in terms of latency, precision, recall, resource utilization, and reliability, assessing the operations that can be achieved. The results indicated that no one mode is inherently optimal; instead, they should be chosen according to particular use case guidelines. The paper concludes with practical advice, outlining the existing inadequacies of Spark used in Continuous mode, and presents opportunities for future exploration, including active learning, graph ML, and hybrid systems that would offer the best of both worlds in real-time fraud mitigation.











