Accepted to IJCAI-ECAI 2026 · IJCAI-ECAI 2026 Special Track (Main Conference)
ICFD-31k: A Large-Scale Dataset and Benchmark for Real-Time Conversational Fraud Detection
†Department of Information Technology, Dr. B.R. Ambedkar National Institute of Technology Jalandhar
Accepted to the IJCAI-ECAI 2026 main conference special track. Presentation scheduled in Bremen, Germany (Aug 15-21, 2026).
TL;DR
ICFD-31k introduces 31,000+ Indian English/Hinglish fraud-call transcripts with chunk-level streaming labels and slow-thinking rationales, plus RoBERTa baselines that reach 99.40 F1 in-domain and 92.97 F1 on unseen scam types.
Abstract
The proliferation of sophisticated telephone scams poses a significant societal and economic threat, impacting diverse linguistic contexts in a country like India. Furthermore, the lack of large-scale, publicly available datasets remains a critical barrier impacting research on robust, real-time countermeasures. In view of this, the proposed work introduces ICFD-31k, the first Indian Conversational Fraud Dataset, representing a new benchmark containing over 31,000 realistic conversational transcripts. ICFD-31k comprises systematically generated content, covering 10 distinct fraud umbrellas spanning from financial impersonation to job scams. ICFD-31k transcripts feature rich annotations comprising a final verdict, chunk-level streaming labels, and detailed slow-thinking rationales. In addition, the human-in-the-loop evaluation validates the ICFD-31k's quality, achieving a Cohen's Kappa of 0.534 that confirms annotation reliability. Furthermore, the proposed work introduces two fine-tuned models based on RoBERTa: M1 for non-streaming data and M2 for streaming data. The comprehensive experiments with strong baselines (M1, M2) further demonstrate the ICFD-31k's utility.
Highlights
- 31,000+ realistic conversational transcripts built for the Indian fraud landscape.
- Bilingual coverage across English and Hinglish, including code-switching patterns.
- Chunk-level streaming labels for real-time fraud detection experiments.
- Slow-thinking rationales for explainability and auditability.
- RoBERTa baselines for both static and streaming settings.
Why it matters
Most fraud-detection datasets either focus on transactions, emails, or monolingual contexts. This paper pushes toward a more deployable benchmark for live call settings, where the hard part is not just classification, but early intervention under incomplete context.
Release note
I will add camera-ready metadata and any public artifacts here once the final proceedings and release flow are ready.