The goal was to create a scalable, intelligent classification pipeline that could help law enforcement agencies automatically detect and categorize cybercrimes using big data and machine learning.
I’m passionate about applying AI and big data tools to solve real-world problems. This project reflects my mission to drive smarter, faster decision-making through scalable data solutions.
Results
This project involved building a distributed machine learning pipeline using PySpark to classify crimes based on network activity descriptions. Feature extraction was done using NLP techniques like TF-IDF and Word2Vec, enabling a Naïve Bayes and Logistic Regression ensemble to achieve 89% accuracy.
We engineered features from packet captures, handled noisy data with PyShark, and built models capable of categorizing various cybercrimes. The approach shows how big data and AI can support law enforcement efforts in real time.
Challenges
While powerful, working with real-world packet data brought data quality and imbalance challenges. Handling inconsistent patterns and ensuring model generalization across crime categories required rigorous preprocessing and tuning.