Crime Classification using PySpark

Services

Big Data Analysis

Category

Big Data

Client

Personal Project

The goal was to create a scalable, intelligent classification pipeline that could help law enforcement agencies automatically detect and categorize cybercrimes using big data and machine learning.

I’m passionate about applying AI and big data tools to solve real-world problems. This project reflects my mission to drive smarter, faster decision-making through scalable data solutions.

Results

This project involved building a distributed machine learning pipeline using PySpark to classify crimes based on network activity descriptions. Feature extraction was done using NLP techniques like TF-IDF and Word2Vec, enabling a Naïve Bayes and Logistic Regression ensemble to achieve 89% accuracy.

We engineered features from packet captures, handled noisy data with PyShark, and built models capable of categorizing various cybercrimes. The approach shows how big data and AI can support law enforcement efforts in real time.

Challenges

While powerful, working with real-world packet data brought data quality and imbalance challenges. Handling inconsistent patterns and ensuring model generalization across crime categories required rigorous preprocessing and tuning.

Available for work

Back to top

Back to top

Let's create
something
extraordinary
together.

Let’s make an impact

Ayesha Saif

Data Scientist

Ready to translate raw data into strategy? Reach out and let’s get started.

Ayesha Saif

Available for work

Back to top

Back to top

Let's create
something
extraordinary
together.

Let’s make an impact

Ayesha Saif

Data Scientist

Ready to translate raw data into strategy? Reach out and let’s get started.

Ayesha Saif

Available for work

Back to top

Back to top

Let's create
something
extraordinary
together.

Let’s make an impact

Ayesha Saif

Data Scientist

Ready to translate raw data into strategy? Reach out and let’s get started.

Ayesha Saif