Project Domain / Category
Data Science/Machine Learning
Abstract / Introduction
Cyber bullying is bullying that occurs with digital devices such as computers, cell phones, and tablets. It can be through online social media forums where people can view, participate, comment or share other people’s content. This may include sharing personal or private information about someone else that may cause embarrassment, such as sending, posting or sharing negative, inaccurate, harmful material about someone else. As social networks provide a rich environment for bullies to use these networks as a threat to attacks against victims, therefore, it is important to find appropriate measures to detect cyberbullying from social media. In this project, we shall find the accuracy by applying appropriate machine learning techniques (e.g. Bayesian, Support Vector Machine, Tree and Random, etc.) on cyber bullying datasets. We shall also compare what techniques are better for detecting cyberbullying and why.
Functional Requirements:
Administrator will perform all these tasks.
- Data-Collection
- For this project, you can collect data from any social media platform (such as Facebook, Twitter, or YouTube) to detect cyber bullying. Your dataset must contain at least 2000 comments. The dataset is shared in the link below. You can collect more data using the API or manually, and add the collected data to the shared dataset.
- Data-Preparation
- After collecting the data, you need to prepare the dataset. In the process, you will label these comments in two classes: B (bullying) and NB (non-bullying). You will also need to remove punctuation marks and digits from the dataset.
- Pre-processing
- As most of the data in the real world are incomplete containing noisy and missing values. Therefore you have to apply pre-processing on your data. In pre-processing, you will normalize the dataset, remove duplicate values, handle noise & outliers, missing values, and stop words.
- Feature Extraction
- After the pre-processing step, you will apply the feature extraction method. You can use TF-IDF, Word2Vec, Uni-Gram, Bi-Gram, Tri-Gram, or Ngram feature extraction method.
- Train & Test Data
- Split data into 70% training & 30% testing data sets.
- Machine learning Techniques
- In this project, you will use minimum four classifiers/models (e.g. Naïve Bayes, Naïve Bayes MN, Poly Kernel, RBF Kernel, Decision Tree, Random Tree and Random Forest Tree) of four machine learning techniques/algorithms.
- Confusion Matrix
- Create a confusion matrix table to describe the performance of a classification model.
- Accuracy Evaluation
- Find the accuracy of all techniques and compare their accuracy.
- This project will also tell us which machine learning technique is best for detecting cyber bullying.
- Python (programming language)
- Anaconda (Python distribution platform)
- Jupiter Notebook (Open source web application)
- Machine Learning (Technique)
Guide For Feature Extraction Techniqueshttps://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-realworld-dataset-796d339a4089 https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-abeginners-guide-to-understand-natural-language-processing/ http://uc-r.github.io/creating-text-features Dataset: https://drive.google.com/file/d/1AfUdn70MfnFirnb7NTu2DTS1AVasnofG/view?usp=sharing