Detection of Cyberbullying using Machine Learning Computer Sci FYP Idea

Project Domain / Category Data Science/Machine Learning Abstract / Introduction Cyber bullying is bullying that occurs with digital devices such as computers, cell phones, and tablets. It can be through online social media forums where people can view, participate, comment or share other people’s content. This may include sharing personal or private information about someone else that may cause embarrassment, such as sending, posting or sharing negative, inaccurate, harmful material about someone else. As social networks provide a rich environment for bullies to use these networks as a threat to attacks against victims, therefore, it is important to find appropriate measures to detect cyberbullying from social media. In this project, we shall find the accuracy by applying appropriate machine learning techniques (e.g. Bayesian, Support Vector Machine, Tree and Random, etc.) on cyber bullying datasets. We shall also compare what techniques are better for detecting cyberbullying and why. Functional Requirements: Administrator will perform all these tasks.

Data-Collection

For this project, you can collect data from any social media platform (such as Facebook, Twitter, or YouTube) to detect cyber bullying. Your dataset must contain at least 2000 comments. The dataset is shared in the link below. You can collect more data using the API or manually, and add the collected data to the shared dataset.

Data-Preparation

After collecting the data, you need to prepare the dataset. In the process, you will label these comments in two classes: B (bullying) and NB (non-bullying). You will also need to remove punctuation marks and digits from the dataset.

Pre-processing

As most of the data in the real world are incomplete containing noisy and missing values. Therefore you have to apply pre-processing on your data. In pre-processing, you will normalize the dataset, remove duplicate values, handle noise & outliers, missing values, and stop words.

Feature Extraction

After the pre-processing step, you will apply the feature extraction method. You can use TF-IDF, Word2Vec, Uni-Gram, Bi-Gram, Tri-Gram, or Ngram feature extraction method.

Train & Test Data

Split data into 70% training & 30% testing data sets.

Machine learning Techniques

In this project, you will use minimum four classifiers/models (e.g. Naïve Bayes, Naïve Bayes MN, Poly Kernel, RBF Kernel, Decision Tree, Random Tree and Random Forest Tree) of four machine learning techniques/algorithms.

Confusion Matrix

Create a confusion matrix table to describe the performance of a classification model.

Accuracy Evaluation

Find the accuracy of all techniques and compare their accuracy.
This project will also tell us which machine learning technique is best for detecting cyber bullying.

Tools/Techniques:

Python (programming language)
Anaconda (Python distribution platform)
Jupiter Notebook (Open source web application)
Machine Learning (Technique)

Prerequisite: Artificial Intelligence, Machine Learning, and Natural Language Processing Concepts, “Students will cover a short course relevant to the mentioned concepts besides SRS and Design initial documentation or see the links below.” Helping Material Machine Learning Techniques: https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0 https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-149374935f3c https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-shouldknow-3cc96e0eeee9 https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623 Feature Extraction Method: https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be

Guide For Feature Extraction Techniques

https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-realworld-dataset-796d339a4089 https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-abeginners-guide-to-understand-natural-language-processing/ http://uc-r.github.io/creating-text-features Dataset: https://drive.google.com/file/d/1AfUdn70MfnFirnb7NTu2DTS1AVasnofG/view?usp=sharing