Sachet-Samachar: Hindi Fake News Detector
by Parul Mann1 Updated: Jan 28, 2022
सचेत-समाचार Hindi Fake News Detector Youtube: https://youtu.be/5wdvW-OnuKU This is the submission for team Hackstreet Girls under the topic 'Combating Disinformation'. COMBATING MISINFORMATION IN HINDI Taking into account the Covid-19 pandemic situation during the last few years, there has been a rampant rise in the increase of fake news. In a country such as India, with news circulation in more than 22 languages, keeping a check on the spread of false news is a complex task. Although there are many available resources to check the validity of news in English, there is little to no work done in regional languages. In India, nearly 55 Crore people speak and understand Hindi making it the primary language for the circulation of news. With the lack of check on Hindi fake news, lots of misinformation is circulated in our country causing socio-political tensions amongst other issues. This is especially dangerous during the current pandemic as the circulation of false medical information can even cause loss of life. To combat the spread of Hindi misinformation, we have made a Hindi Fake News Detector using Machine Learning and Deep Learning Algorithms. Being native Hindi speakers, we understood this problem well and used a dataset we scraped and annotated on our own. Our one-of-a-kind detector provides accuracy up to 82.32 percent! The project aims to identify whether a Hindi news article is fake or not from its link. We worked on a Hindi news dataset that we had prepared on our own. We scraped over 200 news articles from Hindi Fact-Checking websites like Aaj Tak, Fact Check and Alt News Fact Check and annotated the entire dataset of about 2206 data points manually. We used Jupyter Notebook for writing the programs. For the process of scraping URLs, we used libraries like urllib, BeautifulSoup, difflib, re and requests. For pre-processing and building the machine learning models we used Pandas, NumPy, sci-kit learn, Keras and TensorFlow. After annotation, we performed basic pre-processing on the dataset. One of the major challenges we faced during the project was the pre-processing of data. Numerous pre-processing tools are available for cleaning an English dataset. However, since our entire dataset is in the Hindi language, the pre-processing task was very difficult due to the lack of available resources required for data pre-processing. We took a very simple and direct approach to the problem. We simply removed the stop words and punctuation marks followed by stemming and lemmatizing. Then we vectorized the entire dataset using the TF-IDF Vectorizer before using the dataset for training the model. We used Seaborn for the visualization of data as shown on our webpage. After the preprocessing steps and hyperparameter tuning, we tested our dataset on various Machine Learning models like Logistic Regression, SVM, Random Forest, k-NN, and Gradient Boosting Classifier. We also tested our dataset on a Deep Learning LSTM model for 10, 25, 50, and 100 epochs. On a benchmark dataset consisting of 932 fake and 1274 not-fake news links, the model has been successful in identifying most of the fake news links and we achieved an accuracy of 82.35% for the Random Forest model implemented on 10% test data and an accuracy of 60.42% for the LSTM model implemented on 30% test data. We have also built a simple website where we can enter the link of the URL in the search box and check whether the article is fake or not. In the future, we intend to link both the website and ML/DL models together with a suitable backend to support the website. We will also be working on improving the accuracy of the models. Link to GitHub Repository for the project: https://github.com/A-nn-e/Sachet-Samachar Link to YouTube video explaining the project: https://youtu.be/5wdvW-OnuKU
Kit Solution Source
Jupyter Notebook 0 Version:Current License: No License
Follow the below instructions to run the solution: -Run the Web-Scraping code providing a URL as input. You'll get scraped links. -Run the ML and DL code cell by cell to get the accuracy with different models. -For the deployment of our website, we suggest Heroku. Just need to connect the local git repository to the Heroku app. To do this we add the remote of the Heroku app to the git repository and push the changes to our Heroku app and our website will be live on Heroku.
We have used Jupyter Notebook for the deployment.
Jupyter metapackage for installation, docs and chat
Python 14167 Version:Current License: Permissive (BSD-3-Clause)
Exploratory Data Analysis
For extensive analysis and exploration of data, these libraries were used.
The fundamental package for scientific computing with Python.
Python 22957 Version:v1.24.2 License: Permissive (BSD-3-Clause)
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Python 37275 Version:v2.0.0rc1 License: Permissive (BSD-3-Clause)
Data Scraping and Cleaning
For data scraping and cleaning, these libraries were used.
Request HTTP(s) URLs in a complex world.
TypeScript 692 Version:v3.10.0 License: Permissive (MIT)
web scrapping using bs4by ashutoshdhondkar
Project submitted at NIIT
web scrapping using bs4by ashutoshdhondkar
Python 0 Version:Current License: No License
A scalable system (using multiprocessing in Python) to find similarity between thousands of documents using difflib Sequence Matcher/ Levenstein Distance /cosine similarity/ word embeddings generated by word2vec
Python 1 Version:Current License: Permissive (Apache-2.0)
Machine Learning and Deep Learning
For applying Machine Learning and Deep Learning, these libraries were used.
Machine learning with Sci-kit Learn and Tensorflow (V)
Jupyter Notebook 7 Version:Current License: Permissive (MIT)
An Open Source Machine Learning Framework for Everyone
C++ 172263 Version:v2.11.1 License: Permissive (Apache-2.0)