सचेत-समाचार Hindi Fake News Detector Youtube: https://youtu.be/5wdvW-OnuKU This is the submission for team Hackstreet Girls under the topic 'Combating Disinformation'. COMBATING MISINFORMATION IN HINDI Taking into account the Covid-19 pandemic situation during the last few years, there has been a rampant rise in the increase of fake news. In a country such as India, with news circulation in more than 22 languages, keeping a check on the spread of false news is a complex task. Although there are many available resources to check the validity of news in English, there is little to no work done in regional languages. In India, nearly 55 Crore people speak and understand Hindi making it the primary language for the circulation of news. With the lack of check on Hindi fake news, lots of misinformation is circulated in our country causing socio-political tensions amongst other issues. This is especially dangerous during the current pandemic as the circulation of false medical information can even cause loss of life. To combat the spread of Hindi misinformation, we have made a Hindi Fake News Detector using Machine Learning and Deep Learning Algorithms. Being native Hindi speakers, we understood this problem well and used a dataset we scraped and annotated on our own. Our one-of-a-kind detector provides accuracy up to 82.32 percent! The project aims to identify whether a Hindi news article is fake or not from its link. We worked on a Hindi news dataset that we had prepared on our own. We scraped over 200 news articles from Hindi Fact-Checking websites like Aaj Tak, Fact Check and Alt News Fact Check and annotated the entire dataset of about 2206 data points manually. We used Jupyter Notebook for writing the programs. For the process of scraping URLs, we used libraries like urllib, BeautifulSoup, difflib, re and requests. For pre-processing and building the machine learning models we used Pandas, NumPy, sci-kit learn, Keras and TensorFlow. After annotation, we performed basic pre-processing on the dataset. One of the major challenges we faced during the project was the pre-processing of data. Numerous pre-processing tools are available for cleaning an English dataset. However, since our entire dataset is in the Hindi language, the pre-processing task was very difficult due to the lack of available resources required for data pre-processing. We took a very simple and direct approach to the problem. We simply removed the stop words and punctuation marks followed by stemming and lemmatizing. Then we vectorized the entire dataset using the TF-IDF Vectorizer before using the dataset for training the model. We used Seaborn for the visualization of data as shown on our webpage. After the preprocessing steps and hyperparameter tuning, we tested our dataset on various Machine Learning models like Logistic Regression, SVM, Random Forest, k-NN, and Gradient Boosting Classifier. We also tested our dataset on a Deep Learning LSTM model for 10, 25, 50, and 100 epochs. On a benchmark dataset consisting of 932 fake and 1274 not-fake news links, the model has been successful in identifying most of the fake news links and we achieved an accuracy of 82.35% for the Random Forest model implemented on 10% test data and an accuracy of 60.42% for the LSTM model implemented on 30% test data. We have also built a simple website where we can enter the link of the URL in the search box and check whether the article is fake or not. In the future, we intend to link both the website and ML/DL models together with a suitable backend to support the website. We will also be working on improving the accuracy of the models. Link to GitHub Repository for the project: https://github.com/A-nn-e/Sachet-Samachar Link to YouTube video explaining the project: https://youtu.be/5wdvW-OnuKU
Jupyter Notebook 0 Version:Current
Jupyter Notebook 0 Version:Current License: No License
Deployment Environment
We have used Jupyter Notebook for the deployment.
Python 14037 Version:Current
Python 14037 Version:Current License: Permissive (BSD-3-Clause)
Exploratory Data Analysis
For extensive analysis and exploration of data, these libraries were used.
Python 22621 Version:1.24.1
Python 22621 Version:1.24.1 License: Permissive (BSD-3-Clause)
Python 36783 Version:1.5.2
Python 36783 Version:1.5.2 License: Permissive (BSD-3-Clause)
Data Scraping and Cleaning
For data scraping and cleaning, these libraries were used.
TypeScript 692 Version:v3.10.0
TypeScript 692 Version:v3.10.0 License: Permissive (MIT)
Python 0 Version:Current
Python 0 Version:Current License: No License
Python 1 Version:Current
Python 1 Version:Current License: Permissive (Apache-2.0)
Python 48995 Version:2.28.1
Python 48995 Version:2.28.1 License: Permissive (Apache-2.0)
Machine Learning and Deep Learning
For applying Machine Learning and Deep Learning, these libraries were used.
Jupyter Notebook 7 Version:Current
Jupyter Notebook 7 Version:Current License: Permissive (MIT)
C++ 170928 Version:1.15.0
C++ 170928 Version:1.15.0 License: Permissive (Apache-2.0)
Open Weaver – Develop Applications Faster with Open Source