सचेत-समाचार Hindi Fake News Detector
Youtube: https://youtu.be/5wdvW-OnuKU
This is the submission for team Hackstreet Girls under the topic 'Combating Disinformation'.
COMBATING MISINFORMATION IN HINDI
Taking into account the Covid-19 pandemic situation during the last few years, there has been a rampant rise in the increase of fake news. In a country such as India, with news circulation in more than 22 languages, keeping a check on the spread of false news is a complex task. Although there are many available resources to check the validity of news in English, there is little to no work done in regional languages.
In India, nearly 55 Crore people speak and understand Hindi making it the primary language for the circulation of news. With the lack of check on Hindi fake news, lots of misinformation is circulated in our country causing socio-political tensions amongst other issues. This is especially dangerous during the current pandemic as the circulation of false medical information can even cause loss of life.
To combat the spread of Hindi misinformation, we have made a Hindi Fake News Detector using Machine Learning and Deep Learning Algorithms. Being native Hindi speakers, we understood this problem well and used a dataset we scraped and annotated on our own. Our one-of-a-kind detector provides accuracy up to 82.32 percent!
The project aims to identify whether a Hindi news article is fake or not from its link. We worked on a Hindi news dataset that we had prepared on our own. We scraped over 200 news articles from Hindi Fact-Checking websites like Aaj Tak, Fact Check and Alt News Fact Check and annotated the entire dataset of about 2206 data points manually.
We used Jupyter Notebook for writing the programs. For the process of scraping URLs, we used libraries like urllib, BeautifulSoup, difflib, re and requests. For pre-processing and building the machine learning models we used Pandas, NumPy, sci-kit learn, Keras and TensorFlow.
After annotation, we performed basic pre-processing on the dataset. One of the major challenges we faced during the project was the pre-processing of data. Numerous pre-processing tools are available for cleaning an English dataset. However, since our entire dataset is in the Hindi language, the pre-processing task was very difficult due to the lack of available resources required for data pre-processing. We took a very simple and direct approach to the problem. We simply removed the stop words and punctuation marks followed by stemming and lemmatizing. Then we vectorized the entire dataset using the TF-IDF Vectorizer before using the dataset for training the model. We used Seaborn for the visualization of data as shown on our webpage.
After the preprocessing steps and hyperparameter tuning, we tested our dataset on various Machine Learning models like Logistic Regression, SVM, Random Forest, k-NN, and Gradient Boosting Classifier. We also tested our dataset on a Deep Learning LSTM model for 10, 25, 50, and 100 epochs.
On a benchmark dataset consisting of 932 fake and 1274 not-fake news links, the model has been successful in identifying most of the fake news links and we achieved an accuracy of 82.35% for the Random Forest model implemented on 10% test data and an accuracy of 60.42% for the LSTM model implemented on 30% test data.
We have also built a simple website where we can enter the link of the URL in the search box and check whether the article is fake or not. In the future, we intend to link both the website and ML/DL models together with a suitable backend to support the website. We will also be working on improving the accuracy of the models.
Link to GitHub Repository for the project: https://github.com/A-nn-e/Sachet-Samachar
Link to YouTube video explaining the project: https://youtu.be/5wdvW-OnuKU