The greatest threat to the golden era of the internet age is fake news. With millions of sources and a growing corpus of data being put up on the web daily, combatting disinformation is the need of the hour.
Apart from its obvious effects, disinformation on the web has drastic social consequences. With the spread of falsity under the guise of reliability, public acceptance of disinformation becomes a peril. A big example of this has been seen during the COVID-19 pandemic where incorrect data about the disease, ailments, cures and vaccines have rapidly spread through netizens, and affected public opinion for the worse. Moreover, the spread of falsity can negatively affect the communities being described, often patronising or villainising them. In the age of information, any wrong information can tarnish the correct efforts.
2. Scope and Scale:
The obvious solution to this issue comes by tackling the roots of the issue, demanding applications that work over a variety of sources as well as work over a variety of technological interfaces. Seeing that fake news is spread through these channels of communications on the internet: click bait headlines, whatsapp-facebook-twitter link sharing, and unfettered news sites, we developed a single application that works on all these channels. Our solution involves a browser extension that works by right clicking a link, or opening a news link. We use sophisticated machine learning algorithms to then tell the users about the reliability, biases, objectivity in the given news item. Additionally, our Artificial Engineering interface finds the most relevant and credible articles as compared to the one you’re reading right now as well as summarise articles in advance before you use them.
3. Social Impact:
The social impact that this then provides is that not only is the user more aware about which sources to refer to for news, the user is able to perform real time verification of the news or data item shared. Moreover, with feedback loops in place, the spread of disinformation is further tackled by the option of community resolution through the feedback mechanisms in the app. Furthermore, since this works across platforms and websites and has a simple UI/UX, it is easy to use and can be adopted by a large number of people. Tackling disinformation at its root allows us to further stop its spread, and help make the citizens more aware about real issues.
1. Machine Learning:
(i) NLP for News Metrics
A dataset of size ~50k data points which contains news details as well as a binary label for True/Fake is used. The textual data is transformed into vectors by first using the count vectoriser which gives a one-hot like scheme for our words. Further, TFIDF(term frequency- inverse document frequency) is used to optimise these word vectors into a better representation. A logistic regression based model running with a SAGA solver is used to train on the dataset, achieving a validation accuracy of 96%. The model is then replicated into a pipeline, and the pipeline is saved in the Open Neural Network Exchange format for us to use in our flask server.
(ii)Extractive Text Summarisation
Extractive methods attempt to summarise articles by identifying the important sentences or phrases from the original text and stitch together portions of the content to produce a condensed version. These extracted sentences are then used to form the summary. This works by calculating the cosine difference between each sentence pair and finding the highest rated sentences, adding them to our summary.
2. Browser Extension
Chrome extensions work by using either page actions or browser actions. A page action is an action that is specific to certain pages. A browser action is relevant no matter where you are in the browser. We use page actions in order to generate a portfolio of relevant information for the site being considered.
Our extension has two set-ups:
(i) Right-Click Fact Check:
In the right click fact check scenario, a user right clicks on a specific link that they want to be considered. Our extension adds an option on the right click menu where the user can check “Fact Check Link”. The link is sent to the backend server. The backend server returns the article summary, %age reliability as well as the bias and objectivity of the model. This is then shown to the user in a seamless fashion using a Swal-2 based UI.
• Application: When browsing through social media, seeing links in your inbox, as well as to summarise the article.
(ii) Current Article Fact Check:
In the current article fact check, the extension works when the user goes to an article and clicks on our extension icon. As a result, the extension extracts information from the web page including the tab title as well as the URL. This is sent as a post request to the backend which then returns relevant details and displays it using a React.js based frontend. It also displays other relevant articles which have a high credibility to read if you’re interested in the topic.
The user then has an option to give feedback to our model about whether the report received is accurate or not, and the feedback is sent to the server as another request on the ‘/feedback’ route.
• Application: When referring to an article, reading to
3. Backend Server
(i) Rest API
Flask is a lightweight Web Server Gateway Interface WSGI web application framework.
• Feedback route
The ‘/feedback’ route receives the feedback from the user and stores it in a TinyDB database. This can then be used to retrain our model and optimise its performance.
• Predict route
The '/ predict' route is a post route that takes in the URL of the article and runs our ML model on the given article. It also uses the newspaper library to generate summary and receive the keywords from the article. The keywords are then used to make an API call which gives us similar articles.
(ii) Web scraping
We use beautiful soup to extract article data when a user right clicks on an article link and runs our extension. We also use the newspaper library to extract article metadata by scraping the article.
(iii) Classification Pipeline
The classification model is loaded using the joblib library and runs on the samples received.
KIT SOLUTION SOURCE
ASATYA | A Browser extension to Combat Disinformation
Asatya is a browser extension that bundles a suite of tools to fight disinformation on the web. These inform the user of a news report's reliability, media bias, objectivity and even summarizes the report for the user's convenience.
VSCode and Jupyter Notebook are used for development and debugging. Jupyter Notebook is a web based interactive environment often used for experiments, whereas VSCode is used to get a typical experience of IDE for developers.
Jupyter Notebook is used for our development.
cmder is a console emulator package for Windows which supports bash.
EXPLORATORY DATA ANALYSIS
For extensive analysis and exploration of data, and to deal with arrays, these libraries are used. They are also used for performing scientific computation and data manipulation.
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Web scraping is an automatic method to obtain large amounts of data from websites.
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
urllib3 is a powerful, user-friendly HTTP client for Python.
Libraries in this group are used for analysis and processing of unstructured natural language. The data, as in its original form aren't used as it has to go through processing pipeline to become suitable for applying machine learning techniques and algorithms.
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
A REST API (also known as RESTful API) is an application programming interface (API or web API) that conforms to the constraints of REST architectural style and allows for interaction with RESTful web services. REST stands for representational state transfer.
Flask is a web application framework written in Python. Flask is based on the Werkzeug WSGI toolkit and Jinja2 template engine. flask-cors is a Cross Origin Resource Sharing ( CORS ) support for Flask.
TinyDB is a lightweight document oriented database optimized for your database applications. It's written in pure Python and has no external dependencies. The target are small apps that would be blown away by a SQL-DB or an external database server.
Extractive methods attempt to summarise articles by identifying the important sentences or phrases from the original text and stitch together portions of the content to produce a condensed version.
Newspaper can extract and detect languages seamlessly, including metadata and can perform NLP algorithms on the same.
NEWS PARAMETER PREDICTION
We predict the veracity of news and use machine learning methods to find parameters like objectivity, bias and reliability.
scikit learn includes simple and efficient tools for predictive data analysis . It is accessible to everybody, and reusable in various contexts
Joblib is a set of tools to provide lightweight pipelining in Python.
Kit Deployment Instructions
Follow below instructions to run the Asatya Browser Extension.
git clone https://github.com/MananSuri27/CombattingDisinformation.git
npm run build
pip install -r requirements.txt
python3 -m flask run
Now open chrome, go to chrome://extensions/ or “more tools -> extensions”
Then, turn on the developer mode at the top-right corner
Then, in top-left corner click on load unpacked, then follow this path
CombattingDisinformation -> extension -> build
Now at the top right pin our “Asatya” extension and enjoy the seamless experience :)
Aaryak Garg (github.com/Darthfire )
Arsh Kohli ( github.com/arshxyz )
Manan Suri ( github.com/MananSuri27 )