Authors
Vedavyas Mallela, Lena Pang, Aadit Trivedi, Annie Wang, Alice Zhou, Jonathan Mak, Cindy Ngyuen
Abstract
With the outbreak of SARS-CoV-2 (COVID-19), obtaining up-to-date, location-specific news has become increasingly important. However, with the rise of misinformation, region-specific, accurate, and user-relevant information becomes buried. Our primary research objective was to provide precise data to communities through a website that provides a categorical layout of accurate and region-specific information. Because COVID-19 policies vary by location, we believe providing region-specific news would help users in their day-to-day activities. While the global news on COVID-19 is still beneficial for general knowledge, it will not provide people with the specifics they need to know when going outside in their community. We created an algorithm that automates COVID-19 search engine queries for localized news specific to the user’s region. We also defined our metric that quantifies search query efficiency to fine-tune our query algorithm. Using this algorithm, we created a website called COVerage in which we display localized news separated into five COVID-19-specific categories to the user: policies, education, biology, economy, and statistics. Each news article is displayed with an image, an NLP model generated summary, and the top three most relevant keywords that help users find more articles related to those words. We also include a county-specific map to show up-to-date statistics on the number of cases based on the user’s location.
Figure 1. Website interface of our COVerage website. The left side displays most relevant news articles, each with a headline, summary, and tags. The right displays a map corresponding to the user’s location with up-to-date statistics.
Introduction
- SARS-CoV-2 and Misinformation
The SARS-CoV-2 (COVID-19) outbreak was initially reported on December 31, 2019, in Wuhan, China, but has rapidly spread into a pandemic, threatening lives globally. Due to the large scale of this issue, there has been a rise in misinformation as there is a higher demand for information and people are eager to share supposed “remedies”, “causes”, and other “news” about the coronavirus that is largely inaccurate and potentially even detrimental to the well-being of many. To help combat this issue, many recognized organizations, researchers, scientists, and public officials are focusing their efforts on providing trustworthy and up-to-date information that can be easily accessed by the general population.
- Neural Networks
Neural networks are multi-layer algorithms that use machine learning to recognize patterns in data sets. Each layer in the network is made up of many neurons with a specific activation function and these specific functions affect the activations of the following layer. The networks process images or text as input and give outputs to train the network. As the inputs are passed through the hidden layers, the layers generate output. By applying and finding the best loss functions for the particular neural network [2], backpropagation nudges the network in the right direction and forms a stochastic gradient descent, thereby optimizing the model.
Using these neural networks coupled with a language dataset, such as GPT-3, we can organize data from a wide range of online news sources. Artificial neural networks (ANN) can be used to classify many series of texts. The output of the algorithms is then implemented to formulate datasets that are later translated into words, providing an accurate and coherent summary of the problem.
- Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of machine learning that allows computers to analyze, understand, and generate human language [1]. NLP and text information retrieval (IR) research has begun to intersect, allowing for features such as sentiment analysis, machine translation, and extracting meaning from text. After gathering data, NLP allows for succinct text summarization without bias, utilizing existing language models like OpenAI’s GPT-3 to break down sentence meanings syntactically.
There are currently many ways in which neural networks are being used and perfected for both abstractive and extractive text summarization. Extractive and abstractive text summarization differ in that the former takes words and phrases directly from the documents and pieces them together to form an extractive summary, while the latter re-words the document, creating an abstractive summary using natural language processing [3]. Historically, extractive text summarization has been much more common due to it being easier to compute using existing methods. However, newer methods of abstractive summarization are being developed as it is a much more concise and accurate way of summarizing information, as proven through multiple studies directly comparing extractive and abstractive summarization [4]. Some of the most recent and concise methods for text summarization include a pointer-generator model [5] developed by researchers at Stanford University and Google Brain, a Document-Context based Seq2Seq model using RNNs [6], and a combination of BERTSUM, a general news summarization BERT model, and Generative Pretrained Transformer 2 (GPT-2) [7].
Related Works
Due to the recent shift in information retrieval from keyword match to semantic vector search, transformer-based models like BERT are being widely used to map queries and documents into semantic vectors. Through calculating vector distance rather than looking at keywords, it becomes easier to compare relevant and related search results. However, there are inefficiencies in these self-attention models as they lack the necessary prior knowledge. To make up for the flaws of the BERT model, researchers at Microsoft have developed a BM25-weIghted SelfAttentiON framework (BISON), a specific document query architecture that calculates semantic representation to rank documents accordingly [8]. BM25 allows researchers to score words based upon their uniqueness in a query or document, and assign different weights. For BISON, BM25 was used to determine the respective weights for the document and query before each was tokenized. Finally, the tokens are passed through the BISON encoder before the calculation of cosine similarity. To test the quality of BISON, researchers compared its performance alongside other models such as XL-Net and BERT, using Bing’s intrinsic query set and MS Marco, a collection of datasets focused on deep learning in search. For Bing’s intrinsic query set, the different models were evaluated using Normalized Discounted Cumulative Gain (NDCG) and NCG, which measure precision and recall respectively. And for MS Marco, the models were evaluated using Mean Reciprocal Rank (MRR). In both cases, BISON was seen to have performed the best. This model has contributed to the existing body of knowledge by proposing a new method in which search engines can improve their document retrieval quality. We also use the BM25 metric later in the development of our document query metric, the LQM (see 1.2), to compare the two and validate the accuracy of our metric by looking for similar patterns.
Furthermore, to address the issue of long queries not returning relevant results, TinySearch, a semantics-based search engine, was developed [9]. Its architecture consists of a BERT server that generates embedding vectors, combined with a neural network built using Keras that was trained on the Quora Question Pairs dataset. The neural network was designed to determine a score representative of similarity. TinySearch ultimately demonstrated improvements when it came to longer or more complex queries, while it struggled with vaguer, shorter queries. TinySearch was useful in our metric development as it is similar to the notion of calculating the abstractive intersection between query and doc, something we also take into consideration. However, our metric differs from TinySearch in that they used BERT for semantic similarity but we utilize the Natural Language Toolkit’s (NLTK) dataset, synset, in order to map words to their synonyms.
Methods and Materials
- COVerage Algorithm
1.1 Localized Query Algorithm
For our main website to load responses faster, we created a RESTful API hosted on Heroku. The API is live at http://coveragee-api.herokuapp.com. The API was built using Python and served using Flask. To browse news articles, we used the GoogleNews Python library which scrapes news articles from Google’s search engine based on customized queries. We tested three differently worded queries per category to achieve a wide range of results. After determining the most effective query (in terms of news relevancy – see Section 1.2), we used that query for our searching algorithm. To ensure the relevancy of these news articles, we used Python’s datetime library. By doing so, the API verified that the news articles were published at the earliest 7 days before the API request.
1.2 COVerage Localized Query Metric (LQM)
A prominent issue of working on localized queries is that a number of the websites were only tangentially related to the query in the terms of the content being presented. To filter out appropriate queries, we crafted our own metric to help shape our localized query algorithm. The metric was intended to help fine-tune our algorithm to present the most relevant content to our users through the API.
Squery — the query distribution — is a representation of the effectiveness of a query based on the website content the search engine returns. First, the intersection of the words from the query —wquery — and the words from the website content text — wtext— is calculated. This can be calculated extractively (word for word) or abstractively (using vocabulary distributions). We used the Natural Language Toolkit (NLTK) and wordnet from nltk.corpus for calculating the abstractive intersection. The abstractive intersection was determined by cross-checking synonyms of the words in the news article with the query. The vocabulary distributions, specifically, were calculated through wordnet and its attribute, synset. To maintain proportionality to the website content text, the intersection of common words is divided by the number of words in the text. Squery is scaled by 10 to achieve a value between 0 and 1.
Finally, we transformed the query distribution through the sigmoid activation function. The purpose of this transformation was to achieve a scalar value bounded by the limits 0.5 and 1.
Our queries were organized into the following categories: statistics, school information, policies and laws, vaccine progress, information about the economy, and donations. The news articles displayed on our website were organized by region-specific and categorical relevancy which are determined by LQM.
1.3 Categorization
To group our queries by categories and deliver them to users in a cleaner format, we searched for news on policies, education, biology, economy, statistics, and donations and displayed them as rows on our front-end. Responses from the COV-erage API are all grouped in this format as well. For each article, the categories are ranked by the LQM which according to our results falls around 0.55-0.60, to determine how relevant each article is to the specific category. By displaying local donation websites to users, we can help them find ways to aid the prevention of COVID-19.
At first, we automated the scraping of news URLs, images, and text, but to optimize our website’s speed, we decided to utilize Python’s newspaper3k library instead which would provide our users with the headlines, summary, images, and tags more quickly.
- Location-Based Queries and Maps
2.1 U.S. Queries and Transmission Maps
To make our news and data localized, we used the HTML5 Location API to get the user’s location in the form of longitude and latitude points. These points from users are then fed into an API from the Federal Communications Commission (FCC) that returns the specific county info in the form of Federal Information Processing Standards (FIPS) data correlating to those longitude and latitude points. The FIPS data returns county and state information which is later vital for our queries. By using this API, we can guarantee the best localization results as they are from the government census every year. By sending a POST request, the data is received by flask and is then used for queries.
To generate maps of SARS-CoV-2 spread rates in every county in the US, the county FIPS numbers are fed into the embedded COVID-19 spread map from Big Local News (a joint initiative from Google and Stanford Journalism) [10], which then displays localized, regularly-updated COVID-19 statistics from The New York Times. The map is useful to our users because it displays information about COVID-19 rates in their area right next to the news.
2.2 International Queries and Global Map View
If users are using our platform internationally, the reverse geocodes Python library is used to return cities and countries in order to query later in the search API. The news displayed will be relevant to the specific city the user is located in. However because the map from Big Local News is limited to counties within the U.S., the map will show a separate global view map that shows spread rates by country.
When users decline location permissions, the outputted data default to a global view. The global view includes an international map and news from across the world, rather than region-specific news and maps [11].
Figure 2. The website view users will see if they decline to share their location. On the left, the website displays global news, and on the right, a worldwide coronavirus map is shown.
Figure 3. An example news card of a global article that includes an extracted image, a headline, a summary, and keywords linked at the bottom.
- Design
To create a website and mobile app, we first had to determine a basic design that would be user friendly and intuitive. Here we will discuss our design process.
3.1 Website
Our ultimate goal for the web design was to make it concise and easy to navigate. The web design process began with a mockup in Google Slides, where we went through multiple versions before deciding on the final layout and colors. The color scheme is composed of Stanford’s Cardinal, Cloud, and white, to create a lighter appearance overall. The fonts are also from Stanford’s identity toolkit: Crimson Text for the title and Source Sans Pro for the body. Given that website’s purpose is to keep the information concise, we wanted to keep everything on one page. As such, we created a two-column format, with the left column having scrollable rows based on the category of news, and the right column containing our location-specific map for users to see statistics correlating to their county. The About Us section is also in the right column to even out the lengths of the two columns.
3.2 App
Having settled on web design, we used MarvelApp to create a mobile app prototype.
Figure 4. Mockups of the COVerage app. Tags allow users to navigate from each category of articles, and the bottom menu bar provides an About Us page as well as maps of COVID-19 cases in specific counties.
The app’s colors and fonts are consistent with the website, but we reorganized the layout to better fit the smaller window dimensions. We took inspiration from the Instagram and the LinkedIn mobile app, using a bottom navigation menu and category tags, respectively, as an intuitive way for the user to navigate the app and delineate the different sections. We decided to make multiple tabs: one for the articles, one for the map, and one for the about us section. We integrated the optional ZIP code entry into a small settings button in the top right, as it is a minor feature and will not distract from the rest of the design. For the app, we prioritized simplicity and readability, rather than keeping the information all on one page.
Results
Figure 5. LQM Results for Yellowstone County, Montana Coronavirus Economy-related Queries | Query made on August 3, 2020 | Similar trend between average values of LQM / BM25
Figure 6. LQM and BM25 Results for Providence County, Rhode Island Coronavirus Education-related Queries | Query made on August 3, 2020 | Trend resemblance between LQM and BM25 values indicates the viability of LQM
Figure 7. LQM Results for International vs. Local (Manhattan, NY) Vaccine Progress and Biology | Query made on August 6, 2020 | Similarities between the LQM results of both graphs led us to display international vaccine news to users regardless of location
Discussion
For our results, we compared the LQM metric we created for document relevance to the BM25 metric, another widely used ranking function used by search engines. Using test data from Google searches, we compared the scores from LQM with those of BM25 for the first three results of the Google News search to validate our formula and verify that both metrics, while not the same, would show similar patterns. For each set of two graphs, one for the LQM metric and one for the BM25 metric, we display the score given by the metrics on the y-axis, and the three different queries used on the x-axis. The search queries were determined based on the specific group of information we were looking for, and represent three similar ways to search for news on the same overarching category.
Figure 6 indicates the best example of the general matching trend between LQM and BM25. The LQM graph for query “Providence County Rhode Island coronavirus education” indicates the general expected trend of first Google News result being the most relevant, second result being the second most relevant, etcetera. We hypothesize that the discrepancy in this trend among the other graphs is because our metric prioritizes location and COVID-19-specific categorical news as opposed to querying keywords.
Another set of graphs we chose to compare were two LQM graphs for both international coronavirus vaccine/biology news in comparison to localized coronavirus vaccine/biology news. The LQM metric gives similar results for both the international queries as well as queries made specifically for Manhattan, NY (Fig 7.). Because of the very close data points, we ultimately decided that there was no added benefit in customizing the news in the Biology category based upon user location, as it did not obtain a higher score from the LQM metric. Therefore, our COVerage website displays international vaccine and biology news to all users no matter the region they are in. In addition to the graph results, we also made this decision based on our own logical reasoning that COVID-19 vaccine progress is a global effort that users located all across the world would be interested in regardless of where the vaccine development and testing are taking place.
Conclusion
We created COVerage with the intent of allowing users from both national and international backgrounds to receive localized and relevant news in a matter of a few seconds. To ensure news article relevancy, we created a search algorithm optimized for COVID-19 news that actively calculates relevancy in context to the region-specific, local query using the LQM metric. The LQM metric indicates the efficiency of the custom COVID-19-related query in the context of Google News results. Our algorithm allows users to receive accurate news of five specific overarching categories of COVID-19.
While other resources that provide accurate COVID-19 data, COVerage is different in that it provides both localized news and map statistics all on one page, allowing easy access to the user. Unlike a regular news search, COVerage allows users to direct themselves to the exact categorical news they are looking for.
Future Directions
We aim to implement a real-time machine learning-based summarizer on our site in the future as opposed to relying on newspaper3k. We believe that by training and running our own text summarizer, the model can be tailored towards both local and categorical summaries – the two primary focuses of COVerage.
We also plan on making our website into a progressive web application and releasing it to the Apple App Store and Google Play store. We have already created several mockups of potential app design, and are working on making it more intuitive and user friendly.
The COVerage Search Query Algorithm can be utilized to improve the efficiency of automated searches. Search engines can also use the COVerage Query Metric to optimize web crawlers to specifically look for results whose content more closely reflects on the user’s queries.
The code for COVerage’s main website can be found at: https://github.com/vmallela0/COVerage
The code for COVerage’s REST API (Region-specific algorithm + LQM Metric) and the notebook for LQM Metric Visualization can be found at: https://github.com/AaditT/coverage-api
Acknowledgments
We would like to express our gratitude to our mentors, Cindy Nguyen, and Jonathan Mak, for their continuous support, guidance, and insightful feedback throughout the project. We would also like to sincerely thank Professor Tsachy Weissman, Cindy Nguyen, and the Stanford Compression Forum for providing us with the opportunity to engage in a research internship. Without them, our work would not have been possible.
References
[1] Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language processing: an introduction. Journal of the American Medical Informatics Association: JAMIA, 18(5), 544–551. https://doi.org/10.1136/amiajnl-2011-000464
[2] Janocha, K., Czarnecki, W. M. (2017). On loss functions for deep neural networks in classification. Retrieved August 2, 2020, from the arXiv database.
[3] Tretyak, V., Stepanov, D. (2020). Combination of abstractive and extractive approaches for summarization of long scientific texts. Retrieved August 3, 2020, from the arXiv database.
[4] Kallimani, J., Srinivasa, K., & Reddy, B. (2016). Statistical and analytical study of guided abstractive text summarization. Current Science, 110(1), 69-72. Retrieved August 3, 2020, from www.jstor.org/stable/24906612
[5] See, A., Liu, P. J., & Manning, C. D. (2017, April 25). Get to the point: summarization with pointer-generator networks. Retrieved July 13, 2020, from https://arxiv.org/abs/1704.04368
[6] Khatri, C., Singh, G., & Parikh, N. (2018, July 29). Abstractive and extractive text summarization using document context vector and recurrent neural networks. Retrieved July 13, 2020, from https://arxiv.org/abs/1807.08000
[7] Kieuvongngam, V., Tan, B., & Niu, Y. (2020, June 03). Automatic text summarization of COVID-19 medical research articles using BERT and GPT-2. Retrieved July 13, 2020, from https://arxiv.org/abs/2006.01997
[8] Shan, X., Liu, C., Xia, Y., Chen, Q., Zhang, Y., Luo, A., & Luo, Y. (2020, July 10). BISON:BM-25 weighted Self-Attention Framework for Multi-Fields Document Search. Retrieved August 3, 2020, from https://arxiv.org/abs/2007.05186
[9] Patel, M. (2019, August 07). TinySearch — Semantics based Search Engine using Bert Embeddings. Retrieved August 3, 2020, from https://arxiv.org/abs/1908.02451
[10] COVID-19 Case Mapper. (n.d.). Retrieved August 07, 2020, from https://covid19.biglocalnews.org/county-maps/index.html
[11] Coronavirus (COVID-19) Map: Cases Worldwide | Domo. (n.d.). Retrieved August 07, 2020, from https://www.domo.com/covid19/geographics/global/