About this project
Who built this?
My name is Gonzalo. I work as an NLP & AI engineer in Madrid (Spain). Feel free to reach me out with any questions or suggestions you may have. My contact info can be found below.
What is the purpose of this site?
This website hosts a multilingual search tool that is specially aimed to ease research in coronavirus by making the available literature more accessible and centralised. Therefore, its main objective is to reach researchers and doctors, but anyone interested in getting informed about several aspects of the pandemic may also benefit from it.
How is it used?
You will be able to retrieve 44K+ papers on coronavirus-related topics from publishers (namely American Society for Microbiology, BMJ, Elsevier, New England Journal of Medicine, SAGE, Springer Nature and Wiley) listed in the main health science preprints (bioRxiv and medRxiv among others). The search engine is multilingual, meaning that thanks to the AI model that it uses it can handle requests in several languages and also retrieve papers in different languages for the same query. Also notice that the results will be as good as the papers available.
How does it differ from other search engines?
The first difference is the scope. This search engine already contains a curated selection of resources related to research in the COVID-19 instead of documents of any type or subject. Furthermore, it does not only take into account the title but also the content of the paper and can return more realistic results that, for example, use related terms even when not directly mentioned in the title. In this way the tool is semantically more flexible thanks to the AI model used.
How was it done?
The project has been developed with the use of COVID-19 Open Research Dataset (CORD-19) by Allen Institute for AI. A subset of papers was first selected with those including a title and a valid abstract making use of scispaCy. In order to index the articles, a Deep Natural Language Processing model from Google (Multilingual Universal Sentence Encoder) was applied to each of the abstracts (plus title) to encode them into a vector (512 word embedding). Finally, the vectors were indexed using Facebook's Faiss library, which is also used to retrieve the relevant documents. The site was deployed using the Flask microframework with gunicorn and Nginx server runnning on a AWS EC2.
CoronaSearch
AI-powered multilingual search for COVID-19 academic papers.