PreprintPDF Available

Skim: Scalable and Robust Machine Learning Platform for Research Community

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In the last few years, Machine Learning has evolved at a rapid pace. With its burgeoning application in domains ranging from health care to finance to media and entertainment, the list has been endless and the amount of research devoted has been tremendous. Each day, arXiv alone receives almost 100 papers for categories related to ML. With this fast moving world of research it is hard for the research community to be up-to-date with the latest developments. We are introducing a platform called Skim-https://skimhq.tech built specifically to solve problems for the users caused due to the meteoric rate of growth in the field of ML. This paper presents the design and implementation of Skim, a scalable and reliable platform. It also delineates the implementation of the Paper Similarity functionality for Skim, along with a multitude of other features like Conferences, Paper Previews, Charts and Racks.
Content may be subject to copyright.
Skim: Scalable and Robust Machine Learning
Platform for Research Community
Omkar Prabhu
Cogito Corp
prabhuomkar@pm.me
Jane D’souza
ConnectWise
janeadsouza@pm.me
John D’souza
Quantiphi
dsouzaajohn30@pm.me
Akshay Pithadiya
Tata Consultancy Services
akshaypithadiya@gmail.com
Abstract—In the last few years, Machine Learning has evolved
at a rapid pace. With its burgeoning application in domains
ranging from health care to finance to media and entertainment,
the list has been endless and the amount of research devoted
has been tremendous. Each day, arXiv alone receives almost
100 papers for categories related to ML. With this fast moving
world of research it is hard for the research community to be
up-to-date with the latest developments. We are introducing a
platform called Skim - https://skimhq.tech built specifically to
solve problems for the users caused due to the meteoric rate of
growth in the field of ML.
This paper presents the design and implementation of Skim, a
scalable and reliable platform. It also delineates the implemen-
tation of the Paper Similarity functionality for Skim, along with
a multitude of other features like Conferences, Paper Previews,
Charts and Racks.
Index Terms—machine learning, natural language processing
I. INTRODUCTION
It has always been challenging to sift through a clutter of
hundreds of papers available in arXiv daily, to keep up with
the enormous quantum of research generated. Many initiatives
to facilitate ease of access to these papers have been made in
the past and these have been extensively appreciated and used
by the research community. However, another challenge faced
by these conventional websites was to ensure scalability and
high availability for the growing number of users. Recognizing
the paucity of platforms addressing these problems, Skim was
built. We have built Skim for a wide range of users in the com-
munity - from ML enthusiasts who stalk the trends to hardcore
researchers who need conference information and statistics
along with quick access to research papers and organizing
capabilities. In this paper we talk about the implementation
of these beneficial features provided by Skim.
II. RE LATE D WOR K
arXiv is a renowned repository for research papers with nearly
2 million papers as of October 2021. Its easy to build softwares
on top of it due to its public API1and a dataset2.
A. Projects and Platforms
arxiv-sanity [1] is one of the most popular websites to read
about arXiv papers. It works on ML categories providing func-
tionalities like paper previews and searching similar papers by
1arXiv API Access, https://arxiv.org/help/api/
2arXiv Kaggle Dataset, https://www.kaggle.com/Cornell-University/arxiv
TF-IDF. Zeta Alpha [2] is a mature navigator with Transformer
powered search for similar papers and articles. Apart from this,
it contains almost similar features like arxiv-sanity without
the previews. Onikle [3] is a paper list management platform
which recommends users papers based on their ”chronicle”
which is nothing but a list of papers. Onikle too, has limited
capabilities as compared to arxiv-sanity.
B. Papers & Datasets
Springstein et al. [4] talks about the initial set of tools built
around arXiv using the programming interfaces provided by
arXiv. It sketches out a design for boosting a paper based on its
popularity. Scharpf et al. (2020) [5] talks about classification
and clustering of mathematical arXiv documents. It observes
that text and mathematical encodings are separate features of
a document and have relatively low correlation between them.
It found that docText tfidf outperforms other methods like
doc2vecText on an average, when it comes to classification
accuracies. Alvarez et al. (2020) [6] discusses possible appli-
cations of LDA for topic modeling and recommendations to
help the users view papers based on their preferences.
III. OVERVIEW AND MOTI VATI ON
A. Benefits for Research Community
With the help of Skim, people can keep up with the rapid
growth in the number of papers, especially in academia, who
need to work on lot of intersecting areas with CS like biology,
astrophysics and chemistry.
Fig. 1. arXiv submissions for CS category
On arXiv, it is very hard to take a quick glance across a
paper. There is no way to look up arXiv papers that are part
of conference proceedings as a list. With Skim’s Conferences
feature, one can easily see all arXiv papers which are part of
the yearly proceedings.
With Skim, researchers who are focusing on a specific area or
task, can get similar papers, datasets, surveys, etc.
B. Goals
Paper Previews: Image preview for each page of a
lengthy paper with an index to follow through the order.
It allows readers to easily jump between sections.
Papers associated with Conferences: arXiv papers linked
with the conferences proceedings and made available as
year wise Racks.
Better visibility into Conferences: Users can gain all
the possible information about ML conferences such as
deadlines, location and glance through yearly statistics
such as acceptance rates and topic trends.
Paper Similarity & Recommendation: A smart finder for
similar papers and a good recommender system based on
users’ reading history and interests.
Social platform for ML enthusiasts: Users can follow
other users who’ve signed up on the platform. Users can
share the papers they are reading by creating public racks
which one can like and follow.
Given our goals and further study, it was hard to ignore
the similarity of the platform to Spotify3- where papers
were tantamount to songs, racks were equivalent to playlists,
conferences being similar to albums and trending papers were
nothing but charts; thus inspiring us to build Skim as ”The
Spotify for ML research”.
IV. ARCHITECTURE
In this section, we present the proposed architecture by
describing the components, their design and implementation
details. We needed Skim to have the technology stack which
encourages reusability, takes least amount of time for go-
to-market strategy and which is best suited for the type of
tasks that our service needs to perform. Since, it was very
experimental at the beginning we wanted the architecture to be
easy to deploy and scale. Based on the above requirements, the
architecture has following components: Frontend and Backend
API Service, Worker and a Machine Learning Service.
A. Frontend and Backend Service
Frontend is designed in Skim using React.js and a popular
UI framework. Backend is a GraphQL API built using gqlgen
in Golang and MongoDB for database. For improving perfor-
mance, we have implemented dataloaders as well as several
caching strategies using Redis.
B. Worker Service
Worker, as the name suggests, is the core component of the
Skim architecture. This Golang microservice handles most of
the crucial tasks - from sending email notifications, user invites
and logins to running scheduled jobs such as syncing arXiv
3Spotify: Music streaming service powered by AI, https://www.spotify.com/
Fig. 2. Platform Architecture
papers. It also extracts the embedded images, diagrams and
thumbnails for the Paper Preview & Thumbnail feature. It
also crawls Twitter for any papers that researchers all over
the world are talking about, and these tweets form the basis
of the Charts4. It also posts trending papers on Twitter daily,
by processing the data synced from Twitter.
C. Machine Learning Service
This service provides similar paper suggestions to the users
based on the paper they are currently reading. Inferencing is
done by loading in-house research models. In order to provide
high throughput, Skim’s machine learning component is a
gRPC service along with a TTL based cache.
V. MACHINE LEARNING WORK FLO W
To provide the Machine Learning Service with the required
model objects for paper similarity, a workflow was designed
that would get triggered by the Worker Service. When this
workflow is triggered, it executes a Python code that extracts
all the data from the MongoDB database and pre-processes
the title of the papers. Some of the other features of the data
include the arXiv id, summary and author/s of the research
papers. Using Gensim5, TF-IDF vectors are computed for all
the papers in our database. These TF-IDF vectors are saved
in shards which are made available for making inferences
once the machine learning server is initiated along with other
model objects used to optimize the results for the queries.
A .zip file of all the model objects required for the machine
learning service is then version tagged by date and released
on a GitHub repository from where it will be made available
to the machine learning service. This workflow is triggered
weekly to update TF-IDF vectors for all the new papers that
come into our database and to provide better suggestions to
the end user.
A. Model Details
After computing the TF-IDF vectors, we need to measure the
similarity between two papers in order to provide suggestions.
The metric used to quantify the similarity of the papers using
4Skim Charts: https://skimhq.tech/charts
5Gensim: Python framework for vector space modelling, https://
radimrehurek.com/gensim/index.html
Fig. 3. Machine Learning Workflow
their TF-IDF vectors is Cosine Similarity. Sitikhu et al. (2019)
[7] explains the process of calculating Cosine Similarity using
TF-IDF vectors. In the following diagrams, we demonstrate
how the model interprets similarity between two papers based
on their Cosine Similarity calculated using TF-IDF vectors.
Dissimilar Papers: Consider the titles of two papers. Say
Paper 1 is Automatic Machine Translation Evaluation in
Many Languages via Zero-Shot Paraphrasing, which is a
paper from NLP domain and Paper 2 is YOLOv4: Optimal
Speed and Accuracy of Object Detection, which is a paper
from Computer Vision domain. Both the papers are from
two distinct domains and have obviously dissimilar titles.
When we compute their Cosine Similarity using their TF-IDF
vectors and plot it on a Venn diagram, we see two non-
intersecting sets.
Fig. 4. Dissimilar Papers
You can think of each set as a vector representing the title
of the paper in a multi-dimensional space. This differs in case
of similar papers:
Similar Papers: Consider the titles of two papers. Say Paper
1 is YOLOv4: Optimal Speed and Accuracy of Object Detec-
tion, and Paper 2 is YOLO and K-Means Based 3D Object
Detection Method on Image and Point Cloud. Both the papers
are from the same domain of Computer and use models from
the same family of models (YOLO). In this case, when we
compute their Cosine Similarity using TF-IDF vectors and plot
it on a Venn diagram, we see two intersecting sets.
The area of intersection is directly proportional to the
Fig. 5. Similar Papers
Cosine Similarity, higher the Cosine Similarity, greater the
intersecting region. You can think of these intersecting sets
as slightly overlapping vectors, forming a very small angle
between them while overlapping, the cosine of which is a
value closer to 1.0, thus indicating that the two papers may
be similar.
VI. CONCLUSION AND FUTURE WO RK
This paper proposes a scalable architecture for Skim, which
is a platform built for reading research papers quickly using
assistance given by systems which are powered by ML.
In the near future, we plan to extend the support to more
arXiv categories. Technically, we plan to optimize the model
for finding similar papers to overcome the drawbacks of TF-
IDF mentioned in Shahmirzadi et al. (2018) [8] along with
several new platform features and improvements.
ACKNOWLEDGMENT
We would like to thank our peers from the university and
colleagues from the industry for reviewing our work.
REFERENCES
[1] A. Karpathy, “Arxiv Sanity Preserver,” https://github.com/karpathy/
arxiv-sanity-preserver, 2015.
[2] “AI Research Navigator,” https://www.zeta-alpha.com.
[3] “The Preprint Search Platform,” https://www.onikle.io.
[4] M. Springstein, H. H. Nguyen, A. Hoppe, and R. Ewerth, “Tib-arxiv: An
alternative search portal for the arxiv pre-print server,” 2018.
[5] P. Scharpf, M. Schubotz, A. Youssef, F. Hamborg, N. Meuschke, and
B. Gipp, “Classification and clustering of arxiv documents, sections, and
abstracts, comparing encodings of natural and mathematical language,”
2020.
[6] E. Alvarez, F. Lamagna, C. Miquel, and M. Szewc, “Intelligent arxiv:
Sort daily papers by learning users topics preference,” 2020.
[7] P. Sitikhu, K. Pahi, P. Thapa, and S. Shakya, “A comparison of semantic
similarity methods for maximum human interpretability,” 2019.
[8] O. Shahmirzadi, A. Lugowski, and K. Younge, “Text similarity in vector
space models: A comparative study,” 2018.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features related to the text's words only, will give less accurate results. This paper presents three different methods that not only focus on the text's words but also incorporates semantic information of texts in their feature vector and computes semantic similarities. These methods are based on corpus-based and knowledge-based methods, which are: cosine similarity using tf-idf vectors, cosine similarity using word embedding and soft cosine similarity using word embedding. Among these three, cosine similarity using tf-idf vectors performed best in finding similarities between short news texts. The similar texts given by the method are easy to interpret and can be used directly in other information retrieval applications.
Arxiv Sanity Preserver
  • A Karpathy
A. Karpathy, "Arxiv Sanity Preserver," https://github.com/karpathy/ arxiv-sanity-preserver, 2015.
Tib-arxiv: An alternative search portal for the arxiv pre-print server
  • M Springstein
  • H H Nguyen
  • A Hoppe
  • R Ewerth
M. Springstein, H. H. Nguyen, A. Hoppe, and R. Ewerth, "Tib-arxiv: An alternative search portal for the arxiv pre-print server," 2018.
Intelligent arxiv: Sort daily papers by learning users topics preference
  • E Alvarez
  • F Lamagna
  • C Miquel
  • M Szewc
E. Alvarez, F. Lamagna, C. Miquel, and M. Szewc, "Intelligent arxiv: Sort daily papers by learning users topics preference," 2020.