Content uploaded by Duc-Ly Vu
Author content
All content in this area was uploaded by Duc-Ly Vu on Nov 16, 2021
Content may be subject to copyright.
PY2SRC: Towards the Automatic (and Reliable)
Identification of Sources for PyPI Package
Duc-Ly Vu
University of Trento, Italy
ducly.vu@unitn.it
Abstract—Selecting which libraries (‘dependencies’ or ‘pack-
ages’ in the industry’s jargon) to adopt in a project is an essential
task in software development. The quality of the corresponding
source code is a key factor behind this selection (from security
to timeliness). Yet, how easy is it to find the ‘actual’ source?
How reliable is this information? To address this problem, we
developed an approach called PY2SRC to automatically identify
GitHub source code repositories corresponding to packages in
PyPI and automatically provide an indicator of the reliability
of such information. We also report a preliminary empirical
evaluation of the approach on the top PyPI packages.
Index Terms—Mining software repository, quantitavie study,
Python packages, PyPI, Software factors, Software supply chain
I. MOTIVATION AND RESEARCH QUESTIONS
The selection of high-quality libraries can be supported by
the possibility to see the source code of the libraries [2].
Several recent papers discussed qualitative reasons behind the
developers’ choice of dependencies [1], [2]. Most mentioned
factors are connected with the library source code repositories
(e.g., GitHub), such as the number of stars, forks, etc.
Furthermore, links to source code repositories are required
to investigate the reproducibility of packages [7], identify
discrepancies between sources and packages [3], as they pose
both operational risks (e.g., making dependent projects unable
to compile) and security risks (e.g., deploying malicious code
during installation [3]–[6]) in the software supply chain as was
recently demonstrated in npm and rubygems [3], [4], [6].
Large scale analysis, software integration, verification of
source code, software bills of materials all require access to
source code URLs, but those can be difficult to find. The
scientific challenge is whether one can find a reliable URL
automatically.
OSSGADGET FIND SOURCE [8] (OF S for short), the only
existing tool to automatically extract the Github URL of a
TABLE I
STATISTICS OF URLS RETURNED BY DIFFERENT INFORMATION SOURCES
Among the top 4000 packages [9] in PyPI, very few ‘obvious’ fields in a
package (e.g., Readthedocs) for a URL to a source code actually have
one. Instead, one needs to scrap into other hidden areas (e.g., Metadata)to
find the source, and it is unsure the found URLs are correct as they may be
different from the URL present in the majority of other fields.
# Packages Homepage Metadata Readthedocs OFS PY2SRC
Field with URL 2127 2896 822 3369 3493
of which different 349 222 0 478 0
Without any URL 1873 1104 3178 631 507
package, does not provide any reliability information about
its findings (e.g., information to determining whether to trust
a return URL). Table I shows such results are often unreliable.
This paper aims to develop a methodology to identify the
source URL of a package and its reliability information:
RQ1: How to combine PyPI package information to identify
the corresponding GitHub URLs of the source?
RQ2: How reliable is this information in practice?
II. FINDING SOURCE CODE REPOS OF PYPI PAC KA GE S
To find the GitHub URL of a PyPI package, we can rely on
the information about the package extracted from PyPI. For
example, a PyPI page of a package might reference the cor-
responding source code repository in the package description
or the statistics area. Also, each PyPI package has package
metadata in JSON format that might as well contain a GitHub
URL reference. OF S mainly uses package metadata as the
primary information source.
Intuitively, PY2SRC combines the URLs reported by the
information sources as follows:
•Extract all working URLs from the information sources
(possibly solving URL redirections).
•If all information sources (e.g., Badge, Homepage, Meta-
data, Readthedocs, etc.) point to the same URL, return it.
•If URLs point to different GitHub repositories, return the
URL with more supporting information sources.
•If several URLs are equally supported, return the most
reliable URL, according to the metrics in Table II.
For each identified URL, PY2SRC extracts reliability metrics
(Table II) that support the correspondence of this URL to the
PyPI package. A metric has a value of 1 or -1 if the metric
condition is satisfied or not. In particular, if the corresponding
information for a metric is not available (e.g., there is no PyPI
badge), the value zero is assigned to the metric. Hence, the
automatic score for a URL may range from -4 to 4. For manual
validation, this paper uses two additional metrics that have not
been automated yet (last two rows in Table II).
EXAMPLE 1. The GitHub repository name
(https://github.com/aws/aws-cli) of awscli
(https://pypi.org/project/awscli) differs from the package name
by one character (+1), the GitHub and PyPI descriptions
are essentially the same (+1). The repository has a Python
percentage of 100% (+1). There is a match between the list
of GitHub tags and PyPI releases (+1). PyPI maintainers are
1394
2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
2643-1572/21/$31.00 ©2021 IEEE
DOI 10.1109/ASE51524.2021.00176
TABLE II
RELIABLE ATTRIBUTION METRICS
A metric has value 1 if the condition is satisfied, -1 if the condition is not satisfied, and 0 if it cannot be computed (e.g., a badge is absent). All six metrics have
been used to score URLs during the manual validation, while only the first four ones are computed automatically once the GitHub URL is identified.
Intuitive Metric Algorithm for Automatic Calculation Condition for score
Name similarity
I use the Levenshtein distance [10] to calculate the difference between the names of a package in
PyPI and its GitHub repository. The distance ranges from 0(the strings are equal) to the maximum
length of the compared strings. Sometimes the Levenshtein distance could be non-zero because
one name is a substring of the other: the name of the source code repository of the future
package is python-future. Hence, I also check whether names are substrings to have a single
metric that corresponds to a human intuition that the package names are similar.
(Distance <) 2 or (Is a sub-
string of each other)
Description
similarity
The Levenshtein distance between the project descriptions (truncated at 1500 characters) in PyPI
and GitHub pages. The lower the distance, the more credible the URL is. Distance / Length <0.5
Python language Describes if Python is included in the Languages section of the GitHub page. Python in Languages section
PyPI badge Presence of a badge on the GitHub page that points to the PyPI page of the package under analysis. Badge is present and points to
the analyzed package
Authors similarity PyPI package maintainers are present in the list of the GitHub repository contributors. N/A (manual so far)
Tags matching Alignment of GitHub repository tags and PyPI package releases. N/A (manual so far)
in the list of GitHub contributors (+1). There is no badge
pointing to the PyPI page in the GitHub page (0). The total
reliability score is 5.
III. EMPIRICAL VALIDATION
For an initial manual validation, 45 random packages from
the top 4000 packages [9] were selected to identify the URLs,
the reliable attribution metrics (Table II) and compare the
results with automatically computed reliability metrics. The
45 packages were chosen as follows:
•15 packages where OF S and PY2SRC reports the same
URL,
•15 packages where the URL from PY2SRC differed from
OFS and the checked manual URL that was later found
coincided with the former,
•15 additional packages where PY2SRC differed from OF S
and the checked manual URL coincided with the latter.
We observed that the high Pearson correlation coefficient
of 0.725 statistically confirms the visual relationships between
these manually and automatically computed reliability scores.
Hence, automatically calculated URL reliability scores are
likely to correspond to the scores assigned by a human.
TABLE III
PY2SRC SOURCES AND TOOLS COMPARISON IN TERMS OF PRECISION
AND RECALL AMONG THE 325 PAC KAG ES
Information Source Precision Recall
Badge 93% 23%
Homepage 92% 32%
Metadata 97% 19%
Readthedocs 54% 11%
Statistics 95% 27%
OFS 80% 48%
Py2SRC 86% 78%
PY2SRC + OFS 83% 90%
The experiment was then scaled to 325 manually verified
URLs of additional packages among the top 4000 most pop-
ular packages [9] where OF S and PY2SRC produce different
results. As shown in Table III, we found that OFS has lower
precision but higher recall than individual information sources:
it reports GitHub URLs for more packages but returns more
incorrect URLs as well. On the other hand, PY2SRC demon-
strates better precision (+6%), and significant improvement in
recall (+30%) compared to OF S. In addition, by combining
PY2SRC and OF S (when we could not identify the URLs)
allows us to improve the recall to 90% (+12%) while having
slightly worse precision comparing to PY2SRC (-3%).
IV. CONCLUSION AND FUTURE WORK
The proposed approach for finding GitHub URLs of a
package by extracting and combining the information sources
outperforms the state-of-the-art tool OF S. It is also the first to
provide a set of reliable attribution metrics that allow devel-
opers to validate whether the automatically reported GitHub
URLs (either by PY2SRC, OF S, or similar tools) correspond
to the expectations. The comparison with an existing approach
shows the improvement in precision and recall.
Future work is towards increasing the number of automat-
ically computed reliability indicators to author similarity and
releases similarity and to use the found URLs for large-scale
analysis. A major challenge is present when there is no URL in
the package, so one must reverse engineer it from the internet
by, e.g., i) Authors’ “presence” (e.g., membership in multiple
repositories to PyPI packages), ii) presence of scientific work
supporting the code (e.g., published paper corresponding to a
repository), iii) dependencies.
Another interesting direction is to investigate ‘badness met-
rics’ to capture deliberately misleading URLs for reliability or
security (e.g., for finding the difference between source and
packages [3]) such as Damerau-Levenshtein for transposition
strings for typosquatting [6].
ACKNOWLEDGMENTS
This research has been partly funded by the EU H2020
Programs CyberSec4Europe (Grant No. 830929), AssureMoss
(Grant No. 952647). We thank Simone Pirocca for his sup-
port in the implementation and Fabio Massacci and Ivan
Pashchenko for their help and advice and anonymous review-
ers for their insightful reviews.
1395
REFERENCES
[1] R. G. Kula et al. “Do developers update their library dependencies?”
ESEJ 23(1): 384–417, 2018.
[2] I. Pashchenko et al. “A qualitative study of dependency management
and its security implications,” in Proc. of CCS’20, 2020.
[3] D. L. Vu et al. “Lastpymile: identifying the discrepancy between
sources and packages,” in Proc. of ESEC/FSE’21, 2021.
[4] M. Ohm et al. “Backstabber’s knife collection: A review of open source
software supply chain attacks,” Proc. of DIMVA’20, 2020.
[5] D. L. Vu et al. “Towards using source code repositories to identify
software supply chain attacks,” in in Proc. of CCS’20, 2020.
[6] D.-L. Vu et al. “Typosquatting and combosquatting attacks on the python
ecosystem,” in Proc. (EuroS&PW’2020), 2020.
[7] P. Goswami et al. “Investigating the reproducibility of npm packages,”
in Proc. ICSME’20, 2020.
[8] Microsoft, “Oss find source: Attempts to locate the source code (on
github, currently) of a given package.” https://github.com/microsoft/
OSSGadget/wiki/OSS-Find- Source, 2020.
[9] Hugovk, “Top pypi packages”, https://hugovk.github.io/top-pypi-
packages/, 2020.
[10] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-
tions, and reversals” in Soviet physics doklady 10(8):707–710, 1966.
[11] E. Larios Vargas et al. “Selecting third-party libraries: The practitioners’
perspective,” in Proc. ESEC/FSE’20, 2020.
[12] Agresti, Alan and Franklin, Christine, “Statistics the art and science of
learning from data” in Pearson Education Limited, 2018.
1396