Conference PaperPDF Available

Correlation Analysis of Popularity and Interoperability in Open Source Projects

Authors:

Figures

Content may be subject to copyright.
Correlation Analysis of Popularity and
Interoperability in Open Source Projects
Fatima Al Shamsi
Abu Dhabi Systems and Information Centre
Abu Dhabi, UAE
Fatima.AlShamsi@adsic.abudhabi.ae
Sarah Bamatraf, Talal Rahwan, Zeyar Aung, and Davor Svetinovic
Khalifa University of Science and Technology
Abu Dhabi, UAE
{sarah.bamatraf, talal.rahwan, zeyar.aung, davor.svetinovic}@ku.ac.ae
Abstract—In order to examine the success of open source
projects, we analyzed 252,008 projects in the SourceForge
database. We restricted our study to projects that are written
in the top ranked programming languages. We measured the
correlation between popularity, interoperability, productivity and
the success rate of each project. The developers’ contribution to
the project was also examined in terms of the developer’s team
size, and the success in terms of the total number of committers
(contributors). Our results indicate that using multiple program-
ming languages requires additional team members. Also, there is
a significant increase in the functionality of the developed project.
The results also demonstrated that there is a positive correlation
between the development team size, and the total number of
committers.
Index Terms—open-source software, programming languages,
data mining
I. INTRODUCTION
In spite of decades of studies, what makes a program-
ming language used by people of all skill levels remains
elusive. Although a considerable number of programming
languages is still used by the community, the reasons behind
preferring specific languages is a subject of debate [1]. This
paper examines the success of different open-source projects,
especifically those written in the top-ranked programming
languages, to measure their success rate. The concept of open
source software is increasingly gaining attention worldwide.
Open source software is usually developed by programmers
spread across the whole world. They take pride in contributing
towards the development of such software projects [2]. Open
source projects have a significant impact on the software
business globally. In a 2009 survey, it was found that 24% of
all responding enterprises were using or implementing open
source software [3].
In our study, different correlations are observed that asso-
ciate software development and common programming lan-
guages with the success of open source projects in terms of
the developers team size, and the total number of committers.
Software development is a closely-coupled activity, which of-
ten involves tight integration and inter-dependencies between
modules. Thus, it normally requires a substantial amount of
coordination and communication between its developers [4].
The term interoperability will be used to refer to the process
of developing software projects using multiple programming
languages.
To date, the number of papers that study the popularity,
interoperability, and impact of diverse programming languages
is limited. Hence, there is an urgent need for a comprehensive
analysis of the mentioned factors, by analyzing the projects
that are written in several programming languages. Open
source practices and tools have been remarkably success-
ful. But, in several areas there are still opportunities for
improvement. These areas include supporting collaboration
among developers, supporting new developers, supporting the
broader end-user community including non-developers, and
understanding end-users’ requirements [5].
For the purpose of our study, 494,158 SourceForge code
repositories were analyzed. The number of committers, as
well as the total number of developers in each given project
were investigated. Upon these information, our hypotheses was
developed in a structured manners, based on the popularity and
interoperability of programming languages.
Our research is centered around two research hypotheses
that are designed to find correlation between popularity, inter-
operability, and the impact of programming languages in open
source projects.
H1:Being written in more than five programming language
increases the productivity and the success rate of projects.
We verify whether increasing the level of interoperability
contributes to the success of the projects, or the contrary
as writing in multiple programming languages requires larger
team sizes of skillful developers.
H2:Being written in top-ranked programming languages
increases the potential of project success. We investigate the
correlation between the top 10 programming languages in
SourceForge and the success of the projects, while considering
metrics such as committers, and the size of the development
team.
II. RELATED WORK
The history, the fundamentals, the trends, and the advance-
ment of programming languages have been studied by several
research works. Open source users have conventionally been
thought of as similar to developers [6]. The findings further
suggest that a reflexive project tends to be more popular among
users than those intended for a more general audience.
It was found that files modified by a developer can be used
to give an approximation of developer expertise [7]. The new

 !!!""#$%&&& 181
users can then find out who asks for advice while modifying
certain codes. Code contributors have the ability to investigate
who to suggest as reviewers for the code. This process can
facilitate the identification of the best candidates to fix any
given problem. Other research has focused on the different
versions of a software system, in order to analyze the source
codes of these versions. Also, documentation and other meta-
data can reveal regulations and anomalies in the development
process of the system [8].
The Concurrent Versions System (CVS) repositories of
9,999 open source projects that are hosted on the SourceForge
website were examined. The 10 most popular programming
languages have been tested, and it was found that a program-
ming languages is an essential factor when determining the
rate at which source code is written, even after accounting for
the variations between programmers and projects [9].
Feitelson and Heller [10] wanted to determine a success
criterion for open source software projects. To this end, they
analyzed 122,205 projects in the SourceForge database, and
found that there were 80,597 projects with no downloads
at all. The remaining 41,608 projects were categorized into
three categories: super projects (those with over 1.1 million
downloads; this category contained 85 projects), successful
project (those with more than 1,680 downloads each; this
category contained 10,000 projects), and struggling projects
(those with less than 1,680 downloads).
As for the team size, Grottke et al. [11] noticed that a larger
team might increase the average pending time. Higher team
dispersion also affects the pending time, since identifying the
developer that is responsible for a certain failure become more
complicated. As for the team experience, more experience
increases the efficiency of handling failures, since members get
more accustomed when dealing with failures. After processing
numerous failure reports, the authors concluded that team
members typically know who is responsible for each given task
of the coding part, and they are also able to solve problems
quickly.
The success factors of 134,549 open source software
projects that are available at SourceForge.net were identified
by using the data mining 2-itemset association rule [12].
Data from each projects such as potential audience, database
environment, description, development, file name, file size,
downloads, topic and programming language were all col-
lected and stored into a database to be analyzed. The result
of this study demonstrated six success factors that may be
applied by project initiators and developers for the purpose of
increasing the probability of success of each project.
In [13], 100,000 open source software projects from GitHub
were examined to study the popularity, interoperability and
impact of various languages by measuring different metrics
such as the number of lines of code, the size of development
teams, the number of issues, etc.
In distributed and dynamic software project environments,
in addition to source code, traces of the developers’ activities
(a.k.a. contribution logs) are important for the development
process even though they have no direct effects on the software
product itself. By combining the traditional contribution logs
and the data mined from software repositories, an accurate
model for measuring the developers’ contributions can be built
[14]. The model creates clusters of similar projects to extract
weights that are then assigned to the actions that a developer
performed in order to extract a combined measurement of
the developer’s contribution. This is particular relevant to our
study, since we are considering the contributors for a given
project inferred from the commits, and the total number of
developers in each project.
III. RESEARCH METHOD
The research methodology used in analyzing SourceForge
Projects and their communities is divided into two phases:
1) Data collection: this phase involves collecting informa-
tion of 252,008 projects.
2) Statistical analysis: this phase involves analyzing the
data in order to measure the effect of committers, de-
velopers, and top programming languages on the success
of the project..
To perform our empirical study, we collected data on
252,008 open source projects, written for numerous purposes
by collaborative teams using a wide range of languages. A
manual analysis of the projects on SourceForge was per-
formed, exploring the developers, audiences, and the different
programming languages used in each project. It has been found
that there is a clear diversity in the developed projects that
are hosted by SourceForge, and that it provides the needed
requirements of this particular study.
The data that was used in the analysis was collected from the
repositories of open source projects hosted on SourceForge.
The tool that was used to extract the datasets is Boa. The
first step was to collect datasets about the most popular
programming languages, and about the number of commits and
committers in each project. Data related to file changes, users,
project audiences, topics, and repositories per category of use
was collected. The resulting data contains records for 252,008
projects, 494,158 code repositories, 15,063073 revisions, and
147,074,540 files.
All the quantitative analyses were performed using the
statistical software package RStudio [15]. In addition, we
used Gephi [16], which is one of the most popular tools for
graph visualization, in order to better understand data and to
quickly test alternative visual approaches that best display the
programming languages popularity.
We focused on the correlations in order to determine
whether and how strongly pairs of each of the above variables
are related to each other. We suspected that if a project
is written by five or more programming languages or has
many team members then the success rate should be relatively
high. Hence, in order to verify our hypotheses, the Pearson’s
coefficient of correlation was computed to identify any strong
positive correlation(s). More formally, the formula is defined
in Equation 1:

182
r=n
i=1((xi¯x)(yi¯y))
n
i=1((xi¯x)2(yi¯y)2)(1)
The equation demonstrates the ratio of the covariation between
xand yto the total variation in both xand y, where ¯xand ¯y
are their respective mean values.
IV. RESULTS
Due to the space limitations, in this section we present the
partial results of our study. The popularity was measured by
considering the total number of committers and developers.
It is evident that the top ten programming languages are
more popular in terms of the number of committers and
developers. Moreover, less people tend to view projects written
in the lower-ranked programming languages. Higher number
of developers and committers was an indicator of the project’s
success. Furthermore, the higher number of developers and
committers in projects indicated greater popularity and suc-
cess. Finally, the projects that are written in the top ten
programming languages have a greater potential for success.
A. H1: On Multi-Language Projects and Success
To test our first hypothesis, we examined the percentage of
language interoperability by calculating the total number of
projects that are written in 5 or more top ranked programming
languages. The relationship between project interoperability
and the size of the development team is measured based on
the Pearson’s correlation coefficient.
Developers implement various functionalities of a single
project while considering the different programming lan-
guages. Let us first investigate the projects written in more
than one programing language. It was assumed that if a project
is written in more than one programming language, then the
programming languages used in a single project are classified
as interoperable.
Developers that use JavaScript, Perl, and Python are more
willing to adopt different programming languages in their
projects, with the percentage of such developers reaching
32.38%, 30.12%, and 28.26%, respectively. In contrast, the
programming language with the least percentage of developers
using multiple programming languages is PHP, with a per-
centage of just 15.84%. It has been noticed throughout the
years that some programming languages persist, while others
fail. By identifying the factors that influence this phenomenon,
developers can make informed decisions when determining
whether and when to invest the time and effort necessary to
learn a new programming language. To date, the language
adoption process has not been quantitatively studied at a large
scale [17].
Table I shows programming language interoperability of
using five or more programming language in a single project.
The programming language that has the highest percentage in
terms of interoperability is Unix Shell with a percentage of
18.66%, followed by Perl (18.18%), Delphi/Kylix (15.90%),
and python (10.45%). PHP, JavaScript, C, and C# also appear
in respectively 10.15%, 8.11%, 4.91%, and 4.85% of the
multi language projects. C++ and java appear to be the least
programming languages used in projects written in five or
more programming languages.
TABLE I
LANGUAGE INTEROPERABILITY.NUMBERS AND PERCENTAGES OF
PROJECTS WRITTEN IN MULTIPLE PROGRAMMING LANGUAGES,THAT
INCLUDE DIFFERENT PROGRAMMING LANGUAGES.
Programming Multi-Language Percentage of Multi-
Language Projects Language Projects
Java 65 1.74%
C++ 99 3.30%
PHP 59 10.15%
C 88 4.91%
Python 65 10.45%
C# 48 4.85%
JavaScript 47 8.11%
Perl 58 18.18%
Unix Shell 28 18.66%
Delphi/Kylix 14 15.90%
The heat-map correlation matrix in Figure 1 illustrates
the language interoperability, taking into consideration the
projects that are written in 5 or more top ranked programming
languages. The highest relationship is found between C++ and
Java, with a total of 70 projects in common. C++ also seems
to work well with C, as there are 68 projects written in both
languages along with other programming languages. C also
work well with Java with a total number of 64 projects in
common along with other programming languages. The lowest
relationship is found between Perl and Delphi/Kylix with only
3 projects in common.
TABLE II
CORRELATION BETWEEN PROGRAMMING LANGUAGES AND DEVELOPERS
IN OPEN SOURCE PROJECTS.
Programming r t t statistics
Language
Java 0.16 29.488 -52.443
C++ 0.18 13.326 -21.605
PHP 0.22 8.725 -12.147
C 0.15 9.3345 -14.505
Python 0.14 5.4098 -11.873
C# 0.14 6.4307 -10.897
JavaScript 0.19 7.6198 -11.513
Perl 0.11 3.1919 -6.9357
Unix Shell 0.36 7.3377 -7.2427
Delphi/Kylix 0.26 3.5508 -4.7649
Overall Correlation 0.17 30.228 -50.301
The relationship between the total number of programming
languages and the total number of developers in each project
is investigated. A positive correlation would indicate that
increasing the number of programming languages in a project
implies the need for a larger team sizes of developers.

183
Fig. 1. Heat-map of language interoperability.
Table II demonstrates that there is a positive correlation
which implies that increasing the number of programming
languages requires a larger team sizes of developers. A mod-
erate correlation is found with projects written in Unix Shell,
Delphi/Kylix, and PHP.
For the data in Table II, Pearson’s correlation coefficient be-
tween top programming languages and developers per projects
is demonstrated. Let ρbe the true correlation between pro-
gramming language interoperability and developers. The null
hypothesis is that there is no relationship between these two
variables and the research hypothesis is that the two variables
are positively related. These hypotheses are:
H0 : ρ=0
H1 : ρ=0
The test statistics is rand the standardized t statistic is:
t=rn2
1r2(2)
There are n2degrees of freedom. Choosing the 0.95 con-
fidence level for a two tailed test, the region of the correlation
coefficient and the corresponding tvalue is specified for each
programming language. RStudio was used to conduct the t-
test. While the correlation coefficient is not close to 1,itis
evident that it is not equal to 0which is enough to reject the
hypothesis of no relationship.
B. H2: On Top Programming Languages and Success
To test the second hypothesis, we measured the percentage
of committers in projects that are written in top ranked
programming languages. The relationship between projects
that are written in top ranked programming languages and the
total number of committers is measured using the Pearson’s
correlation coefficient.
Table III shows the correlation between the programming
languages and committers as well as the t, p-value and t-
test result which implies that increasing the number of pro-
gramming languages requires a larger team of developers. A
moderate correlation is found with projects written in Unix
Shell, Delphi/Kylix, and C++. A low correlation is found with
projects written in Perl, JavaScript, Python, and PHP.
For the data in Table III, Pearson’s correlation coefficient
between top ranked programming languages and committers
per projects are computed in a similar way as in the above
Section IV-B. Again, while the correlation coefficient is not
close to 1, it is evident that it is not equal to 0which is enough
to reject the hypothesis of no relationship.
Figure 2 shows a scatter plot matrix that show how pro-
gramming language interoperability is correlated with the total
number of committers per projects. The green line interpret
the relation, and it is evident that projects written in top
ranked programming languages result in a higher number of
committers. This is considered as an evidence that if a project
is written using top ranked programming languages, then the
probability of project success is higher, and the evaluated
metric which is the total number of committers does have a
positive affect on the overall project success rate.

184
Fig. 2. Correlation between programming language interoperability and committers.
V. E VALUATION
The empirical findings are based on open source projects
found on SourceForge which lead to results that may not
generalize to the universe of open source projects produced
by developers. The study focused on open source projects
written in top ranked programming languages based on the
obtained data-sets from the year 2013. The findings of the
study remain relevant as Microsoft, and Apache developed
several successful projects on SourceForge since then.
This study is based on 252,008 open source projects, which
is a sizable sample, yet it does not equate the millions of
software programs whose code can be retrieved from the
World Wide Web. SourceForge alone contains over 430,000
open source projects. It is to be noted however that the
sample data-set is sufficiently representative of the universe
of SourceForge projects. Therefore, we believe our results
generalize to other open source projects.
To ensure content validity, projects that are written in top
programming languages and are written in 5 or more program-
ming languages were selected. The percentage of developers
that are in multi-language projects were analyzed. Projects
that are written using a single programming language and had
one developer were discarded from this study. It is difficult to
conclude that the proposed hypotheses hold for projects that
are written in other combinations of lower- and higher- level
programming languages.
Different visualization techniques were interpreted to
demonstrate the information using a variety of methods. The
external validity was ensured by considering a large sample
of data in order to generalize the study result to other studies
of open source projects, which measure the success rate using
metrics such the development team, and the total number of
committers. The experiment of t-test was conducted to ensure
correlation significance level.

185
TABLE III
CORRELATION BETWEEN PROGRAMMING LANGUAGES AND COMMITTERS.
Programming r t t statistics
Language
Java 0.12 22.646 -72.199
C++ 0.17 12.534 -26.324
PHP 0.08 3.1538 -16.579
C 0.12 7.198 -19.252
Python 0.08 3.1102 -15.631
C# 0.13 5.9351 -13.722
JavaScript 0.04 1.8102 -16.054
Perl 0.03 0.92751 -10.369
Unix Shell 0.28 5.7351 -6.3342
Delphi/Kylix 0.19 2.5633 -5.9096
Overall Correlation 0.13 23.143 -69.678
VI. CONCLUSIONS
In this paper, we studied a dataset of 252,008 open-
source projects. Our study endorses different assumptions on
literature regarding programming language popularity. The
findings demonstrated that some programming languages are
more popular than others. In particular, earlier programming
languages such as C are common with a large code base.
Moreover, programming languages such as JavaScript and
Ruby are pervasive in the area of web development. The
success of some programming languages is linked to the
success of certain products. For example, Apple uses the
Objective C programming language, which is gaining much
attention nowadays. Several programing languages are created
but after a while they are no longer used. However, there exists
some programming languages that are preferable by a huge
number of users, and these types of languages are focused
on and improved upon to avoid fading away as is the case
with many programming languages. Many studies focus on the
methods in which programming languages should be created.
Nonetheless, less attention has been given to quantifying the
popularity of these programming languages, and understanding
why these languages happen to be the most prevalent.
To evaluate our first hypothesis, an approach was presented
based on the percentage of developers and the percentage of
committers. In terms of developers, what was focused on was
the developers that adopt 5 or more top ranked programming
languages in the projects that were studied. The percentage of
language interoperability is measured taking into consideration
the number of projects that are written in 5 or more top-ranked
programming languages. The relationship between project
interoperability and the number of the development team is
measured by applying Pearson’s correlation coefficient. As
for our second hypothesis, the percentage of committers in
projects that are written in top ranked programming languages
was measured. The relationship between project that are
written in top ranked programming languages and the total
number of committers was measured by applying Pearson’s
correlation coefficient.
Our study has implications for future research on pro-
gramming languages. We demonstrated that using various
programming languages requires additional team members.
However, there is a notable increase in the functionality of the
developed project. Also, there is a positive correlation between
the developer team size, and the total number of committers.
REFERENCES
[1] A. Stefik and S. Siebert, “An empirical investigation into programming
language syntax,” ACM Transactions on Computing Education, vol. 13,
no. 4, 2013, article no. 19.
[2] G. Hertel, S. Niedner, and S. Herrmann, “Motivation of software devel-
opers in open source projects: An Internet-based survey of contributors
to the Linux kernel,” Research Policy, vol. 32, no. 7, pp. 1159–1177,
2003.
[3] D. Riehle, “Controlling and steering open source projects,” Computer,
vol. 7, no. 44, pp. 93–96, 2011.
[4] D. German and A. Mockus, “Automating the measurement of open
source projects,” in Proceedings of the 3rd Workshop on Open Source
Software Engineering, 2003, pp. 63–67.
[5] A. Ankolekar, J. D. Herbsleb, and K. Sycara, “Addressing challenges to
open source collaboration with the semantic web,” in Proceedings of 3rd
Workshop on Open Source Software Engineering, the 25th International
Conference on Software Engineering (ICSE). IEEE, 2003, pp. 9–14.
[6] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen, “Boa: A language
and infrastructure for analyzing ultra-large-scale software repositories,
in Proceedings of the 2013 International Conference on Software
Engineering (ICSE). IEEE, 2013, pp. 422–431.
[7] H. Siy, P. Chundi, and M. Subramaniam, “Summarizing developer
work history using time series segmentation: Challenge report,” in
Proceedings of the 2008 International Working Conference on Mining
Software Repositories (MSR). ACM, 2008, pp. 137–140.
[8] M. Burch, S. Diehl, and P. Weibgerber, “Visual data mining in software
archives,” in Proceedings of the 2005 ACM Symposium on Software
Visualization (SOFTVIS). ACM, 2005, pp. 37–46.
[9] D. P. Delorey, C. D. Knutson, and S. Chun, “Do programming languages
affect productivity? a case study using data from open source projects,
in Proceedings of the 2007 First International Workshop on Emerging
Trends in FLOSS Research and Development (FLOSS). IEEE, 2007,
pp. 1–5.
[10] D. G. Feitelson, G. Z. Heller, and S. R. Schach, “An empirically-based
criterion for determining the success of an open-source project,” in
Proceedings of the 2006 Australian Software Engineering Conference
(ASWEC). IEEE, 2006, pp. 1–6.
[11] M. Grottke, L. M. Karg, and A. Beckhaus, “Team factors and failure
processing efficiency: An exploratory study of closed and open source
software development,” in Proceedings of the 2010 IEEE 34th Annual
Computer Software and Applications Conference (COMPSAC). IEEE,
2010, pp. 188–197.
[12] A. W. R. Emanuel, R. Wardoyo, J. E. Istiyanto, and K. Mustofa,
“Success factors of OSS projects from SourceForge using datamining
association rule,” in Proceedings of the 2010 International Conference
on Distributed Framework and Applications (DFmA). IEEE, 2010, pp.
1–8.
[13] T. F. Bissyande, F. Thung, D. Lo, L. Jiang, and L. Reveillere, “Popularity,
interoperability, and impact of programming languages in 100,000
open source projects,” in Proceedings of the 2013 IEEE 37th Annual
Computer Software and Applications Conference (COMPSAC). IEEE,
2013, pp. 303–312.
[14] G. Gousios, E. Kalliamvakou, and D. Spinellis, “Measuring developer
contribution from software repository data,” in Proceedings of the 2008
International Working Conference on Mining Software Repositories
(MSR). ACM, 2008, pp. 129–132.
[15] M. P. J. Van der Loo, Learning RStudio for R Statistical Computing.
Packt Publishing Ltd, 2012.
[16] K. Cherven, Network Graph Analysis and Visualization with Gephi.
Packt Publishing Ltd, 2013.
[17] L. A. Meyerovich and A. S. Rabkin, “Empirical analysis of programming
language adoption,” ACM SIGPLAN Notices, vol. 48, no. 10, pp. 1–18,
2013.

186
Chapter
Extracting value from open (government) data has been in the limelight for quite some time now, with the Open Government Data (OGD)-also referred to as the Public Sector Information (PSI)- Directive, emphasizing the need for discovery of potential related to PSI and the re-use of data. This study provides insight on high-value datasets (HVD), i.e., the OGD matching the rigorous quality standards in terms of particularly the completeness, timeliness, and machine-processability, in the European Data Portal (EDP) in order to measure various aspects of the datasets, which will help obtain further insights on the current availability status, potential interoperability possibilities, drivers and barriers. The rationale behind this approach is to identify HVD in line with the thematic categories of the PSI Directive and to put a focus on the EDP HVD across three dimensions, viz., data publishing, availability, and technical requirements which need to be met. Findings attest the need for provisioning qualitatively robust HVDs with requisite technical specifications for facilitating value derivation by the range of stakeholders. Apart from contributing towards the HVD and OGD literature, the study lends insights across the availability-value generation focus via the EDP, thereby providing policy implications and further research pointers along these lines.KeywordsOpen government dataEuropean Data PortalHigh value datasetsTechnical specificationsInteroperabilityData publishersData availability
Article
Full-text available
This paper describes the software for graph storage, analysis and visualization. The article presents a comparative analysis of existing software for analysis and visualization of graphs, describes the overall architecture of application and basic principles of construction and operation of the main modules. Furthermore, a description of the developed graph storage oriented to storage and processing of large-scale graphs is presented. The developed algorithm for finding communities and implemented algorithms of autolayouts of graphs are the main functionality of the product. The main advantage of the developed software is high speed processing of large size networks (up to millions of nodes and links). Moreover, the proposed graph storage architecture is unique and has no analogues. The developed approaches and algorithms are optimized for operating with big graphs and have high productivity.
Conference Paper
Full-text available
Programming languages have been proposed even before the era of the modern computer. As years have gone, computer resources have increased and application domains have expanded, leading to the proliferation of hundreds of programming languages, each attempting to improve over others or to address new programming paradigms. These languages range from procedural languages like C, object-oriented languages like Java, and functional languages such as ML and Haskell. Unfortunately, there is a lack of large scale and comprehensive studies that examine the "popularity", "interoperability", and "impact" of various programming languages. To fill this gap, this study investigates a hundred thousands of open source software projects from GitHub to answer various research questions on the "popularity", "interoperability" and "impact" of various languages measured in different ways (e.g., in terms of lines of code, development teams, issues, etc.).
Conference Paper
Full-text available
In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google Code (250,000+ projects) are the new library of Alexandria. They contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important hypotheses. However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses. We have implemented Boa and provide a web-based interface to Boa's infrastructure. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also see drastic improvements in scalability. Last but not least, reproducing an experiment conducted using Boa is just a matter of re-running small Boa programs provided by previous researchers.
Conference Paper
Full-text available
We present a research to find the success factors of 134,549 Open Source Software (OSS) Projects at sourceforge.net portal using Datamining 2-Itemset Association Rule. Data from each OSS Project i.e. project's audience, database environment, description, developer, development status, download, filename, file size, operating system, programming language, review, thumb (up and down), topic, translation, and user interface are collected and then stored into database to be analyzed using Weka Datamining Tool. The Datamining Association Rule is used to find the success factors of these projects by assuming that the success is reflected by the number of downloads. The success factors of these OSS Projects are found by performing 2-Itemset Association Rules. The result is then formulated into six success factors that should be utilized by future initiators and other developers of OSS Project to increase the possibilty of success of their OSS Projects.
Article
Full-text available
Despite the remarkable success of open source software, there are a number of challenges to collaboration in open source software development, in particular, with respect to supporting collaboration among developers, supporting potential contributors, and in bringing users and developers together. In this paper, we examine some of the possible enhancements of open source development environments, and consider the application of Semantic Web technology to address these.
Article
Full-text available
The motives of 141 contributors to a large Open Source Software (OSS) project (the Linux kernel) was explored with an Internet-based questionnaire study. Measured factors were both derived from discussions within the Linux community as well as from models from social sciences. Participants’ engagement was particularly determined by their identification as a Linux developer, by pragmatic motives to improve own software, and by their tolerance of time investments. Moreover, some of the software development was accomplished by teams. Activities in these teams were particularly determined by participants’ evaluation of the team goals as well as by their perceived indispensability and self-efficacy.
Article
Some programming languages become widely popular while others fail to grow beyond their niche or disappear altogether. This paper uses survey methodology to identify the factors that lead to language adoption. We analyze large datasets, including over 200,000 SourceForge projects, 590,000 projects tracked by Ohloh, and multiple surveys of 1,000-13,000 programmers. We report several prominent findings. First, language adoption follows a power law; a small number of languages account for most language use, but the programming market supports many languages with niche user bases. Second, intrinsic features have only secondary importance in adoption. Open source libraries, existing code, and experience strongly influence developers when selecting a language for a project. Language features such as performance, reliability, and simple semantics do not. Third, developers will steadily learn and forget languages. The overall number of languages developers are familiar with is independent of age. Finally, when considering intrinsic aspects of languages, developers prioritize expressivity over correctness. They perceive static types as primarily helping with the latter, hence partly explaining the popularity of dynamic languages.
Article
Recent studies in the literature have shown that syntax remains a significant barrier to novice computer science students in the field. While this syntax barrier is known to exist, whether and how it varies across programming languages has not been carefully investigated. For this article, we conducted four empirical studies on programming language syntax as part of a larger analysis into the, so called, programming language wars. We first present two surveys conducted with students on the intuitiveness of syntax, which we used to garner formative clues on what words and symbols might be easy for novices to understand. We followed up with two studies on the accuracy rates of novices using a total of six programming languages: Ruby, Java, Perl, Python, Randomo, and Quorum. Randomo was designed by randomly choosing some keywords from the ASCII table (a metaphorical placebo). To our surprise, we found that languages using a more traditional C-style syntax (both Perl and Java) did not afford accuracy rates significantly higher than a language with randomly generated keywords, but that languages which deviate (Quorum, Python, and Ruby) did. These results, including the specifics of syntax that are particularly problematic for novices, may help teachers of introductory programming courses in choosing appropriate first languages and in helping students to overcome the challenges they face with syntax.
Article
The proliferation of open source projects raises a num- ber of vital economic, social, and software engineering questions that are subject of intense research. Based on experience analyzing numerous open source and commer- cial projects we propose a set of tools to support extrac- tion and validation of software project data. Such tools would streamline empirical investigation of open source projects and make it possible to test existing and new the- ories about the nature of open source projects. Our soft- ware includes tools to extract and summarize information from mailing lists, CVS logs, ChangeLog les, and defect tracing databases. More importantly, it cross-links records from various data sources and identies all contributors for a software change. We illustrate some of the capabilities by analyzing data from Ximian Evolution project.
Conference Paper
Researchers in the field of software engineering economics have associated team factors, such as team size and team experience, with productivity and quality. Since distributed and open source development have gained significance in the past few years, further empirical investigation is needed. Our study contributes to the empirical body of knowledge by addressing this development. In particular, we investigate the association between team factors and failure processing efficiency for closed source software releases of a large commercial software vendor and for open source software projects registered with SourceForge.net. We find significant links between team experience and the failure processing efficiency. However, our data does not show any evidence for adverse effects of distributed development. Our results further suggest that service level agreements and process governance are good tools to guarantee satisfactory processing times.