Content uploaded by Esteban Parra
Author content
All content in this area was uploaded by Esteban Parra on Jun 27, 2020
Content may be subject to copyright.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
GierCom - A Dataset of Open Source Developer
Communications in Gier
Esteban Parra
parrarod@cs.fsu.edu
Florida State University
Tallahassee, Florida
Ashley Ellis
ake17@my.fsu.edu
Florida State University
Tallahassee, Florida
Sonia Haiduc
shaiduc@cs.fsu.edu
Florida State University
Tallahassee, Florida
Abstract
Team communication is essential for the development of modern
software systems. For distributed software development teams, such
as those found in many open source projects, this communication
usually takes place using electronic tools. Among these, modern
chat platforms such as Gitter are becoming the de facto choice
for many software projects due to their advanced features geared
towards software development and eective team communication.
Gitter channels contain numerous messages exchanged by devel-
opers regarding the state of the project, issues and features of the
system, team logistics, etc. These messages can contain important
information to researchers studying open source software systems,
developers new to a particular project and trying to get familiar
with the software, etc. Therefore, uncovering what developers are
communicating about through Gitter is an essential rst step to-
wards successfully understanding and leveraging this information.
We present a new dataset, called GitterCom, which is meant to
enable research in this direction and represents the largest man-
ually labeled and curated dataset of Gitter developer messages.
The dataset is comprised of 10,000 Gitter messages collected from
10 Gitter communities associated with the development of open
source software systems. Each message was manually annotated
and veried by two of the authors, capturing the purpose of the
communication expressed by the message. While the dataset has
not yet been used in any publication, we discuss how it can enable
interesting research opportunities in the eld.
CCS Concepts
•Software and its engineering →Collaboration in software
development;Open source model;Documentation;
Keywords
datasets, communication, chat, social media, team communication
platforms
ACM Reference Format:
Esteban Parra, Ashley Ellis, and Sonia Haiduc. 2020. GitterCom - A Dataset
of Open Source Developer Communications in Gitter. In MSR’20: MSR,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MSR’20, May 25–26, 2020, Seoul, South Korea
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-9999-9/18/06. . . $15.00
https://doi.org/10.1145/1122445.1122456
May 25–26, 2020, Seoul, South Korea. ACM, New York, NY, USA, 5 pages.
https://doi.org/10.1145/1122445.1122456
1 Introduction
Modern, complex open source software systems often require large
teams in order to be developed. The teams are usually geograph-
ically distributed across dierent locations, countries and even
continents. In order to collaborate, communicate, and coordinate,
these teams make use of electronic tools such as instant messaging,
email, etc. [
4
,
5
,
8
,
10
]. Recently, modern messaging and collabo-
ration platforms such as Gitter
1
and Slack
2
have revolutionized
team communications and project coordination by providing a user-
friendly way of managing and organizing conversations, facilitating
knowledge sharing, and by integrating with external software de-
velopment tools such as GitHub, Asana, and Jira[
9
]. Given their
features and the support for software development, many open
source projects have adopted Gitter and Slack as their preferred
communication means [
5
]. In particular, Gitter is currently the most
popular instant messaging platform in open source development
teams [5]. It also presents some advantages over Slack, such as:
•
Open access to communications: in Slack, communities are con-
trolled by the administrators, whereas in Gitter, access to the user-
generated data is public. In particular, public messages and user-
generated content in Gitter are subject to the Creative Commons
license: Attribution + Non-Commercial + ShareAlike (BY-NC-SA)
3
•
Free access to historical data: in Slack communities, only the latest
10,000 messages are accessible without paying. Since most public
Slack channels use the free tier [
3
], their historical data is unavail-
able. Conversely, messages posted to public Gitter channels are
preserved and accessible indenitely in chat room logs.
Despite its advantages over Slack, its greater popularity among
open source developers, and the availability of tens of thousands
of message exchanges between developers of open source soft-
ware, there have been no papers so far investigating developer
communications in Gitter. Rather, existing works analyzing devel-
oper communications in modern instant messaging platforms have
so far focused solely on Slack [1–3, 6].
We argue that Gitter developer communications are an untapped
information resource that could be leveraged by researchers in or-
der to get a deeper understanding about the nature of developer
communications in open source software. With this paper, we aim
to encourage research in this direction by introducing GitterCom,
the rst manually labeled dataset of Gitter instant message histories
in open source systems. The dataset consists of 10,000 messages
1https://gitter.im/
2https://slack.com/
3https://creativecommons.org/licenses/by-nc-sa/3.0/us/
1
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
MSR’20, May 25–26, 2020, Seoul, South Korea Esteban Parra, Ashley Ellis, and Sonia Haiduc
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
Table 1: Distribution of messages per purpose category
Category Cucumber Freezing ImageJ Jhipster JSPM MJS4Sklearn5THW6UIKit Xenko Overall
Communication 325 794 490 446 635 695 506 583 321 480 5275
Customer support 442 0 150 239 0 0 4 145 451 0 1431
Dev-Ops 198 183 308 269 305 235 464 240 190 383 2775
Discovery and news 13 1 10 9 7 2 3 15 5 32 97
Fun 0 2 0 0 0 39 0 0 1 0 42
Networking and social
activities
0 0 1 3 0 3 0 0 0 32 39
Participation in Com-
munities of Practice
4 2 9 15 21 13 13 10 12 54 153
Team Collaboration 18 18 32 19 32 13 10 7 20 19 188
across ten Gitter communities devoted to the development of ten
dierent open source systems. The messages were automatically
extracted and then manually labeled by two of the authors with
respect to the communication purpose they express, based on the
categories identied in previous work by Lin et al. [
6
] through
surveys of developers. GitterCom is overall the largest manually
labeled dataset of developer instant messages; the only other man-
ually labeled dataset available is comprised of 500 developer Slack
messages in one software company [10].
The rest of the paper is structured as follows. Section 2 presents
an overview of the dataset, section 3 outlines the data collection
process we followed, section 4 discusses potential research direc-
tions using this dataset, section 5 presents limitations and future
improvements that could be made to the data set and lastly, section
6 concludes the paper.
2 Dataset Description
GitterCom includes data about 10,000 messages collected from
10 open source software development Gitter communities (1,000
messages per community). Each message was manually labeled with
information about the purpose of the communication it expresses,
based on the categories identied by Lin et al. [6].
GitterCom is available in CSV le format online
7
. In the CSV le,
each line is a data record. Each record contains the information for
a single message and consists of seven information elds, separated
by comma and using quotes as the text delimiter. In particular, each
row contains: (i) the channel/system the message belongs to, (ii)
a unique messageID, (iii) the date and time at which the message
was posted, (iv) the author of the message, (v) the content of the
message in plain text, (vi) the corresponding purpose category
(manual label), and (vii) the purpose subcategory (manual label).
Next, we present brief descriptions of the dierent purposes,
their categories and subcategories we used to manually label the
messages in GitterCom. These were rst identied by Lin et al. [
6
],
who surveyed software developers about their use of Slack.
The rst purpose, called
Personal benets
, includes messages
in which the developer’s main purpose is to fulll personal needs.
Messages within this purpose can be further divided into three cat-
egories: discovery and aggregation of news and information, where
4MarionneteJS
5SciKit-Learn
6TheHollyWae
7https://gshare.com/s/9b3df36e22a8a8f77169
developers post reliable, interesting, and relevant blogs or other
sources of information; networking and social activities, where de-
velopers interact with other developers who share similar interests
or jobs; and fun, which are messages sharing gifs and memes or
meant for participating in gaming activities.
The second purpose relates to
Team-wide
activities and includes
messages aimed towards carrying out software development activi-
ties related to the system being developed. Messages within this
purpose can be further divided into the following four categories:
communication messages in which the developers engage in activ-
ities such as communication with teammates (e.g., members of a
distributed team) during meetings and note-taking, communication
with other stakeholders, or discussing non-work topics; team collab-
oration messages in which the developers engage in activities such
as team management, le, and code sharing; Dev-Ops messages in
which the developers engage in activities such as communicating
updates regarding the status of the project (e.g., development oper-
ation notications about recent changes to the system, commits,
bug xes, pushes to the repository, merges), software deployments,
and team Q&As; and nally, customer support messages in which
the developers assist new or existing users of the system on how
to perform certain tasks, identify bugs, and troubleshoot errors.
The last purpose is represented by
Community support
mes-
sages, where developers participate in communities of practice or
special interest groups. These messages are characterized by devel-
opers aiming to keep up with specic frameworks/communities, to
learn about new tools and frameworks for developing applications,
or to brainstorm ideas with other people in the community.
Table 1 shows the number of messages per category in Gitter-
Com, for each of the 10 open source systems/communities we
considered, while Figure 2 shows the overall distribution of mes-
sages associated with each category across all the communities in
GitterCom.
Based on the hierarchy presented above, we notice that the major-
ity of Gitter messages in GitterCom belong to Team-wide purposes.
Figure 2 shows that the distribution of messages varies signicantly
across categories. In particular, 83% of the messages are meant to
support activities directly associated with the development of the
system. On the other hand, 14.31% of the messages are related to
community support and engagement with communities of practice,
and only 2.69% of the messages are linked to personal benets.
Moreover, 53% of the messages involve communication between
the developers and stakeholders, 28% of the messages communicate
2
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
GierCom - A Dataset of Open Source Developer Communications in Gier MSR’20, May 25–26, 2020, Seoul, South Korea
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
updates regarding the status of the system, and 15% of the messages
involve customer support.
52.75%
27.75%
14.31%
0.97% 0.39%
0.42% 1.88% 1.53%
Communication
Dev-ops
Communities of practice
Team collaboration
Discovery and news
Fun
Networking and social
Customer support
Figure 1: Distribution of messages by category
3 Data Collection
This section presents in detail the data collection and curating pro-
cedure we used to create the GitterCom dataset. We rst gathered
the list of all the Gitter communities listed in Gitter’s Explore inter-
face
8
on April 1, 2019. We then excluded the channels in which the
conversations were not in English, resulting in a list of 139 Gitter
communities. Afterwards, using the Gitter API
9
, we extracted all
of the messages in the main channels of these communities, from
their inception until April 1, 2019. This data collection resulted in a
set of 2,939,335 messages across all 139 channels.
To extract the raw data for GitterCom, we used a custom python
script, which uses pycurl to connect to Gitter’s REST API and obtain
all the messages and their corresponding metadata. Afterwards, to
facilitate the labeling process, we ran a custom Java script to convert
the extracted messages from the JSON format provided by Gitter’s
API to CSV format. The data collection scripts and instructions on
their usage are found in our replication package [7].
The 139 channels collected as raw data vary in three main ways:
by membership - the channels contain between 100 and 17,000 mem-
bers per channel, by level of activity - the smallest channel contains
21 messages, whereas the largest channel contains over 423,000
messages, and by type - channels can be made for the development
of a particular software system, where the developers communicate
with each other and with the system’s stakeholders, or made for
building communities of practice in which the members’ discussion
revolves around particular topics, frameworks, or programming
languages, but does not involve discussion about the active devel-
opment of a system.
While we make the entire data we extracted for all the 139 chan-
nels available for download to other researchers
10
, our main goal
for GitterCom was to manually curate and label a subset of the
messages, based on the purposes/intents identied by Lin et al.
[
6
] (as described in Section 2). We therefore selected the rst ten
8https://gitter.im/explore
9https://developer.gitter.im
10https://gshare.com/s/3fd5af0b869b8fd010bb
channels which met the following criteria: (i) they are linked to
an active GitHub repository, (ii) they are used as a communication
tool for the active development of an open-source software system,
(iii) they cover dierent application domains, (iv) they have been
active in the past year, and (v) they contain at least 1,000 messages.
Table 2 shows the details of the selected systems/channels.
Table 2: Subset of Gitter communities included in GitterCom
Community Members Messages Application domain
Marionette 3014 181108 Javascript framework
jspm 1103 27245 Package manager
scikit-learn 3188 9844 Machine Learning
Xenko3d 103 2890 Game engine
FreezingMoon 109 207925 Video game
UIkit 2155 41265 Front-end framework
jHipster 2575 39418 Application generator
Cucumber 337 2030 Testing framework
Imagej 209 8149 Image processing
TheHolyWae 196 15046 VoIP communication
From each of the ten selected channels we then collected the
1,000 most recent consecutive messages up to April 1, 2019, for a
total of 10,000 messages. The rst two authors then carried out a
coding procedure to label these messages, using the categories and
subcategories identied by Lin et al. [
6
] as labels. More specically,
each message was assigned a category describing the main purpose
of the message and a subcategory describing the specic activity
the message relates to. If a message did not provide any meaningful
information by itself (e.g., a single emoji, "ok", "great", ""), it was
classied as "Uninformative". After the individual coding, the two
authors met, discussed, and resolved any coding conicts. The mes-
sages for which a classication of "Uninformative" was agreed upon
were discarded and replaced by an equal number of messages from
the same channel. Then, the coding process was applied on these
new messages. This procedure was repeated until 1,000 messages
were obtained for each channel, all having a label other than "Un-
informative". Across all channels, a total of 1,061 messages were
labeled as "Uninformative" during the labeling process.
During the coding process, when the content of a message was
insucient to determine a category, we used the list of contributors
to the system’s repository as a source of additional information
that could give an insight into the nature of the message. One ex-
ample of such ambiguous messages were questions which could be
interpreted as either a customer asking about the system (Customer
Support) or a developer of the system asking about a part of the
system they are unfamiliar with (Team Q&A). In this particular
case, if a question was made by a contributor to the system, it was
classied as Team Q&A, and Customer Support otherwise.
The manual coding procedure took the two authors overall three
weeks to complete. After completing the manual labeling, we ob-
tained GitterCom, a dataset comprised of 10,000 Gitter messages,
1,000 per Gitter channel, classied according to their purpose.
4 Potential Research Applications
Previous studies have investigated the growing use of alternative
communication means by developers [
5
,
6
,
10
]. The results of these
3
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
MSR’20, May 25–26, 2020, Seoul, South Korea Esteban Parra, Ashley Ellis, and Sonia Haiduc
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
Figure 2: GitterCom sample
studies show the rise of instant messaging tools and the impact
they have on reshaping team dynamics and the communication
landscape in increasingly distributed software development envi-
ronments. Future studies could make use of GitterCom to study the
relationship between open source development activity and commu-
nication trends. In particular, GitterCom enables further research to
analyze and understand patterns in developer communications and
to address important questions such as: How do software teams
use tools like Gitter to communicate among themselves and with
other stakeholders? How do team dynamics reect in team com-
munications? Do developers exchange dierent types of messages
at dierent times in the software life cycle? Do developers new
to a project post dierent types of messages than the more senior
developers?
GitterCom could also be used as a training dataset for machine
learning approaches for automatically classifying new developer
messages based on their purpose. This could, in turn, be useful
to automatically organize messages into threads or to create sum-
maries of developer conversations based on their purpose, such that
developers that were away for a while or newcomers to a project
could quickly catch up on important conversations they missed.
Another avenue for future work would be to use GitterCom in
order to perform large scale replications of previous studies that
analyzed developer communications in Slack [
1
,
2
,
10
], but used
much smaller or restricted datasets (e.g., communications in student
projects or a particular software company). These replications on
GitterCom could help corroborate previous ndings or uncover
new information about how developers communicate through in-
stant messaging tools. One example of such work that could benet
from a large scale replication is work on the identication of mes-
sages that contain rationale for the decisions made by developers
throughout the software life cycle [
2
]. Thus far, work on rationale
has been limited to analyzing the chat messages of three student
teams working on a multi-project capstone course.
5 Limitations and Future Improvements
Although GitterCom is the largest data set of curated and manually
labeled developer instant messages, it still encompasses a small subset
of all the existing Gitter developer communications. Therefore, one
limitation to GitterCom could be that the collected projects are
not representative of all open-source projects and that the most
recent 1,000 messages for a project are not representative of all
the messages exchanged by developers in a project. Improvements
that would help increase the generalizability of the results of future
studies analyzing this dataset include the expansion of the labeled
data in GitterCom to include more messages from more projects.
For this purpose we also release the raw, unlabeled data extracted
by our crawler script, containing over 2 million messages from 139
open source projects at https://gshare.com/s/3fd5af0b869b8fd010bb.
We therefore hope other researchers will join our eort and will
select more of this raw data to label and contribute to GitterCom.
6 Conclusions
Due to the rapid growth in the adoption of instant messaging tools
in open source development communities, there is a strong need
to study the nature of this type of communication between devel-
opers, and its implications for open source software development.
However, such analysis is not possible without data to explore.
We introduced GitterCom, the largest manually labeled and cu-
rated dataset of Gitter developer messages. It comprises 10,000
messages and their corresponding purpose labels across multiple
open source Gitter channels, corresponding to systems covering
a wide range of application domains. We believe that our dataset
provides immense opportunities for researchers to perform large
scale empirical research and further analysis on developer discus-
sions, communication with stakeholders, and team dynamics in
open source systems. Our hope is nevertheless that the initial data
set in this paper will spur interest for the continuing collection and
analysis of developer instant communications.
4
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
GierCom - A Dataset of Open Source Developer Communications in Gier MSR’20, May 25–26, 2020, Seoul, South Korea
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
References
[1]
Rana Alkadhi, Jan Ole Johanssen, Emitza Guzman, and Bernd Bruegge. 2017.
REACT: An Approach for Capturing Rationale in Chat Messages. In Proceedings
of the 11th ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM’17). IEEE, Toronto, ON, Canada, 175–180.
[2]
R. Alkadhi, T. Lata, E. Guzmany, and B. Bruegge. 2017. Rationale in Development
Chat Messages: An Exploratory Study. In Proceedings of the 14th IEEE/ACM
International Conference on Mining Software Repositories (MSR’17). 436–446.
[3]
Preetha Chatterjee, Kostadin Damevski, Lori Pollock, Vinay Augustine, and
Nicholas A Kraft. 2019. Exploratory Study of Slack Q&A Chats as a Mining Source
for Software Engineering Tools. In Proceedings of the 16th IEEE International
Conference on Mining Software Repositories (MSR’19). IEEE, Montreal, Canada,
490–501.
[4]
Shaiful Alam Chowdhury and Abram Hindle. 2015. Mining StackOverow to
Filter out O-topic IRC Discussion. In Proceedings of the 12th IEEE Working
Conference on Mining Software Repositories (MSR’15). IEEE, Florence, Italy, 422–
425.
[5]
Verena Käfer, Daniel Graziotin, Ivan Bogicevic, Stefan Wagner, and Jasmin Ra-
madani. 2018. Communication in Open-Source Projects-End of the E-mail Era?.
In Proceedings of the 40th IEEE/ACM International Conference on Software Engi-
neering(ICSE’18). IEEE, Gothenburg, Sweden, 242–243.
[6]
Bin Lin, Alexey Zagalsky, Margaret-Anne Storey, and Alexander Serebrenik.
2016. Why Developers Are Slacking O: Understanding How Software Teams
Use Slack. In Proceedings of the 19th ACM Conference on Computer Supported
Cooperative Work and Social Computing (CSCW’16 ). ACM, 333–336.
[7]
Esteban Parra. 2020. GitterCom, dataset. https://gshare.com/s/
9b3df36e22a8a8f77169
[8]
M. Storey, A. Zagalsky, F. F. Filho, L. Singer, and D. M. German. 2017. How Social
and Communication Channels Shape and Challenge a Participatory Culture in
Software Development. IEEE Transactions on Software Engineering 43, 2 (Feb.
2017), 185–204.
[9]
Margaret-Anne Storey, Leif Singer, Brendan Cleary, Fernando Figueira Filho,
and Alexey Zagalsky. 2014. The (R) Evolution of Social Media in Software
Engineering. In Proceedings of the 36th ACM/IEEE International Conference in
Software Engineering, Future of Software Engineering (FOSE’14). ACM, Hyderabad,
India, 100–116.
[10]
Viktoria Stray, Nils Brede Moe, and Mehdi Noroozi. 2019. Slack Me if You Can!:
Using Enterprise Social Networking Tools in Virtual Agile Teams. In Proceedings
of the 14th International Conference on Global Software Engineering (ICGSE’19).
IEEE, Montreal, Quebec, Canada, 101–111.
5