PosterPDF Available

Big Data Reduction: Lessons Learned From Analyzing One Billion Dota 2 Matches

Authors:

Abstract

Why study a Video Game? The availability of large datasets of player choices in popular online computer games presents an opportunity to identify sudden changes in choice patterns and explore what factors may contribute to such changes. We were particularly interested in the challenges posed by the scale of such large datasets (we attempted to obtain another >1TB dataset as well). What is Dota 2? Dota 2 is a popular MOBA (Multiplayer Online Battle Arena) videog-ame. The core gameplay revolves around teams of five players choosing from a character pool of over 100 characters and facing off against another five-person team in a race to destroy a large structure located in the enemy's base called The Ancient. The game is also known for its large prizepool annual tournament, The International, whose prize pool ranges in the millions of dollars ($25 million for 2018). Why study Dota 2? Neither of the authors has significant experience with the game, and we had initially explored the availability of datasets for a wide gamut of games with which we have more experience with. However, Dota 2 is unique in having a long-running community project to collect data on the game which is periodically released for use in research. The OpenDota (formerly YASP) project's most recent "data dump" covers >1 billion matches from March 2011 to March 2016 [1] and clearly boosts the appeal of Dota 2 as subject of study. Detect metagame shifts from hero pick ratios, that is to attempt to recognize significant changes in players' propensity for picking certain heroes that would be caused by external events (patches to the game, major tournaments, etc.) "Tame" the dataset's enormity using the simplest tools possible, and using only the resource of a normal machine as may be available to an average researcher without extensive funding
Big Data Reduction: Lessons Learned From Analyzing One Billion Dota 2 Matches
Why study a Video Game? e availability of large datasets of player choices in pop-
ular online computer games presents an opportunity to identify sudden changes in
choice patterns and explore what factors may contribute to such changes. We were
particularly interested in the challenges posed by the scale of such large datasets (we at-
tempted to obtain another >1TB dataset as well).
What is Dota 2? Dota 2 is a popular MOBA (Multiplayer Online Battle Arena) videog-
ame. e core gameplay revolves around teams of ve players choosing from a charac-
ter pool of over 100 characters and facing o against another ve-person team in a race
to destroy a large structure located in the enemy’s base called e Ancient. e game is
also known for its large prizepool annual tournament, e International, whose prize
pool ranges in the millions of dollars ($25 million for 2018).
Why study Dota 2? Neither of the authors has signicant experience with the game,
and we had initially explored the availability of datasets for a wide gamut of games with
which we have more experience with. However, Dota 2 is unique in having a long-run-
ning community project to collect data on the game which is periodically released for
use in research. e OpenDota (formerly YASP) project’s most recent “data dump” cov-
ers >1 billion matches from March 2011 to March 2016 [1] and clearly boosts the ap-
peal of Dota 2 as subject of study.
Detect metagame shis from hero pick ratios, that is to attempt to recognize signi-
cant changes in players’ propensity for picking certain heroes that would be caused by
external events (patches to the game, major tournaments, etc.)
“Tame” the dataset’s enormity using the simplest tools possible, and using only the
resource of a normal machine as may be available to an average researcher without ex-
tensive funding
Marco Lussetti
www.marcolussetti.com
marco@marcolussetti.com
Dyson Fraser
www.linkedin.com/in/dysonfraser
dysonfraser@gmail.com
Dr. Mila Kwiatkowska
Supervisor
mkwiatkowska@tru.ca
Why Study Dota 2? Big Data Reduction Poor Mans Solutions Metagame Shis Future Work
e “diversity” graph records the dierence between a day’s hero picks and the average
of the previous 14 days of picks. is is computed using a simple Manhattan distance
where each heros pick rate represents a dimension in the vector.
We sought to match prominent peaks to external events that might inuence player be-
havior such as game patches [5] and tournaments (no eect from tournaments).
2012-01-12 Major rebalancing of heroes & item changes
2012-06-11 Chaos Knight, Phantom Assassin, Gyrocopter released
2012-07-26 Nyx Assassin, Keeper of the Light, Visage released
2012-10-30 Recently released Centaur Warrunner is nerfed
2012-12-19 Major rebalance of most/all champions
2013-11-14 ree Spirits Patch, signicant out-of-game & economy changes
2013-12-12 Skeleton King removed shortly before, Legion Commander added, Wraith
King added shortly aer
2014-01-29 Terrorblade, Phoenix released
2014-02-05 Year Beast Brawl (special game mode)
2014-11-20 Oracle released
2015-02-18 Year Beast Brawl (special game mode)
2015-04-30 Major balance changes
2015-05-03 No major changes, but e Summit 3 Tournament tickets released
2015-09-25 Major balance changes (prev. day)
2015-12-16 Arc Warden released
We believe our approach largely works, even if some major spikes such as that of 2015-
05-03 remain unexplained. A SME might be able to help explain such spikes.
Java 8 Streams (Bad idea! Keep it simpler!). With such a large amount of data to be
processed, it is not viable to load the data in memory, and it is necessary to process it
iteratively. Initially, we sought to use Java 8 Streams to do so to avoid having to decom-
press it ahead of time. However, we quickly discovered that Streams add signicant
overhead and quickly exhaust a consumer computer’s memory. A simple for-loop (well,
an iterator really) was much more performant!
CSV Parsers (ey matter!). Key to the performance of iterating through such a
large dataset. We originally attempted to use the fastest CSV parser available, uniVoci-
ty-parsers [3]. However, memory usage from this parser was extremely substantial and
would slow down aer ~400,000 and crash at ~450,000 records. We fell back to one of
the most common and well supported parsers, opencsv, however we found its perfor-
mance insucient (around 2,000 lines/s). We found that the second-fastest parser, Sim-
pleFlatMapper, has negligible performance loss but maintains a constant low memory
prole. e entire processing could be done with a few hundred megabytes of RAM!
JSON Parsers (Take your pick). As each row contains a JSON eld that needs to be
parsed, a JSON parser must be employed. We did not see any signicant dierences in
the parsers we briey auditioned and were able to make our choice by and large based
on intuitiveness. We found Jsoniter to be perfectly serviceable.
HashMap Optimization (Doesn’t matter). We were concerned about the memory
used by the HashMap/Dictionary we intended to store the condensed data in while
it was processed, and selected a high-performance library, Trove4J. is library from
Palantir stores a THashMap using only (8 * CAPACITY) bytes compared to Javas
HashMap that uses (32*SIZE + 4*CAPACITY) bytes [4]. However, it became evident
that the HashMaps’ memory utilization was not a signicant concern because of the
coarseness of the granularity.
FileReader leReader = new FileReader(input);
Iterator<String[]> cReader = CsvParser.iterator(leReader);
while (cReader.hasNext()) parseRow(cReader.next(), onlyCount);
We have identied several promising venues for extension of the work which we hope
to pursue, in addition to overdue cleanup of the codebase which is already in progress.
Exploring further metrics that are available from the underlaying data, chiey the
wins-losses ratio for each hero which we currently export but do not explore may be a
signicant metric which can lead to metagame shi detection.
Another possible avenue for improved detection we have identied is to explore pick
ratios not in term of individual heroes but in term of the role these heroes ll.
Lastly, the high dimensionality of the dataset means that using as simple a distance
function as Manhattan or Euclidean distance is likely not the best idea, and that a more
meaningful dissimilarity measure might be more eective.
https://git.io/JHU https://git.io/JHU
References
[1] e OpenDota Project, “Data Dump (March 2011 to March 2016),OpenDota, 24-Mar-
2017. [Online]. Available: https://blog.opendota.com/2017/03/24/datadump2/. [Accessed: 25-Feb-
2019].
[2] M. H. ur Rehman, C. S. Liew, A. Abbas, P. P. Jayaraman, T. Y. Wah, and S. U. Khan, “Big Data
Reduction Methods: A Survey,Data Sci. Eng., vol. 1, no. 4, pp. 265–284, Dec. 2016.
[3] uniVocity Soware Pty Ltd, Comparisons among all Java-based CSV parsers in existence: uni-
Vocity/csv-parsers-comparison. univocity, 2018.
[4] M. Vorontsov, “Trove library: using primitive collections for performance,Java Performance
Tuning Guide, 19-Jul-2014. .
[5] Dota 2 Wiki Contributors, “Category:Patches,Dota 2 Wiki. [Online]. Available: https://
dota2.gamepedia.com/Category:Patches. [Accessed: 25-Mar-2019].
e Dataset. e dataset ranges from March 2011 to April 2016 and contains data on
1,191,768,403 (>1 billion) matches that were played during that time [1]. is data is
publicly available as a gunzipped CSV le (151GB zipped, 1.2TB unzipped). At our
disposal was only a personal machine with good but not outstanding performance that
could not possibly handle the data in its original format. As such, we had to use for Big
Data Reduction techniques.
Dimensionality reduction. e curse of dimensionality is a well-known problem
where the high number of dimensions present in the dataset cause increasingly high
computational burdens [2]. In our case, we have over 50 dimensions in the dataset. We
established than only 22 dimensions were needed to achieve our objectives and pro-
cessed our dataset accordingly. We calculated that on an average day in the dataset, this
could have reduced the space needed as much as from 671 MB to 99MB (a 6.7x reduc-
tion).
Granularity reduction. We also established that for our purposes of detecting points
where the metagame shied, we need not possess extremely granular data. Aer all,
we are interested in the day or few days when this shi occurs rather than the minute.
us, we can aord a much coarser granularity than what is present in the dataset – we
do not need per-match data, just per-day or per-week summaries of what heroes were
picked. A sum of the daily picks (and wins/losses) for each hero was what we sought to
produce. Condensing the data in this manner produced extraordinary space savings as
expected: the potential reduction could be as great as from 99MB to 3KB (35,545x re-
duction).
Resulting compression. ese are theoretical results for simulated data, however in
our actual implementation, we observed the original dataset of 1.2TB being reduced to
3MB or a 396,514x compression. e increased performance in the real-world applica-
tion is likely a result of JSON compression and some level of data sparsity that was not
present in the hand-craed test case.
Research Objectives
2012-01-12 #13
2012-06-11 #9
2012-07-26 #21
2012-10-30 #5
2012-12-19 #14
2013-12-12 #1
2013-11-14 #4
2014-01-29 #34
2014-02-05 #2
2015-03-02 #16
2015-02-18 #26 2015-12-16 #23
2015-04-30 #74
2014-11-20 #50
2015-09-25 #203
Heroes Pick Rates (2011-11-22 - 2016-04-23)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Research on big data analytics is entering in the new phase called fast data where multiple Giga Bytes of data arrive in the big data systems every second. Modern big data systems collect inherently complex data streams due to the volume, velocity, value, variety, variability, and veracity in the acquired data and consequently give rise to the 6Vs of big data. The reduced and relevant data streams are perceived to be more useful than collecting raw, redundant, inconsistent, and noisy data. Another perspective for big data reduction is that the millions-variables big datasets cause the curse of dimensionality which requires unbounded computational resources to uncover actionable knowledge patterns. This article presents a review of methods that are used for big data reduction. It also presents a detailed taxonomic discussion of big data reduction methods including the network theory, big data compression, dimension reduction, redundancy elimination, data mining, and machine learning methods. In addition, the open research issues pertinent to the big data reduction are also highlighted.
  • The Opendota Project
The OpenDota Project, "Data Dump (March 2011 to March 2016), " OpenDota, 24-Mar-
Trove library: using primitive collections for performance
  • M Vorontsov
M. Vorontsov, "Trove library: using primitive collections for performance, " Java Performance Tuning Guide, 19-Jul-2014..