Conference PaperPDF Available

Steam Review Dataset - new, large scale sentiment dataset

Authors:

Abstract and Figures

In this paper we present new binary sentiment classification dataset containing over 3,640,386 reviews from Steam User Reviews, with detailed analysis of dataset properties and initial results of sentiment analysis on collected data.
Content may be subject to copyright.
Steam Review Dataset - new, large scale sentiment dataset
Antoni Sobkowicz, Wojciech Stokowiec
O´
srodek Przetwarzania Informacji - Pa´
nstwowy Instytut Badawczy
al. Niepodległo´
sci 188b, 00-608 Warszawa, Poland
{antoni.sobkowicz, wojciech.stokowiec}@opi.org.pl
Abstract
In this paper we present new binary sentiment classification dataset containing over 3,640,386 reviews from Steam User Reviews, with
detailed analysis of dataset properties and initial results of sentiment analysis on collected data.
1. Introduction
This paper introduces binary sentiment classification
dataset containing over 3,640,386 reviews in English. Con-
trary to other popular sentiment corpora (like Amazon re-
views dataset (McAuley et al., 2015) or IMBD reviews
dataset (Maas et al., 2011)) Steam Review Dataset(Antoni
Sobkowicz, 2016)1is also annotated by Steam community
members providing insightful information about what other
users consider helpful or funny. Additionally, for each
game we have gathered all available screen-shots which
could be used for learning inter-modal correspondences be-
tween textual and visual data. We believe that our dataset
opens new directions of research for the NLP community.
Steam User Reviews, online review part of Steam gam-
ing platform, developed by Valve Corporation, are one of
more prominent ways of interaction between Steam Plat-
from users, allowing them to share their views and expe-
riences with games sold on platform. This allows users to
drive sales of a game up or slow them down to the point
of product being removed from sale, as online user reviews
are known to influence purchasing decisions, both by their
content: (Ye et al., 2009) and volume: (Duan et al., 2008).
Each review is manually tagged by author as either positive
or negative before posting. It also contains authors user
name (Steam Display Name), number of hours user played
the game, number of games owned by the user and number
of reviews written by user.
After the review is online, other Steam users can tag re-
view as Useful/Not Useful (which add to Total score) or
Funny. Useful/Not Useful score is used to generate Use-
fulness score (percentage of Useful score to Total). Funny
score is different – it does not count into total, and allows
user to tag review as Funny only.
In the rest of paper we describe dataset in detail and pro-
vide basic analysis, both based on review scores and texts.
We also provide baselines for sentiment analysis an topic
modelling on dataset. We encourage everyone to explore
dataset, especially:
relations between games, genres and reviews
dataset network properties – connection between
users, groups of people
inter-modal correspondences between reviews and
game screen-shots
1Availability information is described in section 6.
2. Detailed dataset description and analysis
Figure 1: Typical Steam game review.
We gathered over 3,640,386 reviews in English for 6158
games spanning multiple genres, which, to the best of our
knowledge, consist of over 80% of all games in steam store.
We have also gathered screen-shots and basic metadata for
each game that we have processed. For each review we
extracted Review Text, Review Sentiment, and three scores
- Usefulness and Total scores and Funny score. Detailed
description of each of the scores is as follows:
Usefulness Score - the number of users who marked a
given review as useful
Total Score - the number of users rating usefulness of
a given review
Funny Score - the number of users who marked a
given review as funny
Funny Ratio - the fraction of Funny Score to Total
Score
We stored all extracted data, along with raw downloaded
HTML review (for extracting more information in future)
in database. Here by score, we understand the number of
users who marked given review as
2.1. Review sentiment/score
We calculated basic statistics for gathered data: from
collected 3,640,386 reviews written by 1,692,556 unique
users. Global positive to negative review ratio was 0.81 to
0.19. Average review Total Score was 6.39 and maximum
was 22,649. Average Useful Score/Total Score ratio for re-
views with Total Score >1 was 0.29, with maximum of 1.0
and minimum of 0.0. Average Funny Score was 0.95 (with
329,278 reviews with Funny Score at least 1), and maxi-
mum was 20,875.
Sentiment Usefulness average σ
Positive 0.624 0.369
Negative 0.394 0.307
Table 1: Usefulness average comparison for positive and
negative reviews.
Analysis of Usefulness (Useful Score to Total Score ra-
tio) for positive and negative reviews showed that average
Usefulness for positive reviews is statistically higher than
for negative reviews (according to unpaired t-Test, with
P-value >0.0001). Averages and standard deviations are
shown in table 1.
Distribution of Usefulness and Funny Score to Length of
review are shown in figure 5. Additionally, as shown in
figure 4, we binned Usefulness into 100 logarithmic beans.
The utility of the review is roughly (except some outliers)
exponential function of the length of the comment, for
both positive and negative reviews - fitted log function has
R2= 0.954 for positive reviews, R2= 0.979 for negative
reviews. Funny Score seems to be unrelated to Length of
the review.
After analysis, both qualitative and quantitative, we have
decided to mark reviews as popular when they they are in
the 20% of reviews with largest Total Score (per game).
Reviews were marked as funny if they are popular and
have Funny Ratio (Funny Score to Total Score ratio) score
greater or equal to 20% (after excluding reviews with zero
Funny Score). The distribution of Funny Ratio is shown in
figure 2.
2.2. Review content
Average review length was 371 characters/78 words long,
with longest review being 8623 characters long. The dis-
tribution of review length measured in characters is log-
normal with µ= 4.88 and σ= 1.17, with R2= 0.990,
which is consistent with findings by (Sobkowicz et al.,
2013). Histogram of review length with fitted distribution
is shown in figure 6. Long tail of distribution (reviews
over 1500 characters long) consists of 4,7% of all reviews.
However, there is a large number of reviews with lengths
above 8000 characters that do not fit this distribution. A
closer inspection showed that these texts are the result of a
“copy/paste” of the Martin Luther Kings ‘I Have a Dream’
speech, posted 16 times by one unique user (who, beside
that posted only one relevant review). Rest of the these very
long reviews are not informative, like one word repeated
many times, or other, long non-review stories. These out-
liers in the length distribution pointed out (without reliance
on contextual analysis) the existence of trolling behavior,
even in a community of supposedly dedicated users sharing
common interests.
Average length (in characters and words) for positive and
negative reviews are aggregated in table 2. Performed t-Test
on data converted to log scale showed that length difference
is statistically significant (P-value >0.0001), with negative
reviews being longer.
Figure 2: Distribution of funny ratio
Sentiment Avg. words σAvg. chars σ
Positive 73.5 134.2 348.7 633.3
Negative 98.9 162.0 464.9 763.4
Table 2: Average length in words and characters compari-
son for positive and negative reviews.
2.3. Users
There were 1,692,556 unique users, with 35369 users writ-
ing more than 10 reviews, average 2.15 review per user. We
also identified group of 94 users, who each had their own
one or two prepared reviews and posted them repeatedly –
reviews in this group ranged from short informative ones to
”copy/paste” – like the aforementioned Martin Luther King
speech or recipes for pancakes.
There were 6252 users who wrote more than ten reviews,
all of them being positive, and only 47 users who wrote
more than 10 reviews, all of them being negative.
3. Sentiment Analysis
We performed basic sentiment analysis on collected dataset
to establish baseline for future works and comparisons.
3.1. Experiment description
We used full dataset with 30/70 split - 1,120,325 out of
3,640,386 reviews used as test data, and rest as training
data. Each review was represented as TF-IDF vector from
Figure 3: Distribution of extracted Total scores
Figure 4: Usefulness of review to length, binned by length,
with fitted log function for positive and negative.
Figure 5: Funny Score of review to length. Funny Score
for negative reviews is shown on negative to provide better
readability.
Figure 6: Review text length histogram with fitted log-
normal distribution.
space of all available reviews. Using obtained vectors, we
trained two models - one based on Maximum Entropy clas-
sifier (descibed in (Menard, 2002)) and other on Multino-
mial Naive Bayes classifier (described in (McCallum et al.,
1998))
Model evaluation details are shown in tables 3 and 4.
4. Toolset
Steam Review Dataset (SRD) was gathered using custom
toolset written in Python and Selenium. We also created
basic analytical tools using Python with Gensim ( ˇ
Reh˚
uˇ
rek
and Sojka, 2010) and Scikit-learn (Pedregosa et al., 2011)
packages.
4.1. Data gatherer
Data dathering package was creating using Python with Se-
lenium. Package reads game id list from CSV file, and for
each found id it scrapes game front page and two review
pages - for positive and negative reviews. Package handles
large number of reviews for each game (restricted by RAM
Emotion precision recall f1-score support
-1 0.8 0.64 0.71 212704
1 0.92 0.96 0.94 907621
Avg / Total 0.9 0.9 0.9 1120325
Table 3: Results for Maximum Entropy model
Emotion precision recall f1-score support
-1 0.9 0.05 0.09 212704
1 0.82 1 0.9 907621
Avg / Total 0.83 0.82 0.75 1120325
Table 4: Results for Multinomial Naive Bayes model
of machine it runs on), age verification pages, cache clean-
ing and, with additional tools, gathering of screenshots for
each game. For each scraped game, it created two json files
- one for front page information and one with all review
data. Json files can then be parsed using provided scripts
and saved into database (currently SQLite, but few changes
are needed to use other SQL based DB engines).
4.2. Analytical and auxiliary tools
For performing basic analysis, we created several python
scripts.
Classification script which was used for sentiment anal-
ysis part of this work, allows for easy text classification
using one of several algorithms provided by scikit-learn
package. Tool allows for simple algorithm evaluation (with
training and test set) as well as 10-fold cross validation.
Word2vec and doc2vec scripts which can be used to
perform word2vec and doc2vec(Mikolov et al., 2013) anal-
ysis on gathered review and game description data, imple-
mented using gensim package. Tools are interactive and
allow for easy comparison of terms/reviews.
CSV export tool used for exporting CSV from dataset
database. Can be used to export any columns with addi-
tional SQL modifier, and split resulting file in two (with
70/30 ratio) for easy use in model training and validation.
5. Results and discussion
From two tested models, Maximum Entropy model works
better (with f1-score of 0.9). This seems to be because
of unbalanced training set (as dataset is split 0.81/0.19 be-
tween positive and negative classes) - Naive Bayes models
tend to train poorly on unbalanced sets.
6. Availability and future work
Sentiment part of described dataset is available online in
form of CSV file. Full dataset (in form of sqlite/mysql
database), with all accompanying tools, will be provided
at a later date.
In the near future we are going to add more user related
data to the dataset – this should allow this dataset to be
more useful in network-related research.
References
Duan, W., Gu, B., and Whinston, A. B. (2008). Do online
reviews matter?—an empirical investigation of panel data.
Decision support systems, 45(4):1007–1016.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng,
A. Y., and Potts, C. (2011). Learning word vectors for
sentiment analysis. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics:
Human Language Technologies, pages 142–150, Portland,
Oregon, USA, June. Association for Computational Lin-
guistics.
McAuley, J., Pandey, R., and Leskovec, J. (2015). Infer-
ring networks of substitutable and complementary prod-
ucts. In Proceedings of the 21th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data
Mining, pages 785–794. ACM.
McCallum, A., Nigam, K., et al. (1998). A comparison of
event models for naive bayes text classification. In AAAI-
98 workshop on learning for text categorization, volume
752, pages 41–48. Citeseer.
Menard, S. (2002). Applied logistic regression analysis,
volume 106. Sage.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Dean, J. (2013). Distributed representations of words and
phrases and their compositionality. In Advances in neural
information processing systems, pages 3111–3119.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
(2011). Scikit-learn: Machine learning in Python. Jour-
nal of Machine Learning Research, 12:2825–2830.
ˇ
Reh˚
uˇ
rek, R. and Sojka, P. (2010). Software Frame-
work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks, pages 45–50, Val-
letta, Malta, May. ELRA. http://is.muni.cz/
publication/884893/en.
Sobkowicz, P., Thelwall, M., Buckley, K., Paltoglou, G.,
and Sobkowicz, A. (2013). Lognormal distributions of
user post lengths in internet discussions-a consequence of
the weber-fechner law? EPJ Data Science, 2(1):1–20.
Ye, Q., Law, R., and Gu, B. (2009). The impact of online
user reviews on hotel room sales. International Journal of
Hospitality Management, 28(1):180–182.
Language Resources
Antoni Sobkowicz. (2016). Steam Review Dataset - game
related sentiment dataset. O´
srodek Przetwarzania Infor-
macji, 1.0, ISLRN 884-864-189-264-2.
... We conducted this experiment using a subset of the STEAM review dataset, which contains 6 million reviews across tens of thousands of games [34]. From this dataset, we selected 500 games with the most user feedback, sampling approximately 400 reviews for each game, balanced equally between positive and negative ratings. ...
Preprint
Full-text available
Analyzing unstructured data, such as complex documents, has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered unstructured data processing. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is. This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them, leveraging novel agent-based rewrites (that we call {\em rewrite directives}) and an optimization and evaluation framework that we introduce. We introduce {\em (i)} logical rewriting of pipelines, tailored for LLM-based tasks, {\em (ii)} an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and {\em (iii)} an optimization algorithm that efficiently finds promising plans, considering the time constraints of LLM-based plan generation and evaluation. Our evaluation on three different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 1.34 to 4.6×4.6\times higher quality (e.g., more accurate, comprehensive) than well-engineered baselines, addressing a critical gap in existing declarative frameworks for unstructured data analysis. DocETL is open-source at \ttt{docetl.org}, and as of October 2024, has amassed over 800 GitHub Stars, with users spanning a variety of domains.
... There exist many papers relying on Steam data, for instance Steam players reviews [15,19,21,26,33] or Steam user accounts [4,5,17,25]. Some of these studies use Steam tags, but there are only used as tools for filtering games, and are not the topic of study. ...
Preprint
Full-text available
As a novel and fast-changing field, the video game industry does not have a fixed and well-defined vocabulary. In particular, game genres are of interest: No two experts seem to agree on what they are and how they relate to each other. We use the user-generated tags of the video game digital distribution service Steam to better understand how players think about games. We investigate what they consider to be genres, what comes first to their minds when describing a game, and more generally what words do they use and how those words relate to each other. Our method is data-driven as we consider for each game on Steam how many players assigned each tag to it. We introduce a new metric, the priority of a Steam tag, that we find interesting in itself. This allows us to create taxonomies and meronomies of some of the Steam tags. In particular, in addition to providing a list of game genres, we distinguish what tags are essential or not for describing games according to players. Furthermore, we provide a small group of tags that summarise all information contained in the Steam tags.
Chapter
As the video game industry is a novel and fast-changing field, it is not yet clearly understood how players speak about games. We use the user-generated tags of the video game digital distribution service Steam to better understand what words are essential to players for describing a game. Our method is data-driven as we consider for each game on Steam how many players assigned each tag to it. We introduce a new metric, the priority of a Steam tag, that we find interesting in itself. This enables us to categorise Steam tags based on their importance in describing games according to players.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
The paper presents an analysis of the length of comments posted in Internet discussion fora, based on a collection of large datasets from several sources. We found that despite differences in the forum language, the discussed topics and user emotions, the comment length distributions are very regular and described by the lognormal form with a very high precision. We discuss possible origins of this regularity and the existence of a universal mechanism deciding the length of the user posts. We suggest that the observed lognormal dependence may be due to an entropy maximizing combination of two psychological factors which are perceived on a non-linear, logarithmic scale in accordance with the Weber-Fechner law, namely the time spent on post related considerations and the comment length itself. This hypothesis is supported by an experimental check of text length recognition capacity, confirming proportionality of the ‘just noticeable differences’ for text lengths - the basis of the Weber-Fechner law.
Conference Paper
Full-text available
Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Article
Full-text available
Despite hospitality and tourism researchers’ recent attempts on examining different aspects of online word-of-mouth [WOM], its impact on hotel sales remains largely unknown in the existing literature. To fill this void, we conduct a study to empirically investigate the impact of online consumer-generated reviews on hotel room sales. Utilizing data collected from the largest travel website in China, we develop a fixed effect log-linear regression model to assess the influence of online reviews on the number of hotel room bookings. Our results indicate a significant relationship between online consumer reviews and business performance of hotels.
Conference Paper
Full-text available
Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term--document information as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset of movie reviews to serve as a more robust benchmark for work in this area.
Article
Full-text available
This study examines the persuasive effect and awareness effect of online user reviews on movies' daily box office performance. In contrast to earlier studies that take online user reviews as an exogenous factor, we consider reviews both influencing and influenced by movie sales. The consideration of the endogenous nature of online user reviews significantly changes the analysis. Our result shows that the rating of online user reviews has no significant impact on movies' box office revenues after accounting for the endogeneity, indicating that online user reviews have little persuasive effect on consumer purchase decisions. Nevertheless, we find that box office sales are significantly influenced by the volume of online posting, suggesting the importance of awareness effect. The finding of awareness effect for online user reviews is surprising as online reviews under the analysis are posted to the same website and are not expected to increase product awareness. We attribute the effect to online user reviews as an indicator of the intensity of underlying word-of-mouth that plays a dominant role in driving box office revenues.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
To design a useful recommender system, it is important to understand how products relate to each other. For example, while a user is browsing mobile phones, it might make sense to recommend other phones, but once they buy a phone, we might instead want to recommend batteries, cases, or chargers. In economics, these two types of recommendations are referred to as substitutes and complements: substitutes are products that can be purchased instead of each other, while complements are products that can be purchased in addition to each other. Such relationships are essential as they help us to identify items that are relevant to a user's search. Our goal in this paper is to learn the semantics of substitutes and complements from the text of online reviews. We treat this as a supervised learning problem, trained using networks of products derived from browsing and co-purchasing logs. Methodologically, we build topic models that are trained to automatically discover topics from product reviews that are successful at predicting and explaining such relationships. Experimentally, we evaluate our system on the Amazon product catalog, a large dataset consisting of 9 million products, 237 million links, and 144 million reviews.