ArticlePDF Available

Abstract

Main focus of any organization functioning in today's competitive marketplace is to gain and sustain competitive advantage. With the huge volumes of data stored in databases, data marts and data warehouses coupled with advanced data analysis tools, managers are now in a better position to make smart and effective decisions which result in competitive advantage for their organizations. Business Analytics (BA) is a new and upcoming area of advanced data analysis that has emerged as a significant area of study for both researchers as well as practitioners over the last two decades. BA is the process of transforming huge volumes of data into new knowledge through analysis and using that knowledge for effective decision making and problem solving which ultimately results in value-creating competitive actions. Keeping in view the importance of Business Analytics, this paper discusses the concept of Business Analytics, its framework and application.
Business Analytics: A Framework
1&4Rafi Ahmad Khan, 2Adnan Nadeem, 3Arshad Ali
123Faculty of Computer and Information System, Islamic University of Madinah, Madinah, Kingdom of
Saudi Arabia
4Department of Management Studies, University of Kashmir, India
mca_rafi@yahoo.com
Abstract
Main focus of any organization functioning in today’s
competitive marketplace is to gain and sustain
competitive advantage. With the huge volumes of
data stored in databases, data marts and data
warehouses coupled with advanced data analysis
tools, managers are now in a better position to make
smart and effective decisions which result in
competitive advantage for their organizations.
Business Analytics (BA) is a new and upcoming area
of advanced data analysis that has emerged as a
significant area of study for both researchers as well
as practitioners over the last two decades. BA is the
process of transforming huge volumes of data into
new knowledge through analysis and using that
knowledge for effective decision making and problem
solving which ultimately results in value-creating
competitive actions. Keeping in view the importance
of Business Analytics, this paper discusses the concept
of Business Analytics, its framework and application.
Keywords:Business Analytics (BA); Data Mining;
Dashboard; Scorecard; Data Warehouse; Decision
Making
1. Introduction
In the present competitive marketplace where every
organization desire to be on top by having a
competitive edge over others. The success of
organization is mainly dependent upon the quality
and effectiveness of the decisions which can be
achieved through proper analysis of data. An
analytical approach helps in making better decisions
using more optimized information-based facts in a
more organized way. There are different studies
like researches at MIT and Wharton which have
proven that those companies that have adopted data
driven decision making systems, their output and
productivity has seen an increase of 5-6% than the
competition [1]. Data driven decision making refer
to those systems, practices, application and
technologies that support organizations in analyzing
vital business data to have a better understanding of
their business. The net effect is that the application
of analytics for improving decision making is an
objective that can be achieved by organizations in
every industry [2] as it helps organizations in
increasing the effectiveness of their processes and
decision making by continuous monitoring and
rapid response.
Although analytics have been used in business since
the beginning of last century but it got more
impetus and attention in 1960s with the
development of decision support systems. Since
then, analytics have transformed with the
development of enterprise resource planning (ERP)
systems, databases, data warehouses, and a large
number of other software tools and processes [3].
2. Business Analytics
Business Analytics (BA) refers to software solutions
that assist users analyze the huge enterprise data to
make enhanced and effective business decisions. It
is collection of skills, technologies, applications and
practices for continuous repetitive investigation of
past business performance to gain better
understanding and drive business planning [4]. BA
is a comprehensive term which includes predictive
analytics, business intelligence, human performance
management, data visualization, and many other
subjects [5]. According to Turban et.al [6], it refers
to applications and techniques for gathering,
storing, analyzing and providing access to data to
help users make better and strategic decisions [7].
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
102
ISSN:2229-6093
Rouse [8] describes it as a practice of iterative,
methodical exploration of an organization’s data
with emphasis on statistical analysis. BA includes
querying, OLAP, reporting, and alert tools that can
answer queries like why is this happening, what if
these trends continue, what will happen next i.e.,
prediction, what is the best that can happen i.e.,
optimization [3]. It makes extensive use of
data, statistical and quantitative analysis,
explanatory and predictive modeling [9] and fact-
based management to drive decision making.
The quality of BA has significantly enhanced in this
decade, providing managers with better
understanding of business by using the huge data
stored in operational databases as well as data
warehouses [10], [11]. Analytics are now regularly
used in multiple areas, including sales, marketing,
supply chain optimization, and fraud detection [12],
[13]. Literature of BA is full of cases of prominent
organization competing better through analytics
[14].
Realizing the power of data, now organizations treat
it as a valuable business asset to gain competitive
advantage. To tap this power of data, effective BA is
essential which is determined by the quality of data,
skills and knowledge of business analysts and
commitment of organizations.
3. Types of Analytics
Raden [15] has classified advanced analytics into
three main functions: Descriptive, Predictive and
Optimization which are discussed below:
Descriptive/Reporting Analytics: Descriptive
analytics gives information about the past
performance or state of a business and its
environment by using data that is stored in
databases/data warehouses. It helps companies to
gain insight from historical data with reporting,
scorecards, clustering [16]. Descriptive analytics
provides routine, regular and Ad hoc reports that
help companies to look at the facts like, what has
happened, where, and how often. It can execute
specific queries so that managers can examine the
exact problem. Visualization has become an
important component of descriptive analysis as it
can develop powerful insights into the actions and
operations of a company.
Predictive Analytics: Predictive Analytic tools
determine the probable future outcome for an
event, or the likelihood of the situation occurring
and identify relationships and patterns [17].
According to Raden [15], the objective of
predictive models is to understand the causes and
relationship in the data so as make accurate
predictions. It is application of statistical
techniques as well as other more recently
developed techniques of data mining and
sometimes visualization, to detect patterns and
anomalies in detailed transactions. Analysts use
patterns into models that can be applied to new
transactions to predict behavior or outcomes [18].
For example, it can predict churn analysis i.e., the
customers that are most likely to shift to a
competitor, find the customers that are credit risk,
what a buyer is likely to buy and in what quantity,
what promotional campaign customers are likely
to respond, etc.
Various techniques of predictive analytics include:
Classification: classification techniques like
Logistic Regression (LOG), K-Nearest
Neighbors, Artificial Neural Networks (ANN),
Decision Trees, Support Vector Machines,
Fuzzy Sets, Genetic Algorithms (GAs) and Rough
Set are used for customer identification include
target customer analysis and subsequently
classifying the segments of potential
customers so that organizations can direct
their resources and efforts into attracting the
target customer segments [19].
Clustering: Clustering techniques can be used
to divide customers into different groups in
order to target specific promotional campaign
to them.
Association: It can identify the relationships
between different purchasing behaviors, i.e., if a
buyer buys one product, it can predict the other
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
103
ISSN:2229-6093
items that he/she is likely to purchase, thus
helping in promotion of related products.
Optimization/Prescriptive Analytics: it helps
to choose best possible outcome by evaluating a
number of possible outcomes. Enterprise level
optimization models join the descriptive and
predictive models together with probabilistic
and stochastic methods such as Bayesian
models or Monte Carlo Simulation to assist in
determination of the best course of action based
on various ‘what if’ scenario assessments [15].
In addition to above three types of BA techniques,
some researchers have also mentioned the
diagnostic analytics. It focuses on past performance
to determine the answer to the questions like why it
is happening or why something happened. It gives
companies deep insight into a problem by
techniques such as drill-down, data discovery, data
mining etc. to find out dependencies and to discover
patterns from the historical data [20]. The outcome
of this analysis is mostly an analytic dashboard.
4. Business Analytics Framework
A general framework of BA consists of four layers
viz., Data layer, Analytics layer, Reporting/
Visualization Layer, Access layer as presented in
fig1. These layers are discussed below:
(i) Data layer: Data is present at many places in
many forms, so it needs to be gathered at one place.
Data is collected from internal as well as external
sources. Internal sources of data include
operational database of an organization where daily
transactions are stored while as external sources of
data include suppliers, customers, competitors,
government agencies, internet etc. The collected
data is stored in data warehouse and then analyzed
for decision making. Various tools used in data layer
include:
Data Warehouse: Oracle Corporation defines a
data warehouse as a collection of corporate
information derived directly from operational
systems (internal sources) and some external
data sources [21]. Data warehouse is a copy of
transaction data specifically structured for query
and analysis. It is informational, analysis and
decision support oriented, not operational or
transaction processing oriented [22]. Data in a
data warehouse is integrated, subject-oriented,
non-volatile, and time-variant. Before a data
warehouse is populated with data from internal
sources as well as external sources, data needs to
be transformed. This transformation process is
termed as Extract, Transform and Load (ETL)
and performs the following functions:
Extract: Data is extracted from many
sources including the internal as well as
from external sources. It is then
consolidated, and non-relevant data is
filtered out.
Transform: Extracted data is validated and
cleaned up to correct missing, inconsistent,
or invalid values. Data is integrated into
standard format and business rules are
applied that map data to the warehouse
schema.
Load: Cleansed data is then loaded into the
data warehouse/data mart.
Data Mart: Also known as localized data
warehouses, are small sized data warehouses,
typically created by individual divisions or
departments to provide their own decision
support activities [23]. In order to avoid huge
investment and risk of failure, some
organizations invest in data marts for a few
functional areas like finance or marketing
instead of a complete data warehouse. While as
some organizations choose both data
warehouse as well as dedicated data marts
which significantly reduces the query
complexity and hence increases the query
response time.
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
104
ISSN:2229-6093
(ii)Analytics Layer:In this layer, data from Data
Warehouse /DataMart’s are analyzed by using
Descriptive/Predictive/Prescriptive analytics.
Various techniques used in this layer are:
Data Mining: Data mining is is the process of
exploration and analysis, by semi-automatic or
automatic means, of huge quantities of data in
order to discover meaningful patterns and rules
[24], [25]. Berzal, et al. [26] define data mining as
the non-trivial process of identifying valid, novel,
potentially useful and ultimately understandable
patterns in data. Data mining is the technique that
includes the management science, statistical,
mathematical and financial models and methods,
used to find the vital relationships between
variables in the historical data, perform analysis
on the data or to forecast [27]. It is an exploratory
data analysis technique that is employed to
discover useful patterns in data that are
not obvious to the data user [28], [29].
Data mining searches for relationships and global
patterns that exist in large databases but are
hidden among the vast amount of data, e.g., a
relationship between patient data and their
medical diagnosis or the relationship between
driver’s age and the accidental insurance claims.
Such relationship represents valuable knowledge
about the data base and the objects in the data
base [30], [31].
It is vital to choose the appropriate data mining
model/algorithm which can be a linear
regression, classification task, cluster analysis,
rule formation, association finding task, data
summarization, learning classification rules,
finding dependency networks, analyzing changes,
and detecting anomalies [32], [33].
Multidimensional Data Analysis:Also known
as Online Analytical Processing (OLAP), is part of
the wider variety of business intelligence software
that enables managers, executives and analysts to
gain insight into data through rapid, reliable,
collaborative access to a wide-range of
multidimensional views of information. It also
allows business analysts to rotate that data,
changing the relationships to get more detailed
insight into corporate information [34].
Multidimensional analysis enables users to view
the same data in different ways using multiple
dimensions. Each facet of information like region,
product, cost, pricing or time period represent a
different dimension.
Reporting/ Visualization Layer:Various tools
used in this layer are:
Dashboards: They are the tools for visualization of
important business data presented in the form of
graphic indicators, charts and tables [35], [36]. A
digital dashboard provides the user a graphical
high-level view of business processes that can be
drilled-down to find more detail on a particular
business process. This level of detail is often buried
deep in the enterprise’s data, making it otherwise
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
105
ISSN:2229-6093
concealed to a business manager [37]. It gives
users ability to instantly respond to information
being presented and also provides the facility of
drill down and root cause analysis of situations at
hand [38]. It organizes and reflects the information
into the user interface in an easy, interactive and
intuitive manner. It can facilitate users to generate
their individual dashboards with various features
like tables, reports, pivot tables, graphs, and
prompts.
Balanced Scorecards: The Balanced Scorecards
(BSC) concept was presented by Kaplan [39]. They
are semi-standard structured report, supported by
design tools and techniques, that can be used by
managers to keep track of the execution of activities
by the staff within their control and to monitor the
consequences arising from these actions [40]. They
are performance metric to identify and improve
various internal functions and their resulting
external outcomes. BSC enables organizations to put
the strategy into practice by measuring and
delivering feedback to the organizations.
Reports: These are the written documents relating
to the situation and can be created by the end users
by supplying parameter data. The pre-executed
report results are cached in order to support
interactive and high-performance viewing of these
reports.
Ad hoc Reports: Contrary to standard reports which
are predefined and routinely processed, ad hoc
reports are generated when the need arises. They
enable users to produce their own customized
reports without relying on IT team.
Alert is a type of reports that is automatically
triggered when an event occurs, e.g. an e-mail or
SMS message is sent to the customer when product
becomes available which he had previously tried to
purchase, or an alert can instantly notify the
manager if the sales numbers fall below acceptable
level.
5. Discussion
In the rapidly changing business world, the business
processes are becoming increasingly complex
making it more demanding for managers to have
broad understanding of business environment. Due
to technological innovation, globalization,
competition, mergers and acquisitions,
organizations are forced to reconsider their
business strategies. Consequently, many
organizations have resorted to BA which helps them
in understanding and regulating business processes
to gain competitive advantage. Keeping in view the
importance of BA for business organizations, this
paper has provided a general framework of BA
having three layers namely Data layer, Analytics
Layer, and Reporting/ Visualization Layer. Further,
the various tools that are used in these layers were
discussed. This paper will be helpful to researchers
of this discipline as well as academicians and
practitioners of BA.
6. Conclusion
Business organizations are overflowing with data
related to their internal business processes,
suppliers, partners, customers, and competitors. It is
very difficult for these organizations to leverage this
huge data by processing it and converting it into
valuable information/knowledge to increase their
sales, enhance profitability and augment the
efficiently of their business operations.
BA is the process where database collects
information from the different parts of organization,
then various software utilities and applications
convert that raw data into information and
knowledge in the form of reports, tables,
visualization and other analytical tools to deliver
understanding of business, competitors and
customers.
Organizations can benefit by using the BA as it helps
in smooth operations, improved employee
productivity and making effective decisions which
results in better competitiveness. In fact, past
decade has witnessed increased focus of
organizations on BA as proved by the continuously
growing business analytics software market. Now,
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
106
ISSN:2229-6093
BA is adopted by more organizations and has been
extended to a broad range of users, from company
executives and managers to data analysts and other
knowledge workers. In an environment where data
volumes are growing fast and where business
operations and decisions on instinct are no longer
an option, BA offers organizations tools to both
improve the operations of organization internally
and facilitates with effective decision making which
results in competitive advantage.
Acknowledgements
This work is part of the research study titled “Use of
Learning Analytics on Big Data to Predict the
Performance and Attrition Risk of Students in
Islamic University of Medina” supported by
Deanship of Research, Islamic University of
Madinah, Madinah, Kingdom of Saudi Arabia.
References
[1]
E. Brynjolfsson, M. Hitt, K. Lorin and H.
Heekyung, “How Does Data- Driven Decision
Making Affect Firm Performance?,” Strength in
Numbers, vol. 5, no. 3, pp. 24-30, 2012.
[2]
M. Skalak, “The Five Myths of Data Mining,”
Data Mining, vol. 15, no. 1, pp. 42-43, 2003.
[3]
T. H. Davenport and J. G. Harris, Competing on
Analytics: The New Science of Winning,
Harvard: Harvard Business Review Press,
2007.
[4]
M. J. Beller and A. Barnett, “Next Generation
Business Analytics,” 18 06 2009. [Online].
Available:
www.docstoc.com/docs/7486045/Next-
Generation-Business-Analytics-Presentation.
[Accessed 20 06 2009].
[5]
B. Stupakevich, “Words at Work: Defining
“Business Analytics”,” 18 June 2010. [Online].
Available:
https://www.smartdatacollective.com/27873/
. [Accessed 10 April 2018].
[6]
E. Turban, J. Aronson, T.-P. Liang and S.
Ramesh, Decision Support and Business
Intelligence Systems, New Jersey: Pearson
Prentice Hall, 2007.
[7]
S. Zaima, “Data Mining Blunders Exposed,”
Data Mining, vol. 6, no. 2, pp. 10-13, 2003.
[8]
M. Rouse, “Business Analytics (BA),” April
2017. [Online]. Available:
https://searchbusinessanalytics.techtarget.co
m/definition/business-analytics-BA.
[9]
G. Schmueli and O. Koppius, “Predictive vs.
Explanatory Modeling in IS Research,” Social
Science Research Network, pp. 10-19, 2009.
[10]
R. Kohavi and F. Provost, “Applications of data
mining to electronic commerce,” Data mining
and knowledge discovery, vol. 5, no. 1-2, pp. 5-
10, 2001.
[11]
R. Kohavi, C. Brodley, B. Frasca, L. Mason and Z.
Zheng, “KDD-Cup 2000 Organizers’ Report:
Peeling the Onion,” Appears in SIGKDD
Explorations , vol. 2, no. 2, 2000.
[12]
J. Michael, A. berry and G. S. Linoff, Mastering
Data Mining, John Wiley & Sons, 2000.
[13]
U. M. Fayyad, G. Piatetsky-Shapiro and P.
Smyth, “From Data Mining to Knowledge
Discovery in Databases,” AI Magazine Volume
Number3 (1996, vol. 17, no. 3, 1996.
[14]
P. Arena, S. Rhody and M. Stavrianos, “The
Truth About Facts,” Business Intelligence, vol.
3, no. 1, pp. 33-34, 2006.
[15]
N. Raden, “The Foundations of
Analytics:Visualization, Interactivity and
Utility. The ten principles of Enterprise
Analytics,” Spotfire, Inc, Somerville, U.S.A,
2010.
[16]
R. Saxena and A. Srinivasan, Business
Analytics-A Practitioner’s Guide, Springer,
2013.
[17]
E. Turban, j. E. Aronson, T. p. Liang and r.
Sharda, Decision Support and Business
Intelligence Systems, New Jersey: Pearson
Pretince Hall, 2007.
[18]
W. W. Eckerson, “Beyond Reporting:
Requirements for Large-Scale Analytics,”
Renton, Washington, 2008.
[19]
R. A. Khan, “Data Mining: Applications in
Marketing,” CiiT International Journal of Data
Mining and Knowledge Engineering, vol. 6, no.
3, pp. 89-94, 2014.
[20]
T. Vlamis, “The Four Realms of Analytics,” 4
June 2015. [Online]. Available:
http://www.vlamis.com/blog/2015/6/4/the-
four-realms-of-analytics.html.
[21]
P. Lane, “Data Warehousing Guide10g, Release
2 (10.2),” Oracle Corporation, CA, 2005.
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
107
ISSN:2229-6093
[22]
R. Kimball, The Data Warehouse Toolkit :
Practical Techniques for Building Dimensional
Data Warehouses, John Willy & Sons, 1996.
[23]
R. A. Khan and S. M. K. Quadri, “Dovetailing of
Business Intelligence and Knowledge
Management: An Integrative Framework,”
Information and Knowledge Management, vol.
2, no. 4, pp. 1-6, 2012.
[24]
J. A. Berry and G. Lindoff, “Data Mining
Techniques,” Wiley Computer Publishing,
1997.
[25]
T. Connolly and C. Begg, Database Systems A
Practical Approach to Design, Implementation,
and Management,3rd Edition, London: Addison
Wesley, 2002.
[26]
Berzal, Fernando, Cubero, J. Carlos, M. Nicolas
and Serrano, “An Efficient Method for
Association
Rule Mining in Relational
Databases,” Data & Knowledge Engineering,
vol. 37, no. 1, pp. 47-64, 2001.
[27]
S. Becker, Data Warehousing and Web
Engineering, Hershey: Idea Group Publishing,
2002.
[28]
D. Mladenic and M. Grobelnik, “Feature
selection on hierarchy of web documents,”
Decision Support Systems, pp. 45-87, 2003.
[29]
J. Han and M. Kamber, Data Mining: Concepts
and Techniques, Second Edition, San Francisco,
CA, U.S.A: Morgan Kaufmann, 2006.
[30]
A. K. Pujari, Data Mining Techniques,
Hyderabad: Universities Press, 2002.
[31]
M. Holsheimer and A. Siebes, “Data Mining: The
Search for Knowledge in Databases, The
Netherlands,” CWI (Centre for Mathematics
and Computer Science), Amsterdam, 1994.
[32]
J. F. William, G. P and C. J. Matheus, “Knowledge
Discovery in Databases: An overview,” AI
Magazine, vol. 13, no. 3, 1992.
[33]
S. P. Imberman, “Effective Use Of The KDD
Process And Data Mining For Computer
Performance Professionals,” in Proceedings of
CMG 2001, Dec 2001.
[34]
V. Sauter, “Decision Support Systems,”
University of Missouri St. Louis, Missouri,
2002.
[35]
A. Kirtland, “Executive Dashboards,” 20 1 2006.
[Online]. Available:
http://dssresources.com/papers/features/kirt
land/kirtland01202006.html.
[36]
A. Vasiliu, “Dashboards and scorecards:
Linking management reporting to execution,”
30 4 2006. [Online]. Available:
http://dssresources.com/papers/features/vas
iliu/vasiliu04302006.html.
[37]
Gravic, “The Evolution of Real-Time Business
Intelligence and How To Achieve It Using HPE
Shadowbase Software,” Gravic , Inc., Malvern,
PA,USA, 2017.
[38]
D. Chappelle, “Big Data & Analytics Reference
Architecture,” Oracle Corporation, CA, USA,
2013.
[39]
R. Kaplan and D. Norton, The balances
scorecard, Boston: Harvard Business Press,
1996.
[40]
C. M. Olszak, “An overview of information tools
and technologies for competitive intelligence
building: Theoretical approach,” Issues in
Informing Science and Information
Technology, vol. 11, pp. 139-153, 2014.
Rafi Ahmad Khan et al, International Journal of Computer Technology & Applications,Vol 10(2),102-108
IJCTA | Mar-Apr 2019
Available online@www.ijcta.com
108
ISSN:2229-6093
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The KDD (Knowledge Discovery in Databases) paradigm is a step by step process for finding interesting patterns in large amounts of data. Data mining is one step in the process. This paper defines the KDD process and discusses three data mining algorithms, neural networks, decision trees, and association/dependency rule algorithms. The algorithms' potential as good analytical tools for performance evaluation is shown by looking at results from a computer performance dataset.
Chapter
Data mining can also be viewed as a process of model building, and thus the data used to build the model can be understood in ways that we may not have previously taken into consideration. This chapter summarizes some well-known data mining techniques and models, such as: Bayesian classifier, association rule mining and rule-based classifier, artificial neural networks, k-nearest neighbors, rough sets, clustering algorithms, and genetic algorithms. Thus, the reader will have a more complete view on the tools that data mining borrowed from different neighboring fields and used in a smart and efficient manner for digging in data for hidden knowledge.
Article
One of the most important problems in modern finance is finding efficient ways to summarize and visualize the stock market data to give individuals or institutions useful information about the market behavior for investment decisions. The enormous amount of valuable data generated by the stock market has attracted researchers to explore this problem domain using different methodologies. Potential significant benefits of solving these problems motivated extensive research for years. The research in data mining has gained a high attraction due to the importance of its applications and the increasing generation information. This paper provides an overview of application of data mining techniques such as decision tree. Also, this paper reveals progressive applications in addition to existing gap and less considered area and determines the future works for researchers.
Article
We examine whether firms that emphasize decision making based on data and business analytics (“data driven decision making” or DDD) show higher performance. Using detailed survey data on the business practices and information technology investments of 179 large publicly traded firms, we find that firms that adopt DDD have output and productivity that is 5-6% higher than what would be expected given their other investments and information technology usage. Furthermore, the relationship between DDD and performance also appears in other performance measures such as asset utilization, return on equity and market value. Using instrumental variables methods, we find evidence that the effect of DDD on the productivity do not appear to be due to reverse causality. Our results provide some of the first large scale data on the direct connection between data-driven decision making and firm performance.
Article
The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from Yahoo, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. The high number of features is reduced by feature subset selection and additionally by using ‘stop-list’, pruning low-frequency features and using a short description of each document given in the hierarchy instead of using the document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. An efficient approach to generating word sequences is proposed. Based on the hierarchical structure, we propose a way of dividing the problem into subproblems, each representing one of the categories included in the Yahoo hierarchy. In our learning experiments, for each of the subproblems, naive Bayesian classifier was used on text data. The result of learning is a set of independent classifiers, each used to predict probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. The best performance was achieved by the feature selection based on a feature scoring measure known from information retrieval called Odds ratio and using relatively small number of features.