Book

Data Analytics for Business: AI-ML-PBI-SQL-R

Authors:
... As with many black box models, random forest provides pertinent information for a global understanding of the model alongside its fraud predictions (Figure 3). In order to evaluate its effectiveness, the F 1 , recall and precision help assess its ability to be an accurate predictor and reduce false positives (Garn 2024). Moreover, the confusion matrix (Figure 3) provides more granular information to illustrate its performance. ...
Article
Full-text available
This paper demonstrates a design and evaluation approach for delivering real world efficacy of an explainable artificial intelligence (XAI) model. The first of its kind, it leverages three distinct but complementary frameworks to support a user-centric and context-sensitive, post-hoc explanation for fraud detection. Using the principles of scenario-based design, it amalgamates two independent real-world sources to establish a realistic card fraud prediction scenario. The SAGE (Settings, Audience, Goals and Ethics) framework is then used to identify key context-sensitive criteria for model selection and refinement. The application of SAGE reveals gaps in the current XAI model design and provides opportunities for further model development. The paper then employs a functionally-grounded evaluation method to assess its effectiveness. The resulting explanation represents real-world requirements more accurately than established models.
Article
Full-text available
Public transport (PT) is crucial for enhancing the quality of life and enabling sustainable urban development. As part of the UK Transport Investment Strategy, increasing PT usage is critical to achieving efficient and sustainable mobility. This paper introduces Machine Learning Influence Flow Analysis (MIFA), a novel framework for identifying the key influencers of PT usage. Using survey data from bus passengers in Southern England, we evaluate machine learning models. Subsequently, MIFA uncovers that easy payments, e-ticketing, and mobile applications can substantially improve the PT service. MIFA’s implementation demonstrates that strength and importance lead to specific insights into how service characteristics impact user decisions. Practical implications include deploying smart ticketing systems and contactless payments to streamline bus usage. Our results suggest that these strategies can enable bus operators to allocate resources more effectively, leading to increased ridership and enhanced user satisfaction.
Book
Full-text available
Businesses have to cut costs, increase revenue and be profitable. The aim of this book is to introduce Management Science to analyse business challenges and to find solutions analytically. Important topics in modelling, optimisation and probability are covered. These include: linear and integer programming, network flows and transportation; essential statistics, queueing systems and inventory models. The overall objectives are: to enable the reader to increase the efficiency and productivity of businesses; to observe and define challenges in a concise, precise and logical manner; to be familiar with a number of classical and state-of-the art operational research techniques and tools; to devise solutions, algorithms and methods that offer competitive advantage to businesses and organisations; and to provide results to management for decision making and implementation. Numerous examples and problems with solutions are given to demonstrate how these concepts can be applied in a business context.
Article
Full-text available
bnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package.
Article
Full-text available
In this age of ever-increasing data set sizes, especially in the natural sciences, visualisation becomes more and more important. Self-organizing maps have many features that make them attractive in this respect: they do not rely on distributional assumptions, can handle huge data sets with ease, and have shown their worth in a large number of applications. In this paper, we highlight the kohonen package for R, which implements self-organizing maps as well as some extensions for supervised pattern recognition and data fusion.
Book
Full-text available
Professor Jain has written a text on the performance analysis of computer systems which can serve as a reference for both specialists and nonspecialists alike. His text is divided into six parts, each of which consists of a half-dozen or so chapters, and covers diverse topics ranging from Measurement Tools to Experimental Design. Each part of this reference text presumes a minimum of exposure to the field and little or no facility with mathematical technique. The author’s style is light and even entertaining, although it is clear that significant practical experience has informed the overall design of the text and the specific material selected. As the author correctly states in his introduction: “There are many books on computer systems performance. These books discuss only one or two aspects of performance analysis, with a majority of the books being queueing theoretic. Queueing theory is admittedly a helpful tool, but knowledge of simulation, measurement techniques, data analysis and experimental design is invaluable.” His aim was to fill a void by writing a book that integrates these rather diverse aspects of performance analysis, and I believe that he has largely succeeded. Part I of his text, “An Overview of Performance Analysis,” discusses performance techniques and metrics, but only after presenting common mistakes which can be made (consciously and otherwise) in presenting performance data and reaching conclusions. This part of the text sets the style: short chapters with witty headings, each of which emphasizes central ideas and encapsulates significant results in a bold face “box.” Each chapter ends with a list of exercises and each part ends with a list of references for further reading. Part II studies measurement techniques and tools, beginning with a discussion of workload selection and their characterization. Software and hardware monitors, capacity planning, and the art of data presentation is surveyed and this part ends with a discussion of “ratio games.” The witty and particularly relevant heading for this last topic is: “If you can’t convince them, confuse them.” Part III is entitled “Probability Theory and Statistics”, with the emphasis not surprisingly on the latter. Confidence interval estimation, a brief mention of hypothesis testing, and a discussion of regression models completes this part. Although this part is an attempt to make the text self-contained, one obviously cannot do justice to these concepts in a quick survey, and the author’s attempt is no exception. Part IV is a serious attempt to survey experimental design and analysis techniques. Factorial designs with and without replication, fractional factorial designs, and one- and two-factor experiments are considered in succession. These topics are often ignored by the performance analysis community, and they are a welcome addition to a text of this kind. Part V is an introduction to simulation techniques, and Part VI surveys simple and widely used queuing models. The author explains the major pitfalls of a simulation clearly and with emphasis – beginning Part V with sage advice: “The best advice to those about to embark on a very large simulation is often the same as Punch’s famous advice to those about to marry: Don’t!” Part IV is a brief introduction to queueing theory which only brushes the surface of this important performance analysis tool. Nevertheless, the standard models (from the M/M/1 single-server model through product form queueing networks) are explained briefly and their essential formulas are recorded. This last part of the text has the form of a “cook book,” and the better part of wisdom to this reviewer would have been to forgo the more complex models (queueing networks certainly), and emphasize the more widely used and more easily digested single-server models. Professor Jain has drawn on his considerable practical experience at the Digital Equipment Corporation in writing this book, and some of the material has been used by him in a graduate seminar on computer systems at the Massachusetts Institute of Technology. Overall, this text draws together much of the material required by the practicing performance analyst in an enjoyable and easo-to-read form.
Article
Full-text available
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Article
CRISP-DM (CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still thede factostandard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.
Book
Cambridge Core - Computational Science - Data-Driven Science and Engineering - by Steven L. Brunton
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Chapter
Knowledge discovery in databases (KDD) is an iterative multi-stage process for extracting useful, non-trivial information from large databases. Each stage of the process presents numerous choices to the user that can significantly change the outcome of the project. This methodology, presented in the form of a roadmap, emphasises the importance of the early stages of the KDD process and shows how careful planning can lead to a successful and well-managed project. The content is the result of expertise acquired through research and a wide range of practical experiences; the work is of value to KDD experts and novices alike. Each stage, from specification to exploitation, is described in detail with suggested approaches, resources and questions that should be considered. The final section describes how the methodology has been successfully used in the design of a commercial KDD toolkit.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Chapter
There is no better way to quantize a single vector than to use VQ with a codebook that is optimal for the probability distribution describing the random vector. However, direct use of VQ suffers from a serious complexity barrier that greatly limits its practical use as a complete and self-contained coding technique
Article
Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications.
Chapter
Over 50 years of research on how to support managers’ decision making, numerous solutions have been proposed under a variety of banners, as discussed in the contributions presented in this book. One of the recent terms to have been proposed is Business Inteligence (BI), which aims at leveraging new technologies for the gathering, presentation, and analysis of up-to-date data about the firm’s operations to top management. BI is largely distinguished from previous concepts by its reliance on new platforms and technologies (for instance web technologies) to provide nimbler solutions, more responsive to managerial needs than earlier types of systems. As part of BI, the concept of dashboards of information or digital dashboards has been revisited, notably by software vendors. This chapter explains in detail what dashboards of information are and how to develop them. It considers where business data come from and how to use them to support decision making with a dashboard. Using the concept of cognitive levels, it differentiates between different types of aplications of the dashboard concept. Finally, the chapter aspects of its activities and concludes presents an illustrative case study of a firm seeking to develop a nimble tool for measuring and understanding the key that it is the content of the dashboard and the context in which it is used that are the key elements in the process of developing a dashboard.
Article
This paper outlines the application of case-based reasoning and Bayesian belief networks to critical success factor (CSF) assessment for parsimonious military decision making. An important factor for successful military missions is information superiority (IS). However, IS is not solely about minimising information related needs to avoid information overload and the reduction of bandwidth but it is also concerned with creating information related capabilities that are aligned with achieving operational effects and raising operational tempo. Moreover, good military decision making, should take into account the uncertainty inherent in operational situations. Herein, we illustrate the development and evaluation of a smart decision support system (SDSS) that dynamically identifies and assesses CSFs in military scenarios and as such de-clutters the decision making process. The second contribution of this work is an automated configuration of conditional probability tables from hard data generated from simulations of military operational scenarios using a computer generated forces (CGF) synthetic environment.
Article
Knowledge Discovery in Databases creates the context for developing the tools needed to control the flood of data facing organizations that depend on ever-growing databases of business, manufacturing, scientific, and personal information.
Article
Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.
Statistics are discussed in an entertaining but detailed way
  • A Burkov
streamlined way: Burkov, A. (2019). The Hundred-Page Machine Learning Book. Quebec City, Canada: Andriy Burkov. Statistics are discussed in an entertaining but detailed way: Field, A. P. (2021). Discovering Statistics Using R and RStudio (2nd ed.). London, UK: SAGE. Related material is introduced from a managerial perspective: Jank, W. (2011). Business Analytics for Managers. New York: Springer. A comprehensive but gentle introduction to statistical learning:
Chapter 8 contains a brief introduction to linear regression and additional details about the gradient descent algorithm
  • J Watt
Chapter 8 contains a brief introduction to linear regression and additional details about the gradient descent algorithm. Chapter 9 gives a short overview of the logistic regression: Shah, C. (2020). A Hands-on Introduction to Data Science. Cambridge, UK: Cambridge University Press. Python code and weighted regression is offered in Chapter 5: Watt, J. et al. (2020). Machine Learning Refined: Foundations, Algorithms, and Applications. Cambridge, UK: Cambridge University Press.
The Hundred-Page Machine Learning Book. Quebec City, Canada: Andriy Burkov. Decision trees are discussed in Section 6.2 using a human resource example
  • G James
James, G. et al. (2023) is an excellent introduction to statistical learning. The corresponding website includes a lot of resources such as videos and the book itself: www.statlearning.com. the ID3 algorithm: Burkov, A. (2019). The Hundred-Page Machine Learning Book. Quebec City, Canada: Andriy Burkov. Decision trees are discussed in Section 6.2 using a human resource example: Jank, W. (2011). Business Analytics for Managers. New York: Springer. Tree-based methods are introduced in Chapter 8: James, G. et al. (2013). An Introduction to Statistical Learning (2nd ed.). New York: Springer. A detailed technical treatment of trees, boosting trees and random forests is provided:
New York: Springer. Two sections (9.5 and 9.6) provide an idea about tree algorithms
  • T Hastie
Hastie, T. et al. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York: Springer. Two sections (9.5 and 9.6) provide an idea about tree algorithms: Shah, C. (2020). A Hands-on Introduction to Data Science. Cambridge, UK: Cambridge University Press. Chapter 14 is dedicated to tree-based learners: Watt, J. et al. (2020). Machine Learning Refined: Foundations, Algorithms, and Applications. Cambridge, UK: Cambridge University Press.
Jeremy's notebook about the Titanic provides code that uses the logistic regression
  • On Kaggle
On Kaggle, Jeremy's notebook about the Titanic provides code that uses the logistic regression: www.kaggle.com/code/jeremyd/titanic-logistic-regression-in-r.
) provides a detailed R implementation with some back ground: tinyheero
  • T Hastie
Hastie, T. et al. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer. A dedicated book to SOM by its creator: Kohonen, T. (2012). Self-Organizing Maps. 3rd ed. Heidelberg, Germany: Springer. Includes a simple EM introduction: Shah, C. (2020). A Hands-on Introduction to Data Science. Cambridge, UK: Cambridge University Press. EM by Fong Chun Chan (Accessed: 30 January 2024) provides a detailed R implementation with some back ground: tinyheero.github.io/2016/01/03/gmm-em.html. SOM by Inayatus (Accessed: 30 January 2024) an introduction to SOM with R code: rpubs.com/AlgoritmaAcademy/som.
London: Pearson. An easy-to-read introduction to Bayesian belief networks
  • S Russell
  • P Norvig
  • T D Nielsen
  • F V Jensen
  • G James
Russell, S. and Norvig, P. (2020). Artificial Intelligence: A Modern Approach (Pearson Series in Artificial Intelligence). London: Pearson. An easy-to-read introduction to Bayesian belief networks: Nielsen, T. D. and Jensen, F. V. (2013). Bayesian Networks and Decision Graphs. Information Science and Statistics. New York: Springer. Provides a useful chapter about support vector machines, amongst others: James, G. et al. (2023). An Introduction to Statistical Learning. Vol. 112 (2nd ed.). Springer. One of the better ANN introductions can be found in:
Deep learning machine teaches itself chess in 72 hours, plays at international master level
arXiv (2020). Deep learning machine teaches itself chess in 72 hours, plays at international master level. MIT Technology Review.
Activation Functions in Neural Networks [12 Types & Use Cases] (Accessed
  • P Baheti
Baheti, P. Activation Functions in Neural Networks [12 Types & Use Cases] (Accessed 30 January 2024).
Learning SQL: Generate, Manipulate, and Retrieve Data
  • A Beaulieu
Beaulieu, A. (2020). Learning SQL: Generate, Manipulate, and Retrieve Data (3rd ed.). Farnham: O'Reilly. https://www.amazon.co.uk/Learning-SQL-Generate-Manipulate-Retrieve/dp/1492057614.
Install an On-Premises Data Gateway
  • M De Boer
De Boer, M. (2023). Install an On-Premises Data Gateway. [Online].
Practical SQL A Beginner's Guide to Storytelling with Data
  • A Debarros
Debarros, A. (2022). Practical SQL A Beginner's Guide to Storytelling with Data (2nd ed.). San Francisco, CA: No Starch Press.
A methodology for knowledge discovery: A KDD roadmap
  • J C W Debuse
Debuse, J. C. W. et al. (1999). A methodology for knowledge discovery: A KDD roadmap. In: SYS Technical Report (SYS-C99-01). School of Computing Sciences, University of East Anglia.
Chess by Wolfgang Garn
  • W Garn
Garn, W. (2021). Chess by Wolfgang Garn. [Online]. https://www.youtube.com/watch?v=KJ_RMxSvlx8 (Accessed 12 December 2023).
Chess with "Greedy Edi
  • W Garn
Garn, W. (2023). Chess with "Greedy Edi". [Online].
Splitting Hybrid Make-to-Order and Make-to-Stock Demand Profles
  • W Garn
  • J Aitken
Garn, W. and Aitken, J. (2015). Splitting Hybrid Make-to-Order and Make-to-Stock Demand Profles. arXiv: 1504.03594 [stat.ME].
Conditional Probability Generation Methods for High Reliability Effects-Based Decision Making
  • W Garn
  • P Louvieris
Garn, W. and Louvieris, P. (2015). Conditional Probability Generation Methods for High Reliability Effects-Based Decision Making. arXiv: 1512.08553 [cs.AI].
Rgraphviz: Provides Plotting Capabilities for R Graph Objects
  • K D Hansen
Hansen, K. D. et al. (2022). Rgraphviz: Provides Plotting Capabilities for R Graph Objects. R package version 2.42.0.
DeepMind's AI predicts almost exactly when and where it's going to rain
  • W D Heaven
Heaven, W. D. (2021). DeepMind's AI predicts almost exactly when and where it's going to rain. MIT Technology Review.
Cognitive Analytics Tools and Its Applications | A Quick Guide
  • J Kaur
Kaur, J. (2022). Cognitive Analytics Tools and Its Applications | A Quick Guide. [Online].
Power Query M Formula Language
  • D Klopfenstein
Klopfenstein, D. (2023). Power Query M Formula Language. [Online].
Microsoft Power BI Quick Start Guide
  • D Knight
Knight, D. et al. (2022). Microsoft Power BI Quick Start Guide. Birmingham, England: Packt Publishing. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE 78 (9), 1464-1480.