Automated Machine Learning for Business [Book Deskcopy]

Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.


Provides an accessible introduction to machine learning for business. The examples are built around the DataRobot Automated Machine Learning platform, but focus is on the principles of machine learning.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Full-text available
In recent years, two communities have grown around a joint interest on how big data can be exploited to benefit education and the science of learning: Educational Data Mining and Learning Analytics. This article discusses the relationship between these two communities, and the key methods and approaches of educational data mining. The article discusses how these methods emerged in the early days of research in this area, which methods have seen particular interest in the EDM and learning analytics communities, and how this has changed as the field matures and has moved to making significant contributions to both educational research and practice.
Full-text available
Data mining and machine learning have become a vital part of crime detection and prevention. In this research, we use WEKA, an open source data mining software, to conduct a comparative study between the violent crime patterns from the Communities and Crime Unnormalized Dataset provided by the University of California-Irvine repository and actual crime statistical data for the state of Mississippi that has been provided by We implemented the Linear Regression, Additive Regression, and Decision Stump algorithms using the same finite set of features, on the Communities and Crime Dataset. Overall, the linear regression algorithm performed the best among the three selected algorithms. The scope of this project is to prove how effective and accurate the machine learning algorithms used in data mining analysis can be at predicting violent crime patterns.
Full-text available
We have entered the big data era. Organizations are capturing, storing, and analyzing data that has high volume, velocity, and variety and comes from a variety of new sources, including social media, machines, log files, video, text, image, RFID, and GPS. These sources have strained the capabilities of traditional relational database management systems and spawned a host of new technologies, approaches, and platforms. The potential value of big data analytics is great and is clearly established by a growing number of studies. The keys to success with big data analytics include a clear business need, strong committed sponsorship, alignment between the business and IT strategies, a fact-based decision-making culture, a strong data infrastructure, the right analytical tools, and people skilled in the use of analytics. Because of the paradigm shift in the kinds of data being analyzed and how this data is used, big data can be considered to be a new, fourth generation of decision support data management. Though the business value from big data is great, especially for online companies like Google and Facebook, how it is being used is raising significant privacy concerns.
Full-text available
Management of hyperglycemia in hospitalized patients has a significant bearing on outcome, in terms of both morbidity and mortality. However, there are few national assessments of diabetes care during hospitalization which could serve as a baseline for change. This analysis of a large clinical database (74 million unique encounters corresponding to 17 million unique patients) was undertaken to provide such an assessment and to find future directions which might lead to improvements in patient safety. Almost 70,000 inpatient diabetes encounters were identified with sufficient detail for analysis. Multivariable logistic regression was used to fit the relationship between the measurement of HbA1c and early readmission while controlling for covariates such as demographics, severity and type of the disease, and type of admission. Results show that the measurement of HbA1c was performed infrequently (18.4%) in the inpatient setting. The statistical model suggests that the relationship between the probability of readmission and the HbA1c measurement depends on the primary diagnosis. The data suggest further that the greater attention to diabetes reflected in HbA1c determination may improve patient outcomes and lower cost of inpatient care.
Full-text available
Using Latent Semantic Analysis, we quantified the semantic representations of Facebook status updates of 304 individuals in order to predict self-reported personality. We focused on, besides Neuroticism and Extraversion, the Dark Triad of personality: Psychopathy, Narcissism, and Machiavellianism. The semantic content of Facebook updates predicted Psychopathy and Narcissism. These updates had a more ''odd'' and negatively valanced content. Furthermore, Neuroticism, number of Facebook friends, and frequency of status updates were predictable from the status updates. Given that Facebook allows individuals to have major control in how they present themselves and draw benefits from these interac-tions, we conclude that the Dark Triad, involving socially malevolent behavior such as self-promotion, emotional coldness, duplicity, and aggressiveness, is manifested in Facebook status updates.
Full-text available
Mobile phone sensors can be used to develop context-aware systems that automatically detect when patients require assistance. Mobile phones can also provide ecological momentary interventions that deliver tailored assistance during problematic situations. However, such approaches have not yet been used to treat major depressive disorder. The purpose of this study was to investigate the technical feasibility, functional reliability, and patient satisfaction with Mobilyze!, a mobile phone- and Internet-based intervention including ecological momentary intervention and context sensing. We developed a mobile phone application and supporting architecture, in which machine learning models (ie, learners) predicted patients' mood, emotions, cognitive/motivational states, activities, environmental context, and social context based on at least 38 concurrent phone sensor values (eg, global positioning system, ambient light, recent calls). The website included feedback graphs illustrating correlations between patients' self-reported states, as well as didactics and tools teaching patients behavioral activation concepts. Brief telephone calls and emails with a clinician were used to promote adherence. We enrolled 8 adults with major depressive disorder in a single-arm pilot study to receive Mobilyze! and complete clinical assessments for 8 weeks. Promising accuracy rates (60% to 91%) were achieved by learners predicting categorical contextual states (eg, location). For states rated on scales (eg, mood), predictive capability was poor. Participants were satisfied with the phone application and improved significantly on self-reported depressive symptoms (beta(week) = -.82, P < .001, per-protocol Cohen d = 3.43) and interview measures of depressive symptoms (beta(week) = -.81, P < .001, per-protocol Cohen d = 3.55). Participants also became less likely to meet criteria for major depressive disorder diagnosis (b(week) = -.65, P = .03, per-protocol remission rate = 85.71%). Comorbid anxiety symptoms also decreased (beta(week) = -.71, P < .001, per-protocol Cohen d = 2.58). Mobilyze! is a scalable, feasible intervention with preliminary evidence of efficacy. To our knowledge, it is the first ecological momentary intervention for unipolar depression, as well as one of the first attempts to use context sensing to identify mental health-related states. Several lessons learned regarding technical functionality, data mining, and software development process are discussed. NCT01107041; (Archived by WebCite at
Full-text available
Do investments in your employees actually affect workforce performance? Who are your top performers? How can you empower and motivate other employees to excel? Leading-edge companies such as Google, Best Buy, Procter & Gamble, and Sysco use sophisticated data-collection technology and analysis to answer these questions, leveraging a range of analytics to improve the way they attract and retain talent, connect their employee data to business performance, differentiate themselves from competitors, and more. The authors present the six key ways in which companies track, analyze, and use data about their people-ranging from a simple baseline of metrics to monitor the organization's overall health to custom modeling for predicting future head count depending on various "what if" scenarios. They go on to show that companies competing on talent analytics manage data and technology at an enterprise level, support what analytical leaders do, choose realistic targets for analysis, and hire analysts with strong interpersonal skills as well as broad expertise.
We study intraday market intermediation in an electronic market before and during a period of large and temporary selling pressure. On May 6, 2010, U.S. financial markets experienced a systemic intraday event-the Flash Crash-where a large automated selling program was rapidly executed in the E-mini S&P 500 stock index futures market. Using audit trail transaction-level data for the E-mini on May 6 and the previous three days, we find that the trading pattern of the most active nondesignated intraday intermediaries (classified as High-Frequency Traders) did not change when prices fell during the Flash Crash.
Conference Paper
ChaLearn is organizing the Automatic Machine Learning (AutoML) contest for the IJCNN 2015, which challenges participants to solve classification and regression problems without any human intervention. Participants' code is automatically run on the contest servers to train and test learning machines. However, there is no obligation to submit code. Half of the prizes can be won by submitting prediction results only. Datasets of progressive difficulty are introduced throughout six rounds. (Participants can enter the competition in any round.) The rounds alternate phases in which learners are tested on datasets participants have not seen (AutoML), and phases in which participants have limited time to tweak their algorithms on those datasets to improve performance (Tweakathon). This challenge will push the state of the art in fully automatic machine learning on a wide range of real-world problems. The platform will remain available beyond the termination of the challenge.
Reliable data on economic livelihoods remain scarce in the developing world, hampering efforts to study these outcomes and to design policies that improve them. Here we demonstrate an accurate, inexpensive, and scalable method for estimating consumption expenditure and asset wealth from high-resolution satellite imagery. Using survey and satellite data from five African countries-Nigeria,Tanzania, Uganda, Malawi, and Rwanda-we show how a convolutional neural network can be trained to identify image features that can explain up to 75%of the variation in local-level economic outcomes. Our method, which requires only publicly available data, could transform efforts to track and target poverty in developing countries. It also demonstrates how powerful machine learning techniques can be applied in a setting with limited training data, suggesting broad potential application across many scientific domains.
Removal of cloud cover on the satellite remote sensing image can effectively improve the availability of remote sensing images. For thin cloud cover, support vector value contourlet transform is used to achieve multi-scale decomposition of the area of thin cloud cover on remote sensing images. Through enhancing coefficients of high frequency and suppressing coefficients of low frequency, the thin cloud is removed. For thick cloud cover, if the areas of thick cloud cover on multi-source or multi-temporal remote sensing images do not overlap, the multi-output support vector regression learning method is used to remove this kind of thick clouds. If the thick cloud cover areas overlap, by using the multi-output learning of the surrounding areas to predict the surface features of the overlapped thick cloud cover areas, this kind of thick cloud is removed. Experimental results show that the proposed cloud removal method can effectively solve the problems of the cloud overlapping and radiation difference among multi-source images. The cloud removal image is clear and smooth.
This study examines the impact of allowing traders to co-locate their servers near exchange servers on the liquidity of futures contracts traded on the Australian Securities Exchange. It provides evidence of an increase in proxies for high-frequency trading activity following the introduction of co-location. There is strong evidence of a decrease in bid–ask spreads and an increase in market depth after the introduction of co-location. We conclude that the introduction of co-location enhances liquidity. We conjecture that co-location improves the efficiency with which liquidity providers (including market maker high-frequency traders) are able to make markets. © 2013 Wiley Periodicals, Inc. Jrl Fut Mark 34:20–33, 2014
In regression analysis the response variable Y and the predictor variables X1 …, Xp are often replaced by functions θ(Y) and Ø1(X1), …, Øp(Xp). We discuss a procedure for estimating those functions θ and Ø1, …, Øp that minimize e = E{[θ(Y) — Σ Øj(Xj)]}/var[θ(Y)], given only a sample {(yk, xk1, …, xkp), 1 ⩽ k ⩽ N} and making minimal assumptions concerning the data distribution or the form of the solution functions. For the bivariate case, p = 1, θ and Ø satisfy ρ = p(θ, Ø) = maxθ,Øρ[θ(Y), Ø(X)], where ρ is the product moment correlation coefficient and ρ is the maximal correlation between X and Y. Our procedure thus also provides a method for estimating the maximal correlation between two variables.
We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait "Openness," prediction accuracy is close to the test-retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.
This paper introduces the alternating conditional expectation (ACE) algorithm of Breiman and Friedman (1985) for estimating the trans-formations of a response and a set of predictor variables in multiple re-gression that produce the maximum linear effect between the (transformed) independent variables and the (transformed) response variable. These trans-formations can give the data analyst insight into the relationships between these variables so that relationship between them can be best described and non-linear relationships can be uncovered. The power and usefulness of ACE guided transformation in multivariate analysis are illustrated using a simulated data set as well as a real data set. The results from these exam-ples clearly demonstrate that ACE is able to identify the correct functional forms, to reveal more accurate relationships, and to improve the model fit considerably compared to the conventional linear model.
We examine the role of algorithmic traders (AT) in liquidity supply and demand in the 30 DAX stocks on the Deutsche Boerse in January 2008. AT represent 52% of market order volume and 64% of nonmarketable limit order volume. AT more actively monitor market liquidity than human traders. AT consume liquidity when it is cheap, i.e., when the bid-ask quotes are narrow, and supply liquidity when it is expensive. When spreads are narrow AT are less likely to submit new orders, less likely to cancel their orders, and more likely to initiate trades. AT react more quickly to events and even more so when spreads are wide.
A linear regression of y on x can be approximated by a simple difference: the average values of y corresponding to the highest quarter or third of x, minus the average values of y corresponding to the lowest quarter or third of x. A simple theoretical analysis shows this comparison performs reasonably well, with 80%-90% efficiency compared to the linear regression if the predictor is uniformly or normally distributed. Discretizing x into three categories claws back about half the efficiency lost by the commonly-used strategy of dichotomizing the predictor. We illustrate with the example that motivated this research: an analysis of income and voting which we had originally performed for a scholarly journal but then wanted to communicate to a general audience.
Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of Occam's razor has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.
Detecting buildings from very high resolution (VHR) aerial and satellite images is extremely useful in map making, urban planning, and land use analysis. Although it is possible to manually locate buildings from these VHR images, this operation may not be robust and fast. Therefore, automated systems to detect buildings from VHR aerial and satellite images are needed. Unfortunately, such systems must cope with major problems. First, buildings have diverse characteristics, and their appearance (illumination, viewing angle, etc.) is uncontrolled in these images. Second, buildings in urban areas are generally dense and complex. It is hard to detect separate buildings from them. To overcome these difficulties, we propose a novel building detection method using local feature vectors and a probabilistic framework. We first introduce four different local feature vector extraction methods. Extracted local feature vectors serve as observations of the probability density function (pdf) to be estimated. Using a variable-kernel density estimation method, we estimate the corresponding pdf. In other words, we represent building locations (to be detected) in the image as joint random variables and estimate their pdf. Using the modes of the estimated density, as well as other probabilistic properties, we detect building locations in the image. We also introduce data and decision fusion methods based on our probabilistic framework to detect building locations. We pick certain crops of VHR panchromatic aerial and Ikonos satellite images to test our method. We assume that these crops are detected using our previous urban region detection method. Our test images are acquired by two different sensors, and they have different spatial resolutions. Also, buildings in these images have diverse characteristics. Therefore, we can test our methods on a diverse data set. Extensive tests indicate that our method can be used to automatically detect buildings in a robust and fast manner in Ikonos - - satellite and our aerial images.
In July 2009, Medicare began publicly reporting hospitals' risk-standardized 30-day all-cause readmission rates (RSRRs) among fee-for-service beneficiaries discharged after hospitalization for heart failure from all the US acute care nonfederal hospitals. No recent national trends in RSRRs have been reported, and it is not known whether hospital-specific performance is improving or variation in performance is decreasing. We used 2004-2006 Medicare administrative data to identify all fee-for-service beneficiaries admitted to a US acute care hospital for heart failure and discharged alive. We estimated mean annual RSRRs, a National Quality Forum-endorsed metric for quality, using 2-level hierarchical models that accounted for age, sex, and multiple comorbidities; variation in quality was estimated by the SD of the RSRRs. There were 570 996 distinct hospitalizations for heart failure in which the patient was discharged alive in 4728 hospitals in 2004, 544 550 in 4694 hospitals in 2005, and 501 234 in 4674 hospitals in 2006. Unadjusted 30-day all-cause readmission rates were virtually identical over this period: 23.0% in 2004, 23.3% in 2005, and 22.9% in 2006. The mean and SD of RSRRs were also similar: mean (SD) of 23.7% (1.3) in 2004, 23.9% (1.4) in 2005, and 23.8% (1.4) in 2006, suggesting similar hospital variation throughout the study period. National mean and RSRR distributions among Medicare beneficiaries discharged after hospitalization for heart failure have not changed in recent years, indicating that there was neither improvement in hospital readmission rates nor in hospital variations in rates over this time period.
To assess Sweet Talk, a text-messaging support system designed to enhance self-efficacy, facilitate uptake of intensive insulin therapy and improve glycaemic control in paediatric patients with Type 1 diabetes. One hundred and twenty-six patients fulfilled the eligibility criteria; Type 1 diabetes for > 1 year, on conventional insulin therapy, aged 8-18 years. Ninety-two patients were randomized to conventional insulin therapy (n = 28), conventional therapy and Sweet Talk (n = 33) or intensive insulin therapy and Sweet Talk (n = 31). Goal-setting at clinic visits was reinforced by daily text-messages from the Sweet Talk software system, containing personalized goal-specific prompts and messages tailored to patients' age, sex and insulin regimen. HbA(1c) did not change in patients on conventional therapy without or with Sweet Talk (10.3 +/- 1.7 vs. 10.1 +/- 1.7%), but improved in patients randomized to intensive therapy and Sweet Talk (9.2 +/- 2.2%, 95% CI -1.9, -0.5, P < 0.001). Sweet Talk was associated with improvement in diabetes self-efficacy (conventional therapy 56.0 +/- 13.7, conventional therapy plus Sweet Talk 62.1 +/- 6.6, 95% CI +2.6, +7.5, P = 0.003) and self-reported adherence (conventional therapy 70.4 +/- 20.0, conventional therapy plus Sweet Talk 77.2 +/- 16.1, 95% CI +0.4, +17.4, P = 0.042). When surveyed, 82% of patients felt that Sweet Talk had improved their diabetes self-management and 90% wanted to continue receiving messages. Sweet Talk was associated with improved self-efficacy and adherence; engaging a classically difficult to reach group of young people. While Sweet Talk alone did not improve glycaemic control, it may have had a role in supporting the introduction of intensive insulin therapy. Scheduled, tailored text messaging offers an innovative means of supporting adolescents with diabetes and could be adapted for other health-care settings and chronic diseases.
The most annoying aspect of software development is debugging. We don't mind the kinds of bugs that yield to a few minutes inspection. The bugs we hate are the ones that show up only after hours of successful operation, under unusual circumstances, or whose stack traces lead to dead ends. Fortunately, there's a simple technique that dramatically reduces the number of these bugs in our software. It won't reduce the overall number of bugs, at least not at first, but it'll make most defects much easier to find. The technique is to build our software to "fail fast".
Investing in America's data science and analytics talent: The case for action
  • C Ampil
  • L I Cardenas-Navia
  • K Elzey
  • M Fenlon
  • B K Fitzgerald
  • D Hughes
Ampil, C., Cardenas-Navia, L. I., Elzey, K., Fenlon, M., Fitzgerald, B. K., & Hughes, D. (2017). Investing in America's data science and analytics talent: The case for action. Retrieved from
Can Science Help Runners Break The Marathon's 2-Hour Barrier
  • C Aschwanden
Aschwanden, C. (2017, April 13, 2017). Can Science Help Runners Break The Marathon's 2-Hour Barrier. Retrieved from
Algorithms for hyperparameter optimization. Paper presented at the Advances in Neural Information Processing Systems
  • J S Bergstra
  • R Bardenet
  • Y Bengio
  • B Kégl
Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyperparameter optimization. Paper presented at the Advances in Neural Information Processing Systems.
The World's Largest Hedge Fund Is Building an Algorithmic Model From its Employees' Brains
  • R Copeland
  • B Hope
Copeland, R., & Hope, B. (2016, Dec. 22, 2016). The World's Largest Hedge Fund Is Building an Algorithmic Model From its Employees' Brains. The Wall Street Journal.
From Chicago to New York and Back in 8.5 Milliseconds. ZeroHedge
  • T Durden
Durden, T. (2012, August 8, 2012). From Chicago to New York and Back in 8.5 Milliseconds. ZeroHedge.
Google Photos label black people 'gorillas
  • J Guynn
Guynn, J. (2015, July 1, 2015). Google Photos label black people 'gorillas'. USA Today.
A Brief Review of the ChaLearn AutoML Challenge: Anytime Any-dataset Learning Without Human Intervention
  • M Sebag
Sebag, M. (2016). A Brief Review of the ChaLearn AutoML Challenge: Anytime Any-dataset Learning Without Human Intervention. Paper presented at the Workshop on Automatic Machine Learning.
Automated Machine Learning --A Paradim Shift that Accelerates Data Scientist Productivity @ Airbnb
  • H Husain
  • N Handel
Husain, H., & Handel, N. (2017, May 10, 2017). Automated Machine Learning --A Paradim Shift that Accelerates Data Scientist Productivity @ Airbnb.
Why big-data analysis of police activity is inherently biased
  • W Isaac
  • A Dixon
Isaac, W., & Dixon, A. (2017, May 13, 2017). Why big-data analysis of police activity is inherently biased. Salon.
Using Machine Learning to Explore Neural Network Architecture
  • Q Le
  • B Zoph
Le, Q., & Zoph, B. (2017, May 17, 2017). Using Machine Learning to Explore Neural Network Architecture. Retrieved from
Kaiser capmaign slashes opioid prescriptions. 89.3 KPCC
  • B F Ostrov
Ostrov, B. F. (2017, March 31, 2017). Kaiser capmaign slashes opioid prescriptions. 89.3 KPCC. Retrieved from
Report reveals Facebook document that could help advertisers target insecure kids
  • D Riley
Riley, D. (2017, April 30th, 2017). Report reveals Facebook document that could help advertisers target insecure kids. SiliconANGLE.
Microsoft's millennial chatbot pulled offline after Internet teaches her racism
  • J Risley
Risley, J. (2016, March 24, 2016). Microsoft's millennial chatbot pulled offline after Internet teaches her racism. GeekWire.
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are: HarperCollins
  • S Stephens-Davidowitz
Stephens-Davidowitz, S. (2017). Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are: HarperCollins.
CRISP-DM: Towards a standard process model for data mining
  • R Wirth
  • J Hipp
Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. Paper presented at the Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. -11,000 rows -LendingClub_2007_2014_Cleaned_Reduced.csv (used in this example)
You can copy, modify, distribute and perform the work, even for commercial purposes
Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
What are the two types of data that the model may struggle with? 5. Why is the leadership team not presented with the machine learning code? References Ampil
  • C Cardenas-Navia
  • L I Elzey
  • K Fenlon
  • M Fitzgerald
  • B K Hughes
What are the two types of data that the model may struggle with? 5. Why is the leadership team not presented with the machine learning code? References Ampil, C., Cardenas-Navia, L. I., Elzey, K., Fenlon, M., Fitzgerald, B. K., & Hughes, D. (2017). Investing in America's data science and analytics talent: The case for action. Retrieved from
Building Trust in Analytics
  • Kpmg
KPMG. (2017). Building Trust in Analytics. Retrieved from