Daniel Berrar

Daniel Berrar
The Open University (UK) · School of Mathematics and Statistics

PhD

About

88
Publications
86,110
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,824
Citations
Citations since 2016
37 Research Items
1176 Citations
20162017201820192020202120220100200300
20162017201820192020202120220100200300
20162017201820192020202120220100200300
20162017201820192020202120220100200300
Introduction
My research interests are machine learning, data science and the overarching field of artificial intelligence. My theoretical work revolves around statistical learning theory. In my applied research, I use methods from machine learning and statistics for data analysis and knowledge extraction from high-dimensional data sets, with a focus on applications from the life sciences.

Publications

Publications (88)
Article
Although the Park Grass Experiment is an important international reference soil for temperate grasslands, it still lacks the direct extraction of its metaproteome. The identification of these proteins can be crucial to our understanding of soil ecology and major biogeochemical processes. However, the extraction of protein from soil is a technically...
Article
Full-text available
The statistical comparison of machine learning classifiers is frequently underpinned by null hypothesis significance testing. Here, we provide a survey and analysis of underrated problems that significance testing entails for classification benchmark studies. The p-value has become deeply entrenched in machine learning, but it is substantially less...
Preprint
Full-text available
The Park Grass Experiment, is an international reference soil with an impressive repository of temperate grassland (meta)data, however, it still lacks documentation of its soil metaproteome. The identification of these proteins is crucial to our understanding of soil ecology and their role in major biogeochemical processes. However, protein extract...
Article
Continual learning algorithms can adapt to changes of data distributions, new classes, and even completely new tasks without catastrophically forgetting previously acquired knowledge. Here, we present a novel self-organizing incremental neural network, GSOINN+, for continual supervised learning. GSOINN+ learns a topological mapping of the input dat...
Article
Full-text available
Deep learning is a subfield of machine learning that considers computational models with multiple processing layers [1, 3, 6]. At the core of all deep learning approaches lies ‘representation learning’: the models automatically learn a representation of the input data without the explicit guidance of a domain expert. Low-level features (such as edg...
Article
Full-text available
Schistosomiasis is a neglected tropical disease that currently affects over 250 million individuals worldwide. In the absence of an immunoprophylactic vaccine and the recognition that mono-chemotherapeutic control of schistosomiasis by praziquantel has limitations, new strategies for managing disease burden are urgently needed. A better understandi...
Preprint
Full-text available
Background Schistosomiasis is a neglected tropical disease that currently affects over 250 million individuals worldwide. In the absence of an immunoprophylactic vaccine and the recognition that mono-chemotherapeutic control of schistosomiasis by praziquantel has limitations, new strategies for managing disease burden are urgently needed. A better...
Article
The goal of continuous learning is to acquire and fine-tune knowledge incrementally without erasing already existing knowledge. How to mitigate this erasure, known as catastrophic forgetting, is a grand challenge for machine learning, specifically when systems are trained on evolving data streams. Self-organizing incremental neural networks (SOINN)...
Article
Full-text available
Continual learning systems can adapt to new tasks, changes in data distributions, and new information that becomes incrementally available over time. The key challenge for such systems is how to mitigate catastrophic forgetting, i.e., how to prevent the loss of previously learned knowledge when new tasks need to be solved. In our research, we inves...
Conference Paper
Full-text available
Continual learning systems can adapt to new tasks, changes in data distributions, and new information that becomes incrementally available over time. The key challenge for such systems is how to mitigate catastrophic forgetting, i.e., how to prevent the loss of previously learned knowledge when new tasks need to be solved. In our research, we inves...
Article
Full-text available
Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. He...
Article
Full-text available
How well can machine learning predict the outcome of a soccer game, given the most commonly and freely available match data? To help answer this question and to facilitate machine learning research in soccer, we have developed the Open International Soccer Database. Version v1.0 of the Database contains essential information from 216,743 league soc...
Article
Full-text available
The task of the 2017 Soccer Prediction Challenge was to use machine learning to predict the outcome of future soccer matches based on a data set describing the match outcomes of 216,743 past soccer matches. One of the goals of the Challenge was to gauge where the limits of predictability lie with this type of commonly available data. Another goal w...
Article
Full-text available
Uncontrolled host immunological reactions directed against tissue-trapped eggs precipitate a potentially lethal, pathological cascade responsible for schistosomiasis. Blocking schistosome egg production, therefore, presents a strategy for simultaneously reducing immunopathology as well as limiting disease transmission in endemic or emerging areas....
Article
Full-text available
Uncontrolled host immunological reactions directed against tissue-trapped eggs precipitate a potentially lethal, pathological cascade responsible for schistosomiasis. Blocking schistosome egg production, therefore, presents a strategy for simultaneously reducing immunopathology as well as limiting disease transmission in endemic or emerging areas....
Chapter
Full-text available
Article
Full-text available
In a recent crowdsourcing project, 29 teams analyzed the same data set to address the following question: “Are football (soccer) referees more likely to give red cards to players with dark skin tone than to players with light skin tone?” The major finding was that the results of the individual teams varied widely, from no effect to highly significa...
Article
Full-text available
Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers...
Research
Full-text available
The Machine Learning journal invites submissions of original contributions to machine learning research for soccer analytics. Soccer 1 is the biggest global sport and is a fast-growing multibillion dollar industry. The annual revenue of European football clubs alone is estimated at $27bn. Data science and analytics are being more frequently employe...
Conference Paper
Performance measures play a pivotal role in the evaluation and selection of machine learning models for a wide range of applications. Using both synthetic and real-world data sets, we investigated the resilience to noise of various ranking measures. Our experiments revealed that the area under the ROC curve (AUC) and a related measure, the truncate...
Article
In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging t...
Article
Ranking measures play an important role in model evaluation and selection. Using both synthetic and real-world data sets, we investigate how different types and levels of noise affect the area under the ROC curve (AUC), the area under the ROC convex hull, the scored AUC, the Kolmogorov-Smirnov statistic, and the H-measure. In our experiments, the A...
Article
Purpose – The purpose of this paper is to investigate the relevance and the appropriateness of Turing-style tests for computational creativity. Design/methodology/approach – The Turing test is both a milestone and a stumbling block in artificial intelligence (AI). For more than half a century, the “grand goal of passing the test” has taught the au...
Article
Full-text available
Click fraud--the deliberate clicking on advertisements with no real interest on the product or service offered--is one of the most daunting problems in online advertising. Building an effective fraud detection method is thus pivotal for online advertising businesses. We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, open...
Article
Many conventional artificial neural network (ANN) models are designed for one application domain only. The work presented in this paper describes ANN models that operate with a higher economy by sharing neurons across domains. The use of two different types of weights-static weights and dynamic weights-is a fundamental feature of the presented mode...
Article
Full-text available
Turing’s landmark paper on computing machinery and intelligence is multifaceted and has an underemphasized ethical dimension. Turing’s notion of “intelligence” and “thinking” was far more encompassing than the common anthropocentric view may suggest. We discuss a number of open and underrated problems that the common interpretation of the Turing te...
Article
Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and H...
Conference Paper
The evaluation of machine learning algorithms is commonly based on statistical significance tests. However, the suitability of such tests is often questionable. We propose null QQ plots as a simple yet powerful graphical alternative to significance testing. Using ten benchmark data sets, we demonstrate that these plots concisely summarize the essen...
Article
Visualization techniques for high-dimensional data sets play a pivotal role in exploratory analysis in a wide range of disciplines. A particularly challenging problem represents gene expression data based on microarray technology where the number of features (genes) typically exceeds 20,000, whereas the number of samples is frequently below 200. We...
Article
Computers have evolved from mere number crunchers to systems demonstrating an astonishing degree of sophistication, decision-making ability, and autonomy. Silicon is no longer the only substrate facilitating information processing. Despite these progresses, machine intelligence is still far from rivaling human intelligence. Nonetheless, we might be...
Article
Full-text available
The identification of patients who will respond to anti-tumor necrosis factor alpha (anti-TNF-α) therapy will improve the efficacy, safety, and economic impact of these agents. We investigated whether killer cell immunoglobulin-like receptor (KIR) genes are related to response to anti-TNF-α therapy in patients with rheumatoid arthritis (RA). Sixty-...
Article
The receiver operating characteristic (ROC) has emerged as the gold standard for assessing and comparing the performance of classifiers in a wide range of disciplines including the life sciences. ROC curves are frequently summarized in a single scalar, the area under the curve (AUC). This article discusses the caveats and pitfalls of ROC analysis i...
Article
Full-text available
Pseudomonas aeruginosa is considered to grow in a biofilm in cystic fibrosis (CF) chronic lung infections. Bacterial cell motility is one of the main factors that have been connected with P. aeruginosa adherence to both biotic and abiotic surfaces. In this investigation, we employed molecular and microscopic methods to determine the presence or abs...
Article
Full-text available
Since its conception in the mid 1950s, artificial intelligence with its great ambition to understand and emulate intelligence in natural and artificial environments alike is now a truly multidisciplinary field that reaches out and is inspired by a great diversity of other fields. Rapid advances in research and technology in various fields have crea...
Article
Full-text available
Since its conception in the mid 1950s, artificial intelligence with its great ambition to understand and emulate intelligence in natural and artificial environments alike is now a truly multidisciplinary field that reaches out and is inspired by a great diversity of other fields. Rapid advances in research and technology in various fields have crea...
Article
The purpose of this study was to survey the attitudes of optometrists and ophthalmologists, located in a number of different countries, towards diagnostic tests and therapies for dry eye disease. A web-based questionnaire was used to survey attitudes using forced-choice questions and Likert scales. Sixty-one respondents (23 ophthalmologists and 38...
Article
Ecological data suggest a long-term diet high in plant material rich in biologically active compounds, such as the lignans, can significantly influence the development of prostate cancer over the lifetime of an individual. The capacity of a pure mammalian lignan, enterolactone (ENL), to influence the proliferation of the LNCaP human prostate cancer...
Book
More than ever before, research and development in genomics and proteomics depends on the analysis and interpretation of large amounts of data generated by high-throughput techniques. With the advance of computational systems biology, this situation will become even more manifest as scientists will generate truly large-scale data sets by simulating...
Conference Paper
Full-text available
One of the central challenges in structural molecular biology today is the protein folding problem, i.e. the acquisition of the 3D structure of a protein from its linear sequence of amino-acids. Different computational approaches to study protein folding and protein unfolding have recently become common tools available to the researcher. However, d...
Conference Paper
Full-text available
This paper presents a novel type of artificial neural network, called neural plasma, which is tailored for classification tasks involving few observations with a large number of variables. Neural plasma learns to adapt its classification confidence by generating artificial training data as a function of its confidence in previous decisions. In cont...
Article
Full-text available
Motivation: Genomic datasets generated by high-throughput technologies are typically characterized by a moderate number of samples and a large number of measurements per sample. As a consequence, classification models are commonly compared based on resampling techniques. This investigation discusses the conceptual difficulties involved in comparat...
Article
Full-text available
Sphingosine 1-phosphate (S1P), a lysophospholipid, is involved in various cellular processes such as migration, proliferation, and survival. To date, the impact of S1P on human glioblastoma is not fully understood. Particularly, the concerted role played by matrix metalloproteinases (MMP) and S1P in aggressive tumor behavior and angiogenesis remain...
Article
Full-text available
Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology...
Chapter
Novel diagnostic tools promise the development of patient-tailored cancer treatment. However, one major step towards individualized therapy is to use a combination of various data sources, e.g. transcriptomic, proteomic, and clinical data. We have integrated clinical data and lung cancer microarray data that were generated on two different oligonuc...
Article
Full-text available
The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, t...
Article
We present survival trees as an exploratory tool for revealing new insights into gene expression profiles in combination with clinical patient data. Survival trees partition the patient data studied into groups with similar survival outcomes and identify characteristic genetic profiles within these groups. We demonstrate the application of survival...
Conference Paper
With the increasing awareness of protein folding disorders, the explosion of genomic information, and the need for efficient ways to predict protein structure, protein folding and unfolding has become a central issue in molecular sciences research. Molecular dynamics computer simulations are increasingly employed to understand the folding and unfol...
Article
Arguably, the richest source of knowledge (as opposed to fact and data collections) about biology and biotechnology is captured in natural-language documents such as technical reports, conference proceedings and research articles. The automatic exploitation of this rich knowledge base for decision making, hypothesis management (generation and testi...
Article
Gene expression profiling by microarray technology has been successfully applied to classification and diagnostic prediction of cancers. Various machine learning and data mining methods are currently used for classifying gene expression data. However, these methods have not been developed to address the specific requirements of gene microarray anal...
Chapter
Full-text available
Microarray experiments provide the scientific community with huge amounts of data. Without appropriate methodologies and tools significant information and knowledge hidden in these data may not be discovered. Therefore, there is a need for methods capable of handling and exploring large data sets. The field of data mining and machine learning provi...
Chapter
Full-text available
Microarray experiments provide the scientific community with huge amounts of data. Without appropriate methodologies and tools significant information and knowledge hidden in these data may not be discovered. Therefore, there is a need for methods capable of handling and exploring large data sets. The field of data mining and machine learning provi...
Article
Comparative genomic hybridization (CGH) is a molecular cytogenetic analysis method that allows the detection of chromosomal imbalances in entire genomes. The CGH approach is used in cancer research to identify over- and under-representations of chromosomal regions. To search for and analyze tumor-relevant aberration patterns in CGH data, we designe...
Article
Full-text available
The considerable "algorithmic complexity" of biological systems requires a huge amount of detailed information for their complete description. Current high-throughput technology such as microarrays is generating an overwhelming amount of data of biological systems at the molecular and cellular level. To adequately organize, maintain, analyze and in...
Article
Full-text available
Traditionally, classification of complex genetic diseases such as cancer has been performed on the basis of nonmolecular criteria such as tumor tissue type, pathological features, and clinical stage. It has been generally accepted that some patients grouped into a given category will have a certain survival prognosis and response to a particular th...
Article
Full-text available
Background: Classification of human tumors into distinguishable entities is preferentially based on clinical, pathohistological, enzyme-based histochemical, immunohistochemical, and in some cases cytogenetic data. This classification system still provides classes containing tumors that show similarities but differ strongly in important aspects, e.g...
Chapter
Genomics can be broadly defined as the systematic study of genes, their functions, and their interactions. Analogously, proteomics is the study of proteins, protein complexes, their localization, their interactions, and posttranslational modifications. Some years ago, genomics and proteomics studies focused on one gene or one protein at a time. Wit...