Article

The IMP game: Learnability, approximability and adversarial learning beyond $\Sigma^0_1

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We introduce a problem set-up we call the Iterated Matching Pennies (IMP) game and show that it is a powerful framework for the study of three problems: adversarial learnability, conventional (i.e., non-adversarial) learnability and approximability. Using it, we are able to derive the following theorems. (1) It is possible to learn by example all of Σ10Π10\Sigma^0_1 \cup \Pi^0_1 as well as some supersets; (2) in adversarial learning (which we describe as a pursuit-evasion game), the pursuer has a winning strategy (in other words, Σ10\Sigma^0_1 can be learned adversarially, but Π10\Pi^0_1 not); (3) some languages in Π10\Pi^0_1 cannot be approximated by any language in Σ10\Sigma^0_1. We show corresponding results also for Σi0\Sigma^0_i and Πi0\Pi^0_i for arbitrary i.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Chapter
This chapter summarizes the game theoretical strategies for generating adversarial manipulations. The adversarial learning objective for our adversaries is assumed to be to inject small changes into the data distributions, defined over positive and negative class labels, to the extent that deep learning subsequently misclassifies the data distribution. Thus, the theoretical goal of our adversarial deep learning process becomes one of determining whether a manipulation of the input data has reached a learner decision boundary, i.e., where too many positive labels have become negative labels. The adversarial data is generated by solving for optimal attack policies in Stackelberg games where adversaries target the misclassification performance of deep learning. Sequential game theoretical formulations can model the interaction between an intelligent adversary and a deep learning model to generate adversarial manipulations by solving a two-player sequential non-cooperative Stackelberg game where each player’s payoff function increases with interactions to a local optimum. With a stochastic game theoretical formulation, we can then extend the two-player Stackelberg game into a multiplayer Stackelberg game with stochastic payoff functions for the adversaries. Both versions of the game are resolved through the Nash equilibrium, which refers to a pair of strategies in which there is no incentive for either the learner or the adversary to deviate from their optimal strategy. We can then explore adversaries who optimize variational payoff functions via data randomization strategies on deep learning designed for multi-label classification tasks. Similarly, the outcome of these investigations is an algorithm design that solves a variable-sum two-player sequential Stackelberg game with new Nash equilibria. The adversary manipulates variational parameters in the input data to mislead the learning process of the deep learning, so it misclassifies the original class labels as the targeted class labels. The ideal variational adversarial manipulation is the minimum change needed to the adversarial cost function of encoded data that will result in the deep learning incorrectly labeling the decoded data. The optimal manipulations are due to stochastic optima in non-convex best response strategies. The adversarial data generated by this variant of the Stackelberg games simulates continuous interactions with the classifier’s learning processes as opposed to one-time interactions. The learning process of the CNNs can be manipulated by an adversary at the input data level as well as the generated data level. We can then retrain the original deep learning model on the manipulated data to give rise to a secure adversarial deep learning model that is robust to subsequent performance vulnerabilities from game theoretical adversaries. Alternative hypotheses for such adversarial data mining in the game theoretical adversarial deep learning strategies are provided in cybersecurity applications with machine learning that is designed for security requirements. The game theoretical solution concepts lead to a deep neural network that is robust to subsequent data manipulation by a game theoretical adversary. This promising result suggests that learning algorithms based on game theoretical modeling and mathematical optimization are a significantly better approach to building more secure deep learning models.
Article
Full-text available
Original title: К.М.Подниекс. Сравнение различных типов предельного синтеза и прогнозирования функций. Ученые записки Латвийского государственного университета, 1974, том 210, стр. 68-81. Prediction: f(m+1) is guessed from given f(0), ..., f(m). Program synthesis: a program computing f is guessed from given f(0), ..., f(m). The hypotheses are required to be correct for all sufficiently large m, or with some positive frequency. These approaches yield a hierarchy of function prediction and program synthesis concepts. The comparison problem of the concepts is solved.
Article
Full-text available
A natural ωpLω+1 hierarchy of successively more general criteria of success for inductive inference machines is described based on the size of sets of anomalies in programs synthesized by such machines. These criteria are compared to others in the literature. Some of our results are interpreted as tradeoff results or as showing the inherent relative-computational complexity of certain processes and others are interpreted from a positivistic, mechanistic philosophical stance as theorems in philosophy of science. The techniques of recursive function theory are employed including ordinary and infinitary recursion theorems.
Article
Full-text available
This paper is about Algorithmic Probability (ALP) and Heuristic Programming and how they can be combined to achieve AGI. It is an update of a 2003 report describing a system of this kind (Sol03). We first describe ALP, giving the most common implementation of it, then the features of ALP relevant to its application to AGI. They are: Completeness, Incomputability, Subjectivity and Diversity. We then show how these features enable us to create a very general, very intelligent problem solving machine.
Conference Paper
Full-text available
Many classification tasks, such as spam filtering, intrusion detection, and terrorism detection, are complicated by an adversary who wishes to avoid detection. Previous work on adversarial classification has made the unrealistic assumption that the attacker has perfect knowledge of the classifier [2]. In this paper, we introduce the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks. We present efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features and demonstrate their effectiveness using real data from the domain of spam filtering.
Conference Paper
Full-text available
The present paper focuses on some interesting classes Of process-control games, where winning essentially means successfully controlling the process. A master for one of these games is an agent who plays a winning-strategy. In this paper we investigate situations, in which even a complete model (given by a program) of a particular game does not provide enough information to synthesize - even in the limit - a winning strategy. However, if in addition to getting a program, a machine may also watch masters play winning strategies, then the machine is able to learn in the limit a winning strategy for the given game. Studied are successful learning from arbitrary masters and from pedagogically useful selected masters. It is shown that selected masters are strictly more helpful for learning than are arbitrary masters. Both for learning from arbitrary masters and for learning from selected masters, though, there are: cases where one can learn programs for winning strategies from masters but not if one is required to learn a program for the master's strategy itself. Both for learning from arbitrary masters and for learning from selected masters, one can learn strictly more watching mil masters than one can learn watching only m. Lastly a simulation result is presented where the presence of a selected master reduces the complexity from infinitely many semantic mind changes to finitely many syntactic ones.
Conference Paper
Full-text available
One insightful view of the notion of intelligence is the ability to perform well in a diverse set of tasks, problems or environments. One of the key issues is therefore the choice of this set, which can be formalised as a 'distribution'. Formalising and properly defining this distribution is an important challenge to understand what intelligence is and to achieve artificial general intelligence (AGI). In this paper, we agree with previous criticisms that a universal distribution using a reference universal Turing machine (UTM) over tasks, environments, etc., is perhaps amuch too general distribution, since, e.g., the probability of other agents appearing on the scene or having some social interaction is almost 0 for many reference UTMs. Instead, we propose the notion of Darwin-Wallace distribution for environments, which is inspired by biological evolution, artificial life and evolutionary computation. However, although enlightening about where and how intelligence should excel, this distribution has so many options and is uncomputable in so many ways that we certainly need a more practical alternative. We propose the use of intelligence tests over multi-agent systems, in such a way that agents with a certified level of intelligence at a certain degree are used to construct the tests for the next degree. This constructive methodology can then be used as a more realistic intelligence test and also as a testbed for developing and evaluating AGI systems.
Conference Paper
Full-text available
Compression has been advocated as one of the principles which pervades inductive inference and prediction - and, from there, it has also been recurrent in definitions and tests of intelligence. However, this connection is less explicit in new approaches to intelligence. In this paper, we advocate that the notion of compression can appear again in definitions and tests of intelligence through the concepts of 'mindreading' and 'communication' in the context of multi-agent systems and social environments. Our main position is that two-part Minimum Message Length (MML) compression is not only more natural and effective for agents with limited resources, but it is also much more appropriate for agents in (co-operative) social environments than one-part compression schemes - particularly those using a posterior-weighted mixture of all available models following Solomonoff's theory of prediction. We think that the realisation of these differences is important to avoid a naive view of 'intelligence as compression' in favour of a better understanding of how, why and where (one-part or two-part, lossless or lossy) compression is needed.
Conference Paper
Full-text available
It is now widely accepted that in many situations where classifiers are deployed, adversaries deliberately manipulate data in order to reduce the classifier's accuracy. The most prominent example is email spam, where spammers routinely modify emails to get past classifier-based spam filters. In this paper we model the interaction between the adversary and the data miner as a two-person sequential noncooperative Stackelberg game and analyze the outcomes when there is a natural leader and a follower. We then proceed to model the interaction (both discrete and continuous) as an optimization problem and note that even solving linear Stackelberg game is NP-Hard. Finally we use a real spam email data set and evaluate the performance of local search algorithm under different strategy spaces.
Article
Full-text available
We consider two players each of whom attempts to predict the behavior of the other, using no more than the history of earlier predictions. Behaviors are limited to a pair of options, conventionally denoted by0; 1. Such players face the problem of learning to coordinate choices. The present paper formulates their situation recursion theoretically, and investigates the prospects for success. A pair of players build up a matrix with tworows and in#nitely many columns, and are said to #learn" each other if co#nitely many of the columns show the same numberinboth rows #either 0 or 1#. Among other results we prove that there are two collections of players that force all other players to choose their camp. Each collection is composed of players that learn everyone else in the same collection, but no player that learns all members of one collection learns any member of the other. # We thank two generous reviewers who o#ered detailed advice on a variety of points. Correspondence to D...
Book
A mathematical framework for the study of learning in a variety of domains. Systems That Learn presents a mathematical framework for the study of learning in a variety of domains. It provides the basic concepts and techniques of learning theory as well as a comprehensive account of what is currently known about a variety of learning paradigms. Bradford Books imprint
Book
Formal learning theory is one of several mathematical approaches to the study of intelligent adaptation to the environment. The analysis developed in this book is based on a number theoretical approach to learning and uses the tools of recursive-function theory to understand how learners come to an accurate view of reality. This revised and expanded edition of a successful text provides a comprehensive, self-contained introduction to the concepts and techniques of the theory. Exercises throughout the text provide experience in the use of computational arguments to prove facts about learning. Bradford Books imprint
Book
In this book Gary William Flake develops in depth the simple idea that recurrent rules can produce rich and complicated behaviors. Distinguishing "agents" (e.g., molecules, cells, animals, and species) from their interactions (e.g., chemical reactions, immune system responses, sexual reproduction, and evolution), Flake argues that it is the computational properties of interactions that account for much of what we think of as "beautiful" and "interesting." From this basic thesis, Flake explores what he considers to be today's four most interesting computational topics: fractals, chaos, complex systems, and adaptation. Each of the book's parts can be read independently, enabling even the casual reader to understand and work with the basic equations and programs. Yet the parts are bound together by the theme of the computer as a laboratory and a metaphor for understanding the universe. The inspired reader will experiment further with the ideas presented to create fractal landscapes, chaotic systems, artificial life forms, genetic algorithms, and artificial neural networks.
Article
The problem of statistical-or inductive-inference pervades a large number of human activities and a large number of (human and non-human) actions requiring "intelligence." The Minimum Message Length (MML) approach to machine learning (within artificial intelligence) and statistical (or inductive) inference gives a trade-off between simplicity of hypothesis and goodness of fit to the data. There are several different and intuitively appealing ways of thinking of MML. There are many measures of predictive accuracy. The most common form of prediction seems to be a prediction without a probability or anything else to quantify it. MML is also discussed in terms of algorithmic information theory, the shortest input to a (Universal) Turing Machine [(U)TM] or computer program which yields the original data string. This chapter sheds light on information theory, turing machines and algorithmic information theory-and relates all of these to MML. It then moves on to Ockham's razor and the distinction between inference (or induction, or explanation) and prediction.
Article
We will first describe the discovery of algorithmic probability – its motivation, just how it was discovered, and some of its properties. Section two discusses its completeness – its consummate ability to discover regularities in data and why its incomputability does not hinder to its use for practical prediction. Sections three and four are on its subjectivity and diversity – how these features play a critical role in training a system for strong AI. Sections five and six are on the practical aspects, of constructing and training such a system. The final section, seven, discusses present progress and open problems.
Article
This piece is an introduction to the proceedings of the Ray Solomonoff 85th memorial conference, paying tribute to the works and life of Ray Solomonoff, and mentioning other papers from the conference.
Conference Paper
In this paper (expanded from an invited talk at AISEC 2010), we discuss an emerging field of study: adversarial machine learning---the study of effective machine learning techniques against an adversarial opponent. In this paper, we: give a taxonomy for classifying attacks against online machine learning algorithms; discuss application-specific factors that limit an adversary's capabilities; introduce two models for modeling an adversary's capabilities; explore the limits of an adversary's knowledge about the algorithm, feature space, training, and input data; explore vulnerabilities in machine learning algorithms; discuss countermeasures against attacks; introduce the evasion challenge; and discuss privacy-preserving learning techniques.
Article
Introduced is a new inductive inference paradigm, dynamic modeling. Within this learning paradigm, for example, function h learns function g iff, in the i-th iteration, h and g both produce output, h gets the sequence of all outputs from g in prior iterations as input, g gets all the outputs from h in prior iterations as input, and, from some iteration on, the sequence of h@?s outputs will be programs for the output sequence of g. Dynamic modeling provides an idealization of, for example, a social interaction in which h seeks to discover program models of g@?s behavior it sees in interacting with g, and h openly discloses to g its sequence of candidate program models to see what g says back. Sample results: every g can be so learned by some h; there are g that can only be learned by an h if g can also learn that h back; there are extremely secretive h which cannot be learned back by any g they learn, but which, nonetheless, succeed in learning infinitely many g; quadratic time learnability is strictly more powerful than linear time learnability. This latter result, as well as others, follows immediately from general correspondence theorems obtained from a unified approach to the paradigms within inductive inference. Many proofs, some sophisticated, employ machine self-reference, a.k.a., recursion theorems.
Article
Learning is regarded as the phenomenon of knowledge acquisition in the absence of explicit programming. A precise methodology is given for studying this phenomenon rom a computational viewpoint. It consists of choosing an appropriate information gathering mechanism, the learning protocol, and exploring the class of concepts that can be learned using it in a reasonable (polynomial) number of steps. Although inherent algorithmic complexity appears to set serious limits to the range of concepts that can be learned, it is shown that there are some important nontrivial classes of propositional concepts that can be learned in a realistic sense.
Book
"This is the classic work upon which modern-day game theory is based. What began more than sixty years ago as a modest proposal that a mathematician and an economist write a short paper together blossomed, in 1944, when Princeton University Press published Theory of Games and Economic Behavior. In it, John von Neumann and Oskar Morgenstern conceived a groundbreaking mathematical theory of economic and social organization, based on a theory of games of strategy. Not only would this revolutionize economics, but the entirely new field of scientific inquiry it yielded--game theory--has since been widely used to analyze a host of real-world phenomena from arms races to optimal policy choices of presidential candidates, from vaccination policy to major league baseball salary negotiations. And it is today established throughout both the social sciences and a wide range of other sciences. This sixtieth anniversary edition includes not only the original text but also an introduction by Harold Kuhn, an afterword by Ariel Rubinstein, and reviews and articles on the book that appeared at the time of its original publication in the New York Times, tthe American Economic Review, and a variety of other publications. Together, these writings provide readers a matchless opportunity to more fully appreciate a work whose influence will yet resound for generations to come.
Article
“Evidence” in the form of data collected and analysis thereof is fundamental to medicine, health and science. In this paper, we discuss the “evidence-based” aspect of evidence-based medicine in terms of statistical inference, acknowledging that this latter field of statistical inference often also goes by various near-synonymous names—such as inductive inference (amongst philosophers), econometrics (amongst economists), machine learning (amongst computer scientists) and, in more recent times, data mining (in some circles). Three central issues to this discussion of “evidence-based” are (i) whether or not the statistical analysis can and/or should be objective and/or whether or not (subjective) prior knowledge can and/or should be incorporated, (ii) whether or not the analysis should be invariant to the framing of the problem (e.g. does it matter whether we analyse the ratio of proportions of morbidity to non-morbidity rather than simply the proportion of morbidity?), and (iii) whether or not, as we get more and more data, our analysis should be able to converge arbitrarily closely to the process which is generating our observed data. For many problems of data analysis, it would appear that desiderata (ii) and (iii) above require us to invoke at least some form of subjective (Bayesian) prior knowledge. This sits uncomfortably with the understandable but perhaps impossible desire of many medical publications that at least all the statistical hypothesis testing has to be classical non-Bayesian—i.e. it is not permitted to use any (subjective) prior knowledge.
Chapter
We first define Algorithmic Probability, an extremely powerful method of inductive inference. We discuss its completeness, incomputability, diversity and subjectivity and show that its incomputability in no way inhibits its use for practical prediction. Applications to Bernoulli sequence prediction and grammar discovery are described. We conclude with a note on its employment in a very strong AI system for very general problem solving.
Article
Concept drift means that the concept about which data is obtained may shift from time to time, each time after some minimum permanence. Except for this minimum permanence, the concept shifts may not have to satisfy any further requirements and may occur infinitely often. Within this work is studied to what extent it is still possible to predict or learn values for a data sequence produced by drifting concepts. Various ways to measure the quality of such predictions, including martingale betting strategies and density and frequency of correctness, are introduced and compared with one another.For each of these measures of prediction quality, for some interesting concrete classes, (nearly) optimal bounds on permanence for attaining learnability are established. The concrete classes, from which the drifting concepts are selected, include regular languages accepted by finite automata of bounded size, polynomials of bounded degree, and sequences defined by recurrence relations of bounded size. Some important, restricted cases of drifts are also studied, for example, the case where the intervals of permanence are computable. In the case where the concepts shift only among finitely many possibilities from certain infinite, arguably practical classes, the learning algorithms can be considerably improved.
Article
Language learnability has been investigated. This refers to the following situation: A class of possible languages is specified, together with a method of presenting information to the learner about an unknown language, which is to be chosen from the class. The question is now asked, “Is the information sufficient to determine which of the possible languages is the unknown language?” Many definitions of learnability are possible, but only the following is considered here: Time is quantized and has a finite starting time. At each time the learner receives a unit of information and is to make a guess as to the identity of the unknown language on the basis of the information received so far. This process continues forever. The class of languages will be considered learnable with respect to the specified method of information presentation if there is an algorithm that the learner can use to make his guesses, the algorithm having the following property: Given any language of the class, there is some finite time after which the guesses will all be the same and they will be correct.In this preliminary investigation, a language is taken to be a set of strings on some finite alphabet. The alphabet is the same for all languages of the class. Several variations of each of the following two basic methods of information presentation are investigated: A text for a language generates the strings of the language in any order such that every string of the language occurs at least once. An informant for a language tells whether a string is in the language, and chooses the strings in some order such that every string occurs at least once.It was found that the class of context-sensitive languages is learnable from an informant, but that not even the class of regular languages is learnable from a text.
Conference Paper
A mere bounded number of random bits judiciously employed by a probabilistically correct algorithmic coordinator is shown to increase the power of learning to coordinate compared to deterministic algorithmic coordinators. Furthermore, these probabilistic algorithmic coordinators are provably not characterized in power by teams of deterministic ones.An insightful, enumeration technique based, normal form characterization of the classes that are learnable by total computable coordinators is given. These normal forms are for insight only since it is shown that the complexity of the normal form of a total computable coordinator can be infeasible compared to the original coordinator.Montagna and Osherson showed that the competence class of a total coordinator cannot be strictly improved by another total coordinator. It is shown in the present paper that the competencies of any two total coordinators are the same modulo isomorphism. Furthermore, a completely effective, index set version of this competency isomorphism result is given, where all the coordinators are total computable. We also investigate the competence classes of total coordinators from the points of view of topology and descriptive set theory.
Article
There has been a great deal of theoretical and experimental work in computer science on inductive inference systems, that is, systems that try to infer general rules from examples. However, a complete and applicable theory of such systems is still a distant goal. This survey highlights and explains the main ideas that have been developed in the study of inductive inference, with special emphasis on the relations between the general theory and the specific algorithms and implementations. 154 references.
Article
Inductive reference machmes (IIMs) synthesize programs, given examples of their intended input-output behavior. Several different criteria for successful synthesis by llMs are defined. A given criterion is said to be more general than some other criterion if the class of sets which can be inferred by some IIM with respect to the given criterion is larger than the class of sets which can be referred by some IIM with respect to the other criterion A team of IIMs synthesizes programs for a set of functions if and only ff for each function m the set, at least one of the IIMs m the team successfully synthesizes the proper program. The trade-offs between the number of IIMs involved in the learnmg process and the generality of the criteria of success are examined
Article
A theory of approximate language identification analogous to the existing theory of exact language identification is introduced. In the approximate language identification problem a grammar is sought from a solution space of grammars whose language approximates an unidentified language with a specified degree of accuracy. A model for this problem is given in which a class of metrics on languages is defined, and a series of grammar inference procedures for approximate language identification is presented. A comparison of corresponding results for exact and approximate language identification yields two distinct ways in which the results for approximate language identification are stronger than those for exact language identification.
Article
Central concerns of the book are related theories of recursively enumerable sets, of degree of un-solvability and turing degrees in particular. A second group of topics has to do with generalizations of recursion theory. The third topics group mentioned is subrecursive computability and subrecursive hierarchies
Article
In 1964 the author proposed as an explication of {em a priori} probability the probability measure induced on output strings by a universal Turing machine with unidirectional output tape and a randomly coded unidirectional input tape. Levin has shown that if tilde{P}'_{M}(x) is an unnormalized form of this measure, and P(x) is any computable probability measure on strings, x , then tilde{P}'_{M}geqCP(x) where C is a constant independent of x . The corresponding result for the normalized form of this measure, P'_{M} , is directly derivable from Willis' probability measures on nonuniversal machines. If the conditional probabilities of P'_{M} are used to approximate those of P , then the expected value of the total squared error in these conditional probabilities is bounded by -(1/2) ln C . With this error criterion, and when used as the basis of a universal gambling scheme, P'_{M} is superior to Cover's measure bast . When Hastequiv -log_{2} P'_{M} is used to define the entropy of a rmite sequence, the equation Hast(x,y)= Hast(x)+H^{ast}_{x}(y) holds exactly, in contrast to Chaitin's entropy definition, which has a nonvanishing error term in this equation.
  • D L Dowe
  • C S Foreword
  • Wallace
D.L. Dowe. Foreword re C. S. Wallace. Computer Journal, 51(5):523-560, September 2008. Christopher Stewart WALLACE (1933-2004) memorial special issue.
Algorithmic probability, heuristic programming and AGI
  • R J Solomonoff
R.J. Solomonoff. Algorithmic probability, heuristic programming and AGI. In Proceedings of the Third Conference on Artificial General Intelligence, AGI 2010, pages 251-257, Lugano, Switzerland, March 2010. IDSIA.