ArticlePDF Available

Autonomous Optimization of Targeted Stimulation of Neuronal Networks

PLOS
PLOS Computational Biology
Authors:

Abstract and Figures

Electrical stimulation of the brain is increasingly used to alleviate the symptoms of a range of neurological disorders and as a means to artificially inject information into neural circuits in neuroprosthetic applications. Machine learning has been proposed to find optimal stimulation settings autonomously. However, this approach is impeded by the complexity of the interaction between the stimulus and the activity of the network, which makes it difficult to test how good the result actually is. We used phenomenological models of the interaction between stimulus and spontaneous activity in a neuronal network to design a testable machine learning challenge and evaluate the quality of the solution found by the algorithm. In this task, the learning algorithm had to find a solution that balances competing interdependencies of ongoing neuronal activity with opposing effects on the efficacy of stimulation. We show that machine learning can successfully solve this task and that the solutions found are close to the optimal settings to maximize the efficacy of stimulation. Since the paradigm involves several typical problems found in other settings, such concepts could help to formalize machine learning problems in more complex biological networks and to test the quality of their performance.
Identification of network specific objective functions. (A) Networks of dissociated neurons in vitro exhibit activity characterized by intermittent network-wide spontaneous bursts (SB) separated by periods of reduced activity (raster plot for 60 channels in a DIV 27 network). The shading marks the limits of individual SBs as detected by the burst-detection algorithm. (B) The distribution of Inter-Burst Intervals (IBIs) is approximately lognormal. The histogram shows the IBI distribution for the network in (A). The cumulative of this distribution (red) is predictive of the probability of being interrupted by ongoing activity given the elapsed period of inactivity, i.e. the current state s t . (C) Such a distribution was used to weight response strengths so that each dot represents the mean response strengths that can be evoked over a set of trials, including those that did not lead to stimulation, for a given stimulation latency. The fit predicts the objective function of the optimization problem. The example shows the data for the network shown in Fig 1C. The curve reveals a quasiconcave dependency, a unique global maximum and an optimal latency of % 2.5 s in this network. (D) Fits to the probability of avoiding an interruption (blue), response strengths prediction (orange), and the resulting weighted response curve (orange, dotted) shown for another network. An optimal latency of % 1.5 s emerges in this case. (E) All predicted objective functions for each of the 20 networks studied were quasiconcave and unique choices of optimal stimulus latencies were available. The objective functions were normalized to peak magnitude. doi:10.1371/journal.pcbi.1005054.g003
… 
Content may be subject to copyright.
RESEARCH ARTICLE
Autonomous Optimization of Targeted
Stimulation of Neuronal Networks
Sreedhar S. Kumar
1,2
, Jan Wülfing
3
, Samora Okujeni
1,2
, Joschka Boedecker
3,4
,
Martin Riedmiller
3¤
, Ulrich Egert
1,2,4
*
1Laboratory of Biomicrotechnology, IMTEK - Department of Microsystems Engineering, University of
Freiburg, Freiburg, Germany, 2Bernstein Center Freiburg, University of Freiburg, Freiburg, Germany,
3Machine Learning Lab, Department of Computer Science, University of Freiburg, Freiburg, Germany,
4BrainLinks-BrainTools Cluster of Excellence, University of Freiburg, Freiburg, Germany
¤Current address: Google DeepMind, London, United Kingdom
*egert@imtek.uni-freiburg.de
Abstract
Driven by clinical needs and progress in neurotechnology, targeted interaction with neuro-
nal networks is of increasing importance. Yet, the dynamics of interaction between intrinsic
ongoing activity in neuronal networks and their response to stimulation is unknown. None-
theless, electrical stimulation of the brain is increasingly explored as a therapeutic strategy
and as a means to artificially inject information into neural circuits. Strategies using regular
or event-triggered fixed stimuli discount the influence of ongoing neuronal activity on the
stimulation outcome and are therefore not optimal to induce specific responses reliably.
Yet, without suitable mechanistic models, it is hardly possible to optimize such interactions,
in particular when desired response features are network-dependent and are initially
unknown. In this proof-of-principle study, we present an experimental paradigm using rein-
forcement-learning (RL) to optimize stimulus settings autonomously and evaluate the
learned control strategy using phenomenological models. We asked how to (1) capture the
interaction of ongoing network activity, electrical stimulation and evoked responses in a
quantifiable stateto formulate a well-posed control problem, (2) find the optimal state for
stimulation, and (3) evaluate the quality of the solution found. Electrical stimulation of
generic neuronal networks grown from rat cortical tissue in vitro evoked bursts of action
potentials (responses). We show that the dynamic interplay of their magnitudes and the
probability to be intercepted by spontaneous events defines a trade-off scenario with a net-
work-specific unique optimal latency maximizing stimulus efficacy. An RL controller was set
to find this optimum autonomously. Across networks, stimulation efficacy increased in 90%
of the sessions after learning and learned latencies strongly agreed with those predicted
from open-loop experiments. Our results show that autonomous techniques can exploit
quantitative relationships underlying activity-response interaction in biological neuronal net-
works to choose optimal actions. Simple phenomenological models can be useful to vali-
date the quality of the resulting controllers.
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 1 / 22
a11111
OPEN ACCESS
Citation: Kumar SS, Wülfing J, Okujeni S,
Boedecker J, Riedmiller M, Egert U (2016)
Autonomous Optimization of Targeted Stimulation of
Neuronal Networks. PLoS Comput Biol 12(8):
e1005054. doi:10.1371/journal.pcbi.1005054
Editor: Saad Jbabdi, Oxford University, UNITED
KINGDOM
Received: April 15, 2016
Accepted: July 9, 2016
Published: August 10, 2016
Copyright: © 2016 Kumar et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are
credited.
Data Availability Statement: Data are deposited in
the University of Freiburg institutional data repository
at https://www.freidok.uni-freiburg.de/data/11107
(DOI: 10.6094/UNIFR/11107).
Funding: This work was supported by the
Bundesministerium für Bildung und Forschung,
Germany (https://www.bmbf.de, FKZ 01GQ0830 -
Bernstein Focus Neurotechnology Freiburg-
Tübingen; UE) and by the German Research
Foundation through the Brain-Links-BrainTools
Cluster of Excellence (DFG, EXC 1086, www.dfg.de;
UE, MR). Article processing charges were funded by
the DFG and the University of Freiburg (Open Access
Author Summary
Electrical stimulation of the brain is increasingly used to alleviate the symptoms of a range
of neurological disorders and as a means to artificially inject information into neural cir-
cuits in neuroprosthetic applications. Machine learning has been proposed to find optimal
stimulation settings autonomously. However, this approach is impeded by the complexity
of the interaction between the stimulus and the activity of the network, which makes it dif-
ficult to test how good the result actually is. We used phenomenological models of the
interaction between stimulus and spontaneous activity in a neuronal network to design a
testable machine learning challenge and evaluate the quality of the solution found by the
algorithm. In this task, the learning algorithm had to find a solution that balances compet-
ing interdependencies of ongoing neuronal activity with opposing effects on the efficacy of
stimulation. We show that machine learning can successfully solve this task and that the
solutions found are close to the optimal settings to maximize the efficacy of stimulation.
Since the paradigm involves several typical problems found in other settings, such con-
cepts could help to formalize machine learning problems in more complex biological net-
works and to test the quality of their performance.
Introduction
Electrical stimulation of the brain is considered an effective strategy to manage the symptoms
of an increasing range of neurological disorders like essential tremor [1,2], dystonia [3] and
Parkinsons disease (PD) [47], and as a possible means to artificially inject information into
neural circuits, e.g. towards neurotechnical prostheses with sensory feedback [8]. The response
elicited, however, typically results from interaction of the stimulus with uncontrolled ongoing
neuronal activity. The changes of neuronal activity induced by the stimulus are thus not only
added to the continuing dynamics of neuronal activity but may be modulated by it. Under
these circumstances, using constant stimulation would elicit very different responses in each
trial [9,10] and is therefore unsuitable to induce defined response features. To achieve stable
responses, or to modulate them predictably, stimulation settings would need to adapt to the
dynamics of the brains activity.
Without suitable models to capture the interaction between stimulus and ongoing activity
and to characterize the underlying network mode, it is not possible to adjust stimulation such
that a desired response is consistently achieved. Biologically mechanistic and analytically trac-
table models are, however, challenging to develop for a variety of reasons: Interactions may be
non-linear, and may depend on network modes, on the dynamics of individual neurons and
other factors [1113]. Experiments in in vivo model systems show that the measured response
of a network is strongly modulated by its state at the time of stimulation [14,15]. Network
states are, however, problematic to define explicitly because of the non-stationary nature of
activity dynamics, uncertainty about the relevant spatial and temporal scales of influence and
partial observability of the system. They are typically identified in retrospect. To illustrate this,
consider UP and DOWN states observed in the neocortex [1618] as network modes. The UP
(resp. DOWN) state can be quite clearly identified based on intracellular recordings of the
membrane potential and spike activity. Based on extracellular recordings of spikes alone, how-
ever, it would not be known during an inter-spike interval if the network had already transi-
tioned to the DOWN (resp. UP) state at the time of stimulation. The momentary state of the
network would be invisible, thus making it difficult to adjust stimulus settings online. Repeti-
tive stimulation may even lead to interaction between responses. Furthermore, as studies in
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 2 / 22
Publishing program). The funders had no role in
study design, data collection and analysis, decision to
publish, or preparation of the manuscript.
Competing Interests: The authors have declared
that no competing interests exist.
vitro suggest, this influence may affect responses across several scales of organization, from
individual neurons [19] to networks [20,21].
Despite the challenges, promising results from recent studies even with simple feedback
strategies make a compelling case for closed-loop neurostimulation devices. Experiments on a
primate model of PD indicated that even simple adaptive methods were superior to conven-
tional open-loop Deep Brain Stimulation (DBS) [22]. Furthermore, the first report of event-
driven DBS on human patients with PD [23] demonstrated an improvement in symptoms
compared to standard DBS, with a simultaneous reduction in stimulation time. Such event-
driven paradigms monitor a pre-determined indicator function in the spontaneous activity. In
the most simple version, these indicators trigger fixed stimuli [22,23] but do not modify the
stimulus parameters as such. Where the quantitative input-output relations are known, prede-
fined controllers could be successful. Keren and Marom [21], for example, achieved stable
response probabilities with a PI controller to adjust the stimulus based on responses elicited by
previous stimuli since, under the conditions selected, the input-output dependence was nearly
linear.
Because of a lack of quantitative models that could be used to predict the ideal stimulus set-
tings for the systems studied, the notion of optimality does not exist in these stimulation para-
digms. Further, the nature of the problem in these examples was such that it was possible to
define a singular target value a priori, i.e. to stop oscillatory activity [22] or to achieve a quanti-
tatively predefined response feature, here, a fixed probability for a response [21]. When the
quantitative value of the target response cannot be clearly defined, is intrinsically variable, or
where multiple interacting objectives have to be balanced, e.g. a cost function exists, these
approaches cannot be applied. To address such problems, we propose a reinforcement learning
(RL) based closed-loop paradigm to autonomously choose optimal control policies without the
need for a model of the system and its interaction with electrical stimulation.
The objective of this paper is to demonstrate in a proof-of-principle study how an RL based
controller may be used to autonomously adjust stimulus settings to maximize stimulus efficacy
in interaction with a generic network of neurons. The approach poses the following questions:
(1) How to represent the interaction of network activity, stimulus and response as a quantifi-
able stateso that a well posed control problem may be formulated for variable conditions
without predefining a singular target value, (2) how to find the optimal state for stimulation
autonomously, (3) how to evaluate the quality of the solution found.
To develop the concept and techniques, we employed generic neuronal networks in vitro on
substrate integrated microelectrode arrays as a model system. Previous studies offer a partial
understanding of the dynamics in such networks and of the rules governing their interaction
with electrical stimuli [20], which allowed us to test for the quality of the solutions found by an
RL-based controller.
Neuronal networks in cell culture exhibit activity characterized by intermittent network-
wide spontaneous bursts (SB), separated by periods of reduced activity. Electrical stimulation
of the network also evokes bursts of action potentials (responses). Response strengths depend
on the stimulus latency relative to the previous SB, and can be described by a saturating expo-
nential model [20]. For this system we thus posed the following optimization problem: find the
stimulus latency that would maximize the stimulation efficacy measured as the number of
spikes evoked per SB in the face of ongoing network activity. The achievable efficacy is thus not
known a priori and further properties of the network may be relevant to its value, i.e. depen-
dencies may be incompletely captured in the model. According to [20], choosing longer laten-
cies to stimulate ensure that longer responses are evoked, but are more prone to interruption
by the next SB and thereby affects stimulation efficacy adversely by losing out on opportunities
to stimulate. Choosing shorter latencies, on the other hand, ensures that stimuli are delivered
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 3 / 22
more often without interruption by SBs, but at the cost of evoking weaker responses. To maxi-
mize stimulation efficacy in this context, we thus need to balance the trade-off between these
opposing strategies. In this study, we asked if an RL based controller can autonomously find
the ideal balance in this trade-off and identify the optimal stimulus latency.
The control problem thus formulated has several interesting features that make it pertinent
to the problem at hand. To find the optimal time, a controller would have to reconcile the
dynamic interplay of multiple biological processes, namely: a) the initiation of synchronous
SBs in the network, b) recovery of the network excitability after SB termination, and c) the
overall level of excitability of the network.
The controller has also to account for the modulation of system dynamics over a broad-
range of time-scales. Furthermore, though these networks are similar w.r.t. statistical proper-
ties, every network has distinct dynamic features and unique connectivity. The controller thus
needs to be able to operate robustly over a range of activity modes. Out of a high dimensional
spatial and temporal feature space available in the recording, a relevant low dimensional quan-
titative state feature has to be abstracted and a strategy to converge toward optimal perfor-
mance worked out. For these reasons, we argue that this toy problem captures many of the
relevant challenges faced by closed-loop paradigms in biological frameworks and by RL based
controllers in a complex adaptive environment. Finally, drawing on simple quantitative notions
from previous studies, we computed network-specific optimal stimulation latencies [20] from
open-loop data to independently validate the optimality of the learned controller. We observed
that the learned stimulation latencies and achieved stimulation efficacies correlate strongly
with the offline optimal values estimated for these networks.
Our results demonstrate the capacity of autonomous techniques to exploit underlying quan-
titative relationships in neurotechnical interaction with neuronal networks to choose optimal
actions and illustrate how phenomenological models can be used to help formulate the RL
problem and validate the performance of the resulting controllers.
Materials and Methods
Model system
The dynamics of neuronal activity in vivo is dependent on a multitude of factors including and
not limited to cross-structural influences and specific anatomy and connectivity of the region
of interest. Biological complexity in this scale makes it difficult to glean a consistent under-
standing of signal relationships between the network and an external stimulus, a crucial step in
developing feedback control techniques [24]. In order to develop our concepts and algorithms,
we used a model that while being generic and independent of specific functions and/or modali-
ties preserves the biophysical complexity of the neuronal ensemble and relevant challenges an
autonomous controller would face in a more complex context.
Neuronal networks grown on substrate integrated microelectrode arrays are a suitable
model in that they are easily accessible generic neuronal networks that can be maintained in a
controlled environment, exhibit spontaneous activity known to influence the networks inter-
action with external stimuli, and are known to operate in distinct network modes across a
wide-range of time scales. Furthermore controlling such biological neuronal networks poses
many interesting challenges for research in RL such as potentially high dimensional state
spaces, continuous action spaces and non-stationary dynamics.
Frontal cortical tissue was dissected from newborn Wistar rats (obtained from the breeding
facilities of the University of Freiburg) after decapitation, enzymatically dissociated, and cul-
tured on polyethyleneimine (Sigma-Aldrich, St. Louis, MO)-coated microelectrode arrays
(MEAs; Multi Channel Systems, Reutlingen, Germany). The culture medium (1 ml) consisted
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 4 / 22
of minimal essential medium supplemented with 5% heat-inactivated horse serum, 0.5 mM
L-glutamine, and 20 mM glucose (all compounds from Gibco Invitrogen, Life Technologies,
Grand Island, NY). Cultures were stored in a humidified atmosphere at 37°C and 5% CO
2
95% air. Medium was partially replaced twice per week. Neuronal density after the first day in
vitro (DIV) ranged between 1500 and 4000 neurons/mm
2
. The final density after 21 DIV set-
tled at 15002000 neurons/mm
2
, independent of the initial density. At the time of recording,
network size thus amounted to 5 6×10
5
neurons. Animal treatment was according to the
Freiburg University (Freiburg, Germany) and German guidelines on the use of animals in
research. The protocol was approved by the Regierungspräsidium Freiburg and the BioMed
Zentrum, University Clinic Freiburg (permit nos. X-12/08D and X-15/01H).
Electrophysiology
Neuronal activity was recorded inside a dry incubator with MEAs with 59 titanium nitride
(TiN) electrodes of 30 μm diameter and 500 μm pitch (rectangular 6x10 grid). One larger elec-
trode served as reference. The primary signal was amplified (gain 1100, 13500 Hz) and sam-
pled at 25 kHz/12 bit (MEA 1060-BC; Multi Channel Systems). Online spike detection was
done with MEABench (version 1.1.4) [25] at six to eightfold root mean square noise level for
spike threshold.
Such networks of dissociated neurons in vitro exhibit spontaneous activity characterized by
intermittent network-wide synchronous bursts separated by periods of reduced activity. Inter-
burst intervals (IBI) in these networks fit an approximate lognormal distribution. Stimulating
the network also evokes bursts of action potentials (response). The length of these responses at
a chosen recording electrode can be modulated by the latency of the stimuli relative to the SB
at that channel. Their relationship was shown by [20] to fit a saturating exponential model.
Trade-off problem
The optimization problem was defined as the following: what is the optimal stimulus latency
relative to the end of the previous SB at a selected recording electrode (RE) that would maxi-
mize the number of spikes evoked at that site per SB? To illustrate the problem, consider the
following opposing strategies: A) choosing a long latency: Based on the saturating recovery
model, longer latencies would elicit to longer responses (Fig 1). However, such a strategy
would prove futile in the long run; long latencies are prone to interruptions by succeeding SBs
and opportunities to stimulate will be forfeited. This would lower the count of evoked spikes
per SB. B) choosing short latencies: this strategy would ensure that stimuli are delivered more
often, but at the cost of evoking shorter responses. Optimization involves finding the trade-off
between these opposing strategies. We asked that an RL based controller autonomously find
the optimal time for stimulation to balance this trade-off for individual biological networks
based only on the activity at the RE.
Experimental procedure
Experiments were performed on 20 networks between day in vitro (DIV) 19 and 35 (network
denotes a culture at a specific point in time). Each experiment began with one hour of record-
ing spontaneous activity, from which bursts were detected offline. A statistical model of SB
occurrence was estimated by fitting a lognormal function to the IBI distribution to extract the
location and scale parameters (μand σrespectively).
Selection of stimulation and evaluation ssites. Spontaneous data were further analyzed
to identify MEA electrodes that would serve as sites of stimulation and of evaluation of the
responses. As candidate stimulation electrodes (SE) we selected sites that were more likely to
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 5 / 22
participate early in SBs [20]. This procedure identified the so-called major burst leaders[26,
27]. For open-loop stimulation, monophasic, negative voltage pulses of 400 μs width and 0.7 V
amplitude were delivered at candidate SEs at 0.1 Hz. Final SE and RE pairs were typically
selected based on peri-stimulus time histograms (PSTH) from positions with responses con-
sisting of both early (25 ms) and late (50 ms) components.
Response strength. Following the choice of SE and RE we identified the dependence of
response strengths on the periods of inactivity preceding stimuli for a given network. The num-
ber of spikes at the recording channel in a 500 ms window following a stimulus was typically
defined as the response strength (RS). This data was used to estimate the parameters A,Band λ
of the recovery model by least square fitting.
Closed-loop stimulation. Closed-loop episodic learning sessions were performed using
RE and SE positions identified as above. The controller was designed to learn in episodes that
commenced at the termination of each SB (Fig 2A and 2B). The closed-loop architecture was
realized by interfacing the data acquisition software (MEABench) with the closed-loop control
software, CLS
2
(Closed-loop Simulation System, Fig 2C). Learning sessions proceeded in alter-
nating training and testing rounds. During training, the controller was free to explore the state-
action space and learn a control law using the RL algorithm described in the following section,
while during testing it always behaved optimally based on the knowledge hitherto acquired.
Subsequent to the closed-loop session, spontaneous activity was recorded for one hour to
Fig 1. Stimulating the network at an electrode evokes a burst of activity. Response strengths were dependent on the period of
inactivity preceding the stimulus. (A) Raster shows responses at one chosen recording channel in a network to 50 stimuli at the same
electrode. Stimuli were delivered periodically, and thus at random latencies relative to the previous SB. Stimulation cycled through five pre-
selected electrodes at 10 s intervals. Stimulus properties: -0.7 V, 0.4 ms, monophasic against common ground. Trials were aligned to the
time of stimulation (red line) and sorted by the count of spikes within the designated response window (see magenta overlay). A response
window of 2 s was chosen for this network. The diagram exposes the relationship of response strengthsto the period of prior inactivity. The
first 200 ms post-stimulus is zoomed in panel (B). Responses typically consisted of an early (20 ms post-stimulus) and late (50 ms post-
stimulus) component. (C) The relationship between response strengths and periods of prior inactivity can be captured in a saturating
exponential model similar to the dependency of response length [20].
doi:10.1371/journal.pcbi.1005054.g001
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 6 / 22
check for non-stationarity in the IBI distribution. The stimulus properties were as during
open-loop stimulation.
Reinforcement learning
Learning a controller for a given task with RL requires formalizing it as a Markov Decision
Process (MDP). An MDP is defined as a five-tuple (S,A,T,R,P), where Sis a set of states,
Aa set of actions and Tanite set of discrete time points (nite horizon). The reward func-
tion R:SAS!Rdenes the reward an RL controller receives when it applies action
at2Ain state st2Sand transitions into stþ12Sat time t2T. The probabilistic transition
model P(s
t+1
|s
t
,a
t
)denes the conditional probability of transitioning from state s
t
to state
s
t+1
under the action a
t
. The goal of RL is now to nd a control law (policy) p:S!Awhich
maximizes the expected accumulated reward VpðsÞ¼EfPT
t¼0gtRðst;pðstÞ;stþ1Þjs0¼sg
where γis a discounting factor on future rewards. Value Iteration is commonly used to
nd Vif the transition model Pis available. Although in this proof-of-concept study a model
is available and used to verify the solution found by the RL controller a-posteriori, for biologi-
cal systems in general we can not assume that a model is known. We therefore consider
a model-free setting and use Q-learning [28] to learn an action-value function Q(s,a)
(Q:SA!R) which represents the value of choosing action ain state s. The greedy policy
πcanthenbederivedasπ(s) = arg max
a
Q(s,a). To apply Q-learning we rst have to dene
the state and action space as well as a suitable reward function.
State space definition. Our definition of Sis motivated by the following considerations.
Solving the trade-off problem involves reconciling the dynamic interplay of the initiation of
synchronous SBs in the network and the recovery of network excitability after SB termination.
A simple statistical model of the initiation of synchronous SBs is a lognormal function of the
period of inactivity between SBs. The cumulative of this distribution indicates the probability
of SB initiation as a function of time after the preceding SB (Eq 4). At the same time, recovery
can equally be modeled by an exponential function of the time after the end of an SB (Eq 5).
Stimulation at a certain latency thus effectively probes the level of recovery at that time. This
latency was dened as the quantitative state variable accessible to the learned controller, pro-
viding information on the dynamics of both processes. Therefore the time after SB termination
is a simple and intuitive choice of a low dimensional state feature. We discretized this latency
in 0.5 s steps, corresponding to states 1, ...,N. These make up the set of states Stogether with
Fig 2. Stimulation trials and the closed-loop architecture. (A) A trial started with the end of a spontaneous burst (SB). The trial was terminated
either by the next SB (dotted box) or a stimulation. In our paradigm, reward was defined as the number of spikes in the response. Interruptions by
SBs led to neutral rewards (punishment). (B) The time within each trial was discretized into 0.5 s steps, corresponding to states 1, ...,N. At each
state, the controller could choose between two actions: to wait or to stimulate. A stimulateaction led to oneof the terminal states T
i
, with i
indicating the strength of the response. Terminal state Fwas reached if the trial was interrupted by ongoing activity. (C) Schematic visualization of
the closed-loop architecture.
doi:10.1371/journal.pcbi.1005054.g002
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 7 / 22
terminal states that reect the outcome of the stimulation T
i
(iindicating the response
strength) or an interruptionstate F.
Reward function. In order to learn the optimal stimulus latency, the controller needs to
be appropriately rewarded/punished (Eq 1). In Fig 2B, within an episode, at each state the con-
troller could choose between two actions: to waitor to stimulatewhich make up the action
set A. An episode terminated either when a preset maximum number of states (i.e. maximal
latency) was reached, an SB occurred or when a stimulateaction was chosen. After each epi-
sode, the controller received a terminal reward proportional to the strength of the evoked
response. Alternatively, if an SB had occurred or the maximum number of cycles was reached
it received a neutral reward (punishment):
Rðst;at;stþ1Þ¼ i;if stþ1¼Ti;i2f1;...;ng
0;otherwise:
(ð1Þ
Q-Learning. As a learning algorithm, we used online Q-learning with a tabular represen-
tation of the Q-function. To guarantee full exploration of the state and action space the control-
ler follows a random policy π
explore
during training that uniformly chooses the state of
stimulation. The Q-function is iteratively updated during training sessions as:
Qtþ1ðst;atÞ¼Qtðst;atÞþaRðst;at;stþ1Þþgmax
aQtðstþ1;aÞQtðst;atÞ

;
where we set the learning rate to α= 0.5 and use no discounting (γ= 1) since we consider a
nite horizon problem. During testing sessions the controller follows a greedy policy (Eq 2)
without exploration:
pðsÞ¼arg max
a
Qðs;aÞð2Þ
Data analysis
Offline burst detection was performed for spontaneous data using the following algorithm: For
spikes recorded from each electrode: a) interspike interval (ISI) had to be 100 ms, b) an inter-
val 200 ms was allowed at the end of a burst and defined the minimal IBI, and c) the mini-
mum number of spikes in a burst was set to three. Furthermore, at least three recording sites
had to have burst onsets within 100 ms, and only one larger onset interval 200 ms was
allowed [20].
For online burst detection at a single chosen channel, an individual ISI threshold was
defined for each network based on spontaneous activity at the channel of interest prior to the
closed-loop session. The ISI distribution of spontaneous activity was typically bimodal, with a
strong first peak corresponding to ISI within SBs and a second peak for the intervals between
bursts. The minimum between the intra- and inter-burst intervals was chosen as the threshold.
The minimum number of spikes in a burst was set to three.
Parameters extracted from the fitting procedures were used to compute t, the open-loop
parametric estimate of the optimal latency (Eqs 37). To compare the predicted and realized
improvement in stimulation efficacy after learning we estimated the stimulation efficacy of a
strategy using random stimulation latencies taken from the objective function as the baseline
model. The efficacy of this strategy corresponds to the mean of the objective function of each
network.
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 8 / 22
Results
To analyze the performance of closed-loop autonomous control systems for neurotechnical
interaction with neuronal networks we designed a reduced model system that captures several
general aspects of the problem setting. Networks of cortical neurons in culture develop robust
spontaneous activity that influence the outcome of stimulation in non-trivial ways. In addition,
this activity exhibits non-stationarities that any control system needs to cope with. We further
defined a target function whose quantity was not known to the control system a priori but
needed to be identified autonomously. State and action space were restricted to keep the
dimensionality of the paradigm low and allow quantitative validation of the optimization
problem.
Properties of spontaneous network activity and response to electrical
stimulation
Neuronal networks cultured on MEAs display spontaneous activity that consists of synchro-
nized network-wide spontaneous bursts (SB) separated by periods of inactivity. Burst-lengths
ranged between hundreds of milliseconds to few seconds. SBs were detected using an algorithm
that combined an inter-spike-interval threshold and the number of simultaneously active sites
(Fig 3A). Inter-burst-intervals (IBIs) were approximately lognormal distributed (Fig 3B). Fit-
ting algorithms yielded the location and scale parameters (μand σ) of the corresponding log-
normal distribution. The cumulative of this distribution was used to estimate the probability of
another SB occurring given the period of inactivity that elapsedor what we term the proba-
bility of interruptionfollowing an SB (Fig 3B, red line).
Fig 3. Identification of network specific objective functions. (A) Networks of dissociated neurons in vitro exhibit activity characterized by
intermittent network-wide spontaneous bursts (SB) separated by periods of reduced activity (raster plot for 60 channels in a DIV 27 network). The
shading marks the limits of individual SBs as detected by the burst-detection algorithm. (B) The distribution of Inter-Burst Intervals (IBIs) is
approximately lognormal. The histogram shows the IBI distribution for the network in (A). The cumulative of this distribution (red) is predictive of the
probability of being interrupted by ongoing activity given the elapsed period of inactivity, i.e. the current states
t
. (C) Such a distribution was used to
weight response strengths so that each dot represents the mean response strengths that canbe evoked over a set of trials, including those that did
not lead to stimulation, for a given stimulation latency. The fit predicts the objective function of the optimization problem. The example shows the
data for the network shown in Fig 1C. The curve reveals a quasiconcave dependency, a unique global maximum and an optimal latency of 2.5 s
in this network. (D) Fits to the probability of avoiding an interruption (blue), response strengths prediction (orange), and the resulting weighted
response curve (orange, dotted) shown for another network. An optimal latency of 1.5 s emerges in this case. (E) All predicted objective functions
for each of the 20 networks studied were quasiconcave and unique choices of optimal stimulus latencies were available. The objective functions
were normalized to peak magnitude.
doi:10.1371/journal.pcbi.1005054.g003
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 9 / 22
Stimulating a network at a channel evoked a burst of activity at others. For our experiments,
we selected one stimulating and recording channel each. Weihberger et al. [20] showed that the
greater the duration of network inactivity, the longer the responses at a chosen site will be,
according to a saturating exponential model. In order to verify this relationship and extract the
parameters of the corresponding model, stimuli were delivered at random latencies relative to
the previous SB (open-loop stimulation). Fig 1A shows responses at the recording channel to
50 such trials in an example network. Responses typically consisted of an early (20 ms post-
stimulus) and late (>50 ms post-stimulus) component. The early component, presumably
reflecting responses to antidromic stimulation, was characterized by temporally precise and
reliable responses while the late component, presumably reflecting responses to orthodromic,
transsynaptic activation, was both variable and unreliable (Fig 1B).
A least square fit of the response strengths to a saturating exponential model with stimulus
latency as the independent variable was carried out. The fitting function was of the form
A(1 e
λt
)+B(in red in Fig 1C). We then weighted all response strengths with the probability
of being able to deliver a stimulus at the corresponding latencies, without being interrupted by
ongoing activity. The weighted response strength curve (objective function) thus provides an
estimate of the average number of response spikes that can be evoked for each SB (Fig 3C and
3D). A solution that maximizes this estimate is therefore the optimal solution to the proposed
trade-off problem, namely, to find the stimulus latency that maximizes the number of response
spikes per SB.
We observed that a unique optimal stimulus latency exists for each of the 20 networks we
studied (Fig 3E). The optimal latency emerges as the result of interaction of processes underly-
ing ongoing and stimulus evoked activity dynamics of the network. Quantitative insights from
previous studies [20] allowed us to extract relevant parameters from recorded data. We then
constructed a simple parametric model to compute the network-specific optimal latency off-
line, before we let the RL controller explore the problem in a closed-loop.
Dependency of optimal stimulus latencies on properties of network
activity
To understand the emergence of the optimal stimulus latencies from interacting biological pro-
cesses and visualize the nature of the inputoutput relations and their relationship with the
underlying parameter space, we considered simplified phenomenological models of each of the
major contributing processes. Input, in the context of this problem refers to the period of inac-
tivity/latency after which a stimulus is delivered, and outputthe average number of response
spikes evoked for every SBthe response feature of interest. The recovery of post-burst net-
work excitability was modeled as a saturating exponential function (Eq 3). A statistical model
of the temporal occurrence of SB events was considered (Eq 4). The corresponding model
parameters were extracted from spontaneous and evoked activity recorded from each network.
The model equations were:
RðtÞ¼Að1eltÞþB;ð3Þ
IBIðtÞ¼ 1
tsffiffiffiffiffi
2p
peðln tmÞ2
2s2;ð4Þ
IðtÞ¼1Fln t m
s

;ð5Þ
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 10 / 22
where FðxÞ¼ 1
ffiffi
2
ppZx
1
et2
2dt;
fðtÞ¼
IðtÞRðtÞ;ð6Þ
t¼arg max
tfðtÞ:ð7Þ
R(t) and IBI(t) are the response strengths and the IBI respectively, modeled as a function of
the period of inactivity, t(input).
IðtÞis the computed probability of avoiding an interruption,
given a period of inactivity, t, and f(t) the appropriately weighted response strength model
the objective function (the inputoutput relationship). f(t)|tthen gives the stimulus efcacy for
repeated stimulation at latency t. The optimal latency tis the maximizer of this function.
In order to visualize the dependence of the input-output relations on the contributing
parameters, we numerically computed objective functions and the corresponding t, while
varying one or more parameters and holding the remaining constant. Initially, Awas allowed
to vary while parameters B,λ,μ,σwere held constant. Fig 4A and 4B shows the family of recov-
ery functions considered and the corresponding family of objective functions. In general, all
objective functions shared the property of being quasiconcave and permitted a unique maxi-
mum. These maxima (marked as dots) were the desired outputs and the corresponding stimu-
lus latencies t, the desired optimal latency. The desired outputor equivalently the desired
latencyincreased non-linearly with A(Fig 4B).
Within the parameter range observed for A(mean and standard deviation 15.5 ± 9.3) and B
(mean and standard deviation 4 ± 5.8) in our networks, the nature of the objective function
family was preserved; a unique optimal latency existed, and monotonically increased or
decreased non-linearly with A, depending on the value of B(Fig 4C). Fig 5A, summarizes the
dependence of ton the ABplane. Each color coded plane corresponds to a different value of
the time constant λ.λwas allowed to vary in the range observed experimentally (0.2 λ1.2).
Fig 4. Dependence of optimal latency on parameters that capture the networks response to stimuli. Dependence of the objective
function on parameters that capture the networks response to stimuli. In all panels the parameters λ,μand σwere set to 6.67, 1, 0.6 and 1,
respectively. (A) Changes of response strength with the gain Aof the response strength model within the range observed experimentally (5
A40, B= 6.67; t: stimulus latency) (B) The optimal latencies t*(dots), i.e. the maxima of the objective function f(t) increased non-linearly with
the gain parameter A(dashed line). Color code as in panel A (B= 6.67). (C) Changes of optimal timing t*as a function of gain Aand y-intercept
Bwithin the range observed experimentally (-10 B20). Binfluences the relationship of t*with Aand was trivial at B= 0. Black dots and
dashed line indicate the case B= 6.67 shown in panel B. Note that A+B>0 was imposed to ensure that the maximal responses were strictly
positive.
doi:10.1371/journal.pcbi.1005054.g004
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 11 / 22
Next, we varied the location and scale parameters, μand σrespectivelysee Eq 4 of the IBI
distribution. The corresponding inputoutput relations were still quasiconcave, thus ensuring
the existence of a unique maximum. The optimal latency depended almost linearly on μ(Fig
5D). Fig 5E illustrates how the optimal latency is modulated in the ABμspace for λ=1.
The scale parameter σ, however, had no significant effects on the shapes of the objective func-
tions and hence the corresponding optimal times (S1B Fig). The model thus allowed us to
predict the optimal stimulus latency based on the individual properties of spontaneous and
evoked activity of each network.
Fig 5. Dependence of the optimal latency on properties of the networks activity dynamics. (A) Dependence of the optimal stimulus
latency t*on the ABplane. Each plane corresponds to a different value of the time constant λof the recovery function within the range
observed experimentally (0.2 λ1.2). (inset) Zoom-in to 2B6.67 to reveal the monotonic rise of t*(dots and dashed line) that
corresponds to the case described in Fig 4B (λ= 1). (B) Dependence of the gain in stimulation efficacy by using t*over random stimulation
latencies on the time constant λof the recovery function. μ,A,B, and σwere set to 0.6, 20, 6.67, and 1 respectively. (C) IBI distributions for
the range of values observed experimentally of the location parameter μ(0.6 μ2) for A,B,λ,σset to 20, 6.67, 1 and1 respectively. (D)
The family of objective functions corresponding to the IBI distributions in (C) shows the near linear relationship of the optimal latencies with
μ(dots and dashed line) (A,B,λ,σwere 20, 6.67, 1 and 1 respectively; colors as in (C)). (E) Summary of the dependence of the optimal
stimulus latency on the ABμspace for λ= 1. Each plane corresponds to a different value of the location parameter μof the IBI
distribution. (inset) Zoom-in to 2B6.67) to reveal the rise of t*(dots and dashed line) that corresponds to the case described in Fig
4B (λ=1,μ= 0.6).
doi:10.1371/journal.pcbi.1005054.g005
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 12 / 22
RL based strategy to learn optimal latencies
We then proceeded with the closed-loop learning session. The session proceeded in alternating
pairs of training and testing rounds. During training rounds, the controller was free to explore
the state-action space and update its action-value function estimates, while in a testing round,
it always chose an optimal policy based on the knowledge hitherto acquired. The time taken to
run through with the experiment varied across networks, but was typically around 35 hours,
typically covering 1000 SBs. This variability was due to differences in the average burst rate
between networks. The latency chosen by the algorithm during the final testing session was
considered the learned latency. To test the stability of the learned latency some of the sessions
were run with up to 3000 SB in further training and testing rounds. Fig 6 illustrates a typical
session in an example network. In this case learning proceeded in three pairs of 200 training
and 50 testing trials. Note that a trial in our paradigm refers to the period between SBs where
Fig 6. A closed-loop learning session in an example network. A closed-loop learning session in an example network. The session consisted of 1000
trials (200 training (T
i
, red), 50 testing (X
i
, green) trials and 4 such pairs) (A) Raster diagram showing the activity at the recording channel around the time
of stimulation. Trials interrupted by ongoing activity are left empty at t>0 s. The spikes of the interrupting SB were removedin (A) and (B) for clarity.
Successful stimuli evoked responses at t>0 s. Blue lines mark the period of latency prior to the stimulus at t= 0 s Magenta triangles indicate stimuli
delivered in preceding trials. Within training rounds, the controller was free to explore the state space. Note that these rounds are in closed-loop mode but
with a random sequence of stimulation latencies. The strategy in this example was non-greedy. During testing rounds the hitherto best policy was chosen.
After the final round, a latency of 1.4 s was learned. Stimulus properties were as in Fig 1. (B) Zoom-in on responses evoked throughout the session.
Interrupted trials appear as empty rows; in this example all stimuli elicited responses. (C) Stimulus efficacy estimated as the response strength perSB
(RS/SB) computed over each of the training/testing rounds. RS/SB improved considerably during testing compared to the training rounds. The fractionof
trials interrupted in each round is shown as red circles and numerically. The dashed line was added for clarity.
doi:10.1371/journal.pcbi.1005054.g006
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 13 / 22
stimulation can potentially be delivered. Each trial is therefore initiated by ongoing activity (SB
termination) and not by stimuli. Some of the trials were interrupted by ongoing activity, result-
ing in stimulus counts less than the planned number.
To analyze the closed-loop sessions, we first looked at the model parameters of the recovery
function, A,Band λand compared values predicted from open-loop sessions with those recov-
ered from fits to the closed-loop data. Note that in this paradigm responses are available only
at fixed latencies corresponding to the state definition (Fig 7A). The gain of the network, A,
showed a strong positive correlation to the open-loop ones (r = 0.91, p <10
-5
, n = 15 networks,
Fig 7A and 7B), indicating relative stationarity of the quantitative relationship and its accessi-
bility for the controller. Parameter B, which can be interpreted as the excitability threshold for
SB termination, too showed positive correlation (r = 0.66, p = 0.003, n = 18 networks, Fig 7C),
but weaker than model parameter A, suggesting that SB termination may depend on additional
factors not captured by the model. Parameter λshowed a still weaker correlation (S1C Fig).
Across networks, closed-loop estimates of the recovery model were thus mostly consistent with
open-loop estimates.
Fig 7. Comparison of open-loop predictions with autonomously learned strategies. (A) Dependence of response strengths on pre-
stimulus inactivities in data during a closed-loop session in an example network. Each box shows the statistics of response strengths recorded
at one discrete state. The central measures are median and the edges with 25
th
and 75
th
percentiles. Whiskers extend to the most extreme data
points not considered outliers, and outliers are plotted individually. The fit (red) was made to the medians. The minimal latency for burst
termination was 0.4 s in this example, which was thus the earliest state available for stimulation. (B) Across networks, closed-loop estimates of
the gain Acorrelated strongly with open-loop estimates (r = 0.91, p<10
-5
, n = 15 networks), indicating that Awas mostly stable during the
experiments. (C) Similarly, closed-loop estimates of Bwere in agreement with open-loop ones (r = 0.66, p = 0.003, n = 18 networks), although to
a lesser degree. (D) Across networks, learned stimulus latencies show a positive correlation with predicted optimal values (r = 0.94, p<10
-8
,
n = 17 networks). (E) In spite of some variability in Panels B-D the magnitudes of the modeled objective functions for predicted and learned
latencies matched closely (green dots), indicating that the network/stimulator system was performing at a near optimal regime, regardless of
slight discrepancies in the latencies. Exact optima were likely unreachable owing to the coarse discretization (0.5 s) of states. Red dots denote
the corresponding magnitudes at t
rand
for a strategy delivering stimuli at random latencies estimated as the mean of the objective function. (F)
The distribution of errors between learned and predicted latencies is centered around the predicted optimum and confined to within 2 discrete
steps from it.
doi:10.1371/journal.pcbi.1005054.g007
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 14 / 22
We then compared the learned stimulus latencies with those predicted from open-loop ses-
sions. Overall, stimulus latencies learned by the controller showed a strong positive correlation
with the optimal latencies estimated from open-loop experiments (r = 0.94, p<10
-8
, n = 17 net-
works, Fig 7D). Nevertheless, in some networks learned latencies differed from predicted ones,
as visible in their distances to the diagonal in Fig 7D. Next, we compared the measure being
maximizedstimulation efficacy estimated as the response strength per SB, corresponding to
learned and estimated latencies. The network-specific model of the objective function, f(t)(Eq
6) was used to estimate the maximal stimulation efficacy f(t)|tachievable with the predicted
optimal latency vs. the one learned for a given network. Values of this measure were in strong
agreement (Fig 7E), indicating that the control goal was achieved despite errors in predicted
latencies (Fig 7D). One possible source of errors could be the discretization of the controllers
state space into 0.5 s steps. Indeed, the error distribution showed that 74% of the networks
studied fell within ±0.5 s around the optimum (Fig 7F).
Finally, the performance of the controller was evaluated with respect to the defined goal: to
maximize stimulation efficacy measured as the total number of response spikes evoked for
every detected SB in the network. A session-by-session analysis showed that in 94.2% of the
sessions (n = 52 sessions with non-greedy training, 11 networks), the percentage of interrupted
events per session diminished post learning (Fig 8A). IBI distributions of spontaneous activity
prior and subsequent to closed-loop sessions showed small changes in only a few networks
(p<0.001 in 6/20 networks, two sample Kolmogorov-Smirnov test S4 Fig) and in these, the fre-
quency of IBIs less than 5 s could change in both directions.
While the number of spikes in a response did not significantly change across sessions (Fig
8B) the standard deviation across stimuli in a session decreased (Fig 8C) (p = 0.01, two-sample
t-test). Concurrently, in 90% of the cases (n = 52 sessions, 11 networks), the stimulus efficacy
had increased after learning, supporting the effectiveness of the learning algorithm.
The models used to estimate the objective functions were derived from fits to spontaneous
activity and noisy samples of responses in open-loop experiments. The quality of the predic-
tions thus depended on the reliability of these fits. Comparison of the optimal stimulus effica-
cies predicted from our models with the efficacies achieved during the final closed-loop testing
sessions showed that achieved efficacies were within the 99% confidence interval for the models
fitted for each network (Fig 8D). Achieved stimulus efficacies fall within the interval in 8 of 11
networks studied.
Learning clearly improved performance in each network (p<0.001, two-sample Kolmogo-
rov-Smirnov test). The amount of improvement, however, varied across networks (Fig 8E). To
compare performance across networks, we captured each network on a normalized response-
per-stimulus vs. interruption probability plane (Fig 8F). Each network is shown before and
after learning. Only the last pairs of sessions were used for this plot (n = 11 networks). The dis-
tribution shows a clear separation of the mixed-mode performance before and after learning,
indicating the improvement of stimulation efficacy. The improvement was almost exclusively
due to a reduction in interruption probability (S2 and S3 Figs). This, however, also says that
the controller learns to avoid losing in response magnitude by not further reducing the inter-
ruption probability, i.e. it balances the trade-off.
Discussion
Closed-loop stimulation has been proposed as a promising strategy to intervene in the dynam-
ics of pathological networks while adapting to ongoing activity. The selection of signal features
to close such a loop and strategies to identify optimal stimulus settings given a desired network
response, remain open problems. We propose methods of reinforcement learning to
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 15 / 22
autonomously choose optimal control policies given a pre-defined goal. In this study, we pres-
ent proof of principle of such a controller interacting in a goal directed manner with generic
neuronal networks in vitro, boundary conditions for such interactions and an analysis of the
optimality of the learned controller.
Our results demonstrate the capacity of RL based techniques to autonomously exploit quan-
titative relationships underlying a complex network of neurons to find optimal actions. Draw-
ing on previous studies, we identified a simple verifiable trade-off scenario in which the
dynamics of spontaneous activity in these networks interact with external electrical stimuli
[20]. The temporal relationship of SB events in these networks may be approximated by a log-
normal function. Moreover, response strengths to electrical stimuli have been shown to quanti-
tatively fit a saturating exponential model, dependent on the stimulus latency subsequent to an
SB event. Interaction of these underlying processes gives rise to an abstract objective function
predicting a network specific and unique stimulus latency relative to a SB that maximizes the
number of response spikes evoked per SB across repeated stimulation. The goal set for the RL
controller was to autonomously identify this optimal stimulus latency.
Fig 8. Performance evaluation of the controller. (A) The percentage of interrupted trials during training (x-axis) and testing (y-axis) sessions
(n = 52 pairs across 11 networks). This percentage decreased sharply after learning in 94.2% of the recorded sessions. (B) The mean RS evoked
per stimulus was, however, preserved in both sessions. (C) The variability in RS per stimulus decreased significantly (p = 0.01, two-sample t-test).
(D) Comparison of the optimal stimulus efficacies predicted from our models with the efficacies achievedduring the final closed-loop testing
sessions. Vertical bars represent 99% confidence intervals corresponding to the models fitted for each network. Achieved values fall within the
interval in 8/11 networks studied. (E) Mean rewards were calculated over trials in the final training and testing rounds to compare the controllers
performance. After learning, mean rewards increased in each network, which is indicative of the improvement in stimulation efficacy. The rewards
across the sequence of trials in each round were drawn from distinct distributions in every network (p<0.002, two-sample Kolmogorov-Smirnov
test). The individual distributions are shown in S2 and S3 Figs. (F) Summary of learning across networks on a normalized RS/stimulus vs.
interruption probability plane (11 networks). Only final training and testing rounds were considered. Normalization for interruptions was performed
relative to the model-based estimate of interruption probabilities, corresponding to stimulation at randomlatencies for each network. The RS/
stimulus measure was similarly normalized to the model-based estimates of the efficacy assuminga random stimulation strategy. The
improvement in performance clearly separates the data points in the plane. Of the two modalities that contribute to stimulus efficacy, the
improvement was dominated by reduction of interruption probabilities.
doi:10.1371/journal.pcbi.1005054.g008
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 16 / 22
Response strength was defined as the number of spikes detected in a predefined temporal
window (typically 500 ms) from stimulus onset. Note that while [20] define response lengths as
the time to the last spike of the detected response, our data showed that the spike counts in a
temporal window post stimulus is proportional to the response lengths measured in time and
hence can be used as an alternative variable.
This toy problem for the model system captures some of the major challenges that closed-
loop paradigms would face in a biomedical application, i.e. in a very complex, adaptive envi-
ronment. Balancing the trade-off of response magnitude and interruptions involves finding the
dependence of response magnitudes on stimulus latencies and adjusting at the same time to
the distribution of ongoing activity. With every network being distinct in the properties of its
spontaneous dynamics and response to stimuli, the paradigm was tested for robust operation
over a range of parameters. Furthermore, the ongoing activity is highly variable and subject to
unpredictable modulation at a wide range of time scales. Plasticity of synaptic coupling could
lead to further challenges. Prior studies on such networks enable us to model this interplay
using parametric models under the assumption of system stationarity. This provided us with
the convenient situation where an open-loop prediction of the network-specific optimal stimu-
lus latency could be calculated to evaluate the quality of the controllers learned strategy. Fur-
thermore, both main processes involved could be modeled as a function of only the latency
following an SB event, which thus became an informative and quantifiable low-dimensional
state feature for the RL controller. Other contributing factors could have been the recent his-
tory of spontaneous activity. Offline analyses, however, did not show an influence of this on
the quality of the response properties.
We used numerical models of the system to identify the interactions of variations in the
model parameters. The numerical approach revealed the multi-modal nature of the control
problemin that two separate measurable modalities are simultaneously involvedthe expo-
nential recovery function and the statistical model of ongoing event occurrence. Combining
them enabled us to visualize the non-linear quasi convex inputoutput dependence f(t). The
desired metric f(t), was unique, distinct for each network and necessarily attainable, because
the corresponding talways belonged to the domain of interest (010 s) throughout the span
of parameter values observed experimentally. Each network, being a static parameter combina-
tion, would therefore map to a single non-linear inputoutput curve that belonged to the set of
objective functions described earlier. In other words, an optimal solution was possible for all
parameter sets within the observed range. Open-loop estimates of the saturating exponential
model parameters Aand Bcorrelated positively with fits to data from closed-loop sessions.
This indicates that the quantitative relationship is stable and accessible to the controller during
the learning phase. Note that the temporal stability of the model is a precondition to the con-
troller being able to converge to the same optimal stimulus latencies as the open-loop estimates.
Indeed we observed across networks, that the learned optimal stimulus latencies agreed with
those predicted from open-loop studies. The controller apparently robustly and autonomously
exploited the underlying stimulus-response relationships in the network by interacting with it
and adapting appropriately to ongoing activity dynamics.
In this study, stationarity of system dynamics was a necessary assumption in order to be
able to compare optimal stimulus latencies predicted from open-loop data at one point in time,
to those learned later in closed-loop sessions. However, this might not necessarily be the case.
Neuronal networks in vitro are known to undergo activity dependent plastic changes [29]. The
model parameters of the exponential recovery function are likely time dependent as well. Dif-
ferences in the magnitudes of correlations between parameters A(strongly correlated), B(less
positively correlated) and λ(no correlation) in open vs. closed-loop data fits are possible indica-
tors that some parameters are perhaps more strongly modulated over time than others. For
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 17 / 22
slow fluctuations, the RL paradigm could easily be modified to update the controller, provided
the temporal resolution of the state-space is high enough to sample the impact of fluctuations
and that the update intervals are adjusted accordingly. It would, however, not be possible to
monitor these modulations of network properties during testing in the current paradigm. We
would therefore not be able to validate the optimality of resulting controller.
In spite of such sources of variability, the correlation of the learned latencies of the control-
ler with the preceding, temporally distant, open-loop predictions is strong. One reason for this
could be the resolution of the controllerthe state-space discretization chosen for the control-
ler was relatively coarse at 0.5 s. From our parametric model of the trade-off situation (Figs 4
and 5), it can be seen that the impact of parameter fluctuations on optimal latencies would be
small relative to this resolution of the state-space. The actual optimum, thus, could fall in the
neighborhood of the learned latencies. Such a tendency is indeed visible in the error between
the learned and optimal times (Fig 7F), which are centered around the optimum. Therefore, by
coarse graining the state-space discretization we compensate to some degree for non-stationa-
rities in the system. An additional factor contributing to the strong correlation could be rapid
parameter fluctuations, which could average out within the duration of the experiment.
In all networks the achieved stimulation efficacy increased after learning, but individually
could be below or above the predicted levels Fig 8D. This was probably due to the quality of fits
of the recovery function and to non-stationarities in network activity that built up between the
time of the original estimate of the objective function and the training/testing sessions eventu-
ally available for evaluation. Further contributions to variability could come from the choice of
the range of latencies available to the controller for exploration. Longer latencies would lead to
increasing probabilities for interruptions during training, biasing the relative success of the
learned controller. In this study we set maximal latencies to 10 s, which ensured that recovery
would saturate in most networks. Moreover, as Figs 5B and S1A illustrate, the time-constants
of these functions also influence the achieved gain in stimulus efficacies.
Although our control goal was drawn on previous studies on the model system, this insight
was not implemented into the controller and was used only to validate its performance. How-
ever, it must be conceded that our understanding of the underlying relationship did inform the
definition of the low-dimensional state-space for the controller, i.e. that the delay of the stimu-
lus to the preceding burst is relevant for the magnitude of the response. This on the one hand
was essential to validate the controllers quality but also reduced the number of trials needed
for the controller to converge. In turn, our model cannot capture processes depending on the
serial structure of the stimulation sequence, e.g. the rate of stimulation, activity dependent plas-
ticity or even damage to neurons.
Our choice of Q-learning with a tabular representation of the Q-function was motivated as
follows. For one, Q-learning allows us to learn a Q-function without having a model of the sys-
tem dynamics, which in general is not available when dealing with biological systems. Secondly,
since the state space for the control task at hand could be defined as a single discrete variable, a
tabular representation of the Q-function was applicable, which guarantees convergence [28]
(we note that convergence can also be achieved in some cases where the Q-function is approxi-
mated, see e.g. Szepesvári [30] for an overview). A tabular representation of the Q-function is a
suitable choice as long as the biological system can be described by low-dimensional discretized
states. If a different control problem or biological system demands a moderately finer discreti-
zation of the state space, tabular Q-learning could still be applied. However, if the control
problem requires a state-action space that is high-dimensional or continuous, a tabular repre-
sentation of the Q-function is not advisable due to the so called curse of dimensionality (expo-
nentially growing memory demand). For our long-term goal of general control of neuronal
networks using RL controllers this would clearly be the case.
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 18 / 22
To this end, future work should further investigate the possibility of feature based approxi-
mate RL using e.g. artificial neural networks (ANNs) [31,32] or Random Forests [33,34]. As
features, such approaches could utilize statistics summarizing network activity in terms of pre-
vious burst and response characteristics, adapt features learned from offline data sets from
multiple networks [34], or use a machine learning approach to automatically find descriptive
features.
In this proof-of-principle study, we show that an RL based autonomous strategy can find
optimal stimulation settings in the context of a dynamic neuronal system. Extending on special
applications, in which the aim of stimulation is to abolish some type of event, such as epileptic
events or oscillations in Parkinsons Disease, the system studied here represents a more general
situation in that the optimal response is initially numerically undefined, i.e. it can take different
values depending on the properties of a network that are unknown to the experimenter. It also
extends beyond the response clamp paradigm [21] in that it takes into account not only the
probability to induce a response but both its magnitude and the probability of being inter-
rupted by spontaneous activity. Our paradigm is thus related to the idea of inducing desired
network activity, e.g. towards sensory feedback by stimulation from neuroprosthetic devices, or
adjusting the activity dynamics of a network to a desired working mode. In the context of our
study, we focused on this specific multi-modal trade-off problem to maximize a derived feature
of the response (response strength per SB event). We could show that a unique optimal strategy
exists for each network and thus verify that the controller autonomously found the optimum of
the objective function given the limitations of our data. Where RL paradigms are applied to
more general situations, phenomenological models of their interaction with biological neuronal
networks could nonetheless help to estimate the quality of the controllers when full mechanis-
tic descriptions of the system are not available.
Supporting Information
S1 Fig. Dependence of the optimal stimulation latency on the slope of the recovery function
and the location and scale parameters of the IBI distributions. (A) tdepends on the shape
of the recovery function. tshifts to later times with increasing recovery slope (λincreases)
when average inter-burst intervals μare short, i.e. spontaneous activity is high and the proba-
bility for interruption is high. In low activity regimes, however, the probability of interruption
is low, hence tis late and increasing the slope will lead to a decrease of the stimulus efficacy
with increasing latencies since increasing interruption probability then outweighs the gain in
spikes/stimulus. Because of the saturation of recovery changes in the probability for interrup-
tions have a dominating influence on t.(inset)tshifts to later latencies with increasing μfor a
given λ(boxed). A,Band σwere set to 20, 6.67, 1 respectively. (B) Scale parameter, σof the IBI
distribution had little impact on the optimal stimulation latency. A,Band λand μwere set to
20, 6.67, 1 and 0.6 respectively. (C) Across networks, values of λrecovered from fits to closed-
loop data were uncorrelated with open-loop estimates.
(TIF)
S2 Fig. Reward probability distributions for all networks. In each training trial the controller
received a reward according to the number of spikes elicited by the stimulus. In trials inter-
rupted by SBs this resulted in neutral reward (10
3
), pooled with trials eliciting 0 spikes in the
histograms. After learning, the probability for very high rewards was reduced but this was out-
weighed by the lower frequency of 0 and neutral rewards.
(TIF)
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 19 / 22
S3 Fig. Empirical cumulative distribution of rewards for all networks. The Empirical cumu-
lative distribution function (ECDF) of the rewards clearly shows that the improvements by
learning were dominated by reduced probabilities to receive 0 or neutral rewards.
(TIF)
S4 Fig. Distributions of IBIs before and after closed-loop sessions. Distributions of IBIs in
spontaneous activity recorded before (blue) and after (yellow) closed-loop sessions. A two-
sample Kolmogorov-Smirnov test showed that the IBIs were drawn from distinct distributions
in 6/20 networks (p<0.001, bold axes).
(TIF)
Acknowledgments
The authors thank Ute Riede for technical assistance in cell culturing and Arvind Kumar for
suggestions on the manuscript.
Author Contributions
Conceived and designed the experiments: SK UE MR JW JB.
Performed the experiments: SK JW SO.
Analyzed the data: SK JW.
Contributed reagents/materials/analysis tools: SO MR.
Wrote the paper: SK UE JW JB.
Designed the software used in analysis: SK JW.
References
1. Koller WC, Lyons KE, Wilkinson SB, Pahwa R. Efficacy of unilateral deep brain stimulation of the vim
nucleus of the thalamus for essential head tremor. Movement Disorders. 1999; 14(5):847850. doi: 10.
1002/1531-8257(199909)14:5%3C847::AID-MDS1021%3E3.0.CO;2-G PMID: 10495050
2. Rehncrona S, Johnels B, Widner H, Törnqvist AL, Hariz M, Sydow O. Long-term efficacy of thalamic
deep brain stimulation for tremor: Double-blind assessments. Movement Disorders. 2003; 18(2):163
170. doi: 10.1002/mds.10309 PMID: 12539209
3. Vidailhet M, Vercueil L, Houeto JL, Krystkowiak P, Benabid AL, Cornu P, et al. Bilateral Deep-Brain
Stimulation of the Globus Pallidus in Primary Generalized Dystonia. New England Journal of Medicine.
2005; 352(5):459467. doi: 10.1056/NEJMoa042187 PMID: 15689584
4. Krack P, Batir A, Van Blercom N, Chabardes S, Fraix V, Ardouin C, et al. Five-Year Follow-up of Bilat-
eral Stimulation of the Subthalamic Nucleus in Advanced Parkinsons Disease. New England Journal
of Medicine. 2003; 349(20):19251934. doi: 10.1056/NEJMoa035275 PMID: 14614167
5. Bittar RG, Burn SC, Bain PG, Owen SL, Joint C, Shlugman D, et al. Deep brain stimulation for move-
ment disorders and pain. Journal of Clinical Neuroscience. 2005; 12(4):457463. http://dx.doi.org/10.
1016/j.jocn.2004.09.001. PMID: 15925782
6. Sarem-Aslani A, Mullett K. Industrial perspective on deep brain stimulation: history, current state, and
future developments. Frontiers in integrative neuroscience. 2011; doi: 10.3389/fnint.2011.00046 PMID:
21991248
7. Kringelbach ML, Jenkinson N, Owen SLF, Aziz TZ. Translational principles of deep brain stimulation.
Nat Rev Neurosci. 2007; 8(8):623635. doi: 10.1038/nrn2196 PMID: 17637800
8. Raspopovic S, Capogrosso M, Petrini FM, Bonizzato M, Rigosa J, Di Pino G, et al. Restoring Natural
Sensory Feedback in Real-Time Bidirectional Hand Prostheses. Science Translational Medicine. 2014;
6(222):222ra19222ra19. doi: 10.1126/scitranslmed.3006820 PMID: 24500407
9. Azouz R, Gray CM. Cellular Mechanisms Contributing to Response Variability of Cortical Neurons In
Vivo. The Journal of Neuroscience. 1999; 19(6):22092223. PMID: 10066274
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 20 / 22
10. Jones LM, Fontanini A, Sadacca BF, Miller P, Katz DB. Natural stimuli evoke dynamic sequences of
states in sensory cortical ensembles. Proceedings of the National Academy of Sciences. 2007; 104
(47):1877218777. doi: 10.1073/pnas.0705546104
11. He BJ. Spontaneous and Task-Evoked Brain Activity Negatively Interact. The Journal of Neuroscience.
2013; 33(11):46724682. doi: 10.1523/JNEUROSCI.2922-12.2013 PMID: 23486941
12. Petersen CCH, Hahn TTG, Mehta M, Grinvald A, Sakmann B. Interaction of sensory responses with
spontaneous depolarization in layer 2/3 barrel cortex. Proceedings of the National Academy of Sci-
ences. 2003; 100(23):1363813643. doi: 10.1073/pnas.2235811100
13. Kisley MA, Gerstein GL. Trial-to-Trial Variability and State-Dependent Modulation of Auditory-Evoked
Responses in Cortex. The Journal of Neuroscience. 1999; 19(23):1045110460. PMID: 10575042
14. Arieli A, Sterkin A, Grinvald A, Aertsen A. Dynamics of Ongoing Activity: Explanation of the Large Vari-
ability in Evoked Cortical Responses. Science. 1996; 273(5283):18681871. doi: 10.1126/science.
273.5283.1868 PMID: 8791593
15. Hasenstaub A, Sachdev RNS, McCormick DA. State Changes Rapidly Modulate Cortical Neuronal
Responsiveness. The Journal of Neuroscience. 2007; 27(36):96079622. doi: 10.1523/JNEUROSCI.
2184-07.2007 PMID: 17804621
16. Cossart R, Aronov D, Yuste R. Attractor dynamics of network UP states in the neocortex. Nature. 2003;
419:283287. doi: 10.1038/nature01614
17. Shu Y, Hasenstaub A, McCormick DA. Turning on and off recurrent balanced cortical activity. Nature.
2003; 419:288. doi: 10.1038/nature01616
18. Holcman D, Tsodyks M. The Emergence of Up and Down States in Cortical Networks. PLoS Comput
Biol. 2006; 2(3):18. doi: 10.1371/journal.pcbi.0020023
19. Gal A, Eytan D, Wallach A, Sandler M, Schiller J, Marom S. Dynamics of Excitability over Extended
Timescales in Cultured Cortical Neurons. The Journal of Neuroscience. 2010; 30(48):1633216342.
doi: 10.1523/JNEUROSCI.4859-10.2010 PMID: 21123579
20. Weihberger O, Okujeni S, Mikkonen JE, Egert U. Quantitative examination of stimulus-response rela-
tions in cortical networks in vitro. Journal of neurophysiology. 2013; 109(7):17641774. doi: 10.1152/jn.
00481.2012 PMID: 23274313
21. Keren H, Marom S. Controlling neural network responsiveness: tradeoffs and constraints. Frontiers in
Neuroengineering. 2014; doi: 10.3389/fneng.2014.00011 PMID: 24808860
22. Rosin B, Slovik M, Mitelman R, Michal R, Haber SN, Israel Z, et al. Closed-loop deep brain stimulation
is superior in ameliorating parkinsonism. Neuron. 2011; 72(2):370384. doi: 10.1016/j.neuron.2011.08.
023 PMID: 22017994
23. Little S, Pogosyan A, Neal S, Zavala B, Zrinzo L, Hariz M, et al. Adaptive deep brain stimulation in
advanced Parkinson disease. Annals of neurology. 2013; 74(3):449457. doi: 10.1002/ana.23951
PMID: 23852650
24. Kermany E, Gal A, Lyakhov V, Meir R, Marom S, Eytan D. Tradeoffs and Constraints on Neural Repre-
sentation in Networks of Cortical Neurons. The Journal of Neuroscience. 2010; 30(28):95889596. doi:
10.1523/JNEUROSCI.0661-10.2010 PMID: 20631187
25. Wagenaar D, DeMarse TB, Potter SM. MeaBench: A toolset for multi-electrode data acquisition and
on-line analysis. In: Conference Proceedings. 2nd International IEEE EMBS Conference on Neural
Engineering. 2005. p. 518521.
26. Eytan D, Marom S. Dynamics and Effective Topology Underlying Synchronization in Networks of Corti-
cal Neurons. The Journal of Neuroscience. 2006; 26(33):84658476. doi: 10.1523/JNEUROSCI.1627-
06.2006 PMID: 16914671
27. Ham M, Bettencourt L, McDaniel F, Gross G. Spontaneous coordinated activity in cultured networks:
Analysis of multiple ignition sites, primary circuits, and burst phase delay distributions. Journal of
Computational Neuroscience. 2008; 24(3):346357. doi: 10.1007/s10827-007-0059-1 PMID:
18066657
28. Watkins CJ, Dayan P. Q-learning. Machine learning. 1992; 8(3):279292. doi: 10.1023/
A:1022676722315
29. Minerbi A, Kahana R, Goldfeld L, Kaufman M, Marom S, Ziv NE. Long-Term Relationships between
Synaptic Tenacity, Synaptic Remodeling, and Network Activity. PLoS Biol. 2009; 7(6):e1000136. doi:
10.1371/journal.pbio.1000136 PMID: 19554080
30. Szepesvári C. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and
machine learning. 2010; 4(1):1103. doi: 10.2200/S00268ED1V01Y201005AIM009
31. Riedmiller M. Neural Fitted Q IterationFirst experiences with a data efficient neural Reinforcement
Learning Method. In: Lecture Notes in Computer Science: Proc. of the European Conference on
Machine Learning, ECML 2005. Porto, Portugal; 2005. p. 317328.
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 21 / 22
32. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control
through deep reinforcement learning. Nature. 2015; 518(7540):529533. doi: 10.1038/nature14236
PMID: 25719670
33. Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine
Learning Research. 2005; 6(Apr):503556.
34. Guez A, Vincent RD, Avoli M, Pineau J. Adaptive Treatment of Epilepsy via Batch-mode Reinforcement
Learning. In: Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008,
Chicago, Illinois, USA, July 1317; 2008. p. 16711678. Available from: http://www.aaai.org/Library/
IAAI/2008/iaai08-008.php.
Autonomous Optimization of Targeted Stimulation of Neuronal Networks
PLOS Computational Biology | DOI:10.1371/journal.pcbi.1005054 August 10, 2016 22 / 22
... Desired response features of the neural network can be also achieved with reinforcement learning using phenomenological model based on Markov decision process. The group of Egert (Kumar et al. 2016) developed a controller which autonomously optimized low-frequency stimulation settings and evaluated control strategy in real time. Statistics of the burst magnitudes and spontaneous events were used to predict and to optimize an optimal inter-stimulus intervals maximizing the response efficacy for each individual network. ...
... In order to optimize the performance of such system, advanced signal processing techniques need to be used, able to rapidly and reliably compute and extract useful information from the recorded signals. In this context, machine learning algorithms and information theoretic quantities are rapidly taking ground respectively for autonomously adjusting system parameters (Kumar et al. 2016) and for extracting relevant features from neural signals (Panzeri et al. 2017). ...
... Kumar et al [173] used an in-vitro neural network to find, with RL, the best stimulus latency that would provide the highest response in terms of bursts of action potential. ...
Article
Full-text available
The brain is a highly complex physical system made of assemblies of neurons that work together to accomplish elaborate tasks such as motor control, memory and perception. How these parts work together has been studied for decades by neuroscientists using neuroimaging, psychological manipulations, and neurostimulation. Neurostimulation has gained particular interest, given the possibility to perturb the brain and elicit a specific response. This response depends on different parameters such as the intensity, the location and the timing of the stimulation. However, most of the studies performed so far used previously established protocols without considering the ongoing brain activity and, thus, without adaptively targeting the stimulation. In control theory, this approach is called open-loop control, and it is always paired with a different form of control called closed-loop, in which the current activity of the brain is used to establish the next stimulation. Recently, neuroscientists are beginning to shift from classical fixed neuromodulation studies to closed-loop experiments. This new approach allows the control of brain activity based on responses to stimulation and thus to personalize individual treatment in clinical conditions. Here, we review this new approach by introducing control theory and focusing on how these aspects are applied in brain studies. We also present the different stimulation techniques and the control approaches used to steer the brain. Finally, we explore how the closed-loop framework will revolutionize the way the human brain can be studied, including a discussion on open questions and an outlook on future advances.
... MEAs enable full control of the neuronal activity, providing concurrent recording and stimulation at multiple points of the population with high spatial and temporal resolution. This versatility fostered several studies aiming to control specific features of neuronal activity with closed-loop electrical stimulation in MEAs, such as the mean population firing rate using a proportional controller 23 , the network response probability and latency to stimulation using a proportional-integral controller 24 and response strength to stimulation using reinforcement learning 25,26 . ...
Preprint
Full-text available
Adaptive neuronal stimulation has a strong therapeutic potential for neurological disorders such as Parkinson's disease and epilepsy. However, standard stimulation protocols mostly rely on continuous open-loop stimulation. We implement here, for the first time in neuronal populations, two different Delayed Feedback Control (DFC) algorithms and assess their efficacy in disrupting unwanted neuronal oscillations. DFC is a well-established closed-loop control technique but its use in neuromodulation has been limited so far to models and computational studies. Leveraging on the high spatiotemporal monitoring capabilities of specialized in vitro platforms, we show that standard DFC in fact worsens the neuronal population oscillatory behaviour and promotes faster bursting, which was never reported in silico. Alternatively, we present adaptive DFC (aDFC) that monitors ongoing oscillation periodicity and self-tunes accordingly. aDFC disrupts collective neuronal oscillations and decreases network synchrony. Furthermore, we show that the intrinsic population dynamics have a strong impact in the susceptibility of networks to neuromodulation. Experimental data was complemented with computer simulations to show how this network controllability might be determined by specific network properties. Overall, these results support aDFC as a better candidate for therapeutic neurostimulation and provide new insights regarding the controllability of neuronal systems.
... A possible approach to facilitate the fitting procedures could be to develop bidirectional intracortical devices able to record the neuronal activity in response to electrical stimulation and use the recorded neural activity to optimize the stimulation parameters (Rotermund et al., 2019). Another possibility could be to use machine learning to find optimal stimulation settings (Kumar et al., 2016). In any case, more studies are still needed. ...
Article
Full-text available
The restoration of a useful visual sense in a profoundly blind person by direct electrical stimulation of the visual cortex has been a subject of study for many years. However, the field of cortically based sight restoration has made few advances in the last few decades, and many problems remain. In this context, the scientific and technological problems associated with safe and effective communication with the brain are very complex, and there are still many unresolved issues delaying its development. In this work, we review some of the biological and technical issues that still remain to be solved, including long-term biotolerability, the number of electrodes required to provide useful vision, and the delivery of information to the implants. Furthermore, we emphasize the possible role of the neuroplastic changes that follow vision loss in the success of this approach. We propose that increased collaborations among clinicians, basic researchers, and neural engineers will enhance our ability to send meaningful information to the brain and restore a limited but useful sense of vision to many blind individuals.
... In an in vitro study by Kumar et al., Q-learning learning was used to learn the optimal stimulation time for maximizing the intensity of induced spikes in cultures of cortical neurons. [4] Another approach developed to learn new, energy efficient, stimulation patterns for Parkinson's disease used a biophysical surrogate model. By testing different atypical patterns of stimulation against the model, they were able to identify a novel pattern that was more energy efficient than the standard ~130Hz stimulation of the subthalamic nucleus. ...
Conference Paper
Neural modulation is becoming a fundamental tool for understanding and treating neurological diseases and their implicated neural circuits. Given that neural modulation interventions have high dimensional parameter spaces, one of the challenges is selecting the stimulation parameters that induce the desired effect. Moreover, the effect of a given set of stimulation parameters may change depending on the underlying neural state. In this study, we investigate and address the state-dependent effect of medial septum optogenetic stimulation on the hippocampus. We found that pre-stimulation hippocampal gamma (33-50Hz) power influences the effect of medial septum optogenetic stimulation on during-stimulation hippocampal gamma power. We then construct a simulation platform that models this phenomenon for testing optimization approaches. We then compare the performance of a standard implementation of Bayesian optimization, along with an extension to the algorithm that incorporates pre-stimulation state to learn a state-dependent policy. The state-dependent algorithm outperformed the standard approach, suggesting that incorporating pre-stimulation can improve neural modulation interventions.
... Yet, this would be sufficient to initiate SBEs only if the output of this local network is well connected to recruit large parts of the network. Conversely, recurrent input from highly excitable regions to the BIZ must not be too strong to avoid lasting depression of excitability in the BIZ by SBEs (Weihberger et al., 2013;Kumar et al., 2016). A moderately connected position with locally recurrent connectivity would fulfill these prerequisites. ...
Article
Full-text available
The mesoscale architecture of neuronal networks strongly influences the initiation of spontaneous activity and its pathways of propagation. Spontaneous activity has been studied extensively in networks of cultured cortical neurons that generate complex yet reproducible patterns of synchronous bursting events that resemble the activity dynamics in developing neuronal networks in vivo. Synchronous bursts are mostly thought to be triggered at burst initiation sites due to build-up of noise or by highly active neurons, or to reflect reverberating activity that circulates within larger networks, although neither of these has been observed directly. Inferring such collective dynamics in neuronal populations from electrophysiological recordings crucially depends on the spatial resolution and sampling ratio relative to the size of the networks assessed. Using large-scale microelectrode arrays with 1024 electrodes at 0.3 mm pitch that covered the full extent of in vitro networks on about 1 cm², we investigated where bursts of spontaneous activity arise and how their propagation patterns relate to the regions of origin, the network’s structure, and to the overall distribution of activity. A set of alternating burst initiation zones (BIZ) dominated the initiation of distinct bursting events and triggered specific propagation patterns. Moreover, BIZs were typically located in areas with moderate activity levels, i.e., at transitions between hot and cold spots. The activity-dependent alternation between these zones suggests that the local networks forming the dominating BIZ enter a transient depressed state after several cycles (similar to Eytan et al., 2003), allowing other BIZs to take over temporarily. We propose that inhomogeneities in the network structure define such BIZs and that the depletion of local synaptic resources limit repetitive burst initiation.
Article
Full-text available
Intracortical microstimulation (ICMS) is commonly used in many experimental and clinical paradigms; however, its effects on the activation of neurons are still not completely understood. To document the responses of cortical neurons in awake nonhuman primates to stimulation, we recorded single-unit activity while delivering single-pulse stimulation via Utah arrays implanted in primary motor cortex (M1) of three macaque monkeys. Stimuli between 5 and 50 μA delivered to single channels reliably evoked spikes in neurons recorded throughout the array with delays of up to 12 ms. ICMS pulses also induced a period of inhibition lasting up to 150 ms that typically followed the initial excitatory response. Higher current amplitudes led to a greater probability of evoking a spike and extended the duration of inhibition. The likelihood of evoking a spike in a neuron was dependent on the spontaneous firing rate as well as the delay between its most recent spike time and stimulus onset. Tonic repetitive stimulation between 2 and 20 Hz often modulated both the probability of evoking spikes and the duration of inhibition; high-frequency stimulation was more likely to change both responses. On a trial-by-trial basis, whether a stimulus evoked a spike did not affect the subsequent inhibitory response; however, their changes over time were often positively or negatively correlated. Our results document the complex dynamics of cortical neural responses to electrical stimulation that need to be considered when using ICMS for scientific and clinical applications.
Article
Full-text available
The identification of oscillatory neural markers of Parkinson’s disease (PD) can contribute not only to the understanding of functional mechanisms of the disorder, but may also serve in adaptive deep brain stimulation (DBS) systems. These systems seek online adaptation of stimulation parameters in closed-loop as a function of neural markers, aiming at improving treatment’s efficacy and reducing side effects. Typically, the identification of PD neural markers is based on group-level studies. Due to the heterogeneity of symptoms across patients, however, such group-level neural markers, like the beta band power of the subthalamic nucleus, are not present in every patient or not informative about every patient’s motor state. Instead, individual neural markers may be preferable for providing a personalized solution for the adaptation of stimulation parameters. Fortunately, data-driven bottom-up approaches based on machine learning may be utilized. These approaches have been developed and applied successfully in the field of brain-computer interfaces with the goal of providing individuals with means of communication and control. In our contribution, we present results obtained with a novel supervised data-driven identification of neural markers of hand motor performance based on a supervised machine learning model. Data of 16 experimental sessions obtained from seven PD patients undergoing DBS therapy show that the supervised patient-specific neural markers provide improved decoding accuracy of hand motor performance, compared to group-level neural markers reported in the literature. We observed that the individual markers are sensitive to DBS therapy and thus, may represent controllable variables in an adaptive DBS system.
Thesis
Reinforcement Learning is no new discipline in the realm of machine learning, but has seen a surge in popularity and interest from researchers in the last years. Driven by the impact of Deep Learning and impressive success stories such as learning to play Atari on human level or solving the game of Go, one family of Reinforcement Learning methods is in the forefront of said trend: Deep Reinforcement Learning. This term usually refers to the combination of two powerful machine learning methods, namely Q-learning and (possibly deep) artificial neural networks resulting in the popular DQN and NFQ algorithms. Without wanting to belittle the power of said combination, for practitioners there are still many open questions and problems when applying these to a learning task. In this thesis we will focus mainly on two properties of Deep Reinforcement Learning that are especially important when dealing with real world applications, namely the stability and sample efficiency of Deep Reinforcement Learning training procedures. First, we will show by example that Deep Reinforcement Learning can suffer from unstable learning dynamics and propose an algorithm that improves stability as well as sample efficiency on several benchmarks. Second, we will introduce a novel application of Reinforcement Learning to a biological system, namely Biological Neural Networks and we will show that it is possible to learn to control certain activity features of these networks. This application will underline the importance of having stable and sample efficient reinforcement learning procedures.
Conference Paper
In recent years, closed-loop adaptive deep brain stimulation (aDBS) for Parkinson's disease (PD) has gained focus in the research community, due to promising proof-of-concept studies showing its suitability for improving DBS therapy and ameliorating related side effects.The main challenges faced in the aDBS control problem is the presence of non-stationary/non-linear dynamics and the heterogeneity of PD's phenotype, making the exploration of data-driven dynamics-aware control algorithms a promising research direction. However, due to the severe safety constraints related to working with patients, aDBS is a sensitive research field that requires surrogate development platforms with growing complexity, as novel control algorithms are validated.With our current contribution, we propose the characterization and categorization of non-stationary dynamics found in the aDBS problem. We show how knowledge about these dynamics can be embedded in a surrogate simulation environment, which has been designed to support early development stages of aDBS control strategies, specifically those based on reinforcement learning (RL) algorithms. Finally, we present a comparison of representative RL methods designed to cope with the type of non-stationary dynamics found in aDBS.To allow reproducibility and encourage adoption of our approach, the source code of the developed methods and simulation environment are made available online.
Article
Full-text available
In recent years much effort is invested in means to control neural population responses at the whole brain level, within the context of developing advanced medical applications. The tradeoffs and constraints involved, however, remain elusive due to obvious complications entailed by studying whole brain dynamics. Here, we present effective control of response features (probability and latency) of cortical networks in vitro over many hours, and offer this approach as an experimental toy for studying controllability of neural networks in the wider context. Exercising this approach we show that enforcement of stable high activity rates by means of closed loop control may enhance alteration of underlying global input-output relations and activity dependent dispersion of neuronal pair-wise correlations across the network.
Article
Full-text available
Brain-computer interfaces (BCIs) could potentially be used to interact with pathological brain signals to intervene and ameliorate their effects in disease states. Here, we provide proof-of-principle of this approach by using a BCI to interpret pathological brain activity in patients with advanced Parkinson disease (PD) and to use this feedback to control when therapeutic deep brain stimulation (DBS) is delivered. Our goal was to demonstrate that by personalizing and optimizing stimulation in real time, we could improve on both the efficacy and efficiency of conventional continuous DBS. We tested BCI-controlled adaptive DBS (aDBS) of the subthalamic nucleus in 8 PD patients. Feedback was provided by processing of the local field potentials recorded directly from the stimulation electrodes. The results were compared to no stimulation, conventional continuous stimulation (cDBS), and random intermittent stimulation. Both unblinded and blinded clinical assessments of motor effect were performed using the Unified Parkinson's Disease Rating Scale. Motor scores improved by 66% (unblinded) and 50% (blinded) during aDBS, which were 29% (p = 0.03) and 27% (p = 0.005) better than cDBS, respectively. These improvements were achieved with a 56% reduction in stimulation time compared to cDBS, and a corresponding reduction in energy requirements (p < 0.001). aDBS was also more effective than no stimulation and random intermittent stimulation. BCI-controlled DBS is tractable and can be more efficient and efficacious than conventional continuous neuromodulation for PD. Ann Neurol 2013.
Article
Full-text available
A widely held assumption is that spontaneous and task-evoked brain activity sum linearly, such that the recorded brain response in each single trial is the algebraic sum of the constantly changing ongoing activity and the stereotypical evoked activity. Using functional magnetic resonance imaging signals acquired from normal humans, we show that this assumption is invalid. Across widespread cortices, evoked activity interacts negatively with ongoing activity, such that higher prestimulus baseline results in less activation or more deactivation. As a consequence of this negative interaction, trial-to-trial variability of cortical activity decreases following stimulus onset. We further show that variability reduction follows overlapping but distinct spatial pattern from that of task-activation/deactivation and it contains behaviorally relevant information. These results favor an alternative perspective to the traditional dichotomous framework of ongoing and evoked activity. That is, to view the brain as a nonlinear dynamical system whose trajectory is tighter when performing a task. Further, incoming sensory stimuli modulate the brain's activity in a manner that depends on its initial state. We propose that across-trial variability may provide a new approach to brain mapping in the context of cognitive experiments.
Article
Full-text available
Variable responses of neuronal networks to repeated sensory or electrical stimuli reflect the interaction of the stimulus' response with ongoing activity in the brain and its modulation by adaptive mechanisms such as cognitive context, network state or cellular excitability and synaptic transmission capability. Here, we focus on reliability, length, delays and variability of evoked responses with respect to their spatial distribution, interaction with spontaneous activity in the networks and the contribution of GABA-ergic inhibition. We identified network-intrinsic principles that underlie the formation and modulation of spontaneous activity and stimulus-response relations using state-dependent stimulation in generic neuronal networks in vitro. The duration of spontaneously recurring network-wide bursts of spikes was best predicted by the length of the preceding interval. Length, delay and structure of responses to identical stimuli systematically depended on stimulus timing and distance to the stimulation site, which was described by a set of simple functions of spontaneous activity. In addition, the speed of propagation was determined by the overall state of the network at the moment of stimulation. Disinhibition increased the number of spikes per network burst and inter-burst interval length at unchanged gross firing rate, while the response modulation by the duration of pre-stimulus inactivity was preserved. Our data suggest a process of network depression during bursts and subsequent recovery that limit evoked responses following distinct rules. The seemingly unreliable patterns spontaneous activity and stimulus-response relations thus follow a predictable structure determined by the interdependencies of networks structure and activity states.
Book
Full-text available
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book we focus on those algorithms of reinforcement learning which build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Article
Hand loss is a highly disabling event that markedly affects the quality of life. To achieve a close to natural replacement for the lost hand, the user should be provided with the rich sensations that we naturally perceive when grasping or manipulating an object. Ideal bidirectional hand prostheses should involve both a reliable decoding of the user's intentions and the delivery of nearly "natural" sensory feedback through remnant afferent pathways, simultaneously and in real time. However, current hand prostheses fail to achieve these requirements, particularly because they lack any sensory feedback. We show that by stimulating the median and ulnar nerve fascicles using transversal multichannel intrafascicular electrodes, according to the information provided by the artificial sensors from a hand prosthesis, physiologically appropriate (near-natural) sensory information can be provided to an amputee during the real-time decoding of different grasping tasks to control a dexterous hand prosthesis. This feedback enabled the participant to effectively modulate the grasping force of the prosthesis with no visual or auditory feedback. Three different force levels were distinguished and consistently used by the subject. The results also demonstrate that a high complexity of perception can be obtained, allowing the subject to identify the stiffness and shape of three different objects by exploiting different characteristics of the elicited sensations. This approach could improve the efficacy and "life-like" quality of hand prostheses, resulting in a keystone strategy for the near-natural replacement of missing hands.
Article
Thalamic deep brain stimulation (DBS) is proven to suppress tremor in Parkinson's disease (PD) and essential tremor (ET). However, there are few reports on its long-term efficacy. We studied the efficacy of DBS at 2 years and 6–7 years after electrode implantations in the ventrointermediate nucleus of the thalamus in 39 patients (20 PD, 19 ET) with severe tremor. Twenty-five of the patients completed the study. Evaluations were done in a double-blind manner with the Unified Parkinson's Disease Rating Scale (UPDRS) and Essential Tremor Rating Scale (ETRS). DBS decreased tremor sum scores in PD (P < 0.025) compared to the preoperative baseline (median, 7; Q25–75, 6–9) both at 2 years (median, 2; Q25–75, 2–3.5; n = 16) and at 6 to 7 years (median, 2.5; Q25–75, 0.5–3; n = 12). Stimulation on improved tremor sum as well as sub scores (P < 0.025) compared to stimulation off conditions. In ET, thalamic stimulation improved (P < 0.025) kinetic and positional tremor at both follow-up periods (n = 18 and n = 13, respectively) with significant improvements (P < 0.025) in hand-function tests. PD but not ET patients showed a general disease progression. Stimulation parameters were remarkably stable over time. We conclude that high-frequency electric thalamic stimulation can efficiently suppress severe tremor in PD and ET more than 6 years after permanent implantation of brain electrodes. © 2002 Movement Disorder Society
Conference Paper
This paper highlights the crucial role that modern machine learning techniques can play in the optimization of treatment strategies for patients with chronic disorders. In particular, we focus on the task of optimizing a deep-brain stimulation strategy for the treatment of epilepsy. The challenge is to choose which stimulation action to apply, as a function of the observed EEG signal, so as to minimize the frequency and du- ration of seizures. We apply recent techniques from the rein- forcement learning literature—namely fitted Q-iteration and extremely randomized trees—to learn an optimal stimulation policy using labeled training data from animal brain tissues. Our results show that these methods are an effective means of reducing the incidence of seizures, while also minimizing the amount of stimulation applied. If these results carry over to the human model of epilepsy, the impact for patients will be substantial.
Conference Paper
This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.