ResearchPDF Available

Migrating from Classical Machine Learning to Quantum Machine Learning: An overview and case study on Drug Discovery

Authors:
Research

Migrating from Classical Machine Learning to Quantum Machine Learning: An overview and case study on Drug Discovery

Abstract and Figures

Recent advances in Quantum Computing have opened up new dimensions in various fields for its use. This encourages teams in different fields to explore how their existing models could be ported to a quantum approach. Getting started in this field is not a straightforward task and requires domain knowledge in quantum computing. In this work, we explore how to migrate an existing model to a quantum approach. In order to demonstrate the process,
Content may be subject to copyright.
Migrating from Classical Machine Learning to Quantum Machine Learning: An
overview and case study on Drug Discovery
Yash Patel
Manipal University Jaipur
yashpatel.dev@gmail.com
Carlos Toxtli
West Virginia University
carlos.toxtli@mail.wvu.edu
Abstract
Recent advances in Quantum Computing have opened up
new dimensions in various fields for its use. This encour-
ages teams in different fields to explore how their existing
models could be ported to a quantum approach. Getting
started in this field is not a straightforward task and re-
quires domain knowledge in quantum computing. In this
work, we explore how to migrate an existing model to a
quantum approach. In order to demonstrate the process,
we focus on the Drug Discovery field where the capabil-
ities of Quantum Computers can be used to the fullest.
Drug discovery is the process of discovering and designing
new drugs and finding their potential protein targets. Sev-
eral Machine Learning (ML) approaches have been used
for Drug Discovery, but Quantum Machine Learning can
be used to find atypical patterns in the data which Classi-
cal Machine Learning algorithms fail to find. Current ap-
proaches for Drug discovery usually take a long time and
require huge resources. Quantum Machine Learning algo-
rithms can lead to quicker discoveries. Here we discuss
some methods that are used in the Drug Discovery pipeline,
mainly focusing on Drug-Target Identification, along with
Machine Learning and their migration to Quantum Ma-
chine Learning equivalent approaches.
1. Introduction
Drug discovery is the process where new drugs are dis-
covered or created. Drug discovery mainly involves chem-
istry, pharmacology, and clinical services. Cost of Drug
Development is the total cost needed from drug discovery
to clinical and approval. Some costs, such as out-of-pocket
expenses remain unaccounted in the total cost. Also, the
time duration for drug development is usually long, gener-
ally more than ten years for a single drug, which carries a
huge cost. Drug development mainly has four steps: Tar-
get Identification and Validation, Compound Screening and
Lead Discovery, Preclinical Development, and Clinical De-
velopment [2]. Drug discovery comprises of Target Iden-
tification and validation along with compound screening.
Drug discovery amounts to the maximum utilization of re-
sources. The process can involve scientists to determine
germs, viruses, and bacteria that cause a specific disease or
illness. The time frame can range from 320 years, and costs
can range between several million to billions of dollars [7].
Target Identification is the initial and most important
process in the drug discovery pipeline. It concerns collect-
ing effects of a certain drug on a particular biological en-
tity. The biological entity which is affected by a drug is
termed as a target, generally a protein sequence. Identifica-
tion of suitable targets and their corresponding validations
increases the chances of success during the discovery pro-
cess and also allows one to foresee the effects and side ef-
fects associated with the modulation of the target.
Compound screening and lead discovery comprise of
HTS (high throughput screening), which is an experimen-
tation method for screening of compounds to check for any
activity against a biological target. This process over the
years has become automated using certain applications that
use score functions that provide ranks depending on the
amount of activity.
Recent advances in Artificial Intelligence (AI) has led to
an increase in its applications to solve new problems. Ap-
plication of artificial intelligence in drug discovery is a topic
that is being explored, and a substantial amount of work has
been done in the past few years. This has lead to an increase
in the application of AI in many pharmaceutical companies.
Due to the incorporation of AI by pharmaceutical compa-
nies, many datasets have been curated in the past few years
such as DrugBank [14], PubChem [9], ZINC [6], ChEMBL
[4], and many more. These datasets have compounds repre-
sented in the form of SMILES [13] which is a popular line
notation of representing chemical structures. SMILES are
then converted to molecular descriptors which are numeri-
cal vectors (feature vectors) that portray certain characteris-
tics of a compound. These are generated using various ge-
ometrical, structural, and physicochemical properties. Few
libraries have been developed for forming feature vectors
1
from SMILES. Machine Learning algorithms are then used
on the extracted feature vectors.
Advancements in Quantum Computing have enabled to
implement certain Machine Learning techniques on Quan-
tum Computers. Quantum Computers by IBM, DWave,
Rigetti are available for developers and researchers to be
used, but the use is restricted. Toolkits such as PennyLane,
Strawberry Fields, qiskit can be used to develop quantum
computer algorithms. These toolkits have enabled develop-
ers to implement machine learning techniques to quantum
computers and find new applications for their usage.
2. Machine Learning to Quantum Machine
Learning transition theory
2.1. Classical Machine Learning approaches
Machine Learning helps a computer to find patterns in a
data without providing instructions to function. This es-
entially the use of various statistical models which form
a mathematical function with a set of unique coefficients,
which are unique for a particular data. Machine Learning
algorithms identify these coefficients by traversing through
the data and by minimising the cost of predicting a value;
that is by reducing the difference in the predicted value and
the original value or the true value found in the dataset.
Hence, increasing the dataset by including non similar data
can be helpful in increasing the accuracy of the model; this
happens because the coefficients are tuned more a much
diverse and larger dataset. This is however not true for
all cases, one may find even though the dataset size is in-
creased the accuracy might still decrease as a result. Ma-
chine Learning approaches are broadly classified into two
types: Supervised and Unsupervised Learning. For the con-
text of the case study, we relied on Supervised Learning
algorithms, mainly Support Vector Machines (SVMs) [12]
and Random Forest [1]. It has been found and stated by
many researchers that SVMs and Random Forest perform
the best for target identification while using the feature-
based approach.
Support Vector Machine is a supervised Machine Learn-
ing algorithm that is widely used for both regression and
classification tasks. SVM’s plot the features in a n-
dimensional space with each value being a particular coor-
dinate in the plane. These points are separated by a hyper-
plane, splitting them into different classes. Random Forest
uses a different approach to classification. Random forest
is a collection of many Decision Trees, which are graph-
like structures, a model of decisions and their possible out-
comes, resource cost, and utility. Both of the algorithms
are best suitable for classification and regression purposes.
Various case studies and research outcomes have shown that
these two methods are best suitable for Drug Target Identi-
fication studies.
Deep learning algorithms have been widely used in re-
cent years, demonstrating good results for applications in
different fields of study. Deep Neural Networks are Artifi-
cial Neural Networks that have multiple hidden layers. The
basic building block of a neural network is a neuron which
is nothing but an activation function whose coefficients are
determined during the training process. A feed-forward
DNN can be considered as the most basic DNN architec-
ture, having multiple hidden layers. These are widely used
for drug target identification purposes because of their abil-
ity to predict multiple numbers of tasks in a single model.
We have used a simple feed-forward DNN to compare its
results with the QNN counterpart.
2.2. Quantum Machine Learning approaches
Classical computers work on the phenomenon of bit ma-
nipulation based on the classical gates such as the XOR,
NOT, AND etc. But the Quantum Computer use a different
set of gates called quantum gates. A quantum gate is a basic
quantum circuit that runs on a smaller number of qubits, un-
like bit on a classical computer. Quantum gates possess the
property of reversibility which the classical gates lack,thus
enabling them to retain information throughout the com-
putation. It is possible to perform classical computing us-
ing the quantum reversible gates, whereas the reverse if not
true. Quantum physics have yet two basic phenomenon’s
which the Quantum Computers leverage which bolster them
to surpass the Classical Computers-Superposition and En-
tanglement. Superposition allows a qubit to be in two
states simultaneously until the value of the qubit is explic-
itly measured.If two or more qubits are entangles then these
qubits remain entangled after exiting the Quantum Gate,
thus keeping the information in the qubit safely sealed in the
transition. Quantum gates are conventionally represented in
the form of unitary matrices also the number of input and
output qubits of the gate must be equal.
Quantum Computing is the process of manipulation of
qubits to process information. The ability of qubits to be
in superposition principle leads to a substantial speedup in
the amount of information that can be processed. Quan-
tum Machine Learning is the domain where the principles
of quantum computing are leveraged to develop algorithms
that can be used to solve the typical machine learning algo-
rithms. This can be achieved by adapting the classical al-
gorithms to run on a quantum computer. In theory, we can
call quantum machine learning as a method of quantum in-
formation processing that learns input-output relations from
data provided during the training phase using a quantum
function or a quantum approach to learn the data and find-
ing patterns. There are still questions and discussions going
on the efficiency of such algorithms on different types of
data. There are still methods to be discovered to perfect the
art of using quantum computers to solve machine learning
problems. Quantum Machine Learning can also be used to
analyze quantum data. Quantum Machine Learning typi-
cally requires one to encode the given classical data set into
a quantum computer to make it accessible for quantum in-
formation processing. This encoded data can be processed
using a quantum circuit to analyze it.
Certain Quantum Machine Learning approaches have al-
ready been developed, and many still are under develop-
ment. One such algorithm is the Quantum Support Vector
Machines. It is available in the qiskit suite for direct imple-
mentation in python. Machine Learning represents data in a
matrix format and is the basis for quantum information pro-
cessing. Quantum Machine Learning algorithms consists
of two forms;one, a classical machine learning algorithm is
converted into its quantum counterpart to be implemented
on quantum computers;or second, taking computationally
expensice classical machine learning algorithms and run-
ning them on quantum computer, without classical to quan-
tum conversion. In a Quantum Support Vector Machine al-
gorithm, the classical data is converted to the quantum state
for each training example. This is then followed by hyper-
plane parameter optimization.
Neural Networks are a series of algorithms that try to rec-
ognize underlying relationships in a set of data that mimics
the process in the human brain. Neural networks are multi-
layer networks of neurons that we use to classify things,
make predictions. Recent advances in Quantum Comput-
ing have lead to the implementation of neural networks us-
ing the principles of Quantum Mechanics and also to be
deployed to run in a Quantum Computer. Quantum Neu-
ral Networks comprise a quantum perceptron as the basic
unit in the network. A quantum perceptron must satisfy lin-
ear and non-linear properties because Quantum Computing
uses a combination of linear and non-linear properties of
Quantum Mechanics to function. Many similarities can be
found in the operation of QNN and a classical ANN. Both
of them composed of several layers of perceptrons such as
one or more input layer, one or multiple hidden layers, and
one or more output layers. All the layers are fully con-
nected in both ANN and QNN. A weighted sum of outputs
of the previous layers is calculated;if the sum is greater than
a threshold then the node is set, else it is reset. Output layer
of the QNN and ANN however differ in one aspect, QNN
output layer checks its accuracy against the target output
of the whole network.The network, copmutes a function by
checking which output is high, as whole; though there are
no checks to make sure exactly which outputs are high.
3. Case Study : Drug Target Identification
3.1. What is drug discovery?
Drug discovery is basically the discovery of new drugs.
Drug Target identification is the process of identifying an
entity on which the drug compound goes and binds to, usu-
ally a protein or gene. Drug research and development starts
with the identification of biomolecular targets; this is the
preliminary choice, as choosing a target will result in the
modulation of the disease. After the target is identified, it
is then validated by physiological models. Such validation
is nowadays is performed using computer software where
each drug that is validated is awarded a score. Though the
final validation comes only after clinical tests on subjects,
have been performed.
Drug Target Identification (DTI) is both a significant step
and a challenge to be solved in the Drug Discovery pipeline
and has attracted the attention of researches in the past
few years. Traditional methods of DTI are labor-intensive
and also resource intensive; hence, the focus is now being
shifted to computational methods for drug target identifi-
cation. Machine Learning methods are being explored for
Drug Target Identification because of the ability to learn
from datasets and also find patterns in the underlying data.
3.2. Dataset
For the purpose of the case study, we have used the
DrugBank [14] dataset. Drugbank is a freely available
dataset containing drug and drug target information. It com-
promises of detailed drug data and target data. The lat-
est release of the Drugbank (version 5.1.4) compromises
of 13,363 drug entries including 2,606 approved small
molecule drugs, 1,299 approved biotech (protein/peptide)
drugs, 130 nutraceuticals, and over 6,311 experimental
drugs. Additionally, 5,164 non-redundant protein (i.e. drug
target/enzyme/transporter/carrier) sequences are linked to
these drug entries. Each DrugCard entry contains more than
200 data fields with half of the information being devoted to
drug/chemical data and the other half devoted to drug target
or protein data.
Machine Learning methods for drug target identification
can be classified into three types: feature-based approach,
similarity-based approach, and other approaches. The input
to the machine learning model in a feature-based approach
is a feature vector, where each value of the feature vector
represents an essential chemical or biological property of
the drug or the protein. Feature vectors are generated by
concatenating the molecular descriptors of the drug com-
pound and the protein descriptors of the target sequence.
This can be passed as an input to a standard machine learn-
ing model such as Support Vector Machines or Random
Forest. In a similarity-based approach, a similarity matrix
is generated where the (x,y)th element of the matrix is the
similarity of drug x and drug y. Similarly, a target similarity
matrix is generated by a protein sequence alignment. Other
approaches use different information about the drug, such
as the pharmacological information or the biomedical in-
formation. From these drug-protein relations are extracted.
Table 1. Python toolkits for drug discovery.
Name Description
rdkit [10] RDKit is a collection of cheminformat-
ics and machine-learning algorithms writ-
ten in C++ and Python. It comes with 2D
and 3D molecular operations and also pro-
vides the ability to calculate the Molecular
Descriptors and Fingerprints for machine
Learning.
PyBioMed
[3]
A Python package to develop a power-
ful model for prediction tasks by ma-
chine learning algorithms such as scikit-
learn, one of the most important things
to consider is how to effectively represent
the molecules under investigation such as
small molecules, proteins, DNA and even
complex interactions, by a descriptor.
mordred
[11]
A Python package to calculate descriptors
for various molecules.
There might be cases where this information might not be
the real drug-target interaction.
For implementing the classical Machine Learning solu-
tion, we used the feature-based approach, as this data can
also be used easily with the Quantum Machine Learning al-
gorithms. Some of the bioinformatics libraries that we used
are provided in table 1. These libraries were then used to
generate the molecular and the protein descriptors of the
drug and targets, respectively. These generated molecular
descriptors provide information about a certain property of
the drug or protein compound. The molecular descriptors
and the protein descriptors are then appended to form a fea-
ture vector. For example, the size of the molecular descrip-
tor vector be 1 x n, and the size of the protein descriptor
vector be 1 x m, then the feature vector will have the dimen-
sionality of 1 x (n+m). This method is applied for positive
drug target pair as well as a negative drug target pair. A pos-
itive drug target pair is the one where the drug combines to
a particular protein sequence. For example, we have a drug
A, and it combines to 4 proteins or targets in the dataset,
then we have 4 positive drug target pairs and rest of the
proteins will form a negative drug target pair with drug A.
Positive drug target pair is labeled as one and the negative
drug target pair is labeled as zero. This collection of fea-
ture vectors is then used as the input to SVM and Random
Forest machine learning methods. Also, we implemented a
feed-forward DNN for the same classification task.
3.3. Motivation
The recent advances in Quantum Computing and Ma-
chine Learning have many developers, and researchers try
solving real-world problems using technology. We also had
Figure 1. Quantum and Classical algorithm approaches.
a similar kind of motivation to try out a Quantum Machine
Learning algorithm in the domain of drug discovery. There
are four different approaches to combine the discipline of
quantum computing and machine learning shown in Figure
1.
We are using the Classical Quantum approach where the
data is of classical nature and the algorithm to be used is of
Quantum nature. In this type of approach, the classical data
is converted to quantum data. This enables the implementa-
tion of the quantum algorithms on real-world quantum com-
puters. Also, the ability of quantum computers to leverage
the phenomenon of quantum mechanics such as superposi-
tion and entanglement made us want to try out implement-
ing the drug target identification using machine learning on
quantum computers. Also, since quantum gates have the
ability to be reversible, it makes them fast to process data
because the previous state of the machine is known and all
of the information is retained.
Hypothesis After comparing the way classical and quan-
tum computer process information, we hypothesize that ma-
chine learning on quantum computers will lead to com-
petitive results (10%) in comparison with the classical ap-
proaches.
3.4. Limitations
While using the toolkits for implementing quantum com-
puting, we encountered various problems while using the
libraries. While shifting from the classical approach to the
quantum approach, we had to reduce the number of features
that we could select to train the algorithm. This was mainly
due to the lack of quantum computing power available for
commercial use. On IBM-Q experience services, we could
only get a maximum of 32 qubits to be used for simula-
tion purposes and a maximum of 16 qubits to run the algo-
rithm on a real quantum device. In both of the QSVM [5]
and QNN [8] approaches, the number of qubits needs to be
equal to the number of features that are to be trained. Our
initial feature vector used in the classical machine learn-
ing approaches for a drug-target pair was of size 1x1821,
this was then to be reduced to the number of qubits that we
wanted to use.
To reduce the number of features, we had to select the
highest-ranking features from the given feature vector. This
was implemented using the Principal Component Analy-
sis (PCA) [15]. PCA is a technique used for dimension-
ality reduction, that is reducing the dimensions of the fea-
ture space. Feature elimination by dimensionality reduction
helps to retain the simplicity of the data and also maintains
the interpretability of your variables. One drawback of PCA
is that we are also eliminating any benefits that the original
features would bring as compared to the reduced number of
dimensions. The first component in PCA accounts for the
maximum variability in the data given, and each succeed-
ing component accounts for the remaining variability. PCA
is sensitive to the relative scaling of the original variables
due to which scaling of data in the range [-1 , 1] provided
optimal results. The reduction of sale will lead to changing
some important descriptor values, which in their original
values would have led to the finding of different patterns of
correlations.
3.5. Simulators
Quantum Simulators are controllable quantum systems
that can be used to simulate other quantum systems. Quan-
tum simulators can help us tackle problems of quantum
computing on a classical computer using an interface where
quantum data is converted to classical and vice-versa. They
form a bridge between the Classical-Classical algorithms
and Classical-Quantum or Quantum-Classical algorithms.
This data then can be used to implement Quantum Com-
puting algorithms on a classical computer. Many toolk-
its are available to implement quantum simulators such as
the qiskit by IBM, Pennylane by Xanadu, and many more.
Quantum Simulators behave like real-world quantum com-
puters, but they are run on classical machines.
3.6. Results
In our initial hypothesis, we were expecting competi-
tive results for the QSVM and QNN implementations when
compared with their classical counterparts i.e., SVM, and
DNN. But the results were not competitive at all; the QSVM
and QNN did not perform even close to their classical coun-
terparts. Table 2shows the accuracy over each condition.
This might have been due to the limitations of the number of
features that one can select when implementing a quantum
computing approach because of the limitation of the num-
ber of qubits one can use. Quantum supremacy is still quite
far away, but we can hope that once we achieve quantum
supremacy, we can expect to see better results for the drug
target identification using Quantum Computing algorithms.
4. Discussion
In this work, we focus on covering a type of applica-
tion that promises to have impactful results through the use
of quantum computing. It is important to mention that we
only focus on a single subarea of drug discovery that is
Table 2. Experimental results of classical and quantum ap-
proaches.
Approach Machine Algorithm Accuracy
Classical Google Colab SVM 55.0%
Classical Google Colab DNN 73.0%
Quantum Quantum Local QSVM 52.5%
Quantum Quantum Remote QSVM 47.5%
Quantum Quantum Local QNN 49.8%
Drug-Target Identification and other sub-areas such as Hit
search or Lead search could also have a great impact. Fu-
ture work can explore the algorithm migration processes for
other drug discovery techniques and also for other fields of
study.
We did not reject the null hypothesis of considering the
quantum results as competitive. It was mainly due to the
limited data that the algorithms could process due to the
limited number of qubits. The implemented QSVM algo-
rithm used each qubit to process each feature, so the num-
ber of features was limited by the device that executed it, it
ranged from 5 to 32 qubits depending on the IBM Q device.
The same limitation affected the neural network implemen-
tation by using the Xanadu PenyLane library over a Rigetti
Forest environment with ten qubits. Fairer experimentation
would be performed by using the same number of features
as the classical approach, but in general, the drug discovery
subareas are highly demanding in features.
Although the results were not competitive, there are
some lessons learned that could reduce the learning curve
of future implementations. Techniques that allow the best
selection and reduction of features are indispensable for this
limited environments. The quantum machines available in
the cloud have a system of task queuing, so executions can
take a long time, not because of the processing time, but be-
cause of the ability of these devices to prepare and dispatch
the queued processes. It is important to note that emula-
tors consume large amounts of memory, which depends on
the information entered. It is important to make tests with
different volumes of data to prevent memory saturation.
It is not yet clear what type of implementations are the
most benefited from the use of quantum machine learning.
Future work can perform a comparison of techniques by dif-
ferent areas and find which front is closest to the quantum
supremacy. This can lead future research work to create a
simple baseline for experimentation. This type of baselines
is well known in other areas such as IRIS for machine learn-
ing, MNIST for character recognition or CIFAR for classi-
fication of objects in images.
5. Conclusions
Quantum computing is an area that promises to solve the
great problems of humanity by achieving a performance far
superior to today’s computers. Although no cases have been
demonstrated in which quantum computing exceeds the
classical computing in a specific task (quantum supremacy),
scientists in this area ensure that we are close. In this work,
we set out to understand how these types of algorithms work
and how it is possible to convert conventional to quantum
algorithms. To demonstrate the process of transforming a
solution, we focus on a case study that promises impactful
results that is drug discovery. The subarea of drug discov-
ery in which we focus on was Drug-Target Identification.
For this type of problem, classification algorithms are re-
quired. In classical computing, these problems are usually
solved with machine learning algorithms. To demonstrate
how to apply machine learning algorithms for quantum ma-
chine learning, we experimented with a conventional ma-
chine learning method (Support Vector Machine) and a deep
learning method (Multi-Layer Perceptron). In this paper,
we show step by step the process for both the conventional
and quantum solutions, from the preprocessing of informa-
tion to training and evaluation. The algorithms were de-
signed to run in variational circuits. The experiments were
performed in quantum computing simulators and on quan-
tum cloud services. The results were superior in the classic
solutions as expected. The contribution of this work focuses
on documenting the migration process, showing the require-
ments, and the limitations of current quantum systems.
6. Acknowledgement
We thank School of AI for their support. This work was
supported by the School of AI Research Fellowship Pro-
gram.
References
[1] L. Breiman. Random forests. Machine learning, 45(1):5–32,
2001. 2
[2] Y. Cao, J. Romero, and A. Aspuru-Guzik. Potential of quan-
tum computing for drug discovery. IBM Journal of Research
and Development, 62(6):6–1, 2018. 1
[3] J. Dong, Z.-J. Yao, L. Zhang, F. Luo, Q. Lin, A.-P. Lu, A. F.
Chen, and D.-S. Cao. Pybiomed: a python library for various
molecular representations of chemicals, proteins and dnas
and their interactions. Journal of cheminformatics, 10(1):16,
2018. 4
[4] A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies,
A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-
Lazikani, et al. Chembl: a large-scale bioactivity database
for drug discovery. Nucleic acids research, 40(D1):D1100–
D1107, 2011. 1
[5] V. Havl´
ıˇ
cek, A. D. C ´
orcoles, K. Temme, A. W. Harrow,
A. Kandala, J. M. Chow, and J. M. Gambetta. Supervised
learning with quantum-enhanced feature spaces. Nature,
567(7747):209, 2019. 4
[6] J. J. Irwin and B. K. Shoichet. Zinc- a free database of com-
mercially available compounds for virtual screening. Jour-
nal of chemical information and modeling, 45(1):177–182,
2005. 1
[7] K. I. Kaitin. Deconstructing the drug development process:
the new face of innovation. Clinical Pharmacology & Ther-
apeutics, 87(3):356–361, 2010. 1
[8] N. Killoran, T. R. Bromley, J. M. Arrazola, M. Schuld,
N. Quesada, and S. Lloyd. Continuous-variable quantum
neural networks. arXiv preprint arXiv:1806.06871, 2018. 4
[9] S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gin-
dulyte, L. Han, J. He, S. He, B. A. Shoemaker, et al. Pub-
chem substance and compound databases. Nucleic acids re-
search, 44(D1):D1202–D1213, 2015. 1
[10] G. Landrum. Rdkit documentation. Release, 1:1–79, 2013.
4
[11] H. Moriwaki, Y.-S. Tian, N. Kawashita, and T. Takagi. Mor-
dred: a molecular descriptor calculator. Journal of chemin-
formatics, 10(1):4, 2018. 4
[12] J. A. Suykens and J. Vandewalle. Least squares support vec-
tor machine classifiers. Neural processing letters, 9(3):293–
300, 1999. 2
[13] D. Weininger. Smiles, a chemical language and informa-
tion system. 1. introduction to methodology and encoding
rules. Journal of chemical information and computer sci-
ences, 28(1):31–36, 1988. 1
[14] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Has-
sanali, P. Stothard, Z. Chang, and J. Woolsey. Drugbank: a
comprehensive resource for in silico drug discovery and ex-
ploration. Nucleic acids research, 34(suppl 1):D668–D672,
2006. 1,3
[15] S. Wold, K. Esbensen, and P. Geladi. Principal component
analysis. Chemometrics and intelligent laboratory systems,
2(1-3):37–52, 1987. 5
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning and quantum computing are two technologies that each have the potential to alter how computation is performed to address previously untenable problems. Kernel methods for machine learning are ubiquitous in pattern recognition, with support vector machines (SVMs) being the best known method for classification problems. However, there are limitations to the successful solution to such classification problems when the feature space becomes large, and the kernel functions become computationally expensive to estimate. A core element in the computational speed-ups enabled by quantum algorithms is the exploitation of an exponentially large quantum state space through controllable entanglement and interference. Here we propose and experimentally implement two quantum algorithms on a superconducting processor. A key component in both methods is the use of the quantum state space as feature space. The use of a quantum-enhanced feature space that is only efficiently accessible on a quantum computer provides a possible path to quantum advantage. The algorithms solve a problem of supervised learning: the construction of a classifier. One method, the quantum variational classifier, uses a variational quantum circuit 1,2 to classify the data in a way similar to the method of conventional SVMs. The other method, a quantum kernel estimator, estimates the kernel function on the quantum computer and optimizes a classical SVM. The two methods provide tools for exploring the applications of noisy intermediate-scale quantum computers ³ to machine learning. © 2019, The Author(s), under exclusive licence to Springer Nature Limited.
Article
Full-text available
Background: With the increasing development of biotechnology and informatics technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these data needs to be extracted and transformed to useful knowledge by various data mining methods. Considering the amazing rate at which data are accumulated in chemistry and biology fields, new tools that process and interpret large and complex interaction data are increasingly important. So far, there are no suitable toolkits that can effectively link the chemical and biological space in view of molecular representation. To further explore these complex data, an integrated toolkit for various molecular representation is urgently needed which could be easily integrated with data mining algorithms to start a full data analysis pipeline. Results: Herein, the python library PyBioMed is presented, which comprises functionalities for online download for various molecular objects by providing different IDs, the pretreatment of molecular structures, the computation of various molecular descriptors for chemicals, proteins, DNAs and their interactions. PyBioMed is a feature-rich and highly customized python library used for the characterization of various complex chemical and biological molecules and interaction samples. The current version of PyBioMed could calculate 775 chemical descriptors and 19 kinds of chemical fingerprints, 9920 protein descriptors based on protein sequences, more than 6000 DNA descriptors from nucleotide sequences, and interaction descriptors from pairwise samples using three different combining strategies. Several examples and five real-life applications were provided to clearly guide the users how to use PyBioMed as an integral part of data analysis projects. By using PyBioMed, users are able to start a full pipelining from getting molecular data, pretreating molecules, molecular representation to constructing machine learning models conveniently. Conclusion: PyBioMed provides various user-friendly and highly customized APIs to calculate various features of biological molecules and complex interaction samples conveniently, which aims at building integrated analysis pipelines from data acquisition, data checking, and descriptor calculation to modeling. PyBioMed is freely available at http://projects.scbdd.com/pybiomed.html .
Article
Full-text available
Molecular descriptors are widely employed to present molecular characteristics in cheminformatics. Various molecular-descriptor-calculation software programs have been developed. However, users of those programs must contend with several issues, including software bugs, insufficient update frequencies, and software licensing constraints. To address these issues, we propose Mordred, a developed descriptor-calculation software application that can calculate more than 1800 two- and three-dimensional descriptors. It is freely available via GitHub. Mordred can be easily installed and used in the command line interface, as a web application, or as a high-flexibility Python package on all major platforms (Windows, Linux, and macOS). Performance benchmark results show that Mordred is at least twice as fast as the well-known PaDEL-Descriptor and it can calculate descriptors for large molecules, which cannot be accomplished by other software. Owing to its good performance, convenience, number of descriptors, and a lax licensing constraint, Mordred is a promising choice of molecular descriptor calculation software that can be utilized for cheminformatics studies, such as those on quantitative structure–property relationships. Electronic supplementary material The online version of this article (10.1186/s13321-018-0258-y) contains supplementary material, which is available to authorized users.
Article
Full-text available
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public repository for information on chemical substances and their biological activities, launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH). For the past 11 years, PubChem has grown to a sizable system, serving as a chemical information resource for the scientific research community. PubChem consists of three inter-linked databases, Substance, Compound and BioAssay. The Substance database contains chemical information deposited by individual data contributors to PubChem, and the Compound database stores unique chemical structures extracted from the Substance database. Biological activity data of chemical substances tested in assay experiments are contained in the BioAssay database. This paper provides an overview of the PubChem Substance and Compound databases, including data sources and contents, data organization, data submission using PubChem Upload, chemical structure standardization, web-based interfaces for textual and non-textual searches, and programmatic access. It also gives a brief description of PubChem3D, a resource derived from theoretical three-dimensional structures of compounds in PubChem, as well as PubChemRDF, Resource Description Framework (RDF)-formatted PubChem data for data sharing, analysis and integration with information contained in other databases.
Article
Full-text available
Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting.The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling.In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course.For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation.Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
Article
Full-text available
In this letter we discuss a least squares version for support vector machine (SVM) classifiers. Due to equality type constraints in the formulation, the solution follows from solving a set of linear equations, instead of quadratic programming for classical SVM''s. The approach is illustrated on a two-spiral benchmark classification problem.
Article
Full-text available
ChEMBL is an Open Data database containing binding, functional and ADMET information for a large number of drug-like bioactive compounds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chemical biology and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compounds and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb.
Article
Quantum computing has rapidly advanced in recent years due to substantial development in both hardware and algorithms. These advances are carrying quantum computers closer to their impending commercial utility. Drug discovery is a promising area of application which will find a number of uses for these new machines. As a prominent example, quantum simulation will enable faster and more accurate characterizations of molecular systems than existing quantum chemistry methods. Furthermore, algorithmic developments in quantum machine learning offer interesting alternatives to classical machine learning techniques, which may also be useful for the biochemical efforts involved in early phases of drug discovery. Meanwhile, quantum hardware is scaling up rapidly into a regime where an exact simulation is difficult even using the world's largest supercomputers. We review how these recent advances can shift the paradigm with which one thinks about drug discovery, focusing on both the promises and caveats associated with each development. In particular, we highlight how hybrid quantum-classical approaches to quantum simulation and quantum machine learning could yield substantial progress using noisy-intermediate scale quantum devices, while fault-tolerant, error corrected quantum computers are still in their development phase.
Article
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation system designed for modern chemical information processing. Based on principles of molecular graph theory, SMILES allows rigorous structure specification by use of a very small and natural grammar. The SMILES notation system is also well suited for high-speed machine processing. The resulting ease of usage by the chemist and machine compatability allow many highly efficient chemical computer applications to be designed including generation of a unique notation, constant-speed (zeroeth order) database retrieval, flexible substructure searching, and property prediction models.