Data Analytics for Protein Crystallization
Abstract
This unique text/reference presents an overview of the computational aspects of protein crystallization, describing how to build robotic high-throughput and crystallization analysis systems. The coverage encompasses the complete data analysis cycle, including the set-up of screens by analyzing prior crystallization trials, the classification of crystallization trial images by effective feature extraction, the analysis of crystal growth in time series images, the segmentation of crystal regions in images, the application of focal stacking methods for crystallization images, and the visualization of trials.
Topics and features:
• Describes the fundamentals of protein crystallization, and the scoring and categorization of crystallization image trials
• Introduces a selection of computational methods for protein crystallization screening, and the hardware and software architecture for a basic high-throughput system
• Presents an overview of the image features used in protein crystallization classification, and a spatio-temporal analysis of protein crystal growth
• Examines focal stacking techniques to avoid blurred crystallization images, and different thresholding methods for binarization or segmentation
• Discusses visualization methods and software for protein crystallization analysis, and reviews alternative methods to X-ray diffraction for obtaining structural information
• Provides an overview of the current challenges and potential future trends in protein crystallization
This interdisciplinary work serves as an essential reference on the computational and data analytics components of protein crystallization for the structural biology community, in addition to computer scientists wishing to enter the field of protein crystallization.
Dr. Marc L. Pusey is a Research Scientist at iXpressGenes, Inc. Huntsville, AL, USA. Dr. Ramazan Savaş Aygün is an Associate Professor in the Computer Science Department of the University of Alabama in Huntsville, USA.
Chapters (11)
This chapter reviews the basics of the protein crystallization process. As amply proven by the protein structure initiative, protein crystallization can be carried out without any basic knowledge about the specific protein or how it behaves in solution. However, when the goal is not just processing as many proteins as can be produced, but is directed toward a better understanding of a specific biological moiety, a better understanding of what is being done, what one is observing, and how they all relate to the crystal nucleation and growth process is an invaluable aid in translating the observed screening results to a successful outcome. Informed observation is a key component to increased success. Similarly, there are a plethora of approaches that can be taken to screening for crystals, and knowing the strengths and weaknesses of each is key to matching them to the immediate goals to be achieved.
The practice of scoring of protein crystallization screening results is more honored in the breach than in the observance. However, as we hope to show in the balance of this treatise, it can lead to a means for extracting more information than immediately apparent from a crystallization experiment. Scoring has advantages beyond simple good scientific note-keeping practice; the act of objectively examining one’s results, with some thought added, can lead to a deeper appreciation of what led to those results, be it at the protein, screening solution, or mechanics of setting up the plate level. The first goal is to have a system which reflects an increase in the desirability of the results obtained with the numerical score. The scoring scale does not have to be complex or extensive; a 10-point scale is elaborated on herein. However, the scale should clearly distinguish between classes of desirable outcomes.
The goal of protein crystallization screening is to determine the main factors of importance to crystallize a protein under investigation. The protein crystallization screening is often expanded to many hundreds or thousands of conditions to maximize combinatorial chemical space coverage for maximizing the chances of a successful (crystalline) outcome. Available commercial screens may not generate crystalline conditions for some proteins difficult to crystallize. Nevertheless, the previous crystallization trials could be analyzed to recommend screens with crystalline conditions. This chapter presents computational methods for protein crystallization screening.
Protein crystallization is a complex phenomenon requiring thousands of experiments corresponding to different crystallization conditions for successful crystallization. In recent years, high-throughput robotic setups have been developed to automate the protein crystallization experiments, and imaging techniques are used to monitor the crystallization progress. Having an automated system to classify the images according to the crystallization phases can be very useful to crystallographers. This chapter describes the design and implementation of a stand-alone, low-cost, and real-time system for protein crystallization image acquisition and classification with a goal to assist crystallographers in scoring crystallization trials.
Large number of features are extracted from protein crystallization trial images to improve the accuracy of classifiers for predicting the presence of crystals or phases of the crystallization process. The excessive number of features and computationally intensive image processing methods to extract these features make utilization of automated classification tools on stand-alone computing systems inconvenient due to the required time to complete the classification tasks. In this chapter, we provide an analysis of combinations of image feature sets, feature reduction, and classification techniques for crystallization images benefiting from trace fluorescent labeling. Features are categorized into intensity, graph, histogram, texture, shape-adaptive, and region features (using binarized images generated by Otsu’s, green percentile, and morphological thresholding). The effects of normalization, feature reduction with principal components analysis (PCA), and feature selection using random forest classifier are also investigated. Moreover, the time required to extract feature categories is computed and an estimated time of extraction is provided for feature category combinations. The analysis in this chapter shows that research groups can select features according to their hardware setups for real-time analysis.
In recent years, high-throughput robotic setups have been developed to automate the protein crystallization experiments, and imaging techniques are used to monitor the crystallization progress. Images are collected multiple times during the course of an experiment. Huge number of collected images make manual review of images tedious and discouraging. In this chapter, utilizing trace fluorescent labeling, we describe an automated system for monitoring the protein crystal growth in crystallization trial images by analyzing time sequence images. Given the sets of image sequences, the objective is to develop an efficient and reliable system to detect crystal growth changes such as new crystal formation and increase of crystal size. This system consists of three major steps—identification of crystallization trials proper for spatiotemporal analysis, spatiotemporal analysis of identified trials, and crystal growth analysis.
Automated image analysis of protein crystallization images is one of the important research areas. For proper analysis of the microscopic images, it is necessary to have all objects in good focus. If objects in a scene (or specimen) appear at different depths with respect to the camera’s focal point, objects outside the depth of field usually appear blurred. Therefore, scientists capture a collection of images with different depths of field. Each of these images can have different objects in focus. Focal stacking is a technique of creating a single focused image from a stack of images collected with different depths of field. In this chapter, we analyze focal stacking techniques suitable for trace fluorescently labeled protein crystallization images but also applicable images captured under white light.
In general, a single thresholding technique is developed or enhanced to separate foreground objects from the background for a domain of images. This idea may not generate satisfactory results for all images in a dataset, since different images may require different types of thresholding methods for proper binarization or segmentation. To overcome this problem, this chapter explains “super-thresholding” method that utilizes a supervised classifier to decide an appropriate thresholding method for a specific image. This method provides a generic framework that allows selection of the best thresholding method among different thresholding techniques that are beneficial for the problem domain. A classifier model is built using features extracted priori from the original image only or posteriori by analyzing the outputs of thresholding methods and the original image. This model is applied to identify the thresholding method for new images of the domain.
As high throughput, crystallization screening and analysis systems automate the processes starting from setting up plates to scoring, this enables conducting thousands of experiments in a short time. Analysis of crystallization trial experiments in the past has been cumbersome due to the physical environment where an expert needs to look crystallization trial images one by one using a microscope with the likelihood of the majority of experiments yielding unsuccessful outcomes. The visualization of crystallization experiments on a display with some highlighted information along with annotation capability can provide experts a user-friendly and shared environment of collaborative analysis. In this chapter, we summarize the methods and information displayed on various visualization software for protein crystallization analysis.
There are more ways of gaining insight into macromolecular structure than X-ray diffraction. Like X-ray diffraction, some of these are based on the generation of ordered arrays of the molecule to be studied. For many reasons, based on either the protein or its function, this is not always possible. Others, some of which are currently enjoying a marked increase in popularity, do not require crystals. Many of these come with the added advantage that they can be used to capture reaction intermediates and/or enable the experimenter to observe changes in specific amino acids, which is often not possible with X-ray diffraction methods. This chapter divides into two sections; those methods that can be used to obtain a 3D structure (neutron diffraction, cryogenic electron microscopy, nuclear magnetic resonance, and X-ray free electron laser diffraction) and those that are suitable for more general structural information (chemical cross linking, fluorescence resonance energy transfer, circular dichroism). Virtually all of the methods discussed below can be expanded for the study of other aspects of macromolecular structure-function relationships, and some, such as fluorescence and chemical cross linking, are a subset of a rich methodology for the study of macromolecules.
This book provides the lifecycle of data analytics for protein crystallization. A wide range of topics starting from setting up screens to identifying macromolecular structure has been covered. In earlier chapters, the status-of-art and effective low-cost and real-time techniques for protein crystallization analysis have been provided. This chapter provides some of the challenges and future directions for protein crystallization.
... This process implies the manual scan of all data by the expert and makes the classifier useless. In our previous researches on protein crystallization analysis, our crystallographer recommended having one more class between crystal and non-crystal categories, named as likelyleads [5][6][7][8]. In this circumstance, the missed crystals hopefully could be classified as likely-leads rather than non-crystals, and the expert could review experiments labeled as likely-leads manually to avoid any missed crystals without requiring to review non-crystals. ...
... We created different sets of handwritten digits from the MNIST data. Here, we used 10 different pairs of digits to create our sub-data sets: MNIST(0-1), MNIST(1-7), MNIST(1-9), MNIST(2-3), MNIST(2-7), MNIST (3)(4)(5), MNIST(4-9), MNIST(6-9), MNIST (7)(8), and MNIST (8)(9). While some selected pairs may contain data easy to classify (e.g., 0-1), they also have a few confusing cases. ...
... The rate of reject samples is shown in Fig. 9. On average, the WisdomNet rejects less than 4% of the test data. For the data sets MNIST(2-3), MNIST(2-7), MNIST (3)(4)(5), and MNIST(4-9), the reject percentage is above 4%. The MNIST(2-3) data set has the most rejected samples with less than 10%. ...
Misclassification is a critical problem in many machine learning applications.
Since even the classifier models with high accuracy (e.g., > 95%) still introduce some misclassification error, it may not be possible to rely on the output of a classifier. In this paper, we introduce trustable learning, which prompts the learning model to yield only the true output, thus avoiding misclassifications. Whenever the model cannot decide the output accurately, the learning model should indicate that there could be a misclassification error if it is forced to classify, and hence, it should reject to make a decision or defer it to a human expert. Therefore, we develop a methodology for trustable learning and apply it to artificial neural networks and show that it is possible to develop a classifier with 0% misclassification error. We propose a novel neural network architecture named WisdomNet that could provide zero prediction error by introducing an additional neuron named as conjugate neuron that would indicate whether the network is able to classify the data correctly or not. The WisdomNet architecture can be applied to any previously built model, and we have evaluated WisdomNet with several network architectures such as multilayer perceptron, convolutional neural network,
and deep network on different data sets. The results show that the WisdomNet is able to reduce the classification error rate to 0%, while labeling the data is difficult to classify as ‘reject’ at a low percentage of within around 10%.
... Even in such kind of facilities, however, pictures of crystallization drops are evaluated man- ually. Several attempts have been made so far to automate the evaluation of crystallization drops by an image recogni- tion method (Pusey and Aygün 2017). Given that each fa- cility uses a different imager, crystallization tray, lighting system, and so on (Figure1), and because each of these pa- rameters severely affect image recognition, it is difficult to recognize images with high accuracy and versatility. ...
Recently, deep convolutional neural networks have shown good results for image recognition. In this paper, we use convolutional neural networks with a finder module, which discovers the important region for recognition and extracts that region. We propose applying our method to the recognition of protein crystals for X-ray structural analysis. In this analysis, it is necessary to recognize states of protein crystallization from a large number of images. There are several methods that realize protein crystallization recognition by using convolutional neural networks. In each method, large-scale data sets are required to recognize with high accuracy. In our data set, the number of images is not good enough for training CNN. The amount of data for CNN is a serious issue in various fields. Our method realizes high accuracy recognition with few images by discovering the region where the crystallization drop exists. We compared our crystallization image recognition method with a high precision method using Inception-V3. We demonstrate that our method is effective for crystallization images using several experiments. Our method gained the AUC value that is about 5% higher than the compared method.
The crystallization of biological macromolecules like proteins is an important process to study their molecular structures. The quality of crystals is critical to be able to determine their structures using methods such as X-ray crystallography. Therefore, many wet-lab experiments are conducted using numerous screening plates to obtain successful crystal growth. High-throughput microscopy is useful to quickly collect images from the screening plates. Since the automated systems for imaging require high-end instrumentation, they are costly. This study investigates a small scale, mobile fluorescence imaging system, and application. Our system is composed of a mobile imaging system, a mobile app to capture images from plates, and a machine learning model to recognize the presence of crystals presence from images. For fluorescence imaging, we present an assembly of a smartphone or tablet integrated with a macro lens tube and illumination LEDs. The system presented in this study has magnification range from 20x to 250x macro. For the recognition of crystals, a convolutional neural network model was trained on a computer and then deployed on the mobile app. A data set of 1000 trace fluorescently labeled images was used to train and evaluate the model. The accuracy of the hold-out testing images was about 95%. The mobile app for imaging and protein recognition was developed to run on Apple iOS devices. To evaluate the system further, the recombinant inorganic pyrophosphatase protein from Klebsiella pneumoniae, which was expressed from E. coli, was crystallized using the trace fluorescent labeling method. Our system can capture quality images of protein crystals in both white and fluorescence lights. The overall accuracy of recognizing crystal or non-crystal outcomes on the pilot test is about 93%. This mobile imaging system can be useful for small group research labs and students.
In high-throughput systems, the crystallization experiments require the inspection and analysis of a large number of trial images. The visualization and analysis tools are needed to view and analyze the experimental results, and recommend novel crystalline conditions by analyzing prior results. It is essential to integrate all these components into a single system. Therefore, we developed Visual-X2, an interactive visualization software developed to aid the user for quick and efficient visualization and analysis of the results of the experiments. Visual-X2 has a number of useful features for visualization and analysis: dual plate view (thumbnail and symbolic), detailed well view with scoring option, multiple-scan and time-course views, support for screening analysis based on multiple screens, three novel screen analysis methods (associative experimental design, GenScreen, and novelty methods), and generating pipetting file with a family of conditions varying concentrations based on stock concentration.
Protein crystallization screening helps determine factors (e.g., salts, pH of buffers, ionic strengths, temperature, and type of precipitants) that are favorable for the formation of a large protein crystal suitable for X-ray crystallography. While existing commercial screens may not generate crystalline outcomes for difficult proteins, their outcomes could be used for recommending novel screens. Current methods for protein crystallization screening such as associative experimental design (AED) process only cocktails having one chemical per reagent while ignoring cocktails with multiple chemicals per reagent. To analyze cocktails having multiple chemicals per reagent, we propose enhanced associative experimental design (AED) that recommends novel crystallization conditions by analyzing the content of successful preliminary crystallization conditions. In wet lab experiments, our enhanced AED (AED\(^+\)) yielded ten new crystalline conditions for Tt189 (Nucleoside diphosphate kinase) in addition to 20 crystalline conditions generated by AED. Moreover, our AED⁺ allows pairing of crystalline or likely lead outcome with a non-crystalline outcome to generate novel crystalline conditions overcoming the limitation of AED requiring at least two good cocktails having at least one coming reagent.
Misclassification has a high cost in biological research studies such as protein crystallization. For drug development, the 3D structure of a protein is obtained by first crystallizing the protein. Hence, missing a crystalline condition may hinder the development of a drug. It is important to develop classification algorithms that would avoid or minimize misclassifications. Traditional decision tree classifiers are based on an impurity measure that identifies the most informative attribute to be selected at the early levels of a decision tree. The class labels are chosen based on majority of class labels at a leaf node. We introduce a novel decision tree classifier, else-tree, by analyzing pure regions or ranges of an attribute per class. After identifying the longest or most populated contiguous range per class, the rest of the ranges are fed into else branch of the decision tree. Only conflicting or doubtful samples are passed to the lower levels of the decision tree. It does not necessarily assign a class for difficult samples to classify. We have used our protein crystallization trials data and three other publicly available datasets to evaluate else-tree. The experiments show that the else-tree may reduce the misclassification to 0% by labeling difficult samples as undecided when the training set is a good representation of the dataset.
Protein crystallization well plate is a rectangular platform that contains wells usually organized as a grid structure. The crystallization conditions are studied through a screening process by setting up the trial conditions in the well plate. In the past, the expert evaluates the trial wells for the growth of crystals by manually viewing the plate under a microscope or using a high-throughput plate imaging and analysis system. While the first method is tedious and cumbersome, the second method requires financial investment. Recently, a few approaches were developed by collecting images using smartphones thus enabling low-cost automatic scoring (classification) of well images. Nevertheless, these recent methods do not detect which well on the plate is captured. If the user has a smartphone, the user may capture or scan any well by just moving the smartphone to the corresponding well. In this paper, we propose a mobile scanner that identifies the well by using a coded template under the well plate. The mobile scanner provides two modes: image and video. Image mode is used for single well analysis whereas video mode is used to scan the complete plate. In the video mode, the mobile scanner app generates a tilemap of the plate.
ResearchGate has not been able to resolve any references for this publication.