[Show abstract][Hide abstract] ABSTRACT: Keymessage:
By using the genotyping-by-sequencing method, it is feasible to characterize genomic relationships directly at the level of family pools and to estimate genomic heritabilities from phenotypes scored on family-pools in outbreeding species. Genotyping-by-sequencing (GBS) has recently become a promising approach for characterizing plant genetic diversity on a genome-wide scale. We use GBS to extend the concept of heritability beyond individuals by genotyping family-pool samples by GBS and computing genomic relationship matrices (GRMs) and genomic heritabilities directly at the level of family-pools from pool-frequencies obtained by sequencing. The concept is of interest for species where breeding and phenotyping is not done at the individual level but operates uniquely at the level of (multi-parent) families. As an example we demonstrate the approach using a set of 990 two-parent F2 families of perennial ryegrass (Lolium Perenne). The families were phenotyped as a family-unit in field plots for heading date and crown rust resistance. A total of 728 K single nucleotide polymorphism (SNP) variants were available and were divided in groups of different sequencing depths. GRMs based on GBS data showed diagonal values biased upwards at low sequencing depth, while off-diagonals were little affected by the sequencing depth. Using variants with high sequencing depth, genomic heritability for crown rust resistance was 0.33, and for heading date 0.22, and these genomic heritabilities were biased downwards when using variants with lower sequencing depth. Broad sense heritabilities were 0.61 and 0.66, respectively. Underestimation of genomic heritability at lower sequencing depth was confirmed with simulated data. We conclude that it is feasible to use GBS to describe relationships between family-pools and to estimate genomic heritability directly at the level of F2 family-pool samples, but estimates are biased at low sequencing depth.
Theoretical and Applied Genetics 09/2015; DOI:10.1007/s00122-015-2607-9 · 3.79 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Eco-routing is a simple yet effective approach to substantially reducing the environmental impact, e.g., fuel consumption and greenhouse gas (GHG) emissions, of vehicular transportation. Eco-routing relies on the ability to reliably quantify the environmental impact of vehicles as they travel in a spatial network. The procedure of quantifying such vehicular impact for road segments of a spatial network is called eco-weight assignment. EcoMark 2.0 proposes a general framework for eco-weight assignment to enable eco-routing. It studies the abilities of six instantaneous and five aggregated models to estimating vehicular environmental impact. In doing so, it utilizes travel information derived from GPS trajectories (i.e., velocities and accelerations) and actual fuel consumption data obtained from vehicles. The framework covers analyses of actual fuel consumption, impact model calibration, and experiments for assessing the utility of the impact models in assigning eco-weights. The application of EcoMark 2.0 indicates that the instantaneous model EMIT and the aggregated model SIDRA-Running are suitable for assigning eco-weights under varying circumstances. In contrast, other instantaneous models should not be used for assigning eco-weights, and other aggregated models can be used for assigning eco-weights under certain circumstances.
[Show abstract][Hide abstract] ABSTRACT: The papers in this special section were presented a the 29th International Conference on Data Engineering was held in Brisbane, QLD, Australia, on April 8-11, 2013.
IEEE Transactions on Knowledge and Data Engineering 07/2015; 27(7):1739-1740. DOI:10.1109/TKDE.2015.2419315 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A moving top-$k$ spatial keyword (M $k$ SK) query, which takes into account a continuously moving query location, enables a mobile client to be continuously aware of the top-$k$ spatial web objects that best match a query with respect to location and text relevance. The increasing mobile use of the web and the proliferation of geo-positioning render it of interest to consider a scenario where spatial keyword search is outsourced to a separate service provider capable at handling the voluminous spatial web objects available from various sources. A key challenge is that the service provider may return inaccurate or incorrect query results (intentionally or not), e.g., due to cost considerations or invasion of hackers. Therefore, it is attractive to be able to authenticate the query results at the client side. Existing authentication techniques are either inefficient or inapplicable for the kind of query we consider. We propose new authentication data structures, the MIR-tree and MIR $^*$ -tree, that enable the authentication of MkSK queries at low computation and communication costs. We design a verification object for authenticating MkSK queries, and we provide algorithms for constructing verification objects and using these for verifying query results. A thorough experimental study on real data s- ows that the proposed techniques are capable of outperforming two baseline algorithms by orders of magnitude.
IEEE Transactions on Knowledge and Data Engineering 04/2015; 27(4):922-935. DOI:10.1109/TKDE.2014.2350252 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A driver’s choice of a route to a destination may depend on the route’s length and travel time, but a multitude of other, possibly hard-to-formalize aspects, may also factor into the driver’s decision. There is evidence that a driver’s choice of route is context dependent, e.g., varies across time, and that route choice also varies from driver to driver. In contrast, conventional routing services support little in the way of context dependence, and they deliver the same routes to all drivers. We study how to identify context-aware driving preferences for individual drivers from historical trajectories, and thus how to provide foundations for personalized navigation, but also professional driver education and traffic planning. We provide techniques that are able to capture time-dependent and uncertain properties of dynamic travel costs, such as travel time and fuel consumption, from trajectories, and we provide techniques capable of capturing the driving behaviors of different drivers in terms of multiple dynamic travel costs. Further, we propose techniques that are able to identify a driver’s contexts and then to identify driving preferences for each context using historical trajectories from the driver. Empirical studies with a large trajectory data set offer insight into the design properties of the proposed techniques and suggest that they are effective.
The VLDB Journal 02/2015; 24(2). DOI:10.1007/s00778-015-0378-1 · 1.57 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Near infrared spectroscopy (NIRS) analytical techniques is a simple, fast and low cost method of high dimensional phenotyping compared to usual chemical techniques. To use this method there is no need for special sample preparation. The aim of this study is to use NIRS data to predict plant traits (e.g. dry matter, protein content, etc.) for the next generation. In total 1984 NIRS data from 995 ryegrass families (first cut) were used. The Absorption of radiation in the region of 960 – 1690 nm in every 2 nm distance produced 366 bins to represent the NIRS spectrum. The amount of genetic variance and the heritability for each bin were estimated using a mixed model. To use all the information for prediction, since we have 366 bins, a reduction in number of parameters is necessary. The usual method is to combine principal component analysis (PCA) and partial least square (PLS). Another method is random regression, which have the advantage that dimension reduction can be applied to different elements in the model. From initial results, heritabilities were between 0.15 and 0.24 for the 366 bins. Phenotypic correlations between bins ranged from 0.85 to 0.99. These correlations indicate that we can reduce the parameter dimension with random regression or PCA.
Plant and Animal Genome XXIII (PAG XXIII), San Diego, CA, USA; 01/2015
[Show abstract][Hide abstract] ABSTRACT: Genotyping by sequencing (GBS) allows generating up to millions of molecular markers with a cost per sample which is proportional to the level of multiplexing.
Increasing the sample multiplexing decreases the genotyping price but also reduces the numbers of reads per marker. In this work we investigated how this reduction of the coverage depth affects the genomic relationship matrices used to estimated breeding value of F2 family pools in perennial ryegrass.
A total of 995 families were genotyped via GBS providing more than 1.8M allele frequency estimates for each family with an average coverage depth of 12.6 per marker. Simulated datasets with a progressively reduced depth showed an increasing level of missing values together with an overestimated genetic variance caused by inflated diagonals in the genomic relationship matrix.
In order to address these drawbacks we first showed how to correct the diagonal elements by estimating the amount of genetic variance caused by the reduction of the coverage depth. Secondly we developed a method to scale the relationship matrix by taking into account the overall amount of pairwise non-missing loci between all families.
Seed yield and heading date were chosen as example traits to prove that these two procedures can considerably mitigate the loss of accuracy in predicted breeding values associated with the decreasing coverage depth. These findings will allow increasing the sample multiplexing in the GBS assays reducing the genotyping cost or increasing the size of the population under analysis for the same cost.
Plant and Animal Genome XXIII (PAG XXIII), CA, USA; 01/2015
[Show abstract][Hide abstract] ABSTRACT: Implementation of genomic selection (GS) for the genetic improvement of forage crops, such as perennial ryegrass, requires the establishment of sufficiently large training populations with high-quality phenotype and genotype data. This paper presents estimates of genetic and environmental variance and covariance components, obtained in a training population of 1453 F2 families. These families were produced in 2001, 2003, 2005, and 2007, and they were tested at seven locations throughout Europe. Families were cultivated together with commercial varieties that were used as control. Analyses focused on forage yield (green and dry matter) and six traits scored by visual inspection (i.e., rust resistance, aftermath heading, spring growth, density, winter hardiness, and heading date). Data were analyzed with linear mixed models, including fixed effects (trial and control varieties, within year and location), and random effects (breeding values, pedigree or parents, repeated effects of family or parents within location, and within trial environmental effects, to recover interblock information). Results showed very significant genetic variances for all traits, which provide good opportunities for future GSbased breeding programs. Forage yield showed family heritabilities of up to 0.30 across locations and up to 0.60 within a location. Similar or moderately lower values were found for the other traits. In particular, the heritabilities of rust resistance and aftermath heading were very promising. Genetic correlations between traits were generally low but positive, which increases the potential for multitrait selection.
[Show abstract][Hide abstract] ABSTRACT: The ability to timely process significant amounts of continuously updated
spatial data is mandatory for an increasing number of applications. Parallelism
enables such applications to face this data-intensive challenge and allows the
devised systems to feature low latency and high scalability. In this paper we
focus on a specific data-intensive problem, concerning the repeated processing
of huge amounts of range queries over massive sets of moving objects, where the
spatial extents of queries and objects are continuously modified over time. To
tackle this problem and significantly accelerate query processing we devise a
hybrid CPU/GPU pipeline that compresses data output and save query processing
work. The devised system relies on an ad-hoc spatial index leading to a problem
decomposition that results in a set of independent data-parallel tasks. The
index is based on a point-region quadtree space decomposition and allows to
tackle effectively a broad range of spatial object distributions, even those
very skewed. Also, to deal with the architectural peculiarities and limitations
of the GPUs, we adopt non-trivial GPU data structures that avoid the need of
locked memory accesses and favour coalesced memory accesses, thus enhancing the
overall memory throughput. To the best of our knowledge this is the first work
that exploits GPUs to efficiently solve repeated range queries over massive
sets of continuously moving objects, characterized by highly skewed spatial
distributions. In comparison with state-of-the-art CPU-based implementations,
our method highlights significant speedups in the order of 14x-20x, depending
on the datasets, even when considering very cheap GPUs.
[Show abstract][Hide abstract] ABSTRACT: Abstract
Background: Reed canary grass (Phalaris arundinacea) is an economically important forage and bioenergy grass of the temperate regions of the world. Despite its economic importance, it is lacking in public genomic data. We explore comparative exomics of the grass cultivars in the context of response to salt exposure. The limited data set poses challenges to the computational pipeline.
Methods: As a prerequisite for the comparative study, we generate the Phalaris reference transcriptome sequence, one of the first steps in addressing the issue of paucity of processed genomic data in this species. In addition, the differential expression (DE) and active-but-stable genes for salt stress conditions were analyzed by a novel method that was experimentally verified on human RNA-seq data. For the comparative exomics, we focus on the DE and stable genic regions, with respect to salt stress, of the genome.
Results and conclusions: In our comparative study, we find that phylogeny of the DE and stable genic regions of the Phalaris cultivars are distinct. At the same time we find the phylogeny of the entire expressed reference
transcriptome matches the phylogeny of only the stable genes. Thus the behavior of the different cultivars is
distinguished by the salt stress response. This is also reflected in the genomic distinctions in the DE genic regions.These observations have important implications in the choice of cultivars, and their breeding, for bio-energy fuels. Further, we identified genes that are representative of DE under salt stress and could provide vital clues in our understanding of the stress handling mechanisms in general.
[Show abstract][Hide abstract] ABSTRACT: The efficient processing of workloads that interleave moving-object updates and queries is challenging. In addition to the conflicting needs for update-efficient versus query-efficient data structures, the increasing parallel capabilities of multi-core processors yield challenges. To prevent concurrency anomalies and to ensure correct system behavior, conflicting update and query operations must be serialized. In this setting, it is a key concern to avoid that operations are blocked, which leaves processing cores idle. To enable efficient processing, we first examine concurrency degrees from traditional transaction processing in the context of our target domain and propose new semantics that enable a high degree of parallelism and ensure up-to-date query results. We define the new semantics for range and \(k\)-nearest neighbor queries. Then, we present a main-memory indexing technique called parallel grid that implements the proposed semantics as well as two other variants supporting different semantics. This enables us to quantify the effects that different degrees of consistency have on performance. We also present an alternative time-partitioning approach. Empirical studies with the above and three existing proposals conducted on modern processors show that our proposals scale near-linearly with the number of hardware threads and thus are able to benefit from increasing on-chip parallelism.
The VLDB Journal 10/2014; 23(5):817-841. DOI:10.1007/s00778-014-0353-2 · 1.57 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Increasing volumes of geo-referenced data are becoming available. This data includes so-called points of interest that describe businesses, tourist attractions, etc. by means of a geo-location and properties such as a textual description or ratings. We propose and study the efficient implementation of a new kind of query on points of interest that takes into account both the locations and properties of the points of interest. The query takes a result cardinality, a spatial range, and property-related preferences as parameters, and it returns a compact set of points of interest with the given cardinality and in the given range that satisfies the preferences. Specifically, the points of interest in the result set cover so-called allying preferences and are located far from points of interest that possess so-called alienating preferences. A unified result rating function integrates the two kinds of preferences with spatial distance to achieve this functionality. We provide efficient exact algorithms for this kind of query. To enable queries on large datasets, we also provide an approximate algorithm that utilizes a nearest-neighbor property to achieve scalable performance. We develop and apply lower and upper bounds that enable search-space pruning and thus improve performance. Finally, we provide a generalization of the above query and also extend the algorithms to support the generalization. We report on an experimental evaluation of the proposed algorithms using real point of interest data from Google Places for Business that offers insight into the performance of the proposed solutions.
[Show abstract][Hide abstract] ABSTRACT: With the increasing availability of moving-object tracking data, trajectory search and matching is increasingly important. We propose and investigate a novel problem called personalized trajectory matching (PTM). In contrast to conventional trajectory similarity search by spatial distance only, PTM takes into account the significance of each sample point in a query trajectory. A PTM query takes a trajectory with user-specified weights for each sample point in the trajectory as its argument. It returns the trajectory in an argument data set with the highest similarity to the query trajectory. We believe that this type of query may bring significant benefits to users in many popular applications such as route planning, carpooling, friend recommendation, traffic analysis, urban computing, and location-based services in general. PTM query processing faces two challenges: how to prune the search space during the query processing and how to schedule multiple so-called expansion centers effectively. To address these challenges, a novel two-phase search algorithm is proposed that carefully selects a set of expansion centers from the query trajectory and exploits upper and lower bounds to prune the search space in the spatial and temporal domains. An efficiency study reveals that the algorithm explores the minimum search space in both domains. Second, a heuristic search strategy based on priority ranking is developed to schedule the multiple expansion centers, which can further prune the search space and enhance the query efficiency. The performance of the PTM query is studied in extensive experiments based on real and synthetic trajectory data sets.
The VLDB Journal 06/2014; 23(3). DOI:10.1007/s00778-013-0331-0 · 1.57 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We consider an application scenario where points of interest (PoIs) each have a web presence and where a web user wants to iden- tify a region that contains relevant PoIs that are relevant to a set of keywords, e.g., in preparation for deciding where to go to conve- niently explore the PoIs. Motivated by this, we propose the length- constrained maximum-sum region (LCMSR) query that returns a spatial-network region that is located within a general region of in- terest, that does not exceed a given size constraint, and that best matches query keywords. Such a query maximizes the total weight of the PoIs in it w.r.t. the query keywords. We show that it is NP- hard to answer this query. We develop an approximation algorithm with a (5 + ε) approximation ratio utilizing a technique that scales node weights into integers. We also propose a more efficient heuris- tic algorithm and a greedy algorithm. Empirical studies on real data offer detailed insight into the accuracy of the proposed algorithms and show that the proposed algorithms are capable of computing results efficiently and effectively.
Proceedings of the VLDB Endowment 05/2014; 7(9):733-744. DOI:10.14778/2732939.2732946
[Show abstract][Hide abstract] ABSTRACT: With the rapidly increasing deployment of Internet-connected, location-aware mobile devices, very large and increasing amounts of geo-tagged and timestamped user-generated content, such as microblog posts, are being generated. We present indexing, update, and query processing techniques that are capable of providing the top-k terms seen in posts in a user-specified spatio-temporal range. The techniques enable interactive response times in the millisecond range in a realistic setting where the arrival rate of posts exceeds today's average tweet arrival rate by a factor of 4-10. The techniques adaptively maintain the most frequent items at various spatial and temporal granularities. They extend existing frequent item counting techniques to maintain exact counts rather than approximations. An extensive empirical study with a large collection of geo-tagged tweets shows that the proposed techniques enable online aggregation and query processing at scale in realistic settings.
2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
[Show abstract][Hide abstract] ABSTRACT: Different uses of a road network call for the consideration of different travel costs: in route planning, travel time and distance are typically considered, and green house gas (GHG) emissions are increasingly being considered. Further, travel costs such as travel time and GHG emissions are time-dependent and uncertain. To support such uses, we propose techniques that enable the construction of a multi-cost, time-dependent, uncertain graph (MTUG) model of a road network based on GPS data from vehicles that traversed the road network. Based on the MTUG, we define stochastic skyline routes that consider multiple costs and time-dependent uncertainty, and we propose efficient algorithms to retrieve stochastic skyline routes for a given source-destination pair and a start time. Empirical studies with three road networks in Denmark and a substantial GPS data set offer insight into the design properties of the MTUG and the efficiency of the stochastic skyline routing algorithms.
2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
[Show abstract][Hide abstract] ABSTRACT: In this paper we investigate the use of GPUs to solve a data-intensive problem that involves huge amounts of moving objects. The scenario which we focus on regards objects that continuously move in a 2D space, where a large percentage of them also issues range queries. The processing of these queries entails a large quantity of objects falling into the range queries to be returned. In order to solve this problem by maintaining a suitable throughput, we partition the time into ticks, and defer the parallel processing of all the objects events (location updates and range queries) occurring in a given tick to the next tick, thus slightly delaying the overall computation. We process in parallel all the events of each tick by adopting an hybrid approach, based on the combined use of CPU and GPU, and show the suitability of the method by discussing performance results. The exploitation of a GPU allow us to achieve a speedup of more than 20× on several datasets with respect to the best sequential algorithm solving the same problem. More importantly, we show that the adoption of new bitmap-based intermediate data structure we propose to avoid memory access contention entails a 10× speedup with respect to naive GPU based solutions.
Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 02/2014