Alexander Hinneburg

Martin Luther University of Halle-Wittenberg, Halle-on-the-Saale, Saxony-Anhalt, Germany

Are you Alexander Hinneburg?

Claim your profile

Publications (51)10.83 Total impact

  • Ricardo Usbeck, Ivo Hedtke, Alexander Hinneburg
    Informatik Spektrum 01/2014; submitted(TBA):TBA.
  • [show abstract] [hide abstract]
    ABSTRACT: We study in a quantitative way whether the most popular tags in a collaborative tagging system are distinctive features when looking at the underlying content. For any set of annotations being helpful in searching, this property must necessarily hold to a strong degree. Our initial experiments show that the most frequent tags in CiteULike are distinctive features, despite the process of annotating documents is not centrally coordinated nor correction mechanisms like in a Wiki-system are used.
    Proceedings of the International Conference on Web Intelligence, Mining and Semantics, WIMS 2011, Sogndal, Norway, May 25 - 27, 2011; 01/2011
  • Source
    Kristin Mittag, Alexander Hinneburg
    Informatik 2010: Service Science - Neue Perspektiven für die Informatik, Beiträge der 40. Jahrestagung der Gesellschaft für Informatik e.V. (GI), Band 2, 27.09. - 1.10.2010, Leipzig; 01/2010
  • Source
    André Gohr, Myra Spiliopoulou, Alexander Hinneburg
    KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Valencia, Spain, October 25-28, 2010; 01/2010
  • Source
    André Gohr, Alexander Hinneburg, Rene Schult, Myra Spiliopoulou
    Proceedings of the SIAM International Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks, Nevada, USA; 01/2009
  • [show abstract] [hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analyses have been performed and published on this data, leveling off the ultimate value of these projects far below their potential. A prominent reason published proteomics data is seldom reanalyzed lies in the heterogeneous nature of the original sample collection and the subsequent data recording and processing. To illustrate that at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Human Proteome Organization's Plasma Proteome Project (HUPO PPP). Interestingly, despite the broad spectrum of instruments and methodologies applied in the HUPO PPP, our analysis reveals several obvious patterns that can be used to formulate concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in future experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data by noise-tolerant algorithms such as the latent semantic analysis holds great promise and is currently underexploited.
    Journal of Proteome Research 02/2008; 7(1):182-91. · 5.06 Impact Factor
  • Source
    A. Hinneburg, H.-H. Gabriel, A. Gohr
    [show abstract] [hide abstract]
    ABSTRACT: Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on; 11/2007
  • LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • Alexander Hinneburg, Hans-Henning Gabriel
    LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • Alexander Hinneburg, Andrea Porzel, Karina Wolfram
    [show abstract] [hide abstract]
    ABSTRACT: Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring substances is an important task to in- vestigate new potentially useful chemical compounds. Multi-dimensional NMR-spectra are relational objects like documents, but consists of con- tinuous multi-dimensional points called peaks instead of words. We de- velop several mappings from continuous NMR-spectra to discrete text- like data. With the help of those mappings any text retrieval method can be applied. We evaluate the performance of two retrieval methods, namely the standard vector space model and probabilistic latent seman- tic indexing (PLSI). PLSI learns hidden topics in the data, which is in case of 2D-NMR data interesting in its owns rights. Additionally, we de- velop and evaluate a simple direct similarity function, which can detect duplicates of NMR-spectra. Our experiments show that the vector space model as well as PLSI, which are both designed for text data created by humans, can efiectively handle the mapped NMR-data originating from natural products. Additionally, PLSI is able to flnd meaningful "topics" in the NMR-data.
    Bioinformatics Research and Development, First International Conference, BIRD 2007, Berlin, Germany, March 12-14, 2007, Proceedings; 01/2007
  • Source
    Alexander Hinneburg, Björn Egert, Andrea Porzel
    [show abstract] [hide abstract]
    ABSTRACT: 2D-Nuclear magnetic resonance (NMR) spectra are used in the (structural) analysis of small molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of ¹H and ¹³C at the same time. A spectrum consists of several peaks in a twodimensional space. The most important information of a peak is the location of its center, which captures the bonding relationships of hydrogen and carbon atoms. A spectrum contains much information about the chemical structure of a product, but in most cases the structure cannot be read off in a simple and straightforward manner. Structure elucidation involves a considerable amount (manual) efforts. Using high-field NMR spectrometers, many 2D-NMR spectra can be recorded in short time. So the common situation is that a lab or company has a repository of 2D-NMR spectra, partially annotated with the structural information. For the remaining spectra the structure in unknown. In case two research labs are collaborating, the repositories will be merged and annotations shared. We reduce that problem to the task of finding duplicates in a given set of 2D-NMR spectra. Therefore, we propose a simple but robust definition of 2D-NMR duplicates, which allows for small measurement errors. We give a quadratic algorithm for the problem, which can be implemented in SQL. Further, we analyze a more abstract class of heuristics, which are based on selecting particular peaks. Such a heuristic works as a filter step on the pairs of possible duplicates and allows false positives. We compare all methods with respect to their run time. Finally we discuss the effectiveness of the duplicate definition on real data.
    http://journal.imbio.de/index.php?paper_id=53. 01/2007;
  • Source
    Björn Egert, Steffen Neumann, Alexander Hinneburg
    [show abstract] [hide abstract]
    ABSTRACT: D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of mole- cules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a deflnition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime perfor- mance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash- collisions.
    Data Integration in the Life Sciences, 4th International Workshop, DILS 2007, Philadelphia, PA, USA, June 27-29, 2007, Proceedings; 01/2007
  • Source
    Alexander Hinneburg, Hans-Henning Gabriel
    [show abstract] [hide abstract]
    ABSTRACT: The Denclue algorithm employs a cluster model based on kernel density estimation. A cluster is deflned by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster. A disadvantage of Denclue 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close. We introduce a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs. We prove that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm. We show experimentally that the new procedure needs much less iterations and can be accelerated by sampling based methods with sacriflcing only a small amount of accuracy.
    Advances in Intelligent Data Analysis VII, 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6-8, 2007, Proceedings; 01/2007
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analysis have been performed and published on this data, levelling o the ultimate value of these projects far below their potential. In order to illustrate that these repositories should be considered sources of detailed knowledge instead of data graveyards, we here present a novel way of analyzing the information contained in proteomics experiments with a 'latent semantic analysis'. We apply this information retrieval approach to the peptide identification data contributed by the Plasma Proteome Project. Interestingly, this analysis is able to overcome the fundamental diculties of analyzing such divergent and heterogeneous data emerging from large scale proteomics studies employing a vast spec- trum of dierent sample treatment and mass-spectrometry technologies. Moreover, it yields several concrete recommendations for optimizing pro- teomics project planning as well as the choice of technologies used in the experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data holds great promise and is currently underexploited.
    Proceedings of the German Conference on Bioinformatics, GCB 2007, September 26-28, 2007, Potsdam, Germany.; 01/2007
  • Stefan Brass, Alexander Hinneburg
    01/2006;
  • Source
    Karina Wolfram, Andrea Porzel, Alexander Hinneburg
    [show abstract] [hide abstract]
    ABSTRACT: Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring products is an important task to investi- gate new potentially useful chemical compounds. We develop a set-based similarity function, which, however, does not su-ciently capture more abstract aspects of similarity. NMR-spectra are like documents, but con- sists of continuous multi-dimensional points instead of words. Probabilis- tic semantic indexing (PLSI) is an retrieval method, which learns hidden topics. We develop several mappings from continuous NMR-spectra to discrete text-like data. The new mappings include redundancies into the discrete data, which proofs helpful for the PLSI-model used afterwards. Our experiments show that PLSI, which is designed for text data cre- ated by humans, can efiectively handle the mapped NMR-data originat- ing from natural products. Additionally, PLSI combined with the new mappings is able to flnd meaningful "topics" in the NMR-data.
    Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings; 01/2006
  • Source
    Dirk Habich, Wolfgang Lehner, Alexander Hinneburg
    [show abstract] [hide abstract]
    ABSTRACT: Advanced Data Mining applications require more and more support from relational database engines. Especially clustering applications in high dimensional features spac e demand a proper support of multiple Top-k queries in order to perform projected clustering. Although some research tackles to problem of optimizing restricted ranking (top-k) queries, there is no solution considering more than one sin- gle ranking criterion. This deficit - optimizing multiple To p- k queries over joins - is targeted by this paper from two per- spectives. On the one hand, we propose a minimal but quite handy extension of SQL to express multiple top-k queries. On the other hand, we propose an optimized hash join strat- egy to efficiently execute this type of queries. Extensive ex- periments conducted in this context show the feasibility of our proposal.
    17th International Conference on Scientific and Statistical Database Management, SSDBM 2005, 27-29 June 2005, University of California, Santa Barbara, CA, USA, Proceedings; 01/2005
  • Source
    Conference Proceeding: Dimension induced clustering.
    [show abstract] [hide abstract]
    ABSTRACT: It is commonly assumed that high-dimensional datasets contain points most of which are located in low-dimensional manifolds. Detection of low-dimensional clusters is an extremely useful task for performing operations such as clustering and classification, however, it is a challenging computational problem. In this paper we study the problem of finding subsets of points with low intrinsic dimensionality. Our main contribution is to extend the definition of fractal correlation dimension, which measures average volume growth rate, in order to estimate the intrinsic dimensionality of the data in local neighborhoods. We provide a careful analysis of several key examples in order to demonstrate the properties of our measure. Based on our proposed measure, we introduce a novel approach to discover clusters with low dimensionality. The resulting algorithms extend previous density based measures, which have been successfully used for clustering. We demonstrate the effectiveness of our algorithms for discovering low-dimensional m-flats embedded in high dimensional spaces, and for detecting low-rank sub-matrices.
    Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 21-24, 2005; 01/2005
  • Source
    Stefan Brass, Christian Goldberg, Alexander Hinneburg
    [show abstract] [hide abstract]
    ABSTRACT: We investigate classes of SQL queries which are syntactically correct, but certainly not intended, no matter for which task the query was written. For instance, queries that are contradictory, i.e. always re- turn the empty set, are quite often written in exams of database courses. Current database management systems, e.g. Oracle, execute such queries without any warning. In this paper, we explain serveral classes of such errors, and give algorithms for detecting them. Of course, questions like the satisfiability are in general undecidable, but our algorithm can treat a significant subset of SQL queries. We believe that future database management systems will perform such checks and that the generated warnings will help to develop code with fewer bugs in less time.
    01/2004;

Publication Stats

1k Citations
432 Downloads
2k Views
10.83 Total Impact Points

Institutions

  • 1999–2010
    • Martin Luther University of Halle-Wittenberg
      • Institute of Computer Science
      Halle-on-the-Saale, Saxony-Anhalt, Germany
  • 2002–2003
    • Universitätsklinikum Halle (Saale)
      Halle-on-the-Saale, Saxony-Anhalt, Germany