Alexander Hinneburg

Martin Luther University of Halle-Wittenberg, Halle-on-the-Saale, Saxony-Anhalt, Germany

Are you Alexander Hinneburg?

Claim your profile

Publications (52)10.83 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Topic models extract representative word sets - called topics - from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.
    03/2014;
  • Ricardo Usbeck, Ivo Hedtke, Alexander Hinneburg
    Informatik Spektrum 01/2014;
  • Alexander Hinneburg, Rico Preiss, René Schröder
    [Show abstract] [Hide abstract]
    ABSTRACT: The demo presents a prototype --- called TopicExplorer --- that combines topic modeling, key word search and visualization techniques to explore a large collection of Wikipedia documents. Topics derived by Latent Dirichlet Allocation are presented by top words. In addition, topics are accompanied by image thumbnails extracted from related Wikipedia documents to aid sense making of derived topics during browsing. Topics are shown in a linear order such that similar topics are close. Topics are mapped to color using that order. The auto-completion of search terms suggests words together with their color coded topics, which allows to explore the relation between search terms and topics. Retrieved documents are shown with color coded topics as well. Relevant documents and topics found during browsing can be put onto a shortlist. The tool can recommend further documents with respect to the average topic mixture of the shortlist.
    Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II; 09/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Finding correlated words in large document collections is an important ingredient for text analytics. The naïve approach computes the correlations of each word against all other words and filters for highly correlated word pairs. Clearly, this quadratic method cannot be applied to real world scenarios with millions of documents and words. Our main contribution is to transform the task of finding highly correlated word pairs into a word clustering problem that is efficiently solved by locality sensitive hashing (LSH). A key insight of our new method is to note that the empirical Pearson correlation between two words is the cosine of the angle between the centered versions of their word vectors. The angle can be approximated by an LSH scheme. Although centered word vectors are not sparse, the computation of the LSH hash functions can exploit the inherent sparsity of the word data. This leads to an efficient way to detect collisions between centered word vectors having a small angle and therefore provides a fast algorithm to sample highly correlated word pairs. Our new method based on LSH improves run time complexity of the enhanced naïve algorithm. This algorithm reduces the dimensionality of the word vectors using random projection and approximates correlations by computing cosine similarity on the reduced and centered word vectors. However, this method still has quadratic run time. Our new method replaces the filtering for high correlations in the naïve algorithm with finding hash collisions, which can be done by sorting the hash values of the word vectors. We evaluate the scalability of our new algorithm to large text collections.
    05/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We study in a quantitative way whether the most popular tags in a collaborative tagging system are distinctive features when looking at the underlying content. For any set of annotations being helpful in searching, this property must necessarily hold to a strong degree. Our initial experiments show that the most frequent tags in CiteULike are distinctive features, despite the process of annotating documents is not centrally coordinated nor correction mechanisms like in a Wiki-system are used.
    Proceedings of the International Conference on Web Intelligence, Mining and Semantics, WIMS 2011, Sogndal, Norway, May 25 - 27, 2011; 01/2011
  • Source
    Kristin Mittag, Alexander Hinneburg
    Informatik 2010: Service Science - Neue Perspektiven für die Informatik, Beiträge der 40. Jahrestagung der Gesellschaft für Informatik e.V. (GI), Band 2, 27.09. - 1.10.2010, Leipzig; 01/2010
  • Source
    André Gohr, Myra Spiliopoulou, Alexander Hinneburg
    KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Valencia, Spain, October 25-28, 2010; 01/2010
  • Source
    André Gohr, Alexander Hinneburg, Rene Schult, Myra Spiliopoulou
    Proceedings of the SIAM International Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks, Nevada, USA; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analyses have been performed and published on this data, leveling off the ultimate value of these projects far below their potential. A prominent reason published proteomics data is seldom reanalyzed lies in the heterogeneous nature of the original sample collection and the subsequent data recording and processing. To illustrate that at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Human Proteome Organization's Plasma Proteome Project (HUPO PPP). Interestingly, despite the broad spectrum of instruments and methodologies applied in the HUPO PPP, our analysis reveals several obvious patterns that can be used to formulate concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in future experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data by noise-tolerant algorithms such as the latent semantic analysis holds great promise and is currently underexploited.
    Journal of Proteome Research 02/2008; 7(1):182-91. · 5.06 Impact Factor
  • Source
    A. Hinneburg, H.-H. Gabriel, A. Gohr
    [Show abstract] [Hide abstract]
    ABSTRACT: Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on; 11/2007
  • LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analysis have been performed and published on this data, levelling o the ultimate value of these projects far below their potential. In order to illustrate that these repositories should be considered sources of detailed knowledge instead of data graveyards, we here present a novel way of analyzing the information contained in proteomics experiments with a 'latent semantic analysis'. We apply this information retrieval approach to the peptide identification data contributed by the Plasma Proteome Project. Interestingly, this analysis is able to overcome the fundamental diculties of analyzing such divergent and heterogeneous data emerging from large scale proteomics studies employing a vast spec- trum of dierent sample treatment and mass-spectrometry technologies. Moreover, it yields several concrete recommendations for optimizing pro- teomics project planning as well as the choice of technologies used in the experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data holds great promise and is currently underexploited.
    Proceedings of the German Conference on Bioinformatics, GCB 2007, September 26-28, 2007, Potsdam, Germany.; 01/2007
  • Source
    Björn Egert, Steffen Neumann, Alexander Hinneburg
    [Show abstract] [Hide abstract]
    ABSTRACT: D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of mole- cules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of 1H and 13C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a deflnition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime perfor- mance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash- collisions.
    Data Integration in the Life Sciences, 4th International Workshop, DILS 2007, Philadelphia, PA, USA, June 27-29, 2007, Proceedings; 01/2007
  • Source
    Alexander Hinneburg, Hans-Henning Gabriel
    [Show abstract] [Hide abstract]
    ABSTRACT: The Denclue algorithm employs a cluster model based on kernel density estimation. A cluster is deflned by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster. A disadvantage of Denclue 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close. We introduce a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs. We prove that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm. We show experimentally that the new procedure needs much less iterations and can be accelerated by sampling based methods with sacriflcing only a small amount of accuracy.
    Advances in Intelligent Data Analysis VII, 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6-8, 2007, Proceedings; 01/2007
  • Alexander Hinneburg, Andrea Porzel, Karina Wolfram
    [Show abstract] [Hide abstract]
    ABSTRACT: Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring substances is an important task to in- vestigate new potentially useful chemical compounds. Multi-dimensional NMR-spectra are relational objects like documents, but consists of con- tinuous multi-dimensional points called peaks instead of words. We de- velop several mappings from continuous NMR-spectra to discrete text- like data. With the help of those mappings any text retrieval method can be applied. We evaluate the performance of two retrieval methods, namely the standard vector space model and probabilistic latent seman- tic indexing (PLSI). PLSI learns hidden topics in the data, which is in case of 2D-NMR data interesting in its owns rights. Additionally, we de- velop and evaluate a simple direct similarity function, which can detect duplicates of NMR-spectra. Our experiments show that the vector space model as well as PLSI, which are both designed for text data created by humans, can efiectively handle the mapped NMR-data originating from natural products. Additionally, PLSI is able to flnd meaningful "topics" in the NMR-data.
    Bioinformatics Research and Development, First International Conference, BIRD 2007, Berlin, Germany, March 12-14, 2007, Proceedings; 01/2007
  • Source
    Alexander Hinneburg, Björn Egert, Andrea Porzel
    [Show abstract] [Hide abstract]
    ABSTRACT: 2D-Nuclear magnetic resonance (NMR) spectra are used in the (structural) analysis of small molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of ¹H and ¹³C at the same time. A spectrum consists of several peaks in a twodimensional space. The most important information of a peak is the location of its center, which captures the bonding relationships of hydrogen and carbon atoms. A spectrum contains much information about the chemical structure of a product, but in most cases the structure cannot be read off in a simple and straightforward manner. Structure elucidation involves a considerable amount (manual) efforts. Using high-field NMR spectrometers, many 2D-NMR spectra can be recorded in short time. So the common situation is that a lab or company has a repository of 2D-NMR spectra, partially annotated with the structural information. For the remaining spectra the structure in unknown. In case two research labs are collaborating, the repositories will be merged and annotations shared. We reduce that problem to the task of finding duplicates in a given set of 2D-NMR spectra. Therefore, we propose a simple but robust definition of 2D-NMR duplicates, which allows for small measurement errors. We give a quadratic algorithm for the problem, which can be implemented in SQL. Further, we analyze a more abstract class of heuristics, which are based on selecting particular peaks. Such a heuristic works as a filter step on the pairs of possible duplicates and allows false positives. We compare all methods with respect to their run time. Finally we discuss the effectiveness of the duplicate definition on real data.
    http://journal.imbio.de/index.php?paper_id=53. 01/2007;
  • LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • Alexander Hinneburg, Hans-Henning Gabriel
    LWA 2007: Lernen - Wissen - Adaption, Halle, September 2007, Workshop Proceedings; 01/2007
  • Stefan Brass, Alexander Hinneburg
    01/2006;
  • Source
    Karina Wolfram, Andrea Porzel, Alexander Hinneburg
    [Show abstract] [Hide abstract]
    ABSTRACT: Searching and mining nuclear magnetic resonance (NMR)- spectra of naturally occurring products is an important task to investi- gate new potentially useful chemical compounds. We develop a set-based similarity function, which, however, does not su-ciently capture more abstract aspects of similarity. NMR-spectra are like documents, but con- sists of continuous multi-dimensional points instead of words. Probabilis- tic semantic indexing (PLSI) is an retrieval method, which learns hidden topics. We develop several mappings from continuous NMR-spectra to discrete text-like data. The new mappings include redundancies into the discrete data, which proofs helpful for the PLSI-model used afterwards. Our experiments show that PLSI, which is designed for text data cre- ated by humans, can efiectively handle the mapped NMR-data originat- ing from natural products. Additionally, PLSI combined with the new mappings is able to flnd meaningful "topics" in the NMR-data.
    Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings; 01/2006

Publication Stats

2k Citations
10.83 Total Impact Points

Institutions

  • 1999–2012
    • Martin Luther University of Halle-Wittenberg
      • Institute of Computer Science
      Halle-on-the-Saale, Saxony-Anhalt, Germany
  • 2002–2003
    • Universitätsklinikum Halle (Saale)
      Halle-on-the-Saale, Saxony-Anhalt, Germany