Alexander Hinneburg

Martin Luther University of Halle-Wittenberg, Halle-on-the-Saale, Saxony-Anhalt, Germany

Are you Alexander Hinneburg?

Claim your profile

Publications (59)19.55 Total impact

  • Michael Röder · Andreas Both · Alexander Hinneburg
    [Show abstract] [Hide abstract]
    ABSTRACT: Quantifying the coherence of a set of statements is a long standing problem with many potential applications that has attracted researchers from different sciences. The special case of measuring coherence of topics has been recently studied to remedy the problem that topic models give no guaranty on the interpretablity of their output. Several benchmark datasets were produced that record human judgements of the interpretability of topics. We are the first to propose a framework that allows to construct existing word based coherence measures as well as new ones by combining elementary components. We conduct a systematic search of the space of coherence measures using all publicly available topic relevance data for the evaluation. Our results show that new combinations of components outperform existing measures with respect to correlation to human ratings. Finally, we outline how our results can be transferred to further applications in the context of text mining, information retrieval and the world wide web.
    No preview · Article · Feb 2015
  • Ricardo Usbeck · Ivo Hedtke · Alexander Hinneburg

    No preview · Article · Jan 2014 · Informatik Spektrum
  • Source
    Frank Rosner · Alexander Hinneburg · Michael Röder · Martin Nettling · Andreas Both
    [Show abstract] [Hide abstract]
    ABSTRACT: Topic models extract representative word sets - called topics - from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.
    Full-text · Conference Paper · Dec 2013
  • A. Gohr · M. Spiliopoulou · A. Hinneburg
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a visualization technique for summarizing contents of document streams, such as news or scientific archives. The content of streaming documents change over time and so do themes the documents are about. Topic evolution is a relatively new research subject that encompasses the unsupervised discovery of thematic subjects in a document collection and the adaptation of these subjects as new documents arrive. While many powerful topic evolution methods exist, the combination of learning and visualization of the evolving topics has been less explored, although it is indispensable for understanding a dynamic document collection. We propose Topic Table, a visualization technique that builds upon topic modeling for deriving a condensed representation of a document collection. Topic Table captures important and intuitively comprehensible aspects of a topic over time: the importance of the topic within the collection, the words characterizing this topic, the semantic changes of a topic from one timepoint to the next. As an example, we visualize content of the NIPS proceedings from 1987 to 1999.
    No preview · Article · Jan 2013 · Communications in Computer and Information Science
  • Source
    Alexander Hinneburg · Rico Preiss · René Schröder
    [Show abstract] [Hide abstract]
    ABSTRACT: The demo presents a prototype --- called TopicExplorer --- that combines topic modeling, key word search and visualization techniques to explore a large collection of Wikipedia documents. Topics derived by Latent Dirichlet Allocation are presented by top words. In addition, topics are accompanied by image thumbnails extracted from related Wikipedia documents to aid sense making of derived topics during browsing. Topics are shown in a linear order such that similar topics are close. Topics are mapped to color using that order. The auto-completion of search terms suggests words together with their color coded topics, which allows to explore the relation between search terms and topics. Retrieved documents are shown with color coded topics as well. Relevant documents and topics found during browsing can be put onto a shortlist. The tool can recommend further documents with respect to the average topic mixture of the shortlist.
    Preview · Conference Paper · Sep 2012
  • Frank Rosner · Alexander Hinneburg · Martin Gleditzsch · Mathias Priebe · Andreas Both
    [Show abstract] [Hide abstract]
    ABSTRACT: Finding correlated words in large document collections is an important ingredient for text analytics. The naïve approach computes the correlations of each word against all other words and filters for highly correlated word pairs. Clearly, this quadratic method cannot be applied to real world scenarios with millions of documents and words. Our main contribution is to transform the task of finding highly correlated word pairs into a word clustering problem that is efficiently solved by locality sensitive hashing (LSH). A key insight of our new method is to note that the empirical Pearson correlation between two words is the cosine of the angle between the centered versions of their word vectors. The angle can be approximated by an LSH scheme. Although centered word vectors are not sparse, the computation of the LSH hash functions can exploit the inherent sparsity of the word data. This leads to an efficient way to detect collisions between centered word vectors having a small angle and therefore provides a fast algorithm to sample highly correlated word pairs. Our new method based on LSH improves run time complexity of the enhanced naïve algorithm. This algorithm reduces the dimensionality of the word vectors using random projection and approximates correlations by computing cosine similarity on the reduced and centered word vectors. However, this method still has quadratic run time. Our new method replaces the filtering for high correlations in the naïve algorithm with finding hash collisions, which can be done by sorting the hash values of the word vectors. We evaluate the scalability of our new algorithm to large text collections.
    No preview · Article · May 2012
  • André Gohr · Alexander Hinneburg · Myra Spiliopoulou · Ricardo Usbeck
    [Show abstract] [Hide abstract]
    ABSTRACT: We study in a quantitative way whether the most popular tags in a collaborative tagging system are distinctive features when looking at the underlying content. For any set of annotations being helpful in searching, this property must necessarily hold to a strong degree. Our initial experiments show that the most frequent tags in CiteULike are distinctive features, despite the process of annotating documents is not centrally coordinated nor correction mechanisms like in a Wiki-system are used.
    No preview · Conference Paper · Jan 2011
  • Source
    Alexander Hinneburg · Dirk Habich · Marcel Karnstedt
    [Show abstract] [Hide abstract]
    ABSTRACT: Sensor data have become very huge and single measures are produced at high rates, resulting in streaming sensor data. In this paper, we present a new mining tool called Online DFT, which is particularly powerful for estimating the spectrum of a data stream. Unique features of our new method include low update complexity with high-accuracy estimations for very long periods, and the ability of long-range forecasting based on our Online DFT. Furthermore, we describe some applications of our Online DFT.
    Full-text · Article · Dec 2010
  • Source
    Kristin Mittag · Alexander Hinneburg
    [Show abstract] [Hide abstract]
    ABSTRACT: Information needs like searching scientific literature that involve high recall rates are difficult to satisfy with ad hoc keyword search. We propose to state queries implicitly by the specification of a set of query documents. The result of such a query is a set of answer documents that are ranked within the answer set. We describe efficient techniques to process such queries. Preliminary experiments using data from the TREC Genomics track 2005 are reported. 1
    Preview · Conference Paper · Jan 2010
  • Source
    André Gohr · Myra Spiliopoulou · Alexander Hinneburg

    Full-text · Conference Paper · Jan 2010
  • Christine Staiger · Alexander Hinneburg · Ralf Bernd Klösgen
    [Show abstract] [Hide abstract]
    ABSTRACT: Most mitochondrial proteins are synthesized in the cytosol of eukaryotic cells as precursor proteins carrying N-terminal extensions called transit peptides or presequences, which mediate their specific transport into mitochondria. However, plant cells possess a second potential target organelle for such transit peptides, the chloroplast. It can therefore be assumed that mitochondrial transit peptides in plants are exposed to an increased demand of specificity, which in turn leads to reduced degrees of freedom in these transit peptides compared with those of nonplant organisms. Our study investigates this hypothesis using fractal dimension. Statistical analysis of sequence data shows that the fractal dimension of mitochondrial transit peptides in plants is indeed significantly lower than that from nonplant organisms.
    No preview · Article · May 2009 · Molecular Biology and Evolution
  • Source
    André Gohr · Alexander Hinneburg · Rene Schult · Myra Spiliopoulou
    [Show abstract] [Hide abstract]
    ABSTRACT: Document collections evolve over time, new topics emerge and old ones decline. At the same time, the terminology evolves as well. Much literature is devoted to topic evolution in nite document sequences assuming a xed vocabulary. In this study, we propose \Topic Monitor " for the monitoring and understanding of topic and vocabulary evolution over an in nite document sequence, i.e. a stream. We use Probabilistic Latent Semantic Analysis (PLSA) for topic modeling and propose new folding-in techniques for topic adaptation under an evolving vocabulary. We extract a series of models, on which we detect index-based topic threads as human-interpretable descriptions of topic evolution. 1
    Full-text · Conference Paper · Apr 2009

  • No preview · Chapter · Jan 2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analyses have been performed and published on this data, leveling off the ultimate value of these projects far below their potential. A prominent reason published proteomics data is seldom reanalyzed lies in the heterogeneous nature of the original sample collection and the subsequent data recording and processing. To illustrate that at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Human Proteome Organization's Plasma Proteome Project (HUPO PPP). Interestingly, despite the broad spectrum of instruments and methodologies applied in the HUPO PPP, our analysis reveals several obvious patterns that can be used to formulate concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in future experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data by noise-tolerant algorithms such as the latent semantic analysis holds great promise and is currently underexploited.
    No preview · Article · Feb 2008 · Journal of Proteome Research
  • Source
    Alexander Hinneburg · H.-H. Gabriel · Andrè Gohr
    [Show abstract] [Hide abstract]
    ABSTRACT: Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.
    Preview · Conference Paper · Nov 2007
  • Source
    Alexander Hinneburg · Hans-Henning Gabriel
    [Show abstract] [Hide abstract]
    ABSTRACT: The Denclue algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster. A disadvantage of Denclue 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close. We introduce a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs. We prove that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm. We show experimentally that the new procedure needs much less iterations and can be accelerated by sampling based methods with sacrificing only a small amount of accuracy.
    Preview · Conference Paper · Sep 2007
  • [Show abstract] [Hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analysis have been performed and published on this data, levelling off the ultimate value of these projects far below their potential. In order to illustrate that these repositories should be considered sources of detailed knowledge instead of data graveyards, we here present a novel way of analyzing the information contained in proteomics experiments with a 'latent semantic analysis'. We apply this information retrieval approach to the peptide identification data contributed by the Plasma Proteome Project. Interestingly, this analysis is able to overcome the fundamental difficulties of analyzing such divergent and heterogeneous data emerging from large scale proteomics studies employing a vast spectrum of different sample treatment and mass-spectrometry technologies. Moreover, it yields several concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in the experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data holds great promise and is currently underexploited.
    No preview · Conference Paper · Jan 2007
  • Source
    Alexander Hinneburg · Björn Egert · Andrea Porzel
    [Show abstract] [Hide abstract]
    ABSTRACT: 2D-Nuclear magnetic resonance (NMR) spectra are used in the (structural) analysis of small molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of ¹H and ¹³C at the same time. A spectrum consists of several peaks in a twodimensional space. The most important information of a peak is the location of its center, which captures the bonding relationships of hydrogen and carbon atoms. A spectrum contains much information about the chemical structure of a product, but in most cases the structure cannot be read off in a simple and straightforward manner. Structure elucidation involves a considerable amount (manual) efforts. Using high-field NMR spectrometers, many 2D-NMR spectra can be recorded in short time. So the common situation is that a lab or company has a repository of 2D-NMR spectra, partially annotated with the structural information. For the remaining spectra the structure in unknown. In case two research labs are collaborating, the repositories will be merged and annotations shared. We reduce that problem to the task of finding duplicates in a given set of 2D-NMR spectra. Therefore, we propose a simple but robust definition of 2D-NMR duplicates, which allows for small measurement errors. We give a quadratic algorithm for the problem, which can be implemented in SQL. Further, we analyze a more abstract class of heuristics, which are based on selecting particular peaks. Such a heuristic works as a filter step on the pairs of possible duplicates and allows false positives. We compare all methods with respect to their run time. Finally we discuss the effectiveness of the duplicate definition on real data.
    Preview · Article · Jan 2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since the advent of public data repositories for proteomics data, readily accessible results from high-throughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analysis have been performed and published on this data, levelling o the ultimate value of these projects far below their potential. In order to illustrate that these repositories should be considered sources of detailed knowledge instead of data graveyards, we here present a novel way of analyzing the information contained in proteomics experiments with a 'latent semantic analysis'. We apply this information retrieval approach to the peptide identification data contributed by the Plasma Proteome Project. Interestingly, this analysis is able to overcome the fundamental diculties of analyzing such divergent and heterogeneous data emerging from large scale proteomics studies employing a vast spec- trum of dierent sample treatment and mass-spectrometry technologies. Moreover, it yields several concrete recommendations for optimizing pro- teomics project planning as well as the choice of technologies used in the experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data holds great promise and is currently underexploited.
    Full-text · Conference Paper · Jan 2007
  • Alexander Hinneburg · Ralf Klinkenberg · Ingo Mierswa · Stefan Posch · Steffen Neumann

    No preview · Conference Paper · Jan 2007