Stefan Siersdorfer

Stefan Siersdorfer
Forschungszentrum L3S

PhD

About

65
Publications
16,418
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,624
Citations
Citations since 2016
5 Research Items
908 Citations
2016201720182019202020212022020406080100120140
2016201720182019202020212022020406080100120140
2016201720182019202020212022020406080100120140
2016201720182019202020212022020406080100120140

Publications

Publications (65)
Article
In the last years, the blogosphere has become a vital part of the web, covering a variety of different points of view and opinions on political and event-related topics such as immigration, election campaigns, or economic developments. Tracking the public opinion is usually done by conducting surveys resulting in significant costs both for intervie...
Conference Paper
Full-text available
Trying to comprehend the structure and content of large text corpora can be a daunting and often time consuming task. In this paper, we introduce a novel tool that exploits the structural properties for extracting and visualizing the underlying topics in a given dataset. To this end, we make use of a combination of latent topic analysis, discrimina...
Article
Full-text available
Social network analysis is leveraged in a variety of applications such as identifying influential entities, detecting communities with special interests, and determining the flow of information and innovations. However, existing approaches for extracting social networks from unstructured Web content do not scale well and are only feasible for small...
Conference Paper
Full-text available
Social graph construction from various sources has been of interest to researchers due to its application potential and the broad range of technical challenges involved. The World Wide Web provides a huge amount of continuously updated data and information on a wide range of topics created by a variety of content providers, and makes the study of e...
Conference Paper
Many modern data analytics applications in areas such as crisis management, stock trading, and healthcare, rely on components capable of nearly real-time processing of streaming data produced at varying rates. In addition to automatic processing methods, many tasks involved in those applications require further human assessment and analysis. Howeve...
Conference Paper
Full-text available
Many data processing tasks such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models require human input, and, on a large scale, can only be accurately solved using crowd based online work. Recent work shows that frameworks where crowd workers compete against each oth...
Conference Paper
Full-text available
Crowd based online work is leveraged in a variety of applications such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models. However, annotating large amounts of data through crowdsourcing can be slow and costly. In order to improve both cost and time efficiency of cr...
Conference Paper
Full-text available
A large number of images are continuously uploaded to popular photo sharing websites and online social communities. In this demonstration we show a novel application which automatically classifies images in a live photo stream according to their attractiveness for the community, based on a number of visual and textual features. The system effective...
Article
Full-text available
An analysis of the social video sharing platform YouTube and the news aggregator Yahoo! News reveals the presence of vast amounts of community feedback through comments for published videos and news stories, as well as through meta ratings for these comments. This paper presents an in-depth study of commenting and comment rating behavior on a sampl...
Article
Full-text available
We propose a novel collaborative approach for document classification, combining the knowledge of multiple users for improved organization of data such as individual document repositories or emails. To this end, we distribute locally built classification models in a network of participating users, and combine the shared classifiers into more powerf...
Article
Full-text available
The Web contains an increasing amount of biased and opinionated documents on politics, products, and polarizing events. In this article, we present an indepth analysis of Web search queries for controversial topics, focusing on query sentiment. To this end, we conduct extensive user assessments and discriminative term analyses, as well as a sentime...
Conference Paper
Full-text available
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its content is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we pre...
Conference Paper
Full-text available
Sentiment lexica are useful for analyzing opinions in Web collections, for domain-dependent sentiment classification, and as sub-components of recommender systems. In this paper, we present a strategy for automatically generating topic-dependent lexica from large corpora of review articles by exploiting accompanying user ratings. Our approach combi...
Conference Paper
Full-text available
Wikipedia is widely considered the largest and most up-to-date online encyclopedia, with its content being continuously maintained by a supporting community. In many cases, real-life events like new scientific findings, resignations, deaths, or catastrophes serve as triggers for collaborative editing of articles about affected entities such as pers...
Conference Paper
Full-text available
We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating di...
Article
Full-text available
Photo publishing in Social Networks and other Web2.0 applications has become very popular due to the pervasive availability of cheap digital cameras, powerful batch upload tools and a huge amount of storage space. A portion of uploaded images are of a highly sensitive nature, disclosing many details of the users' private life. We have developed a w...
Conference Paper
Photo publishing in Social Networks and other Web2.0 applications has become very popular due to the pervasive availability of cheap digital cameras, powerful batch upload tools and a huge amount of storage space. A portion of uploaded images are of a highly sensitive nature, disclosing many details of the users’ private life. We have developed a w...
Article
We propose a novel approach to multimedia information understanding based on the analysis of contextual data generated by users in collaborative multimedia databases such as YouTube. Our novel approach exploits low level content analysis to address two difficult problems in social media, namely tag sparsity and summarization. We use content duplica...
Conference Paper
Full-text available
Modern content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive na-ture, disclosing many details of the users' private sphere. In order to support users in making privacy decisions in the cont...
Article
Full-text available
Modern content sharing environments such as Flickr or YouTube contain a large amount of private resources such as photos showing weddings, family holidays, and private parties. These resources can be of a highly sensitive na-ture, disclosing many details of the users' private sphere. In order to support users in making privacy decisions in the cont...
Conference Paper
In this paper, we present an in-depth analysis of Web search queries for controversial topics, focusing on query sentiment. To this end, we conduct extensive user assessments as well as an automatic sentiment analysis using the SentiWordNet thesaurus.
Article
Full-text available
The emergence of large-scale social Web communities has enabled users to share online vast amounts of multimedia content. An analysis of YouTube reveals a high amount of redundancy, in the form of videos with overlapping or duplicated content. We use robust content-based video analysis techniques to detect overlapping sequences between videos. Base...
Conference Paper
Full-text available
We propose a novel collaborative approach for distributed document classification, combining the knowledge of multiple users for improved organization of data such as individual document repositories or emails. The approach builds on top of a P2P network and outperforms the state of the art approaches in collaborative classification.
Article
Flickr, the large-scale online photo sharing website, is often viewed as one of the ‘classic’ examples of Web2.0 applications through which researchers are able to observe the social behavior of online communities. One of the main features of Flickr is groups. These provide a means to organize, share and discuss photos of potential interest to grou...
Conference Paper
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution of images, and make use of the SentiWordNet thesaurus to extract numerical values for their...
Article
Full-text available
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution of images, and make use of the SentiWordNet thesaurus to extract numerical values for their...
Conference Paper
Full-text available
In this paper, we propose a probabilistic algorithm for detecting near duplicate text, audio, and video resources efficiently and effectively in large-scale P2P systems. To this end, we present a thorough cost and probabilistic analysis that allows the algorithm to adapt to network and data collection characteristics for minimizing network cost. In...
Conference Paper
We present a brief overview of the way in which image analysis, coupled with associated collateral text, is being used for auto-annotation and sentiment analysis. In particular, we describe our approach to auto-annotation using the graph- theoretic dominant set clustering algorithm and the annotation of images with sentiment scores from SentiWordNe...
Article
Search Engines have become the main entry point to Web content, and a large part of the "visible" Web consists in what is presented by them as top retrieved results. Therefore, it would be desirable if the first few results were a representative sample of the entire result set. This paper provides a preliminary study about opinions contained in sea...
Conference Paper
Full-text available
An analysis of the social video sharing platform YouTube reveals a high amount of community feedback through comments for published videos as well as through meta ratings for these comments. In this paper, we present an in-depth study of commenting and comment rating behavior on a sample of more than 6 million comments on 67,000 YouTube videos for...
Article
Full-text available
We present a brief overview of the way in which image analysis, coupled with associated collateral text, is being used for auto-annotation and sentiment analysis. In particular, we describe our approach to auto-annotation using the graph- theoretic dominant set clustering algorithm and the annotation of images with sentiment scores from SentiWordNe...
Conference Paper
Full-text available
The analysis of the leading social video sharing platform YouTube reveals a high amount of redundancy, in the form of videos with overlapping or duplicated content. In this paper, we show that this redundancy can provide useful information about connections between videos. We reveal these links using robust content-based video analysis techniques a...
Conference Paper
Full-text available
The rapidly increasing popularity of Web 2.0 knowledge and content sharing systems and growing amount of shared data make discovering relevant content and flnding contacts a dif- flcult enterprize. Typically, folksonomies provide a rich set of structures and social relationships that can be mined for a variety of recommendation purposes. In this pa...
Conference Paper
Full-text available
Web 2.0 applications like Flickr, YouTube, or Del.icio.us are increasingly popular online communities for creating, editing and sharing content. The growing size of these folksonomies poses new challenges in terms of search and data mining. In this paper we introduce a novel methodology for auto- matically ranking and classifying photos according t...
Article
Full-text available
This article introduces a methodology for automatically organizing document collections into thematic categories for Personal Information Management (PIM) through collaborative sharing of machine learning models in an efficient and privacy-preserving way. Our objective is to combine multiple independently learned models from several users to constr...
Conference Paper
The International Workshop on Recommendation and Collaboration (ReColl 2008) aims to identify emerging trends in recommendation technology and collaborative environments in the context of intelligent user interfaces. We explore these two topics separately and the synergies between them.
Article
This chapter addresses the problem of automatically organizing heterogeneous collections of Web documents for the generation of thematically-focused expert search engines and portals. As a possible application scenario for our techniques, we consider a focused Web crawler that aims to populate topics of interest by automatically categorizing newly-...
Conference Paper
Collaborative data/knowledge management methods aim to achieve improved result quality through combination or merging of results and models obtained from multiple users and sites. Typical application scenarios in the domain of Web information systems include collaborative methods and meta methods for information acquisition (e.g. collaborative Web...
Conference Paper
Full-text available
This paper describes initial work on developing an information system to gather, process and visualise various multimedia data sources related to the South Yorkshire (UK) floods of 2007. The work is part of the Memoir project which aims to investigate how technology can help people create and manage long-term personal memories. We are using maps to...
Conference Paper
Full-text available
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples...
Conference Paper
Full-text available
Web 2.0 applications like Flickr, YouTube, or Del.icio.us are increasingly popular online com- munities for creating, editing and sharing con- tent. However, the rapid increase in size of on- line communities and the availability of large amounts of shared data make discovering rele- vant content and finding related users a difficult task. Web 2.0...
Conference Paper
This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific classification of retrieved Web content...
Conference Paper
This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific; classification of retrieved Web conten...
Article
Full-text available
This paper addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or co-authorship among bibliographic data entries). Our appr...
Conference Paper
Full-text available
This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages...
Conference Paper
Full-text available
This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With ”junk documents” we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world a...
Conference Paper
This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages...
Article
Full-text available
This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These t...
Article
Full-text available
Automatic text classification methods come with various calibration parameters such as thresholds for probabilities in Bayesian classifiers or for hyperplane distances in SVM classifiers. In a given application context these parameters should be set so as to meet the relative importance of various result quality metrics such as precision versus rec...
Article
Full-text available
bove (i.e., viewing the query terms as an initial training document). According to [CBD99a] the key components of a focused crawler are a document classifier to test whether a visited document fits into one of the specified topics of interest, and a distiller to identify the best URLs for the crawl frontier (i.e., those hyperlinks in already visite...
Conference Paper
Focused (thematic) crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It involves the automatic classification of visited documents into a user- or community-specific topic hierarchy (ontology). The quality of training data for the classifier is the most critical issue and a potential bottleneck fo...
Conference Paper
Full-text available
This paper presents the BINGO! focused crawler, an advanced tool for information por- tal generation and expert Web search. In contrast to standard search engines such as Google which are solely based on precomputed index structures, a focused crawler interleaves crawling, automatic classification, link analy- sis and assessment, and text filtering...
Article
Full-text available
Dieses Papier befasst sich mit der automatischen Klassifikation von Webdokumenten in eine vorgegebene Taxonomie. Wir betrachten dabei vektorbasierte Verfahren des maschinellen Lernens am Beispiel von SVM (Support Vector Machines). In diesem Papier beschreiben wir Möglichkeiten zur Generierung von Featurevektoren unter Berücksichtigung der Besonderh...
Article
Full-text available
This paper presents the BINGO! focused crawler, an advanced tool for information portal generation and expert Web search. In contrast to standard search engines such as Google which are solely based on precomputed index structures, a focused crawler interleaves crawling, automatic classification, link analysis and assessment, and text filtering. A...
Conference Paper
Full-text available
The BINGO! system implements an approach to focused crawling that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic "archetypes" and uses them for periodically re-training the classifier; this way the crawler is dynamically...
Article
Full-text available
In this paper, we provide several alternatives to the classical Bag-Of-Words model for automatic authorship attribution. To this end, we consider linguistic and writing style infor- mation such as grammatical structures to construct di®er- ent document representations. Furthermore we describe two techniques to combine the obtained representations:...

Network

Cited By