Jean-Yves Delort’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (22)


WHAD: Wikipedia historical attributes data: Historical structured data extraction and vandalism detection from the Wikipedia edit history
  • Article

December 2013

·

80 Reads

·

16 Citations

Language Resources and Evaluation

Enrique Alfonseca

·

·

Jean-Yves Delort

·

This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.


Figure 2: Plate diagram of the generative model used. 
Figure 3: Evaluation of the extractions. X-axis has the threshold for p ( r | w ) , and Y-axis has the precision of the extractions as a percentage. 
Pattern learning for relation extraction with a hierarchical topic model
  • Conference Paper
  • Full-text available

July 2012

·

216 Reads

·

76 Citations

We describe the use of a hierarchical topic model for automatically identifying syntactic and lexical patterns that explicitly state ontological relations. We leverage distant supervision using relations from the knowledge base FreeBase, but do not require any manual heuristic nor manual seed list selections. Results show that the learned patterns can be used to extract new relations with good precision.

Download

DualSum: A topic-model based approach for update summarization

April 2012

·

52 Reads

·

57 Citations

Update summarization is a new challenge in multi-document summarization focusing on summarizing a set of recent documents relatively to another set of earlier documents. We present an unsupervised probabilistic approach to model novelty in a document collection and apply it to the generation of update summaries. The new model, called Dualsum, results in the second or third position in terms of the ROUGE metrics when tuned for previous TAC competitions and tested on TAC-2011, being statistically indistinguishable from the winning system. A manual evaluation of the generated summaries shows state-of-the art results for Dualsum with respect to focus, coherence and overall responsiveness.


Automatic Moderation of Online Discussion Sites

April 2011

·

127 Reads

·

23 Citations

International Journal of Electronic Commerce

Online discussion sites are plagued with various types of unwanted content, such as spam and obscene or malicious messages. Prevention and detection-based techniques have been proposed to filter inappropriate content from online discussion sites. But, even though prevention techniques have been widely adopted, detection of inappropriate content remains mostly a manual task. Existing detection techniques, which are divided into rule-based and statistical techniques, suffer from various limitations. Rule-based techniques usually consist of manually crafted rules or blacklists of key words. Both are time-consuming to create and tend to generate many false-positives and false-negatives. Statistical techniques typically use corpora of labeled examples to train a classifier to tell "good" and "bad" messages apart. Although statistical techniques are generally more robust than rule-based techniques, they are difficult to deploy because of the prohibitive cost of manually labeling examples. In this paper we describe a novel classification technique to train a classifier from a partially labeled corpus and use it to moderate inappropriate content on online discussion sites. Partially labeled corpora are much easier to produce than completely labeled corpora, as they are made up only with unlabeled examples and examples labeled with a single class (e.g., "bad"). We implemented and tested this technique on a corpus of messages posted on a stock message board and compared it with two baseline techniques. Results show that our method outperforms the two baselines and that it can be used to significantly reduce the number of messages that need to be reviewed by human moderators.


Hierarchical cluster visualization in web mapping systems

April 2010

·

29 Reads

·

15 Citations

This paper presents a technique for visualizing large spatial data sets in Web Mapping Systems (WMS). The technique creates a hierarchical clustering tree, which is subsequently used to extract clusters that can be displayed at a given scale without cluttering the map. Voronoi polygons are used as aggregation symbols to represent the clusters. This technique retains hierarchical relationships between data items at different scales. In addition, aggregation symbols do not overlap, and their sizes and the number of points that they cover is controlled by the same parameter. A prototype has been implemented and tested showing the effectiveness of the method for visualizing large data sets in WMS.


Vizualizing Large Spatial Datasets in Interactive Maps

February 2010

·

32 Reads

·

20 Citations

This paper addresses the problem of reducing cluttering in interactive maps. It presents a new technique for visualizing large spatial datasets using hierarchical aggregation. The technique creates a hierarchical clustering tree, which is subsequently used to extract clusters that can be displayed at a given scale without cluttering the map. Voronoi polygons are used as aggregation symbols to represent the clusters. This technique retains hierarchical relationships between data items at different scales. In addition, aggregation symbols do not overlap, and their sizes and the number of points that they cover is controlled by the same parameter. The scalability analysis shows that the method can effectively be used with datasets of up to 1000 items.


Automating Financial Surveillance

December 2009

·

195 Reads

·

11 Citations

Lecture Notes of the Institute for Computer Sciences

·

Jean-Yves Delort

·

·

[...]

·

James R. Curran

Financial surveillance technology alerts analysts to suspicious trading events. Our aim is to identify explainable false positives (e.g., caused by price-sensitive information in company news) and explainable true positives (e.g., caused by ramping in forums) by aligning these alerts with publicly available information. Our system aligns 99% of alerts, which will speed the analysts' task by helping them to eliminate false positives and gather evidence for true positives more rapidly. © Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering 2010.


The Impact of Manipulation in Internet Stock Message Boards

November 2009

·

3,592 Reads

·

32 Citations

International Journal of Banking Accounting and Finance

Internet message boards are often used to spread information in order to manipulate financial markets. Although this hypothesis is supported by many cases reported in the literature and in the media, the real impact of manipulation in online forums on financial markets remains an open question. This work analyses the effect of manipulation in internet stock message boards on financial markets. We employ a unique corpus of moderated messages to investigate market manipulation. Internet message boards administrators use the process of moderation to restrict market manipulation. From the data we find that manual supervision of stock message boards by moderators does not effectively protect users against manipulation. Furthermore, by focusing on messages that have been moderated as manipulative due to ramping we show that ramping is positively related to market returns, volatility and volume. We also demonstrate that stocks with higher turnover, lower price level, lower market capitalization and higher volatility are more common targets of ramping.


Automatically Characterizing Salience Using Readers' Feedback.

January 2009

·

32 Reads

·

2 Citations

Salience is an important characteristic of information influencing users' cognitive and emotional states. For example, salient parts of a document are those that readers will find moving or provoking. This article analyzes the main characteristics of salience and the different meanings of the concept in information retrieval and linguistics. It also presents a generic approach for identifying linguistically salient segments in a text using readers' textual feedback. The method, supports any kind of text and. textual feedback. We evaluated the effectiveness of the method with a. corpus of blog posts and readers' comments. Our preliminary experiments show that the method has promising results with an fscore of 0.65. The method could also be used on 90% of commented posts which proves that it can be used on a large scale.


Figure 1 : Une carte peu lisible et faiblement interactive  
Figure 2 : Exemple de navigation dans l'interface  
Amélioration de la navigation dans les hypercartes

April 2008

·

58 Reads

Le Web compte de plus en plus d'hypercartes qui permettent de visualiser et d'explorer des informations géographiques graphiquement représentées sur une carte. Dans les interfaces actuelles des hypercartes, les utilisateurs ont souvent du mal à naviguer et ils expérimentent souvent un sentiment de "perte dans l'hyperespace". Cet article analyse les principaux facteurs de perte dans l'hyperespace et présente une interface adaptée à la navigation dans les hypercartes.


Citations (18)


... Real-life data is often dirty and needs thorough cleaning [16,17,42] prior to further analysis. Particularly in crowd-sourced datasets, such as Wikidata, change exploration can reveal problems in the data values, such as vandalism [7] or sweeping, and thus likely accidental, deletes. We show examples of vandalism discovery in Scenario 1. ...

Reference:

Exploring Change-A New Dimension of Data Analytics
WHAD: Wikipedia historical attributes data: Historical structured data extraction and vandalism detection from the Wikipedia edit history
  • Citing Article
  • December 2013

Language Resources and Evaluation

... Scientific document understanding poses a persistent challenge, primarily attributable to its structure, diverse content modalities (such as tables and figures), and the incorporation of citations within the text. The recent emergence of large-scale scientific document summarization and question-answering datasets [2,22,8] were automatically collected from the public repositories [4,9]. Several works on scientific documents encompass tasks such as abstract generation [15], delving into the contributions outlined in a paper [20,12], scientific papers summarization [3] and formulate multi-perspective summaries by leveraging reviews of the research papers [5,29]. ...

DualSum: A topic-model based approach for update summarization
  • Citing Conference Paper
  • April 2012

... Data-analytic techniques have the potential to detect false information as it being disseminated (Delort et al., 2011;Owda et al., 2017). Natural language analytics can detect the posts in social media that are intended to pump particular stocks, providing a real-time warning to potential investors. ...

Automatic Moderation of Online Discussion Sites
  • Citing Article
  • April 2011

International Journal of Electronic Commerce

... Various time-based detection processes of session boundaries based on their interactions have been proposed [11, 6]. For instance, in [8] user sessions are extracted with respect to a maximal time span between consecutively accessed pages. Catledge and Pitkow [5] discovered a relation between the length of successively accessed pages and their frequencies. ...

Link Recommender Systems: The Suggestions by Cumulative Evidence Approach
  • Citing Article
  • January 2002

... A clue is a pair (D, W ) where C corresponds to the largest set of common pieces of information extracted from the textual content of the set of documents D. Relevant clues are the clues which have been accessed during the same search activity. During the demonstration the concept of clue about a user's information needs, first introduced in [1], will be presented with more details and illustrated by an example. Then, VISS, our clue vizualization software is to be put forward. ...

CEA: A Content-Based Algorithm to Detect Users' Shifts of Focus on the Web

... For instance, social media is a notable vehicle for market manipulation. Several studies document that participants post misleading news to temporarily inflate stock prices and profit from herding behavior (Delort et al., 2009;Sabherwal et al., 2011;Al Nasseri et al., 2015;Renault, 2017). In addition, bots and fake accounts are also increasingly emerging to spread false narratives and generate artificial activities (Davis et al., 2016;Fan et al., 2020). ...

The Impact of Manipulation in Internet Stock Message Boards

International Journal of Banking Accounting and Finance

... Although text mining has been used in assessing market information's impact on the market (Tetlock 2007) and in automated financial trading (Mittermayer and Knolmayer 2006), studies on textual analysis in an MSS context are preliminary. Existing MSSs typically provide news search capabilities, allowing surveillance specialists to inspect the timeline of events (Milosavljevic et al. 2010). They fail to effectively exploit textual market information and its relations with suspicious transactions. ...

Automating Financial Surveillance

Lecture Notes of the Institute for Computer Sciences