
Djamel Abdelkader ZighedUniversity of Lyon, Lyon, France · Human Science Institute
Djamel Abdelkader Zighed
PhD
About
188
Publications
23,782
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,965
Citations
Introduction
Additional affiliations
January 2011 - December 2015
January 1984 - present
Publications
Publications (188)
In the scientific digital libraries, some papers from different research communities can be described by community-dependent keywords even if they share a semantically similar topic. Articles that are not tagged with enough keyword variations are poorly indexed in any information retrieval system which limits potentially fruitful exchanges between...
In this article we address the problem of expanding
the set of papers that researchers encounter when conducting
bibliographic research on their scientific work. Using classical
search engines or recommender systems in digital libraries, some
interesting and relevant articles could be missed if they do not
contain the same search key-phrases that t...
This paper proposes a novel approach in incorporating several metadata such as citations, co-authorship, titles, and keywords to identify real authors in author disambiguation task. Classification schemes make use of these variables to identify authorship. The methodology performed in this paper is: (1) coarse grouping of article by the use of focu...
This paper focuses on the detection of likely mislabeled instances in a learning dataset. In order to detect potentially mislabeled samples, two solutions are considered which are both based on the same framework of topological graphs. The first is a statistical approach based on Cut Edges Weighted statistics (CEW) in the neighborhood graph. The se...
This book is a collection of representative and novel works done in Data Mining, Knowledge Discovery, Clustering and Classification that were originally presented in French at the EGC'2012 Conference held in Bordeaux, France, on January 2012. This conference was the 12th edition of this event, which takes place each year and which is now successful...
This paper describes SONDY, a tool for analysis of trends and dynamics in online social network data. SONDY addresses two audiences: (i) end-users who want to explore social activity and (ii) researchers who want to experiment and compare mining techniques on social data. SONDY helps end-users like media analysts or journalists understand social ne...
Online social networks play a major role in the spread of information at very large scale. A lot of effort have been made in order to understand this phenomenon, ranging from popular topic detection to information diffusion modeling, including influential spreaders identification. In this article, we present a survey of representative methods deali...
L'analyse statistique implicative (ASI) est une méthode d'analyse de données non symétrique, conçue par Régis Gras il y a plus de trente ans. A travers thèses, articles de revues, livres et colloques, elle a été développée et l’est encore par lui, par des doctorants ou avec la collaboration d'équipes de recherche universitaires en France et à l'étr...
The recent and novel research contributions collected in this book are extended
and reworked versions of a selection of the best papers that were originally presented
in French at the EGC’2011 Conference held in Brest, France, on January 2011. EGC
stands for "Extraction et Gestion des connaissances" in French, and means
"Knowledge Discovery and Man...
In many application domains, the choice of a proximity measure affect directly the result of classification, comparison or the structuring of a set of objects. For any given problem, the user is obliged to choose one proximity measure between many existing ones. However, this choice depend on many characteristics. Indeed, according to the notion of...
Online discussions became increasingly widespread with the Web 2.0: no matter the distance, whether you know the person or not, you can discuss and exchange ideas with people all over the world through forums, blogs, and newsgroups. The news websites have extensively used forums in order to encourage the reader being a real participant in the infor...
The expansion of web user roles is, nowadays, a fact due to the ability of users to interact, discuss, exchange ideas and opinions, and form social networks through the web. The interaction level among users leads to the appearance of several social roles which can be characterized as positions, behaviors, or virtual identities. These roles may be...
During the last decade, Knowledge Discovery and Management (KDM or, in French, EGC for Extraction et Gestion des connaissances) has been an intensive and fruitful research topic in the French-speaking scientific community. In 2003, this enthusiasm for KDM led to the foundation of a specific French-speaking association, called EGC, dedicated to supp...
Analyzing the social roles inside on-line communities became a big challenge nowadays. The on-line communities formed around exchange platforms (e.g., forums) create an increasing source of data for analyzing user’s behavior. This paper proposes an exploratory analysis of communities in news
website based on its sub-communities. Actually, we assume...
Web forums are a huge data source. They allow people to interact with unknown individuals. Studying forums shows that the interaction is not obvious only through the structure but also through the content of the post. Taking into account this observation, we extract a social network with different kinds of relationships i.e. the structural relation...
Forums on the Internet are an overwhelming source of knowledge considering the number of topics treated and users who participate in these discussions. This volume of data is difficult to comprehend for a person with respect for the large number of posts. Our work proposes a new formal framework for synthesizing information contained in these forum...
BELMANDT is the collective pseudonym under which a group of mathematicians and computer scientists have decide to publish their common works on Pretopology and to give a nod to BOURBAKI. These works would never have been realized without the dynamic leadership of professor Marcel Brissaud, now retired, and who is of course associated to the publica...
The choice of a proximity measure between objects has a direct
impact on the results of any operation of supervised or unsupervised
classification, comparison, evaluation or structuring a set of
objects. For a given problem, the user is prompted to choose one
among the many existing proximity measures. However, according to
the notion of topologica...
Ensembles of randomized trees such as Random Forests are among the most popular tools used in machine learning and data mining. Such algorithms work by introducing randomness in the induction of several decision trees before employing a voting scheme to give a prediction for unseen instances. In this paper, randomized trees ensembles are studied in...
Many algorithms of machine learning use an entropy measure as optimization criterion.Among the widely used entropy measures,
Shannon’s is one of the most popular. In some real world applications, the use of such entropy measures without precautions,
could lead to inconsistent results. Indeed, the measures of entropy are built upon some assumptions...
During the last decade, the French-speaking scientific community developed a very strong research activity in the field of Knowledge Discovery and Management (KDM or EGC for “Extraction et Gestion des Connaissances” in French), which is concerned with, among others, Data Mining, Knowledge Discovery, Business Intelligence, Knowledge Engineering and...
This is an edited book, not an article! See
https://www.researchgate.net/publication/231315510_Advances_in_Knowledge_Discovery_and_Management
Decision trees generate classifiers from training data through a process of recursively splitting the data space. In the case of training on continuous-valued data, the associated attributes must be discretized into several intervals using a set of crisp cut points. One drawback of decision trees is their instability, i.e., small data deviations ma...
We extend the framework of spatial autocorrelation analysis on Reproducing Kernel Hilbert Space (RKHS). Our results are based on the fact that some geometrical neighborhood structures vary when samples are mapped into a RKHS, while other neighborhood structures do not. These results allow us to design a new measure for measuring the goodness of a k...
Many supervised induction algorithms require discrete data, however real data often comes in both discrete and continuous formats. Quality discretization of continuous attributes is an important problem that has effects on accuracy, complexity, variance and understandability of the induction model. Usually, discretization and other types of statist...
Most of the real data often comes in a mixed format (discrete or continuous), however many supervised induction algorithms require discrete data. Quality discretization of continuous attributes is an important problem that has effects on accuracy, complexity, variance and understandability of the induction models. Most of the existing discretizatio...
Many supervised induction algorithms require discrete data, even while real data often comes in a discrete and continuous formats. Quality discretization of continuous attributes is an important problem that has effects on speed, accuracy and understandability of the induction models. Usually, discretization and other types of statistical processes...
The healthcare industry produces a constant flow of data, creating a need for deep analysis of databases through data mining tools and techniques resulting in expanded medical research, diagnosis, and treatment. Data Mining and Medical Knowledge Management: Cases and Applications presents case studies on applications of various modern data mining m...
A multimedia index makes it possible to group data according to similarity criteria. Traditional index structures are based on trees and use the k-Nearest Neighbors (k-NN) approach to retrieve databases. Due to some disadvantages of such an approach, the use of neighborhood graphs was proposed. This approach is interesting, but it has some disadvan...
Ontology learning from text is considered as an appealing and a challenging approach to address the shortcomings of the hand-crafted
ontologies. In this paper, we present OLEA, a new framework for ontology learning from text. The proposal is a hybrid approach
combining the pattern-based and the distributional approaches. It addresses key issues in...
The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and
the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity
index in a clustering process, which can lead to an objective selection of the optimal number of clus...
The goal of any clustering algorithm producing flat partitions of data is to find the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in the clustering process, which can lead to an objective selection of the optimal number of clusters...
We propose to evaluate the quality of decision trees grown on imbalanced datasets with a splitting criterion based on an asymmetric entropy measure. To deal with the class imbalance problem in machine learning, especially with decision trees, different authors proposed such asymmetric splitting criteria. After the tree is grown a decision rule has...
The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in a clustering process, which can lead to an objective selection of the optimal number of clus...
The goal of any clustering algorithm is to find the optimal clustering solution with the optimal number of clusters.In order to evaluate a clustering solution, a number of validity indices are used during or at the end of a clustering process.They can be internal, external or relative.In this paper, we provide two main contributions: First, we pres...
Decision tree induction has been widely used to generate classifiers
from training data
through a process of recursively splitting the data space. In the
case of training on continuousvalued
data, the associated attributes must be discretized in advance or
during the learning
process. We generate discretization points by performing resampling
on...
Implicative statistics criteria have proven to be valuable interestingness
measures for association rules. Here we highlight their interest
for classification trees. We start by showing how Gras� implication
index may be defined for rules derived from an induced decision tree.
This index is especially helpful when the aim is not classification
itse...
This book constitutes the refereed proceedings of the Third International Workshop on Mining Complex Data, MCD 2007, held in Warsaw, Poland, in September 2007, co-located with ECML and PKDD 2007.
The 20 revised full papers presented were carefully reviewed and selected; they present original results on knowledge discovery from complex data. In cont...
This paper brings two contributions in relation with the semantic
heterogeneous (documents composed of texts and images) information
retrieval: (1) A new context-based semantic distance measure for
textual data, and (2) an IR system providing a conceptual and an
automatic indexing of documents by considering their heterogeneous
content using a doma...
A major lack in the existing semantic similarity methods is that no one takes into account the context or the considered domain. However, two concepts similar in one context may appear completely unrelated in another context. In this paper, our first-level approach is context-dependent. We present a new method that computes semantic similarity in t...
Having a reliable semantic similarity measure between words/concepts can have major effect in many fields like information retrieval and information integration. A major lack in the existing semantic similarity measures is that no one takes into account the actual context or the considered domain. However, two concepts similar in one context may ap...
Les mesures d'entropie, dont la plus connue est celle de Shannon,
ont été proposées dans un contexte de codage et de transmission
d'information.
Néanmoins, dès le milieu des ann{ées soixante, elles ont
été utilisées dans
d'autres domaines comme l'apprentissage et plus particulièrement
pour construire
des graphes d'induction et des arbres de déc...
In this paper, we face two problems in classical semantic similarity measures. Firstly, the context-dependency problem in knowledge-base measures since no one takes into account the context of the target domain. That is, a multisource context-dependent approach is presented. Secondly, the coverage problem with these measures since similarities can...
Induction graphs, which are a generalization of decision trees, have a special place among the methods of Data Mining. Indeed,
they generate lattice graphs instead of trees. They perform well, are capable of handling data in large volumes, are relatively
easy for a non-specialist to interpret, and are applicable without restriction on data of any t...
This paper explains how text mining was used within the context of a research project on social dialogue regimes, jointly undertaken by the University of Geneva, the University of Lyon 2 and the International Institute of Labour Studies of the International Labour Organisation (ILO). The research project, which was made possible through the generou...
This paper highlights the interest of implicative statistics for classification
trees. We start by showing how Gras� implication index may be defined
for the rules derived from an induced decision tree. Then, we show
that residuals used in the modeling of contingency tables provide
interesting alternatives to Gras� index. We then consider two main...
Retrieving hidden information in image databases
is a difficult task because of their complex structure and the
subjectivity related to their interpretation. In this situation the
use of an index is primordial. We propose an effective method for
locally updating neighborhood graphs which constitute our index.
This method is based on an intelli...
La fouille de donn{\'e}es textuelles constitue un champ majeur du
traitement automatique des donn{\'e}es. Une large vari{\'e}t{\'e}
de conf{\'e}rences, comme
TREC, lui sont consacr{\'e}es. Dans cette {\'e}tude, nous nous int{\'e}ressons
{\`a} la fouille
de textes juridiques, dans l�objectif est le classement automatique
de ces textes.
Nous util...
La d{\'e}couverte d�informations cach{\'e}es dans les bases de donn{\'e}es
multim{\'e}dias
est une t�che difficile {\`a} cause de leur structure complexe et
{\`a} la subjectivit{\'e}
li{\'e}e {\`a} leur interpr{\'e}tation. Face {\`a} cette situation,
l�utilisation d�un index
est primordiale. Un index multim{\'e}dia permet de regrouper les donn{\...
This paper presents a step in a long process of
analyzing, structuring, and retrieving multimedia databases. Indeed,
we propose to bring an improvement to an existing content
based image retrieval approach. We propose an effective method
for locally updating neighborhood graphs which constitute our
multimedia index. This method is based on an...
In this paper we present a new entropy measure to grow decision trees. This measure has the characteristic to be asymmetric, allowing the user to grow trees which better correspond to his ex-pectation in terms of recall and preci-sion on each class. Then we propose decision rules adapted to such trees. Experiments have been realized on real medical...
We propose a new statistical approach
for characterizing the class
separability degree in Rp. This approach
is based on a nonparametric
statistic called �the Cut Edge
Weight�. We show in this paper the
principle and the experimental applications
of this statistic.
First, we build a geometrical connected
graph like Toussaint�s Relative
Ne...
Decision tree methods generally suppose that the number of
categories of the attribute to be predicted is fixed. Breiman et al.,
with
their Twoing criterion in CART, considered gathering the categories
of
the predicted attribute into two supermodalities. In this paper, we
propose
an extension of this method. We try to merge the categories in an...
Decision tree methods generally suppose that the number of categories of the attribute to be predicted is fixed. Breiman et al., with their Twoing criterion in CART, considered gathering the categories of the predicted attribute into two supermodalities. In this article, we propose an extension of this method. We try to merge the categories in an o...
We propose a new statistical approach for characterizing the class separability degree in ℝp. This approach is based on a non-parametric statistic called ‘the cut edge weight’. We show in this paper the principle and the experimental applications of this statistic. First, we build a geometrical connected graph like Toussaint's Relative Neighbo...
Search algorithms in image databases usually return k nearest neighbours
(kNN) of an image according to a similarity measure. This approach
presents some anomalies and is based on assumptions that are not always
satisfied. We have examined the causes of these anomalies and we have
concluded that image query models have to exploit topological pr...
Neighborhood graphs are an effective and very widespread
technique in several fields. But, in spite of the neighborhood graphs
interest,
their construction algorithms suffer from a very high complexity
what prevents their implementation for great data volumes processing
applications. With this high complexity, the update task is also affected....
This paper is concerned with the neighbourhood-based supervised learning of a continuous class. It deals with identifying
and handling outliers. We first explain why and how to use the neighbourhood graph issued from predictors in the prediction
of a continuous class. Global quality of the representation is evaluated by a neighbourhood autocorrelat...
The purpose of this study was to determine whether reading performance is equivalent between the initial mammogram on a viewbox and the digitalized screen display. We randomly selected forty-nine mammograms revealing cancer and the same number of normal or benign mammograms. The benign diagnosis was confirmed after a two-year follow-up and a second...
The purpose of this study was to determine whether reading performance is equivalent between the initial mammogram on a viewbox and the digitalized screen display. We randomly selected forty-nine mammograms revealing cancer and the same number of normal or benign mammograms. The benign diagnosis was confirmed after a two-year follow-up and a second...
In supervised learning, all the instances of the learning sample must be preclassified. In some cases, the labelling is done subjectively by an expert. For instance, in text categorisation, each text is assigned to one or more labels or categories. In such cases, the labelling is inconsistent because it may be different from an expert to another an...
Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using...
In the context of complex datamining, we propose in this article an image database representation using topological graphs. Each image is represented as a point in a multidimensional space R p, using numerical features automatically extracted from image. These points are gathered in a topological graph. Graph exploration may be compared to database...
Dans le contexte de la fouille de donn{\'e}es complexes, nous proposons
dans cet article
la repr{\'e}sentation d�une base d�images {\`a} l�aide des graphes
topologiques. Chaque image est
repr{\'e}sent{\'e}e comme un point dans l�espace multidimensionnel
Rp {\`a} l�aide des caract{\'e}ristiques
num{\'e}riques automatiquement extraites {\`a} parti...
Cet article discute des possibilit�es de mesurer la qualit�e de
l�ajustement d�arbres d�induction aux donn�ees comme cela se fait
classiquement
pour les mod`eles statistiques. Nous montrons comment adapter
aux arbres d�induction les statistiques du khi-2, notamment celle
du rapport
de vraisemblance utilis�ee dans le cadre de la mod�elisation de...
In this paper we propose a topological1 model for image database query using neighborhood graphs. A related neighborhood graph is built from automatically extracted low-level features, which represent images as points of
space. Graph exploration correspond to database browsing, the neighbors of a node represent similar images. In order to perform...
This paper is concerned with the goodness-of-fit of induced decision trees. Namely, we explore the possibility to measure the goodness-of-fit as it is classically done in statistical modeling. We show how Chi-square statistics and especially the Log-likelihood Ratio statistic that is abundantly used in the modeling of cross tables, can be adapted f...
This paper is concerned with the determination, in a crosstable,
of the simultaneous merging of rows and columns that maximizes the
association
between the row and column variables. We present an heuristic,
first introduced in [21], and discuss its complexity and reliability.
The
heuristic reduces drastically the complexity of the exhaustive sc...
Decision tree methods generally suppose that the number of categories of the attribute to be predicted is fixed. Breiman et al., with their Twoing criterion in CART, considered gathering the categories of the predicted attribute into two superclasses. In this paper, we propose an extension of this method. We try to merge the categories in an optima...
Cet article est consacr{\'e} {\`a} l�{\'e}valuation statistique des
descriptions de tables de contingence
fournies par les arbres d�induction. On se limite au cas particulier
de donn{\'e}es cat{\'e}gorielles.
Trois aspects sont successivement abord{\'e}s. i) La nature de l�ajustement
en apprentissage
supervis{\'e}, o� l�on souligne la distinctio...
In this paper, we propose a technique for detection and segmentation of skin color areas. This method is based on data mining and image analysis techniques for skin model definition. This model is able to classify skin-color and non skin color pixels using different color spaces. Our method is using data mining techniques in order to produce classi...
In this paper, we build predictors which are able to detect the executive curriculum vitw (CV). The corpus used is composed by executive and non-executive CV and is very unbalanced. Indeed it is composed by more than 90% of non-executive CV. Low structure, scattered information, strongly symbolic representation are some of the characteristics that...
Our research thematic deals with the representation quality and the outlier detection in supervised learning. The prediction of a continuous value is referred to as regression learning. In this case, once constructed the neighbourhood graph resulting from the predictors. We prodused in a recent work to evaluate the representation quality using neig...
Data interactive exploration and knowledge visualization are too ofteiz neglected in Data Mini tools. Ongoing work presented in lb is paper aims at filling this gap. Making the hypothesis that too much data kills data, we propose to build graphically displayed interactive contingency. Manipulation primitives, inspired from dynamic queries and OL4P,...