Conference Paper

Reducing metadata complexity for faster table summarization

DOI: 10.1145/1739041.1739072 Conference: EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings
Source: DBLP


Since the visualization real estate puts stringent constraints on how much data can be presented to the users at once, table summarization is an essential tool in helping users quickly explore large data sets. An effective summary needs to minimize the information loss due to the reduction in details. Summarization algorithms leverage the redundancy in the data to identify value and tuple clustering strategies that represent the (almost) same amount of information with a smaller number of data representatives. It has been shown that, when available, metadata, such as value hierarchies associated to the attributes of the tables, can help greatly reduce the resulting information loss. However, table summarization, whether carried out through data analysis performed on the table from scratch or supported through already available metadata, is an expensive operation. We note that the table summarization process can be significantly sped up when the metadata used for supporting the summarization itself is pre-processed to reduce the unnecessary details. The pre-processing of the metadata, however, needs to be performed carefully to ensure that it does not add significant amounts of additional loss to the table summarization process. In this paper, we propose a tRedux algorithm for value hierarchy pre-processing and reduction. Experimental evaluations show that, depending on the table and taxonomy complexity, metadata summarization can provide gains in table summarization time that can range (in absolute values) from seconds to 10s-of-1000s of seconds. Consequently, while resulting in only an extra ~ 20% reduction in table quality, tRedux can provide ~ 2x speedups in table summarization time. Experiments also show that tRedux has a better performance than alternative metadata reduction strategies in supporting table summarization; and, as the taxonomy complexity increases, the absolute gains of tRedux also increase.

Download full-text


Available from: K. Selcuk Candan, Aug 25, 2014
  • Source
    • "Allowed values are: {meat, fish}. The problem to solve is the summarization of the dataset using Attribute Value Taxonomies [3]. The goal is to produce a good and informative summary of the data for the user using the metadata supplied by the taxonomies. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In many domains (e.g., data mining, data management, data warehouse), a hierarchical organization of attribute values can help the data analysis process. Nevertheless, such hierarchical knowledge does not always available or even may be inadequate or useless when exists. Starting from this consideration, in this paper we tackle the problem of the automatic definition of data-driven taxonomies. To do this we combine techniques coming from information theory and clustering to obtain a structured representation of the attribute values: the Contextual Attribute-Value Taxonomy (CAVT). The two main advantages of our method are to be fully unsupervised (i.e., without any knowledge provided by an expert) and parameter-free. We experiments the benefit of use CAVTs in the two following tasks: (i) the multilevel multidimensional sequential pattern mining problem in which hierarchies are involved to exploit abstraction over the data, (ii) the table summarization problem, in which the hierarchies are used to aggregate the data to supply a sketch of the original information to the user. To validate our approach we use real world datasets in which we obtain appreciable results regarding both quantitative and qualitative evaluation.
    Full-text · Article · Mar 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Semantic Web (SW) deployment is now a realization and the amount of semantic annotations is ever increasing thanks to several initiatives that promote a change in the current Web towards the Web of Data, where the semantics of data become explicit through data representation formats and standards such as RDF/(S) and OWL. However, such initiatives have not yet been accompanied by efficient intelligent applications that can exploit the implicit semantics and thus, provide more insightful analysis. In this paper, we provide the means for efficiently analyzing and exploring large amounts of semantic data by combining the inference power from the annotation semantics with the analysis capabilities provided by OLAP-style aggregations, navigation, and reporting. We formally present how semantic data should be organized in a well-defined conceptual MD schema, so that sophisticated queries can be expressed and evaluated. Our proposal has been evaluated over a real biomedical scenario, which demonstrates the scalability and applicability of the proposed approach.
    Full-text · Conference Paper · Mar 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Considering relational tables as the object of analysis, meth-ods to summarize them can help the analyst to have a starting point to explore the data. Typically, table summarization aims at producing an informative data summary through the use of metadata supplied by at-tribute taxonomies. Nevertheless, such a hierarchical knowledge is not always available or may even be inadequate when existing. To overcome these limitations, we propose a new framework, named cTabSum, to automatically generate attribute value taxonomies and directly perform table summarization based on its own content. Our innovative approach considers a relational table as input and proceeds in a two-step way. First, a taxonomy for each attribute is extracted. Second, a new table summarization algorithm exploits the automatic generated taxonomies. An information theory measure is used to guide the summarization pro-cess. Associated with the new algorithm we also develop a prototype. In-terestingly, our prototype incorporates some additional features to help the user familiarizing with the data: (i) the resulting summarized table produced by cTabSum can be used as recommended starting point to browse the data; (ii) some very easy-to-understand charts allow to vi-sualize how taxonomies have been so built; (iii) finally, standard OLAP operators, i.e. drill-down and roll-up, have been implemented to easily navigate within the data set. In addition we also supply an objective evaluation of our table summarization strategy over real data.
    Full-text · Conference Paper · Aug 2013