Christos Faloutsos’s research while affiliated with Santa Clara University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (898)


Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization
  • Preprint
  • File available

December 2024

Luca Masserano

·

Abdul Fatir Ansari

·

Boran Han

·

[...]

·

How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) provides better accuracy than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and performs on par or better than modern deep learning models trained specifically on each dataset; and ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics. In addition, we show that our method can easily capture complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time.

Download

A Flexible Forecasting Stack

November 2024

·

9 Reads

Proceedings of the VLDB Endowment

Forecasting extrapolates the values of a time series into the future, and is crucial to optimize core operations for many businesses and organizations. Building machine learning (ML)-based forecasting applications presents a challenge though, due to non-stationary data and large numbers of time series. As there is no single dominating approach to forecasting, forecasting systems have to support a wide variety of approaches, ranging from deep learning-based methods to classical methods built on probabilistic modelling. We revisit our earlier work on a monolithic platform for forecasting from VLDB 2017, and describe how we evolved it into a modern forecasting stack consisting of several layers that support a wide range of forecasting needs and automate common tasks like model selection. This stack leverages our open source forecasting libraries GluonTS and AutoGluon-TimeSeries , the scalable ML platform SageMaker , and forms the basis of the no-code forecasting solutions ( SageMaker Canvas and Amazon Forecast ), available in the Amazon Web Services cloud. We give insights into the predictive performance of our stack and discuss learnings from using it to provision resources for the cloud database services DynamoDB, Redshift and Athena.






FeatNavigator: Automatic Feature Augmentation on Tabular Data

June 2024

·

7 Reads

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.


Figure 3: HiCom always gains. Relative performance improvement for HiCom-OPT over the second-best method on nodes in dense regions vs. all regions.
Notations
Dataset Statistics. "Avg. D" means the average node degree. "Avg. T" means the average number of tokens on each node, estimated by 1 token = 3/4 words.
Run time comparison. For each setting, the training time for one epoch is shown, as well as the performance in F1 score for the final model.
Hierarchical Compression of Text-Rich Graphs via Large Language Models

June 2024

·

15 Reads

Text-rich graphs, prevalent in data mining contexts like e-commerce and academic graphs, consist of nodes with textual features linked by various relations. Traditional graph machine learning models, such as Graph Neural Networks (GNNs), excel in encoding the graph structural information, but have limited capability in handling rich text on graph nodes. Large Language Models (LLMs), noted for their superior text understanding abilities, offer a solution for processing the text in graphs but face integration challenges due to their limitation for encoding graph structures and their computational complexities when dealing with extensive text in large neighborhoods of interconnected nodes. This paper introduces ``Hierarchical Compression'' (HiCom), a novel method to align the capabilities of LLMs with the structure of text-rich graphs. HiCom processes text in a node's neighborhood in a structured manner by organizing the extensive textual information into a more manageable hierarchy and compressing node text step by step. Therefore, HiCom not only preserves the contextual richness of the text but also addresses the computational challenges of LLMs, which presents an advancement in integrating the text processing power of LLMs with the structural complexities of text-rich graphs. Empirical results show that HiCom can outperform both GNNs and LLM backbones for node classification on e-commerce and citation graphs. HiCom is especially effective for nodes from a dense region in a graph, where it achieves a 3.48% average performance improvement on five datasets while being more efficient than LLM backbones.


GraphStorm: all-in-one graph machine learning framework for industry applications

June 2024

·

38 Reads

Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perform graph construction and model training and inference with just a single command; (b) Expert-friendly: GraphStorm contains many advanced GML modeling techniques to handle complex graph data and improve model performance; (c) Scalable: every component in GraphStorm can operate on graphs with billions of nodes and can scale model training and inference to different hardware without changing any code. GraphStorm has been used and deployed for over a dozen billion-scale industry applications after its release in May 2023. It is open-sourced in Github: https://github.com/awslabs/graphstorm.



Citations (48)


... LLMs [33,46] have been utilized in various tasks in multiple domains [25,29,31,38,41,57]. One emerging use case of LLMs is to use them to instruct and control other models [13,14,19,24,41,50], due to their ability to capture context from a user's instructions (prompts) [28]. ...

Reference:

Khattat: Enhancing Readability and Concept Representation of Semantic Typography
SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM
  • Citing Conference Paper
  • January 2024

... We anticipate that graph reordering will also be beneficial for other systems: Many systems such as DistDGL [72], GraphStorm [73], PaGraph [30], P3 [13], DistGNN [38], BGL [33], GNNLab [60], and DiskGNN [32] are built on top of DGL or Salient [23] which is built on top of PyG, and therefore, these systems may also benefit from graph reordering. Further, many research papers [20,30,59] mention that poor data locality is a significant challenge for efficient GNN training. ...

GraphStorm: All-in-one Graph Machine Learning Framework for Industry Applications
  • Citing Conference Paper
  • August 2024

... With the prevalence of language modeling, there has been an outbreak in the effective integration of text and graph topology on text-attributed graphs, leading to a comprehensive benchmark on text-attributed graphs [33,35,15,23]. While textual information undoubtedly plays a crucial role in understanding the data and making predictions for user-centric tasks, it is essential to recognize that other modalities, particularly visual information, can be helpful and provide complementary information that text alone cannot capture. ...

TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning
  • Citing Conference Paper
  • July 2024

... The increasing prevalence of the Web of Things (WoT) has facilitated the continuous collection of spatiotemporal data from sensors deployed across various geographic locations. Spatiotemporal modeling [35,51,57,61] has been widely utilized in web applications, such as social networks [36,64] and Web of Things applications [10,14,15]. Radiation forecasting is a typical spatiotemporal task within the domains of web mining and knowledge discovery, playing a crucial role in aggregating and analyzing real-time radiation data. ...

NETEVOLVE: Social Network Forecasting using Multi-Agent Reinforcement Learning with Interpretable Features
  • Citing Conference Paper
  • May 2024

... A novel approach of dynamic graph is proposed in [60], where the graph is constructed from active buses in the grid as vertices and nodes represent active grid devices both in each time making the graph dynamic. In this approach the Line Outage Distribution Factor (LODF) was used then the temporal weighting based on distance and weights of the graph is applied, finally the anomaly detection is preformed. ...

Dynamic Graph-Based Anomaly Detection in the Electrical Grid
  • Citing Conference Paper
  • July 2023

... Graph mining is the field of extracting non-trivial knowledge from graphs and linked data in general [29] [30]. Major problems in the area include efficiently finding subgraphs of interest [31][32] [33], lightweight [34] or heuristic community discovery [35], and fairness [36]. Applications include parallel graph mining under memory considerations [37], multilayered ontologies [38], discovering food-drug interaction [39], graph correlation coefficient based on topological properties [40], fake news discovery [41], multicriterial community discovery on Twitter [42], recommender systems [43], compressing social graphs using higher order patterns [44], and self organizing maps (SOMs) for clustering the user base of a cultural portal [45]. ...

Less is More: SlimG for Accurate, Robust, and Interpretable Graph Mining
  • Citing Conference Paper
  • August 2023

... The other works consider embedding of RDFS ontologies with KGs [22][73] [76]. JOIE [73] transforms the RDFS ontology into an ontology view graph, with each concept modeled as a node, and each relation's domain concepts and range concepts connected by a meta relation. ...

Concept2Box: Joint Geometric Embeddings for Learning Two-View Knowledge Graphs
  • Citing Conference Paper
  • January 2023

... Such modeling takes advantage of standardized values by adopting a CDM, combining them with other attributes, and providing insights regarding the observed patterns. Once modeled, features can be extracted from the graph structure and the obtained information is explored through visualization metaphors [Cazzolato et al. 2023]. The unsupervised approach allows users to perform exploratory data analysis (EDA) over EHRs. ...

Exploratory Data Analysis in Electronic Health Records Graphs: Intuitive Features and Visualization Tools
  • Citing Conference Paper
  • June 2023

... The latter results in ordered sequences of real-valued data points commonly referred to as time series. Analytical tasks over time series data are becoming increasingly important in virtually every domain [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], including astronomy [20], energy [21], environmental [22], and social [23] sciences. ...

Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances
  • Citing Article
  • June 2023

Proceedings of the VLDB Endowment