Lingfei Wu's research while affiliated with IBM Research and other places

Publications (15)

Article
Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we...
Chapter
Natural language processing (NLP) and understanding aim to read from unformatted text to accomplish different tasks. While word embeddings learned by deep neural networks are widely used, the underlying linguistic and semantic structures of text pieces cannot be fully exploited in these representations. Graph is a natural way to capture the connect...
Conference Paper
Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code and neglect the creation of the documentation in a notebook. In this work, we present a human-centered automation system, Themisto, that can support users to easily creat...
Preprint
Many data scientists use Jupyter notebook to experiment code, visualize results, and document rationales or interpretations. The code documentation generation CDG task in notebooks is related but different from the code summarization task in software engineering, as one documentation (markdown cell) may consist of a text (informative summary or ind...
Preprint
Full-text available
Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations, which leads to challenges in sharing their notebooks with others and future selves. Inspire...
Preprint
Prior work on automated question generation has almost exclusively focused on generating simple questions whose answers can be extracted from a single document. However, there is an increasing interest in developing systems that are capable of more complex multi-hop question generation, where answering the questions requires reasoning over multiple...
Preprint
Sequence-to-sequence models for abstractive summarization have been studied extensively, yet the generated summaries commonly suffer from fabricated content, and are often found to be near-extractive. We argue that, to address these issues, the summarizer should acquire semantic interpretation over input, e.g., via structured representation, to all...
Preprint
Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source cod...
Preprint
Graph Neural Networks (GNNs) have boosted the performance of many graph related tasks such as node classification and graph classification. Recent researches show that graph neural networks are vulnerable to adversarial attacks, which deliberately add carefully created unnoticeable perturbation to the graph structure. The perturbation is usually cr...

Citations

... Based on our investigation, this toolkit has not received general adoption for model documentation on GitHub. Wang et al. proposed a tool called Themisto that combines deep-learning and information retrieval approaches to generate documentation for code cells in the notebook [44]. Themisto targets the same population with our tool, i.e., data scientists using the notebook, but with a different objective. ...
... Ji et al. [74] also use GCN to encode AST for the code clone task. Liu et al. [112] propose a task of code documentation generation for Jupyter notebooks. When generating documentation, the model HAConvGNN considers the relevant code cells and code token information. ...
... Base on this result, we integrated our approach into a user-facing downstream application (Wang et al., 2021c) to further explore the Human-AI collaboration opportunity in the code documentation scenario. In the follow-up user study (reported seperately (Wang et al., 2021b)), users found that the automatically generated documentation reminded them to document code they would have ignored, and improved their satisfaction with their computational notebooks. ...
... G RAPH Neural Networks (GNNs), a class of neural networks for learning on graph structured data, have been successfully applied in many areas to solve real world problems, such as link predictions in social networks [1], pattern recognition in autonomous driving [2], product recommendation and personalized search in E-commerce [3], fraud detection in financial services [4], power estimation and tier design in the semiconductor industry [5,6], traffic forecasting [7], and natural language processing [8,9,10]. Among many different graph representation learning approaches, the class of spatial graph convolution based models, which adopts a message passing scheme to update node features, has gained particular attention due to its simplicity yet good performance. ...
... Hallucination in downstream NLG tasks There are active efforts to reduce the unfaithfulness or factual errors of task-specific LMs fine-tuned for various downstream natural language generation (NLG) tasks such as ummarization [42][43][44][45][46][47][48], data-to-text [49,50,20,[51][52][53] and dialogue system [54][55][56][57][58]. In contrast to these works, we focus on general purpose LM for open-ended text generation task. ...
... Document-level Event Argument Extraction aims at identifying arguments and their roles for multiple events in a document. It is a practically more useful but more challenging task than sentence-level Argument Extraction (Nguyen et al., 2016;Wadden et al., 2019;Lin et al., 2020) because in a typical long input document events usually scatter across multiple sentences and are inherently connected. ...