Conference Paper

Don't Hold My UDFs Hostage - Exporting UDFs For Debugging Purposes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

User-defined functions (UDFs) are an integral part of performing in-database analytics. Executing data analysis inside a database provides significant improvements over traditional methods, such as close-to-the-data execution , low conversion overhead and automatic parallelization. However, UDFs have poor support for debugging. Since they are executed from within the database process, traditional debugging tools such as Integrated Development Environments (IDEs) and Read-Eval-Print Loops (REPLs) cannot be used during development. As a result, writing functional UDFs is challenging. In this paper , we present an extension to the open-source database system MonetDB that allows developers to debug their UDFs using modern debugging techniques.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These user-defined functions also introduce new issues. They force users to rewrite code so the code fits within the query workflow, are difficult to debug [3] and introduce safety issues as arbitrary code can now run within the database kernel. ...
... Firstly, they force users to rewrite code so the code fits within the query work flow. Secondly, because these UDFs run within the database process, they are difficult to create and debug [10]. Users cannot use the IDEs/REPLs/debugging tools that they are familiar with (such as RStudio or PyCharm) while writing the user-defined functions. ...
Preprint
While traditional RDBMSes offer a lot of advantages, they require significant effort to setup and to use. Because of these challenges, many data scientists and analysts have switched to using alternative data management solutions. These alternatives, however, lack features that are standard for RDBMSes, e.g. out-of-core query execution. In this paper, we introduce the embedded analytical database MonetDBLite. MonetDBLite is designed to be both highly efficient and easy to use in conjunction with standard analytical tools. It can be installed using standard package managers, and requires no configuration or server management. It is designed for OLAP scenarios, and offers near-instantaneous data transfer between the database and analytical tools, all the while maintaining the transactional guarantees and ACID properties of a standard relational system. These properties make MonetDBLite highly suitable as a storage engine for data used in analytics, machine learning and classification tasks.
Article
With the tremendous growth in data science and machine learning, it has become increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools that perform the computation outside the database system. While the database community has attempted to integrate some of these tools in the RDBMS, this has not swayed the trend as existing solutions are often not convenient for the incremental, iterative development approach used in these fields. In this paper, we propose AIDA - an abstraction for advanced in-database analytics. AIDA emulates the syntax and semantics of popular data science packages but transparently executes the required transformations and computations inside the RDBMS. In particular, AIDA works with a regular Python interpreter as a client to connect to the database. Furthermore, it supports the seamless use of both relational and linear algebra operations using a unified abstraction. AIDA relies on the RDBMS engine to efficiently execute relational operations and on an embedded Python interpreter and NumPy to perform linear algebra operations. Data reformatting is done transparently and avoids data copy whenever possible. AIDA does not require changes to statistical packages or the RDBMS facilitating portability.
Article
Full-text available
In commercial software development organizations, increased complexity of products, shortened development cycles, and higher customer expectations of quality have placed a major responsibility on the areas of software debugging, testing, and verification. As this issue of the IBM Systems Journal illustrates, there are exciting improvements in the underlying technology on all three fronts. However, we observe that due to the informal nature of software development as a whole, the prevalent practices in the industry are still immature, even in areas where improved technology exists. In addition, tools that incorporate the more advanced aspects of this technology are not ready for large-scale commercial use. Hence there is reason to hope for significant improvements in this area over the next several years.
Conference Paper
Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk operations. At the same time, overwhelming volume and variety of data as well as parsing overhead suggests that the use of specialized analytical data management systems would be beneficial. Data might also already be stored in a database. Efficient execution of data analysis programs such as data mining directly inside a database greatly improves analysis efficiency. We investigate how these vector-based languages can be efficiently integrated in the processing model of operator--at--a--time databases. We present MonetDB/Python, a new system that combines the open-source database MonetDB with the vector-based language Python. In our evaluation, we demonstrate efficiency gains of orders of magnitude.
Article
The IPython project provides an enhanced interactive environment for scientific computing, with features including support data visualization and facilities for distributed and parallel computation. The most important characteristic of scientific computing is a collection of high-performance code written in FORTRAN, C language, and C++ that runs in batch mode on large systems, clusters, and superconductors. The IPython project aims to provide a greatly enhanced Python shell, facilities for interactive distributed and parallel computing, and comprehensive set of tools for building special-purpose interactive environments for scientific computing. This project has been providing tools to extend Python's interactive capabilities and continues to be developed as a base layer for new interactive environments. It offers a set of control commands designed to improve Python's usability in an interactive environment.
Dont hold my data hostage-a case for client protocol redesign
  • M Raasveldt
  • H Mühleisen
Raasveldt, M. and Mühleisen, H. (2017). Dont hold my data hostage-a case for client protocol redesign. Proceedings of the VLDB Endowment, 10(10):1022-1033.
Code complete. Pearson Education
  • S Mcconnell
McConnell, S. (2004). Code complete. Pearson Education.