Karl Czajkowski’s research while affiliated with University of Southern California and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (51)


Figure 2: CFDE submission process. A DCC initiates the submission process by creating a new set of TSVs that meet the C2M2 requirements, running a CFDE tool to build term tables, and submitting that entire datapackage. The cfde-submit CLI then performs a lightweight validation of the submission data, starts the data upload to CFDE's servers (step 1), and then initiates processing in the cloud (step 2). The system that manages the cloud processing is called Globus Flows. Globus Flows is Globus software-as-a-service (SaaS) running in the AWS cloud. CFDE's submission process is one of many "flows" that the flows service manages, and the final action of cfde-submit is to start a run of the CFDE submission flow. The CFDE submission flow moves the submitted data to a permanent location (step 3), sets access permissions (not shown), and executes code on a CFDE server (step 4) that ingests the submitted data into the CFDE portal's database service, Deriva. While processing is happening in the cloud (steps 2-3), status can be checked using cfde-submit, but it does not appear in the CFDE portal until step 4. At this point, the DCC uses the CFDE portal to review and approve (or reject) the datapackage (step 5). Deriva then merges the new datapackage into a test catalog before finally publishing it to the public catalog (step 6), making it searchable by anyone at the CFDE portal.
Figure 3: Summary page of a submitted data package with interactive chart and summary statistics.
Figure 4: Core data available for search at the CFDE portal over time. The sharp decrease in biosamples in October 2021 is due to replicate cell line data being more appropriately modeled as from a single biosample. Note that the y-axis is exponential, and therefore the increases are quite large: the January 2022 release, for example, contains nearly half a million (430,405) more files than the October 2021 release.
An example Google search for Common Fund Data.
Making Common Fund data more findable: catalyzing a data ecosystem
  • Article
  • Full-text available

November 2022

·

105 Reads

·

12 Citations

GigaScience

Amanda L Charbonneau

·

Arthur Brady

·

Karl Czajkowski

·

[...]

·

The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biomedical datasets from individual Common Fund Programs’ Data Coordination Centers (DCCs) into a uniform metadata model that can then be indexed and searched from a centralized portal. This Crosscut Metadata Model (C2M2) supports the wide variety of data types and metadata terms used by individual DCCs and can readily describe nearly all forms of biomedical research data. We detail its use to ingest and index data from 11 DCCs.

Download

Regional synapse gain and loss accompany memory formation in larval zebrafish

January 2022

·

751 Reads

·

27 Citations

Proceedings of the National Academy of Sciences

Significance Imaging of labeled excitatory synapses in the intact brain before and after classical conditioning permits a longitudinal analysis of changes that accompany associative memory formation. When applied to midlarval stage zebrafish, this approach reveals adjacent regions of synapse gain and loss in the lateral and medial pallium, respectively. Such major structural changes could account for the robust nature of memory formation from classical conditioning.


Model-Adaptive Interface Generation for Data-Driven Discovery

October 2021

·

82 Reads

Discovery of new knowledge is increasingly data-driven, predicated on a team's ability to collaboratively create, find, analyze, retrieve, and share pertinent datasets over the duration of an investigation. This is especially true in the domain of scientific discovery where generation, analysis, and interpretation of data are the fundamental mechanisms by which research teams collaborate to achieve their shared scientific goal. Data-driven discovery in general, and scientific discovery in particular, is distinguished by complex and diverse data models and formats that evolve over the lifetime of an investigation. While databases and related information systems have the potential to be valuable tools in the discovery process, developing effective interfaces for data-driven discovery remains a roadblock to the application of database technology as an essential tool in scientific investigations. In this paper, we present a model-adaptive approach to creating interaction environments for data-driven discovery of scientific data that automatically generates interactive user interfaces for editing, searching, and viewing scientific data based entirely on introspection of an extended relational data model. We have applied model-adaptive interface generation to many active scientific investigations spanning domains of proteomics, bioinformatics, neuroscience, occupational therapy, stem cells, genitourinary, craniofacial development, and others. We present the approach, its implementation, and its evaluation through analysis of its usage in diverse scientific settings.


Towards Co-Evolution of Data-Centric Ecosystems

July 2020

·

16 Reads

·

10 Citations

Database evolution is a notoriously difficult task, and it is exacerbated by the necessity to evolve database-dependent applications. As science becomes increasingly dependent on sophisticated data management, the need to evolve an array of database-driven systems will only intensify. In this paper, we present an architecture for data-centric ecosystems that allows the components to seamlessly co-evolve by centralizing the models and mappings at the data service and pushing model-adaptive interactions to the database clients. Boundary objects fill the gap where applications are unable to adapt and need a stable interface to interact with the components of the ecosystem. Finally, evolution of the ecosystem is enabled via integrated schema modification and model management operations. We present use cases from actual experiences that demonstrate the utility of our approach.


ERMrest: a web service for collaborative data management

July 2018

·

30 Reads

·

7 Citations

The foundation of data oriented scientific collaboration is the ability for participants to find, access and reuse data created during the course of an investigation, what has been referred to as the FAIR principles. In this paper, we describe ERMrest, a collaborative data management service that promotes data oriented collaboration by enabling FAIR data management throughout the data life cycle. ERMrest is a RESTful web service that promotes discovery and reuse by organizing diverse data assets into a dynamic entity relationship model. We present details on the design and implementation of ERMrest, data on its performance and its use by a range of collaborations to accelerate and enhance their scientific output.


Fig. 3. FaceBase ERM. Metadata are organized broadly as investigation, biosample, bioassay, and asset entities with relationships indicated by arrows.
Fig. 4. FaceBase Data Curation Pipeline. Shaded boxes indicate Spoke responsibilities versus clear boxes for the Hub's activities.
Fig. 5. Select elements of GPCR catalog model. From top to bottom, four tiers of entities and relationships have been added in phases: core protein concepts; core assets including alignment and expression data; experiment metadata; and experiment assets capturing experimental results.
Fig. 8. Dynamically generated display of GPCR Target including metadata, activity tracking graph, and alignment.
Experiences with DERIVA: An Asset Management Platform for Accelerating eScience

October 2017

·

140 Reads

·

25 Citations

The pace of discovery in eScience is increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. It is all too common for investigators to spend inordinate amounts of time developing ad hoc procedures to manage their data. In previous work, we presented Deriva, a Scientific Asset Management System, designed to accelerate data driven discovery. In this paper, we report on the use of Deriva in a number of substantial and diverse eScience applications. We describe the lessons we have learned, both from the perspective of the Deriva technology, as well as the ability and willingness of scientists to incorporate Scientific Asset Management into their daily workflows.


ERMrest: A Collaborative Data Catalog with Fine Grain Access Control

October 2017

·

13 Reads

·

4 Citations

Creating and maintaining an accurate description of data assets and the relationships between assets is a critical aspect of making data findable, accessible, interoperable, and reusable (FAIR). Typically, such metadata are created and maintained in a data catalog by a curator as part of data publication. However, allowing metadata to be created and maintained by data producers as the data is generated rather then waiting for publication can have significant advantages in terms of productivity and repeatability. The responsibilities for metadata management need not fall on any one individual, but rather may be delegated to appropriate members of a collaboration, enabling participants to edit or maintain specific attributes, to describe relationships between data elements, or to correct errors. To support such collaborative data editing, we have created ERMrest, a relational data service for the Web that enables the creation, evolution and navigation of complex models used to describe and structure diverse file or relational data objects. A key capability of ERMrest is its ability to control operations down to the level of individual data elements, i.e. fine-grained access control, so that many different modes of data-oriented collaboration can be supported. In this paper we introduce ERMrest and describe its fine-grained access control capabilities that support collaborative editing. ERMrest is in daily use in many data driven collaborations and we describe a sample policy that is based on a common biocuration pattern.


ERMrest: an entity-relationship data storage service for web-based, data-oriented collaboration

October 2016

·

28 Reads

·

1 Citation

Scientific discovery is increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. While the details vary from domain to domain, these data often consist of diverse digital assets (e.g. image files, sequence data, or simulation outputs) that are organized with complex relationships and context which may evolve over the course of an investigation. In addition, discovery is often collaborative, such that sharing of the data and its organizational context is highly desirable. Common systems for managing file or asset metadata hide their inherent relational structures, while traditional relational database systems do not extend to the distributed collaborative environment often seen in scientific investigations. To address these issues, we introduce ERMrest, a collaborative data management service which allows general entity-relationship modeling of metadata manipulated by RESTful access methods. We present the design criteria, architecture, and service implementation, as well as describe an ecosystem of tools and services that we have created to integrate metadata into an end-to-end scientific data life cycle. ERMrest has been deployed to hundreds of users across multiple scientific research communities and projects. We present two representative use cases: an international consortium and an early-phase, multidisciplinary research project.


Fig. 1. Scientific Asset Management Architecture.
Fig. 2. Chaise layered architecture.
Fig. 6. WebAuthN layered architecture.
Accelerating data-driven discovery with scientific asset management

October 2016

·

110 Reads

·

14 Citations

The overhead and burden of managing data in complex discovery processes involving experimental protocols with numerous data-producing and computational steps has become the gating factor that determines the pace of discovery. The lack of comprehensive systems to capture, manage, organize and retrieve data throughout the discovery life cycle leads to significant overheads on scientists' time and effort, reduced productivity, lack of reproducibility, and an absence of data sharing. In “creative fields” like digital photography and music, digital asset management (DAM) systems for capturing, managing, curating and consuming digital assets like photos and audio recordings, have fundamentally transformed how these data are used. While asset management has not taken hold in eScience applications, we believe that transformation similar to that observed in the creative space could be achieved in scientific domains if appropriate ecosystems of asset management tools existed to capture, manage, and curate data throughout the scientific discovery process. In this paper, we introduce DERIVA, a framework and infrastructure for asset management in eScience and present initial results from its usage in active research use cases.


Figure 1: Primitives of Data-Oriented Architecture.
Figure 2: Reference architecture for Digital Asset Management based on Data-Oriented Architecture principles depicting a deployment of its primary components (white) and complementary components (gray).
Figure 3: Screenshot of the Chaise interface to the asset management system deployed for FaceBase.
Data Centric Discovery with a Data-Oriented Architecture

June 2015

·

1,815 Reads

·

9 Citations

Increasingly, scientific discovery is driven by the analysis, manipulation, organization, annotation, sharing, and reuse of high-value scientific data. While great attention has been given to the specifics of analyzing and mining data, we find that there are almost no tools nor systematic infrastructure to facilitate the process of discovery from data. We argue that a more systematic perspective is required, and in particular, propose a data-centric approach in which discovery stands on a foundation of data and data collections, rather than on fleeting transformations and operations. To address the challenges of data-centric discovery, we introduce a Data-Oriented Architecture and contrast it with the prevalent Service-Oriented Architecture. We describe an instance of the Data-Oriented Architecture and describe how it has been used in a variety of use cases.


Citations (47)


... To address this challenge, the NIH established the Common Fund Data Ecosystem (CFDE) consortium (https://cfde.info). In its first phase, the CFDE consortium established a data model that standardizes cross-program data elements such as genes, tissues, drugs, and diseases [43]. These harmonized identifiers can be used to find data files produced by multiple CF programs, but such data model fails to directly enable cross-program hypothesis generation. ...

Reference:

Playbook workflow builder: Interactive construction of bioinformatics workflows
Making Common Fund data more findable: catalyzing a data ecosystem

GigaScience

... However, studies of synapse, ECM, and microglial dynamics in the intact developing brain have been limited by imaging constraints in developing mammals. In contrast, zebrafish (Danio rerio) are a vertebrate model system that develops ex utero and is transparent up to 14 days post fertilization (dpf), enabling dynamic imaging of core neurodevelopmental processes including synapse formation [37][38][39] . We recently characterized a population of microglia in the zebrafish hindbrain that interact with synapses and recapitulate core morphological and molecular features of mammalian synapse-associated microglia 40 , facilitating studies of microglia-synapse interactions during development. ...

Regional synapse gain and loss accompany memory formation in larval zebrafish

Proceedings of the National Academy of Sciences

... 8 Automated database schema evolution is nontrivial, 53,54 and CI/CD for relational database applications is still rarely applied; 34 however, it remains one of the most challenging aspects of database development. 55 Database design, 20 testing, 15,25 data quality, 56 and schema evolution 57 are preconditions for successful CI/CD adoption. Currently, however, in most cases, only the schema migration part is automated, and other CI/CD practices like automated testing or static code analysis are rarely included. ...

Towards Co-Evolution of Data-Centric Ecosystems
  • Citing Conference Paper
  • July 2020

... FaceBase uses the DERIVA (Discovery Environment for Relational Information and Versioned Assets) platform for data-intensive sciences built with FAIR principles at its core (Schuler et al. 2016;Bugacov et al. 2017). DERIVA provides a unique platform for building online data resources consisting of a metadata catalog with rich data modeling facilities and expressive ad hoc query support (Czajkowski et al. 2018), file storage with versioning and fixity, persistent identifiers for all data records, user-friendly desktop applications for bulk data transfer and validation, semantic file containers (Chard et al. 2016), and intuitive user interfaces for searching, displaying, and data entry that adapt to any data model implemented in the system (Tangmunarunkit et al. 2021). Building on this foundation, we devised and employed the following strategy for selfserve data curation. ...

ERMrest: a web service for collaborative data management
  • Citing Conference Paper
  • July 2018

... Once a data package is uploaded, a DERIVA [44] database instance automatically begins ingesting it, performing further validation using a custom validation script. Users are notified by email when the ingest process has completed and are provided with a link to preview the data in a secure section of the CFDE portal (or to access a description of any ingest errors). ...

Experiences with DERIVA: An Asset Management Platform for Accelerating eScience

... This can include automated workflows for approval processes and user authentication mechanisms [8,15,25,40]. Such functionality ensures that the visibility of catalog content needs to be unlocked by access requests and the assignment of appropriate access keys [5,41]. As a more recent development, Artificial Intelligence (AI) can be used to identify sensitive or secret data by assigning attributes or to display data sets that are not accessible to the user [15,23,24]. ...

ERMrest: A Collaborative Data Catalog with Fine Grain Access Control
  • Citing Conference Paper
  • October 2017

... If the goal is simply to facilitate enhanced annotation of a singular, very specific kind of study data, then one does not need the sophistication of the technology that CEDAR offers. For example, developers associated with the NIH Biomedical Informatics Research Network (BIRN) created an attractive, hardcoded interface for annotating their particular kinds of neuroimaging data as part of their Digital Asset Management System (DAMS) 54 . The Stanford Microarray Database developed a similarly well-crafted tool, known as Annotare 55 , for entering metadata about gene-expression datasets in accordance with hardcoded MAGE-Tab descriptions 50 . ...

Accelerating data-driven discovery with scientific asset management

... The Entity-relationship model, put forward by P.P.Chen in 1976 [28], was an efficient method used to design database schema in different subjects, such as internal control construction of a transaction system [29], data storage service for web-based, data-oriented collaboration [30] and a system of adult extended education [31]. Although improved entity-relationship models have been put forward [32][33][34][35], it is more convenient to directly modify the rules of the basic entity-relationship model in this article. ...

ERMrest: an entity-relationship data storage service for web-based, data-oriented collaboration
  • Citing Article
  • October 2016

... Thus, these tools cannot be used to enable ne-grained data-driven rule engines, such as R . Other data-driven policy engines, such as IOBox [12], also require individual data events. IOBox is an extract, transform, and load (ETL) system, designed to crawl and monitor local le systems to detect le events, apply pattern matching, and invoke actions. ...

Data Centric Discovery with a Data-Oriented Architecture

... In prevous work [2] we introduced the concept of a BDAM for biomedical research and described our experiences with several prototypical user studies which had informed the early design of the system. In this paper, we expand that discussion to describe in detail the architecture, design and implementation of the BDAM catalog service as well as the use of BDAM in a major research center and microscopy core for stem cell-based kidney research. ...

An Asset Management Approach to Continuous Integration of Heterogeneous Biomedical Data

Lecture Notes in Computer Science