
Michael J. Klaiber- Dr. rer. nat.
- AI Runtime Compiler Lead at EnCharge AI
Michael J. Klaiber
- Dr. rer. nat.
- AI Runtime Compiler Lead at EnCharge AI
About
25
Publications
7,374
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
441
Citations
Introduction
Current institution
EnCharge AI
Current position
- AI Runtime Compiler Lead
Additional affiliations
October 2018 - present
May 2016 - September 2018
September 2011 - April 2016
Publications
Publications (25)
A resource-efficient hardware architecture for connected components analysis (CCA) of streamed video data is presented which reduces the required hardware resources especially for larger image widths. On-chip memory requirements increase with image width and dominate the resources of state-of-
the-art CCA single-pass hardware architectures. A reduc...
In this paper, an adaptive architecture for dynamic management and allocation of on-chip FPGA Block Random Access Memory (BRAM) resources is presented. This facilitates the dynamic sharing of valuable and scarce on-chip memory among several processing elements (PEs), according to their dynamic run-time memory requirements. Different real-time appli...
A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image obj...
In classical connected component labeling algorithms the image has to be scanned two times. The amount of memory required for these algorithms is at least as high as for storing a full image. By using single pass connected component labeling algorithms, the memory requirement can be reduced by one order of magnitude to only a single image row. This...
The Union-Retire CCA (UR-CCA) algorithm started a new paradigm for connected components analysis. Instead of using directed tree structures, UR-CCA focuses on connectivity. This algorithmic change leads to a reduction in required memory, with no end-of-row processing overhead. In this paper we describe a hardware architecture based on UR-CCA and it...
A key issue in system design is the lack of communication between hardware, software and domain expert. Recent research work shows progress in automatic HW/SW co-design flows of neural accelerators that seems to make this kind of communication obsolete. Most real-world systems, however, are a composition of multiple processing units, communication...
Most connected component labelling and analysis algorithms are based on some variant of Union-Find. In this paper, it is shown the Find operation is unnecessary for single-pass algorithms, leading to the Union-Retire approach where the focus is on connectivity rather than labelling. The computational complexity of the resulting algorithm is linear...
Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive growth of AI workloads across the spectrum of computing devices. However, their superior accuracy comes at a high computational cost, which necessitates approaches beyond tradi...
End-to-end performance estimation and measurement of deep neural network (DNN) systems become more important with increasing complexity of DNN systems consisting of hardware and software components. The methodology proposed in this paper aims at a reduced turn-around time for evaluating different design choices of hardware and software components o...
Union-find algorithms form the basis of managing sets of equivalent labels within most connected components labelling algorithms. The new class of single-pass connected components analysis (CCA) algorithms (where a feature vector of each component is extracted during processing) are analysed and compared within this context. Such algorithms have be...
In this paper, a memory-efficient architecture for single-pass connected components analysis suited for high-throughput embedded image processing systems is proposed which achieves a speedup by partitioning the image into slices. Although global data dependencies of image segments spanning several image slices exist, a temporal and spatial local al...
Single-pass connected components analysis (CCA) algorithms suffer from a time overhead to resolve labels at the end of each image row. This work demonstrates how this overhead can be eliminated by replacing the conventional raster scan by a zig-zag scan. This enables chains of labels to be correctly resolved while processing the next image row. The...
This paper presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the...
The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the...
Connected components analysis (CCA) is an essential step in image processing to extract features such as the area or size of arbitrarily-shaped objects from binary images. In this dissertation two dedicated hardware architectures performing CCA tailored for reconfigurable hardware are presented: the first to process a single pixel per clock cycle,s...
In this paper, a Real-Time Process Analysis System for the characterization and measurement of spray and atomization processes is presented. Contrary to indirect measure methods such as phase Doppler interferometry (PDI) or laser diffraction, the proposed imaging system provides reliable measurement results for properties not only of (almost) spher...
Calculation of mean, variance and standard deviation are often required for segmentation or feature extraction. In image processing, often an integer approximation is adequate. Conventional methods require division and square root operations, which are expensive to realize in hardware in terms of both the amount of required resources and latency. A...
Spray drying processes combine drying under mild conditions, formation of morphology, and shape forming in one process. These spray drying processes can be supplemented by another unit operation aiming at the produc- tion of the dry material, e.g. by polymerization. Since single droplets inside a spray are difficult to study, acoustic levitation is...
JPEG-LS has a large number of different and independent context sets that provide the opportunity for par-allelism. As JPEG-LS, many of the lossless image compression standards have “adaptive” error modeling as the core part. This, however, leads to data dependency loops of the compression scheme such that a parallel compression of neighboring pixe...
Questions
Question (1)
Is anyone aware of a benchmark database for Connected Component Labeling/Analysis algorithms similar to what the Berkley database (https://www.eecs.berkeley.edu/) is for segmentation?