[show abstract][hide abstract] ABSTRACT: This paper addresses the problem of interactive image segmentation with a user-supplied object bounding box. The underlying problem is the classification of pixels into foreground and background, where only background information is provided with sample pixels. Many approaches treat appearance models as an unknown variable and optimize the segmentation and appearance alternatively, in an expectation maximization manner. In this paper, we describe a novel approach to this problem: the objective function is expressed purely in terms of the unknown segmentation and can be optimized using only one minimum cut calculation. We aim to optimize the trade-off of making the foreground layer as large as possible while keeping the similarity between the foreground and background layers as small as possible. This similarity is formulated using the similarities of distant pixel pairs. We evaluated our algorithm on the GrabCut dataset and demonstrated that high-quality segmentations were attained at a fast calculation speed.
Pattern Recognition (ICPR), 2010 20th International Conference on; 09/2010
[show abstract][hide abstract] ABSTRACT: We propose a content-preserving stereo image editing technique by using stereo extension of seam carving. Seam carving is the process of deleting or duplicating connected paths, or seams, that consist of less important pixels in order to resize the image while preserving the image content as much as possible. The problem targeted in this paper is how to apply the seam carving method to a pair of stereo images, where the consistency between the left and right images should also be preserved through the seam carving process. For this consistency, we introduce new energy terms. Based on these energy terms, seam pairs between the input images are classified into two types, corresponding seams in order to maintain consistency or occluded seams to change the consistency intentionally. The novelty of this paper is that stereo matching results are fused into the framework in the seam carving. We demonstrate that by selecting seams in an appropriate way we can virtually manipulate the depths of objects in the scene.
3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2010; 07/2010
[show abstract][hide abstract] ABSTRACT: This paper presents a method that jointly performs synthesis of free-viewpoint images and object segmentation from that viewpoint. This method works efficiently and online by sharing a certain calculation process among rendering and segmentation steps. Since the segmentation is performed for arbitrary viewpoints directly, the extracted object can be superimposed onto another 3-D scene with geometric consistency. Experimental results using a 25-camera array show the effectiveness of our method.
Image Processing (ICIP), 2009 16th IEEE International Conference on; 12/2009
[show abstract][hide abstract] ABSTRACT: This paper presents a method that jointly performs synthesis and object segmentation of free-viewpoint video using multi-view video as the input. This method works efficiently and online by sharing a calculation process between the synthesis and segmentation steps. The matching costs calculated through the synthesis step are adaptively fused with other cues depending on the reliability in the segmentation step. Since the segmentation is performed for arbitrary viewpoints directly, the extracted object can be superimposed onto another 3-D scene with geometric consistency. We can observe that the object and new background move naturally along with the viewpoint change as if they existed together in the same space. Experimental results using a 25-camera array show the effectiveness of our method.
[show abstract][hide abstract] ABSTRACT: In this paper, we consider the rate-distortion performance of a multi- view image set with subsampling of the viewpoints. As a basic analysis, we compare two scenarios: (i) all images would be coded and transmitted in even quality, and (ii) a half of the images would be discarded by subsampling at the sender side, and the remaining half would be coded and transmitted. In the second scenario, the discarded images would be reconstructed at the receiver side using some view interpolation technology. We first introduce a theoretical model describing the rate-distortion performance of the scenarios above. Then, we present numerical simulations and experiments, showing that which scenario yields better performance depends on the bitrate, the properties of the image set, and the accuracy of the view interpolation.
[show abstract][hide abstract] ABSTRACT: In order to increase the ways that users can intuitively interact with a tabletop display, we developed UlteriorScape. This system integrates two major functions of its predecessors. As with Tablescape Plus, UlteriorScape uses tabletop objects as both projection screens and input interfaces. As with the Lumisight Table, the tabletop is physically single, but visually multiple with a view-dependent display, which can show different images to each user around the table. We developed several applications that demonstrate the advantages of UlteriorScape. In one, a single user can hold a simple screen-object over the tabletop and see additional images projected onto its surface. Those images change interactively based on the position of the object, which is tracked by ID. In another application, multiple users can simultaneously work with a single mountain-shaped object that displays separate images on each side. In this paper, we describe the system design of UlteriorScape and its applications.
Horizontal Interactive Human Computer Systems, 2008. TABLETOP 2008. 3rd IEEE International Workshop on; 11/2008
[show abstract][hide abstract] ABSTRACT: We present a real-time video-based rendering system using a network camera array. Our system consists of 64 commodity network cameras that are connected to a single PC through a Gigabit Ethernet. To render a high-quality novel view, we estimate a view-dependent per-pixel depth map in real-time by using a layered representation. The rendering algorithm is fully implemented on a GPU, which allows our system to efficiently use CPU and GPU independently and in parallel. Using QVGA input video resolution, our system renders a free-viewpoint video at up to 30 fps depending on rendering parameters. Experimental results show high-quality images synthesized from various scenes.
3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2008; 06/2008
[show abstract][hide abstract] ABSTRACT: This paper proposes a basic technology that utilizes natural objects in the environment and uses them as a simple interface. We focus on an interface that uses a small wooden balance beam. Its input is the movement of the user's center of gravity and the balance on the beam. The user controls the interface by hopping onto it, walking back and forth, jumping on it or leaning his body in a direction. Strain sensors are attached to the beam, and by using the sensor data, approximate position of user's center of gravity and change of the balance can be estimated. The interface was applied to an interactive system, which is a combination of a balance beam and computer graphics.
3D User Interfaces, 2008 IEEE Symposium on; 04/2008
[show abstract][hide abstract] ABSTRACT: This paper proposes a novel paradigm: human-centered tabletop computing, which enhances the role of an ordinary table by projecting interactive images onto tabletop objects and the table surface at the same time. The advantage of this approach is that it utilizes tabletop objects as projection screens as well as input tools. As a result, we can change the appearance and role of each tabletop object easily and fulfill two important requirements of tabletop tangible interfaces, identifiability and versatility, which have proven difficult to satisfy simultaneously in previous systems. Our prototype, tablescape plus, achieves these functions by using two projectors and a special tabletop screen system that diffuses or transmits images selectively according to the projection orientation. This paper presents the design principle, optical design, and implementation of tablescape plus. Furthermore, we introduce several interactive applications.
Horizontal Interactive Human-Computer Systems, 2007. TABLETOP '07. Second Annual IEEE International Workshop on; 11/2007
[show abstract][hide abstract] ABSTRACT: This paper proposes a novel tabletop display named "EmiTable" that can emit imperceptible metadata along with the tabletop image. Actually, our system displays a visual image on the tabletop whose pixels contain metadata as bit patterns for dedicated receivers. Since the bit patterns are embedded as high-speed flickers, the users would not perceive the hidden signal behind the image. However, we can read out the metadata by putting stand-alone receivers on the tabletop. Since the hidden signal is embedded independently in each pixel, different metadata can be drawn according to the position on the tabletop. The advantage of our system is that it can superimpose metadata which are strongly related to the image content in an imperceptible way. This paper presents the detailed design and several applications.
Horizontal Interactive Human-Computer Systems, 2007. TABLETOP '07. Second Annual IEEE International Workshop on; 11/2007
[show abstract][hide abstract] ABSTRACT: This paper discusses a system in which multi-view images are captured and encoded in a distributed fashion and a viewer synthesizes a novel view from this data. We developed an efficient method for such system that combines decoding and rendering process to directly synthesize the novel image without reconstructing all the input images. Our method jointly performs disparity compensation in decoding process and geometry estimation in rendering process, because they are essentially equivalent if the camera parameters for the input images are known. It achieves low-complexity for both encoder and decoder in distributed multi-view coding system. Experimental results show superior coding performance of our method compared to a conventional intra-coding method especially at low bit rate.
Image Processing, 2007. ICIP 2007. IEEE International Conference on; 01/2007
[show abstract][hide abstract] ABSTRACT: This paper proposes a view-dependent light field coding scheme using some image-based rendering techniques prior to coding. The proposed coder first synthesizes an image at a given viewpoint, which is called a representative viewpoint, and then predicts all input images by using the synthesized image as a reference. It can produce a view-dependent scalable bitstream. This means that the quality of synthesized views around the representative viewpoint is kept high even at extremely low bit rates, and the quality of views away from there is improved according to the increase of the bit rate. Our experimental results show that this coding scheme also achieves good coding efficiency for both multi-camera images and integral photography, which are common light field representations
Image Processing, 2006 IEEE International Conference on; 11/2006
[show abstract][hide abstract] ABSTRACT: A light field means a 4-D function which characterizes the flow of light rays from a target scene, and used for image-based rendering. This paper presents a novel theoretical framework which considers the aliasing problem in dealing with discrete light field data. We introduce a new scheme called the aliasing separation to isolate the additional aliasing component caused by subsampling, and give a new perspective for how to optimize the reconstruction filter to interpolate light field data without aliasing artifacts. This optimization is closely related with depth estimation. Both the focus measure for light field rendering and the multiple baseline stereo can be derived from our theoretical framework
Image Processing, 2006 IEEE International Conference on; 11/2006
[show abstract][hide abstract] ABSTRACT: This paper proposes real-time interactive systems illustrating your shadows cast on the floor, which we name "graphic shadow." They will make you experience an exciting space of interaction performing an illusion on your own shadows.
The Second IEEE and ACM International Symposium on Mixed and Augmented Reality, 2003. Proceedings. 01/2006;
[show abstract][hide abstract] ABSTRACT: Light field rendering (LFR) is an image-based rendering method for synthesizing free-viewpoint images from a set of multi-view images. In LFR, no/little knowledge of geometry is required, since scene objects are assumed to be at a constant depth. This assumption causes focus-like effects on synthetic images, in which only the objects near to the assumed depth are synthesized clearly and sharply. In order to generate all-in-focus synthetic images, the authors proposed a focus measurement method which uses several kinds of reconstruction filters. Using this method, we can detect in-focus parts on multiple images that are synthesized with different assumed depths, and integrate them into one final image. Though this algorithm was shown to be effective for some real scenes, the optimal combination of reconstruction filters has not been discussed yet. In this paper, we introduce a detailed spatial domain analysis of this focus measurement method, and compare three combinations by the theoretical analysis and an experiment.
Image Processing, 2005. ICIP 2005. IEEE International Conference on; 10/2005
[show abstract][hide abstract] ABSTRACT: This paper introduces a novel image-based rendering method which uses inputs from unstructured cameras and synthesizes free-viewpoint images of high quality. Our method uses a set of depth layers in order to deal with scenes with large depth ranges. To each pixel on the synthesized image, the optimal depth layer is assigned automatically based on the on-the-fly focus measurement algorithm that we propose. We implemented this method efficiently on a PC and achieved nearly interactive frame-rates.
Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on; 08/2005
[show abstract][hide abstract] ABSTRACT: A novel tabletop display provides different images to different users surrounding the system. It can also capture users' gestures and physical objects on the tabletop. The Lumisight Table approach is based on the optical design of a special screen system composed of a building material called Lumisty and a Fresnel lens. The system combines these films and a lens with four projectors to display four different images, one for each user's view. In addition, we need appropriate input methods for this display media. In the current state of the project, we can control computers by placing physical objects on the display or placing our hands over the display. This screen system also makes it possible to use a camera to capture the appearance of the tabletop from inside of the system. Our other main idea is to develop attractive and specialized applications on the Lumisight Table, including games and applications for computer-supported cooperative-work (CSCW) environments. The projected images can be completely different from each other, or partially identical and partially different. Users can share the identical parts as public information, because all users can see it. This article is available with a short video documentary on CD-ROM.
[show abstract][hide abstract] ABSTRACT: Light field rendering is a fundamental method for synthesizing free-viewpoint images from a set of multiviewpoint images. In the simplest case, the scene structure is approximated by a simple plane: a focal plane. This approximation leads to focus-like effects on synthetic images where the focused depth is determined by the focal plane. A serious problem is that the range of the focused depth is too small in most practical cases. In this paper, we propose a focus measure that is specialized for synthetic images by light field rendering. When a set of differently-focused images is generated at a given viewpoint, the proposed focus measure enables us to obtain a depth map and an all in-focus image. Our approach has some remarkable differences from other related techniques, such as depth-from-stereo and depth-from-focus methods. Experimental results show that the proposed method effectively enhances PSNR of the final synthetic images.
Image Processing, 2004. ICIP '04. 2004 International Conference on; 11/2004
[show abstract][hide abstract] ABSTRACT: This paper proposes a general method of glare generation based on wave optics. A glare image is regarded as a result of Fraunhofer diffraction, which is equivalent to a 2D Fourier transform of the image of given apertures or obstacles. In conventional methods, the shapes of glare images are categorized according to their source apertures, such as pupils and eyelashes and their basic shapes (e.g. halos, coronas or radial streaks) are manually generated as templates, mainly based on statistical observation. Realistic variations of these basic shapes often depend on the use of random numbers. Our proposed method computes glare images fully automatically from aperture images and can be applied universally to all kinds of apertures, including camera diaphragms. It can handle dynamic changes in the position of the aperture relative to the light source, which enables subtle movement or rotation of glare streaks. Spectra can also be simulated in the glare, since the intensity of diffraction depends on the wavelength of light. The resulting glare image is superimposed onto a given computer-generated image containing high intensity light sources or reflections, aligning the center of the glare image to the high intensity areas. Our method is implemented as a multipass rendering software. By precomputing the dynamic glare image set and putting it into texture memory, the software runs at an interactive rate.
[show abstract][hide abstract] ABSTRACT: In this paper, the authors have developed a system that animates 3D facial agents based on real-time facial expression analysis techniques and research on synthesizing facial expressions and text-to-speech capabilities. This system combines visual, auditory, and primary interfaces to communicate one coherent multimodal chat experience. Users can represent themselves using agents they select from a group that we have predefined. When a user shows a particular expression while typing a text, the 3D agent at the receiving end speaks the message aloud while it replays the recognized facial expression sequences and also augments the synthesized voice with appropriate emotional content. Because the visual data exchange is based on the MPEG-4 high-level Facial Animation Parameter for facial expressions (FAP 2), rather than real-time video, the method requires very low bandwidth.