ArticlePDF Available

Enhanced visualisation for minimally invasive surgery

  • Odin Vision

Abstract and Figures

Endoscopes used in minimally invasive surgery provide a limited field of view, thus requiring a high degree of spatial awareness and orientation. Attempts at expanding this small, restricted view with previously observed imagery have been made by researchers and is generally known as image mosaicing or dynamic view expansion. For minimally invasive endoscopy, SLAM-based methods have been shown to have potential values but have yet to address effective visualisation techniques. The live endoscopic video feed is expanded with previously observed footage. To this end, a method that highlights the difference between actual camera image and historic data observed earlier is proposed. Old video data is faded out to grey scale to mimic human peripheral vision. Specular highlights are removed with the help of texture synthesis to avoid distracting visual cues. The method is further evaluated on in vivo and phantom sequences by a detailed user study to examine the ability of the user in discerning temporal motion trajectories while visualising the expanded field of view, a feature that is of practical value for enhancing spatial awareness and orientation. The difference between historic data and live video is integrated effectively. The use of a single texture domain generated by planar parameterisation is demonstrated for view expansion. Specular highlights can be removed through texture synthesis without introducing noticeable artefacts. The implicit encoding of motion trajectory of the endoscopic camera visualised by the proposed method facilitates both global awareness and temporal evolution of the scene. Dynamic view expansion provides more context for navigation and orientation by establishing reference points beyond the camera's field of view. Effective integration of visual cues is paramount for concise visualisation.
Content may be subject to copyright.
DOI 10.1007/s11548-011-0631-z
Enhanced visualisation for minimally invasive surgery
Johannes Totz ·Kenko Fujii ·Peter Mountney ·
Guang-Zhong Yang
Received: 11 January 2011 / Accepted: 30 May 2011
© CARS 2011
Purpose Endoscopes used in minimally invasive surgery
provide a limited field of view, thus requiring a high degree
of spatial awareness and orientation. Attempts at expanding
this small, restricted view with previously observed imagery
have been made by researchers and is generally known as
image mosaicing or dynamic view expansion. For minimally
invasive endoscopy, SLAM-based methods have been shown
to have potential values but have yet to address effective visu-
alisation techniques.
Methods The live endoscopic video feed is expanded with
previously observed footage. To this end, a method that high-
lights the difference between actual camera image and his-
toric data observed earlier is proposed. Old video data is faded
out to grey scale to mimic human peripheral vision. Specular
highlights are removed with the help of texture synthesis to
avoid distracting visual cues. The method is further evalu-
ated on in vivo and phantom sequences by a detailed user
study to examine the ability of the user in discerning tempo-
ral motion trajectories while visualising the expanded field of
view, a feature that is of practical value for enhancing spatial
awareness and orientation.
Results The difference between historic data and live video
is integrated effectively. The use of a single texture domain
generated by planar parameterisation is demonstrated for
view expansion. Specular highlights can be removed through
texture synthesis without introducing noticeable artefacts.
The implicit encoding of motion trajectory of the endoscopic
camera visualised by the proposed method facilitates both
global awareness and temporal evolution of the scene.
J. Totz (B
)·K. Fujii ·P. Mountney ·G.-Z. Yang
The Hamlyn Centre for Robotic Surgery,
Imperial College London, SW7 2AZ, London, UK
Conclusions Dynamic view expansion provides more
context for navigation and orientation by establishing ref-
erence points beyond the camera’s field of view. Effec-
tive integration of visual cues is paramount for concise
Keywords View expansion ·Minimally invasive surgery ·
Visualisation ·Surgical navigation
Minimally invasive surgery (MIS) requires a high degree of
dexterity and spatial awareness to navigate safely in vivo.
Endoscopic imaging provides only a limited field of view
(FOV) with off-axis visualisation. Furthermore, the camera
view is often rotated, complicating control of tools and cam-
era, which adds to disorientation. To alleviate these chal-
lenges, the current view provided by the endoscopic camera
can be expanded with previously observed footage, as shown
in Fig. 1. This has shown to be useful for resolving spatio-
orientation and anatomical referencing.
View expansion is closely related to image mosaicing,
which has been used for panorama construction to survey
larger areas of organs for bladder[1], kidney[2]orliversur-
gery. However, in traditional techniques, camera motion is
often confined to one particular plane, which is not represen-
tative of surgical scenarios.
View expansion for out-of-plane camera motion has been
approached by Lerotic et al. [3]. The authors model the
expansion as an optical flow process that reprojects pix-
els based on an estimated homography. This results in
expanded images where the video flows out of the orig-
inal frame bounds. Remaining seam artefacts caused by
changing illumination conditions are removed by Poisson
Fig. 1 Example of dynamic view expansion. The live video image is
expanded with previously observed footage of areas that are now out of
The feasibility of another technique has been shown by
Mountney and Yang[4]. Stereoscopic simultaneous localisa-
tion and mapping (SLAM) is used in an endoscopic imaging
setup to predict and estimate the position of landmarks on
organ surfaces and the relative camera motion. This yields
a sparse point cloud in 3D-space known as the SLAM map
that can be triangulated to approximate the tissue surface. By
selecting parts of the video frames as texture maps, one can
create a sufficiently realistic virtual scene for rendering in the
peripheral area of the video, that is the narrow-FOV endo-
scopic view is expanded with historic video footage textured
onto a 3D model.
In principal, what is attempted by SLAM-based methods
amounts to surgical scene reconstruction. Grasa et al. [5]
describe a method in the context of augmented reality for
taking measurements on the estimated organ surface. While
SLAM is often sparse due to performance concerns, dense
reconstruction is also possible, as, for example, performed
by Stoyanov et al. [6]. While the authors target motion com-
pensation for beating heart surgery, the idea can be adapted
to a navigational aid by providing a 3D rendering of the scene
from alternative angles, as done recently by Moll et al. [7].
View expansion methods, however, have yet to address an
important technical issue related to clear and concise visual-
isation. The authors of [4] take great care that the resulting
expanded view is free of artefacts and seams between camera
and expansion are reduced. Unfortunately, this makes distin-
guishing between endoscopic view and expansion difficult.
But clear distinction is necessary to ensure surgeons make
clinical decisions based on live data.
In this paper, a new visualisation method is proposed to
clearly distinguish between live view and expansion assem-
bled from previously observed video footage. The proposed
expansion scheme implicitly encodes the temporal trajec-
tory by slowly fading colour to grey-scale along its path.
This provides an important cue for camera motion trajec-
tory estimation for the expanded scene. Removal of imag-
ing artefacts, such as specular highlights caused by intense
light reflection on the mucus layer, is well studied [8,9]as
these tend to cause problems with vision-based algorithms.
While they carry subtle yet important cues for shape recogni-
tion[10], their presence in the expanded areas is misleading.
For this method, specular highlights are removed only on the
expanded regions via texture synthesis. The actual camera
images are not modified in any way and are presented as-is
to the operator to ensure high fidelity of the operating field
of view.
An overview of the method and its individual steps, from
video to view expansion visualisation is depicted in Fig. 2,
which illustrates the key steps involved. “Stereo SLAM”
introduces the framework used in previous work, followed
by “Texture pipeline” describing the differences of the pro-
posed method. Details on each step at the proposed algorithm
are provided.
Stereo SLAM
The method presented in this paper is based on [4]. A detailed
description of the SLAM system used is available in [11].
The stereo camera was calibrated at the beginning of the
procedure using a method described in [12], estimating
the focal length fx,fy, the principal point cx,cy, width
and height w, hof the camera image and the distortion
The Shi-Tomasi[13] corner detector is used to identify
suitable landmarks in the video image by estimating eigen-
values of the image covariance matrix. Each feature yis
tracked in both the left and the right camera images and trian-
gulated to determine its 3D-space position yp. As the video
progresses, new features are identified and features that come
back into view are updated. The collection of all features is
called the SLAM map Y={y0,y1...yn}for nfeatures. It
is a sparse point cloud approximating the tissue surface. An
actual surface representation is obtained by converting Yinto
a triangle mesh. This is achieved by projecting all yps into the
global xy-plane, dropping the z-coordinate and performing
a 2D Delaunay triangulation.
Using the relative motion of features estimated by the
SLAM framework, it is possible to estimate the camera posi-
tion and orientation. Video frames are then selected to serve
as texture maps for each triangle of Y. Finally, camera images
are embedded into a rendering of Yby projecting the 3D
position of features into camera space:
ypc =PC·MC·yp(1)
where ypis the 3D position of a feature, ypc is the resulting
clip-space[14] position, and MCis the modelview matrix
derived from the estimated camera pose. PCis the projec-
tion matrix derived from the camera’s intrinsic parameters
as follows (following OpenGL’s notation [14]):
Fig. 2 Texture processing pipeline. Endoscopic video is processed by
SLAM to estimate tissue surface. This surface is parameterised into a
planar texture domain into which video frames are projected. Artefacts
are removed by masking and subsequent texture synthesis. The final
visualisation renders an expanded view around the video images by
fading historic footage to grey scale. Components in the dashed box are
addressed in this paper which differ significantly from previous work
left =−
right =near
Fig. 3 Illustration of artefact falling onto an edge of the SLAM map
triangulation. aSingle texture domain set-up; bPer-triangle texture
top =near
bottom =near
rightleft 0right+left
rightleft 0
topbottom top+bottom
topbottom 0
farnear 2·far·near
with near and far set to 0.15 and 10, respectively.
Texture pipeline
The main difference of the proposed method to its pre-
decessor[4] lies in how texture maps are assembled. Here, a
strategy is employed that generates one global texture image
Tcovering every part of Yto simplify its processing. As
illustrated by Fig. 3, using a single texture map solves two
important problems elegantly:
1. Removal of artefacts caused by specular highlights when
such highlights would crosses an edge of a triangle in Y.
2. Updating the texture map when Yneeds to be retriangu-
lated due to newly tracked or deleted SLAM map fea-
Removing an artefact is relatively straightforward if
surrounding information from adjacent triangles is readily
available in a planar form (Fig. 3a), so that image filtering
operations can be applied directly. Otherwise, one (trivial)
texture domain per surface element (Fig. 3b) requires tedious
assembly of adjacent image information.
SLAM is dynamic, constantly adding new features and
removing old untrackable ones. Thus, any method needs to
be able to cope with retriangulations of Ywhile preserving
already observed video footage used for the view expansion.
In [4], all video frames used for texture maps had to be kept in
memory to recreate textures in case of retriangulations. Using
a single texture domain, this task becomes trivial because in
the event of a retriangulation, the new updated texture domain
is simply a reprojection of the old one using new coordinates.
Use of a single texture domain requires a planar parame-
terisation of the surface of Ythat bijectively maps every
point to a pixel in the texture image Tby assigning a unique
uv-coordinate in Tto every vertex of Y. As it stands, the
current xy-plane Delaunay triangulation already provides
such a parameterisation. Unfortunately, using it directly leads
to excessive distortion and undersampling of video content
because the actual size of Y’s surface triangles is not taken
into account.
Therefore, Mean Value Coordinates[15](MVC)areused
to optimise the position of Y’s vertices inside the texture map
to reduce distortion. MVC falls into the category of confor-
mal parameterisations that minimise angular distortions. It
requires the convex hull of Yin Tas input to compute the
remaining interior positions. For any given vertex v0and its
kincident neighbours v1...vk,v
0can be expressed as a linear
combination of weights λiderived from the angles αiat v0
between viand vi+1(more details in[15]):
wi=tan (αi1/2)+tan (αi/2)
Interestingly, the Delaunay triangulation computes the con-
vex hull in the global xy-plane and could be used directly.
Unfortunately, while this would keep distortion of surface
triangles low, it is not robust enough in a SLAM context
because of a high range of values for yp. Instead, the bound-
ary of Yis arranged uniformly on a circle, ignoring relative
sizes and angles, but still optimising the interior positions.
Figure 4illustrates this problem: for this particular sequence,
respecting the actual boundary size of Yleads to heavy un-
dersampling in some areas as can be seen in the upper right
of Fig. 4a. The circular boundary performs more robustly in
the general case (cf. Fig. 4b).
When a new video frame is processed, sometimes newly
tracked features require a retriangulation of Y. This changes
the triangle mesh topology and often the 3D space positions
of features. Therefore, it is necessary to recompute the
parameterisation. This requires a reprojection of the con-
tents of Tcorresponding to the previous frame because the
new uv-coordinates point to invalid locations. Reprojection
is simple and fast due to the use of a single texture map:
Tis used as a render-target with the new uv-coordinates
as vertex positions and a copy of the previous Tas a tex-
Fig. 4 Illustration of SLAM map boundary parameterisation. This
shows video content in the planar texture domain with the arrows high-
lighting corresponding points and their vastly different location relative
to adjacent point inside the texture map. aUsing intrinsic convex bound-
ary leads to excessive distortion in some cases. bCircular boundary is
more robust
ture mapped onto the previous uv-coordinates. This trans-
fers all previously accumulated video footage. But reprojec-
tion can lead to excessive blurring if done too often because
of repeated texture filtering. Therefore, Yis reparameter-
ised only if new SLAM features are tracked or old ones are
deleted, i.e. the texture coordinates are not updated if features
simply move slightly, thus avoiding reprojection.
Frame projection
The current video frame is projected into the texture map
by rendering Yinto Twith the uvcoordinates as posi-
tions and the SLAM-estimated 3D-positions ypprojected
into camera space yielding ypc as texture coordinates into
the current video frame. This projection has to be perspec-
tive correct, i.e. the actual coordinate into the video frame is
ypc.xy/ypc.w, with .xy signifying the first two components
of the 4-dimensional vector ypc and .w the last.
In addition, when rendering into T, hardware alpha-
blending[14] is enabled so that new video frame pixels over-
write existing pixels in Tand masked pixels (due to artefacts)
do not contribute anything and are discarded:
θRGB =ξRGB ·ξA+θRGB ·(1ξA)(3)
with ξbeing the video frame colour of a pixel in RGB and
A its alpha channel set to 1.0 for valid pixel data and 0.0
for masked-off pixels, θthe current colour of the target pixel
onto which the video frame pixel is being projected in Tand
θthe new colour for that pixel.
Every pixel colour ξcoming from the video frames is
treated as being prepared for display in sRGB and as such
undergoes Gamma correction prior to any operation that
modifies colour, e.g. blending in Eq. (3), grey-scale fade in
Eq. (5) and artefact removal (see further below). The pur-
pose of this correction is to transform colours from non-
linear sRGB into linear RGB, so that linear equations make
Fig. 5 Evolution of texture
domain and image over time.
Parameterisations of the tissue
surface approximating SLAM
map are shown in the top row,
with exemplar video frames
projected at each point in time
below. The right-most image
shows the bare parameterisation
at the end of the sequence
Fig. 6 Example of stable and
transitory specular highlights
sense [16]. After a final pixel colour has been determined,
it is converted back to sRGB for display on contemporary
As time passes and more and more video frames have
been projected, the texture image fills up with content, as
illustrated by Fig. 5.
Artefact removal
In [4], Poisson Image Editing [17] was used to remove seams
in the visualisation caused by changing illumination. This
paper, however, will mainly focus on false specular high-
Specular highlights are caused by a thin layer of mucus
on the surface of tissue and intense and direct lighting, as
seen, for example, in Fig. 6. While highlights carry subtle
but important cues on distance and shape of an object[10],
these are counterproductive when shown on the expanded
view as they originate from a very different camera position
than the current one. That is, the perspective cue carried by
historic highlights is wrong. Therefore, specular highlights
are considered artefacts that need to be removed during frame
To address this problem, highlights are masked off by
thresholding pixels based on intensity and saturation during
frame projection by setting the corresponding alpha channel
to zero. The threshold for intensity is above μ+2·σand
for saturation below μσ, with μbeing the mean and σ
the standard deviation for each, respectively, calculated over
the whole frame. Thus, the blending mode explained above
ignores them which will leave a number of holes in the tex-
ture map Twhere no information is available. Fortunately,
many specular highlights are unstable and appear and dis-
appear over time which will cause the mentioned holes to
be filled in automatically. There are, however, a few stable
highlights for which this approach is unsatisfactory, leaving
large holes even over extended periods of time.
Therefore, texture synthesis is employed to fill in these
holes. An MRF-based multi-scale method [18]isusedto
synthesise patterns based on existing pixel neighbourhoods.
That is, the current texture image Tis marked with the loca-
tion of specular highlight holes and then decomposed into
5×5 pixel neighbourhoods. The neighbourhoods not having
any holes form at set of 75-dimensional vectors Fidescrib-
ing the local appearance of the tissue surface. Feature vectors
are inserted into a kd-tree for efficient Approximate Nearest
Neighbour search. To fill in believable patterns that already
exist in Tfor each hole j, feature vectors are matched to the
holes on a greedy best-fit basis:
arg min
The Fjthat minimises the difference to the feature vec-
tors is filled back into T. Starting from random locations
in T, a priority queue is set up that contains locations of
5×5 pixel neighbourhoods which are adjacent to the holes,
sorted according to how many of the 25 pixels are valid,
i.e. the location with the least hole-pixels has the highest pri-
ority. The top of the queue is popped, and a suitable Fjis
found and written back into T, and the queue is then updated
Fig. 7 The colour mapping scheme used for fading. The colour of
video pixels in the expanded view are faded to grey with increasing
age. The gradient bar at the top shows the corresponding colour
To address the issue of highlighting the difference between
live camera image and expanded view, historic data is
“demoted” by desaturating it. This keeps colour informa-
tion in the centre of the image (the “fovea”), while the
periphery appears grey scale. This mimics human peripheral
vision[19]. In addition, by slowly fading to grey scale, a trail
of historical camera motion trajectory is implicitly encoded
so as to further enhance spatio-temporal orientation.
Every pixel in Thas an associated age that monotonically
increases every time the texture map is updated. Currently,
this rate is such that video footage is fully grey scale after 2 s.
The ’age’ of each pixel is reset to zero when fresh pixels from
the current video frame are written into T. On presenting the
expanded view on screen, the colour γfor every pixel of T
mapped onto Yis faded to grey scale using:
+age ·
with r,gand bbeing the red, green and blue components
of the original colour (coming from the previously projected
video frames) and ithe grey scale intensity and age clamped
to the range [0,1]. Figure 7illustrates how a pixel’s RGB val-
ues fade to grey. Despite the linear equation 5, the graph in
Fig. 7is non-linear due to gamma correction of video images.
The data used in this study are from the publicly available
VIP Laprascopic/Endoscopic Video Dataset[20] and are a
recording of an exploration of the abdominal cavity. Figure 8
shows a number of frames during the procedures. The left
column for each sequence shows the original video frame,
with the corresponding view expansion rendering to the right.
The short sequence on the left of Fig. 8has the camera rise
at first and then pan to the right. The much longer sequence
on the right of Fig. 8depicts the camera panning to all sides,
exploring the abdominal cavity.
All visualisations are performed with a black background,
the video frame at its original size and the expanded view
to either side of it, depending on camera motion. As time
progresses, new features are tracked by SLAM and its map
expands providing more and more context as the camera
moves around. The colour of the expansion fades off to grey
while newly arriving video footage resets the associated pixel
age and produces coloured output. The actual camera images
always match the format supplied by the imaging device, in
this case rectangular.
Figures 9and 10 show the effect of specular highlight
removal and the subsequent inpainting of holes. Transitory
highlights require no inpainting at all because they will not
be visible: by the time they appear on the periphery of the
video frame, they will have been filled in by other video frame
projections (Fig. 9). A few highlights are stable though and
require proper texture synthesis and inpainting. This is illus-
trated in Fig. 10.
Both video sequences were processed on a dual-core Intel
CPU running at 2.5 GHz, with 4 GB RAM and an NVIDIA
GeForce 8600M GT with 256 MB dedicated video RAM.
Table 1lists the processing time for each step of the pipeline.
Texture synthesis is computationally expensive and is thus
excluded for real-time implementation.
Figure 11 compares the effect of fading to grey for
Sequence 1. In a dynamic scene with the camera constantly
moving, it could be possible for the clinician to confuse what
is live video and what is historic data. Fade to grey makes this
distinction very clear while keeping the structural appearance
of tissue. In addition, the colour-trail left behind provides a
navigational cue as to which area has been observed recently.
User study
To assess the effectiveness of the proposed visualisation
method, a user study was conducted by comparing the pro-
posed method against a full-colour visualisation scheme with
a rectangle highlighting the current live video frame bound-
aries in the expanded FOV.
Fig. 8 Expanded views for two laparoscopic video sequences
In addition to the two porcine in vivo sequences, a silicone
phantom[21] sequence was also recorded. Eighteen partici-
pants with normal or corrected-to-normal vision were given
the task to identify as quickly and as accurately as possible
the recent camera trajectory covered using image examples
as shown in Fig. 11. Slides with static images of DVE were
shown to the participants, five for each visualisation method:
with the proposed fade-to-grey approach and without. The
order of slides was randomised and each viewpoint was used
only once to avoid potential learning effects. Gaze of partic-
ipants was recorded using a Tobii Eye-Tracker 1750. Partic-
ipants were given the chance to familiarise the experimental
setting and drawing the trajectory with the mouse prior to the
experiment itself. The results of that training session are not
used in the analysis.
During the experiment, all mouse movements were
recorded, with the average time it takes from slide-start to
first mouse click reported in Table 2. On average, it took
1.5 s longer for subjects to start drawing out the camera tra-
jectory with the mouse when visualising the full-colour vs.
fade-to-grey images. In addition, the standard deviation is
also lower for the image onset to mouse click onset, implying
Fig. 9 Transitory highlights are filled in automatically by newly pro-
jected video frame and therefore require no inpaint. Note that these
would never be visible in the visualisation because they are hidden
behind the camera image and the corresponding image regions only
appear on the periphery after they disappeared
Fig. 10 Stable highlights persist over a long time until they are visi-
ble around the live video frame and require inpainting. aShows how
texture synthesis inpaints one such highlight; bShows its effect on the
greater consistency and confidence among subjects to draw
the camera trajectory.
Figure 12a shows two examples of the trajectory drawn,
for all subjects on these two example DVE images with the
SLAM-estimated camera trajectory super-imposed. It is evi-
dent that for the proposed fading method, the trajectories
Tabl e 1 Average processing times for various steps of the pipeline
Step Processing time μ±σ(ms)
SLAM 18 ±31.1
Reprojection 10.5±7.6
Frame projection 4 ±0.8
Visualisation 1.9±2.8
Fig. 11 Comparison of visualisation, with and without fade-to-grey
Tabl e 2 Average time until decision for study participant, in seconds
Full colour Proposed fade-to-grey
Average 5.15 3.48
Standard deviation 0.98 0.58
derived are much more consistent, whereas for the plain col-
our visualisation method, participants reported that they were
guessing randomly or resorting to the boundary of the expan-
sion to infer the underlying temporal trajectory.
Preliminary analysis of the eye-tracking data gives further
insight into participants’ visual search behaviour. Only the
visual search pattern before the first mouse click is used for
analysis because mouse cursor movement and eye gaze can
be potentially correlated [22]. Figure 12b shows the gaze-
hotspot images corresponding to the exemplary trajectory.
These reveal that participants focus on the peripheral grey-
scale area before making their decision. Without the use of the
proposed visualisation method, the gaze is scattered across
the entire image without a clear visual search pattern.
In a post-study interview, experienced clinicians com-
mented on the potential clinical utility of the method, espe-
cially for avoiding drift and disorientation in examination
loops, i.e. revisiting previously observed organs.
Quantitative evaluation
To provide quantitative validation of the proposed system,
its ability for loop closure was determined. Loop closure is
important to avoid visible cracks and seams between his-
toric data and the new video frame. Sequence 2 incorpo-
rates a loop with the camera panning to the left, right, up,
down and back to the original position again, partly depicted
in Fig. 8. The very first video frame was projected into T
and ten landmarks manually identified that were not chosen
Fig. 12 Exemplary camera
trajectories identified by study
participants (thin green lines)
for plain colour and fade-to-grey
on the phantom sequence with
superimposed SLAM-estimated
trajectory (orange dashed lines)
and the corresponding hotspot
images of fixation points for all
Fig. 13 Average projection
error for loop closing in
sequence 2
as SLAM features. This first projection was used as a ref-
erence to compare to the following video frames with: the
ten landmarks were identified again and their position in the
video compared with their position in the visualisation of
Ywith the reference T. Ideally, this difference in position
should be zero. Figure 13 depicts this difference over time.
In this sequence, the camera pans to the left causing the error
to increase minimally due to SLAM’s noise modelling. It
returns back to the start position after about 10 s, pans to
the right and returns. After 25 s the camera pans upwards
and the error increases because some features are lost, mak-
ing the estimated camera position uncertain. Returning back
to the start position also drops the error back previous levels.
In this paper, an image mosaicing scheme based on SLAM
has been proposed for dynamic view expansion. The method
is based on fading historic footage to grey scale such
that potentially critical issues arising from poor distinction
between current live video and out-of-date contextual infor-
mation are avoided. For dynamic view expansion, view-
dependent specular highlights are removed effectively using
texture synthesis. Preliminary results derived from the paper
in terms of quantitative validation of SLAM loop closing
within an in vivo environment and detailed user study demon-
strate the potential clinical value of the technique. The advan-
tage of dynamic view expansion is its ability to provide refer-
ence context of the surgical view, thus alleviating the problem
of disorientation during endoscopic interventions. For prac-
tical clinical applications, the method can potentially reduce
the cognitive burden of the clinician, enabling a safer and
more consistent procedure. For future work, integration of
dense surface reconstruction techniques will be investigated
to provide a better approximation of the tissue surface. This
will allow integration of higher-fidelity perspective clues
and potentially scene relighting to enhance the perception
of depth and occlusion by reintroducing specular highlights
that are co-aligned with the navigation pathways.
Conflict of interest None.
1. Behrens A, Bommes M, Stehle T, Gross S, Leonhardt S, Aach T
(2011) Real-time image composition of bladder mosaics in fluo-
rescence endoscopy. Comput Sci—Res Dev 26(1):51–64
2. Atasoy S, Noonan DP, Benhimane S, Navab N,Yang G-Z (2008) A
global approach for automatic fibroscopic video mosaicing in min-
imally invasive diagnosis. In: Metaxas D, Axel L, Fichtinger G,
Székely G (eds) Medical image computing and computer-assisted
intervention—MICCAI 2008. Lecture notes in computer science,
vol. 5241/2008. Springer, Heidelberg, pp 850–857
3. Lerotic M, Chung AJ, Clark J, Valibeik S, Yang G-Z (2008)
Dynamic view expansion for enhanced navigation in natural orifice
transluminal endoscopic surgery. In: Metaxas D, Axel L, Fichtin-
ger G, Székely G (eds) Medical image computing and computer-
assisted intervention—MICCAI 2008. Lecture notes in computer
science, vol 5242/2008. Springer, Heidelberg, pp 467–475
4. Mountney P, Yang G-Z (2009) Dynamic view expansion for min-
imally invasive surgery using simultaneous localization and map-
ping. In: 31st Annual international conference of the IEEE Engi-
neering in Medicine and Biology Society, pp 1184–1187
5. Grasa OG, Civera J, Guemes A, Munoz V, Montiel J (2009) EKF
monocular SLAM 3D modeling, measuring and augmented reality
from endoscope image sequences. In: Proceedings of AMI-ARCS:
5th workshop on augmented environments for medical imaging
including augmented reality in computer-aided surgery, pp 102–
6. Stoyanov D, Darzi A, Yang G-Z (2004) Dense 3D depth recov-
ery for soft tissue deformation during robotically assisted laparo-
scopic surgery. In: Barillot C, Haynor DR, Hellier P (eds) Medical
image computing and computer-assisted intervention—MICCAI
2004. Lecture notes in computer science, vol 3217/2004. Springer,
Heidelberg, pp 41–48
7. MollM, Tang H-W,Gool LV (2010) GPU-accelerated robotic intra-
operative laparoscopic 3d reconstruction. In: Navab N, Jannin P
(eds) Information processing in computer-assisted interventions.
Lecture notes in computer science, vol 6135. Springer, Heidelberg,
pp 91–101
8. Stoyanov D, Yang G-Z (2005) Removing specular reflection com-
ponents for robotic assisted laparoscopic surgery. In: IEEE Inter-
national conference on image processing 3
9. Arnold M, Ghosh A, Ameling S, Lacey G (2010) Automatic seg-
mentation and inpainting of specular highlights for endoscopic
imaging. EURASIP J Image Video Process 2010, pp 1–12
10. Norman JF, Todd JT, Orban GA (2004) Perception of three-
dimensional shape from specular highlights, deformations of shad-
ing, and other types of visual information. Psychol Sci 15:565–570
11. Mountney P, Stoyanov D, Davison A, Yang G-Z (2006) Simulta-
neous stereoscope localization and soft-tissue mapping for mini-
mal invasive surgery. In: Larsen R, Nielsen M, Sporring J (eds)
Medical image computing and computer-assisted intervention—
MICCAI 2006. Lecture notes in computer science, vol 4190/2006.
Springer, Heidelberg, pp 347–354
12. Zhang P, Milios EE, Gu J (2005) Vision data registration for robot
self-localization in 3D. In: IEEE/RSJ International conference on
intelligent robots and systems (IROS), pp 2315–2320
13. Shi J, Tomasi C (1994) Good features to track. In: IEEE Computer
society conference on computer vision and pattern recognition pro-
ceedings CVPR’94, pp 593–600
14. Segal M, Akeley K (2006) The OpenGL graphics system: a speci-
fication (version 2.1)
15. Floater MS (2003) Mean value coordinates. Comput Aided Geom
Des 20(1):19–27
16. Gritz L, d’Eon E (2007) GPU Gems 3, ch. The importance of being
linear. Addison-Wesley Professional
17. Pérez P, Gangnet M, Blake A (2003) Poisson image editing. ACM
Trans Graph 22:313–318
18. Wei L-Y, Lefebvre S, Kwatra V, Turk G (2009) State of the art in
example-based texture synthesis. In: Eurographics 2009, State of
the Art Report, EG-STAR
19. Chiras D (2010) Human biology, ch the visual sense: the eye. Jones
& Bartlett Publishers Incorporated, p 220
20. Mountney P, Stoyanov D, Yang G-Z (2010) Three-dimensional
tissue deformation recovery and tracking: introducing techniques
based on laparoscopic or endoscopic images. IEEE Signal Process
Mag 27:14–24
21. Clark J, Sodergren M, Noonan D, Darzi A, Yang G-Z (2009) The
natural orifice simulated surgical environment (NOSsE™): Explor-
ing the challenges of NOTES without the animal model. J Lapar-
oendosc Adv Surg Tech 19:211–214
22. Chen MC, Anderson JR, Sohn MH (2001) What can a mouse cursor
tell us more? correlation of eye/mouse movements on web brows-
ing. In: CHI’01 extended abstracts on human factors in computing
systems, CHI EA ’01. ACM, New York, NY, pp 281–282
... The authors of [7,8] used simultaneous localization and mapping to provide an expanded surgical view. They built a 3D model of the surgical site based on the laparoscope navigation. ...
... Increasing the number of strips results in a smooth cylindrical surface whereas it increases the computation time. To project an image of width W T that matches the arc length on the approximate cylinder, the approximate cylinder radius r cylinder is defined as in (8) with the angle θ cylinder that determines the projection range. ...
Full-text available
Minimally invasive surgery is widely used because of its tremendous benefits to the patient. However, there are some challenges that surgeons face in this type of surgery, the most important of which is the narrow field of view. Therefore, we propose an approach to expand the field of view for minimally invasive surgery to enhance surgeons’ experience. It combines multiple views in real-time to produce a dynamic expanded view. The proposed approach extends the monocular Oriented features from an accelerated segment test and Rotated Binary robust independent elementary features—Simultaneous Localization And Mapping (ORB-SLAM) to work with a multi-camera setup. The ORB-SLAM’s three parallel threads, namely tracking, mapping and loop closing, are performed for each camera and new threads are added to calculate the relative cameras’ pose and to construct the expanded view. A new algorithm for estimating the optimal inter-camera correspondence matrix from a set of corresponding 3D map points is presented. This optimal transformation is then used to produce the final view. The proposed approach was evaluated using both human models and in vivo data. The evaluation results of the proposed correspondence matrix estimation algorithm prove its ability to reduce the error and to produce an accurate transformation. The results also show that when other approaches fail, the proposed approach can produce an expanded view. In this work, a real-time dynamic field-of-view expansion approach that can work in all situations regardless of images’ overlap is proposed. It outperforms the previous approaches and can also work at 21 fps.
... As regards the problem of the narrow-angle FOV, the earlier studies relied on the movement of an endoscope to create a static panorama picture that fill the operation area [7][8][9][10][11]. However, the position and shape of the internal organs as well as instruments, especially for abdominal MIS, frequently change during the operations. ...
... For example, Liu et al. [8] utilized a tracking device according to the images from a single-camera gastroscope with a dual-cubic projection method in order to simultaneously create both local and panoramic views. In [11], an image mosaicing scheme based on Simultaneous Localization and Mapping (SLAM) has been proposed for dynamic view expansion. Ali et al. [22] also proposed a novel data term for motion estimation for robust bladder image mosaicing. ...
Full-text available
Purpose The minimally invasive surgery (MIS) has shown advantages when compared to traditional surgery. However, there are two major challenges in the MIS technique: the limited field of view (FOV) and the lack of depth perception provided by the standard monocular endoscope. Therefore, in this study, we proposed a New Endoscope for Panoramic-View with Focus-Area 3D-Vision (3DMISPE) in order to provide surgeons with a broad view field and 3D images in the surgical area for real-time display. Method The proposed system consisted of two endoscopic cameras fixed to each other. Compared to our previous study, the proposed algorithm for the stitching videos was novel. This proposed stitching algorithm was based on the stereo vision synthesis theory. Thus, this new method can support 3D reconstruction and image stitching at the same time. Moreover, our approach employed the same functions on reconstructing 3D surface images by calculating the overlap region’s disparity and performing image stitching with the two-view images from both the cameras. Results The experimental results demonstrated that the proposed method can combine two endoscope’s FOV into one wider FOV. In addition, the part in the overlap region could also be synthesized for a 3D display to provide more information about depth and distance, with an error of about 1 mm. In the proposed system, the performance could achieve a frame rate of up to 11.3 fps on a single Intel i5-4590 CPU computer and 17.6 fps on a computer with an additional GTX1060 Nvidia GeForce GPU. Furthermore, the proposed stitching method in this study could be made 1.4 times after when compared to that in our previous report. Besides, our method also improved stitched image quality by significantly reducing the alignment errors or “ghosting” when compared to the SURF-based stitching method employed in our previous study. Conclusion The proposed system can provide a more efficient way for the doctors with a broad area of view while still providing a 3D surface image in real-time applications. Our system give promises to improve existing limitations in laparoscopic surgery such as the limited FOV and the lack of depth perception.
... There are also many existing vision-based methods to reconstruct the 3D surface of a target organ while estimating the endoscope poses from a monocular endoscope video (see [8]- [10] for the surveys). The methods are ranging from shape-from-shading (SfS) [11]- [13], visual simultaneous localization and mapping (SLAM) [14]- [17], and structure-from-motion (SfM) [18]- [22]. However, most of existing works only have demonstrated the reconstruction result of a partial surface of the target organ, which is not sufficient for our localization purpose. ...
... In this paper, we have presented an SfM pipeline to reconstruct the whole shape of a stomach from a standard monocular endoscope video. For this work, we have decided to adopt SfM because it has numbers of advantages compared to other approaches such as SfS [11]- [13] and SLAM [14]- [17]. The SfS can recover the 3D structure from a single image. ...
Full-text available
Gastric endoscopy is a common clinical practice that enables medical doctors to diagnose various lesions inside a stomach. In order to identify the location of a gastric lesion such as early cancer and a peptic ulcer within the stomach, this work addresses to reconstruct the color-textured 3D model of a whole stomach from a standard monocular endoscope video and localize any selected video frame to the 3D model. We examine how to enable structure-from-motion (SfM) to reconstruct the whole shape of a stomach from endoscope images, which is a challenging task due to the texture-less nature of the stomach surface. We specifically investigate the combined effect of chromo-endoscopy and color channel selection on SfM to increase the number of feature points. We also design a plane fitting-based algorithm for 3D point outliers removal to improve the 3D model quality. We show that whole stomach 3D reconstruction can be achieved (more than 90% of the frames can be reconstructed) by using red channel images captured under chromo-endoscopy by spreading indigo carmine (IC) dye on the stomach surface. In experimental results, we demonstrate the reconstructed 3D models for seven subjects and the application of lesion localization and reconstruction. The methodology and results presented in this paper could offer some valuable reference to other researchers and also could be an excellent tool for gastric surgeons in various computer-aided diagnosis applications.
... Minimally invasive surgery is gradually merging with computer vision techniques, and the use of computers for image processing to extend the surgical field of view has broken the limitations of traditional surgery [2][3][4]. Three-dimensional models allow a more intuitive view of the surgical scene and simplify the localization process. There are some solutions for the three-dimensional imaging methods based on computer vision in the lumen environment. ...
Full-text available
Traditional endoscopic treatment methods restrict the surgeon’s field of view. New ap�proaches to laparoscopic visualization have emerged due to the advent of robot-assisted surgical techniques. Lumen simultaneous localization and mapping (SLAM) technology can use the image sequence taken by the endoscope to estimate the pose of the endoscope and reconstruct the lumen scene in minimally invasive surgery. This technology gives the surgeon better visual perception and is the basis for the development of surgical navigation systems as well as medical augmented reality. However, the movement of surgical instruments in the internal cavity can interfere with the SLAM algorithm, and the feature points extracted from the surgical instruments may cause errors. Therefore, we propose a modified endocavity SLAM method combined with deep learning semantic segmentation that introduces a convolution neural network based on U-Net architecture with a symmetric encoder–decoder structure in the visual odometry with the goals of solving the binary segmentation problem between surgical instruments and the lumen background and distinguishing dynamic feature points. Its segmentation performance is improved by using pretrained encoders on the network model to obtain more accurate pixel-level instrument segmentation. In this setting, the semantic segmentation is used to reject the feature points on the surgical instruments and reduce the impact caused by dynamic surgical instruments. This can provide more stable and accurate mapping results compared to ordinary SLAM systems
... To overcome the limited field of view of endoscopes, several authors have presented computer vision techniques for anatomic 3D surface reconstruction [9,15]. Methods such as SLAM [6], stereo endoscopy [3,10], shape from shading [11], and shape from structured light [4] have been actively investigated for environments subject to endoscopic inspection. ...
Bladder cancer is likely to recur after resection. For this reason, bladder cancer survivors often undergo follow-up cystoscopy for years after treatment to look for bladder cancer recurrence. 3D modeling of the bladder could provide more reliable cystoscopic documentation by giving an overall picture of the organ and tumor positions. However, 3D reconstruction of the urinary bladder based on endoscopic images is challenging. This is due to the small field of view of the endoscope, considerable image distortion, and occlusion by urea, blood or particles. In this paper, we will demonstrate a method for the conversion of uncalibrated, monocular, endoscopic videos of the bladder into a 3D model using structure-from-motion (SfM). First of all, frames are extracted from video sequences. Distortions are then corrected in a calibration procedure. Finally, the 3D reconstruction algorithm generates a sparse surface approximation of the bladder lining based on the corrected frames. This method was tested using an endoscopic video of a phantom that mimics the rich structure of the bladder. The reconstructed 3D model covered a large part of the object, with an average reprojection error of 1.15 pixels and a relative accuracy of 99.4%.
... Nonetheless, mosaicing is a specific kind augmentation that does not imply an overlay but rather a virtual expansion of the endoscopic view, which is often too narrow. Especially popular in NOTES, two different mosaic approaches have nonetheless been proposed for laparoscopy; one that relies on additional cameras ( Tamadazte et al., 2015 ) and one that relies on image stitching ( Mountney and Yang, 2010;Totz et al., 2012 ). ...
This article establishes a comprehensive review of all the different methods proposed by the literature concerning augmented reality in intra-abdominal minimally invasive surgery (also known as laparoscopic surgery). A solid background of surgical augmented reality is first provided in order to support the survey. Then, the various methods of laparoscopic augmented reality as well as their key tasks are categorized in order to better grasp the current landscape of the field. Finally, the various issues gathered from these reviewed approaches are organized in order to outline the remaining challenges of augmented reality in laparoscopic surgery.
... Viewport enhancement A further possible application of AR is to improve the viewing conditions of the surgeon. This can be done by expanding the restricted viewport and visualizing the surrounding area using image stitching methods [19,153,241], potentially even in 3D [265]. Similar techniques have also been investigated for the purpose of video summarization, e.g., to obtain a condensed representation of an examination for a medical record (see Section 4.2.4). ...
Full-text available
In recent years, digital endoscopy has established as key technology for medical screenings and minimally invasive surgery. Since then, various research communities with manifold backgrounds have picked up on the idea of processing and automatically analyzing the inherently available video signal that is produced by the endoscopic camera. Proposed works mainly include image processing techniques, pattern recognition, machine learning methods and Computer Vision algorithms. While most contributions deal with real-time assistance at procedure time, the post-procedural processing of recorded videos is still in its infancy. Many post-processing problems are based on typical Multimedia methods like indexing, retrieval, summarization and video interaction, but have only been sparsely addressed so far for this domain. The goals of this survey are (1) to introduce this research field to a broader audience in the Multimedia community to stimulate further research, (2) to describe domain-specific characteristics of endoscopic videos that need to be addressed in a pre-processing step, and (3) to systematically bring together the very diverse research results for the first time to provide a broader overview of related research that is currently not perceived as belonging together.
... Davusion et al. [Davison et al., 2007]. Methods mentioned in [Totz et al., 2012, Maier-Hein et al., 2013, Grasa et al., 2009 uses SLAM based approach with EKF and stereo-endoscopic image data for extending FOV during MIS. An overview of state-of-the-art methods for 3D surface reconstruction in computer-assisted laparascopy for MIS is presented in [Maier-Hein et al., 2013]. ...
Cystoscopy is the reference procedure for the diagnosis and treatment of bladder cancer. The small field of view (FOV) of endoscopes makes both the diagnosis and follow-up of lesions difficult. Image mosaics are a solution to this problem since they visualize large FOVs of the bladder scene. However, due to low contrast, weak texture, inter- and intra-patient texture variability and illumination changes in these image sequences, the task of image mosaicing becomes challenging. This is also a major concern in other endoscopic data and non-medical scenes like underwater videos. In this thesis, a total variational energy has been first minimized using a first-order primal-dual algorithm in convex optimization to obtain optical flow vector fields giving a dense and accurate correspondence between homologous points of the image pairs. The correspondences are then used to obtain transformation parameters for registering the images to one global mosaic coordinate system. The proposed methods for dense optical flow estimation include a data-term which is modeled to minimize at most the outliers and a regularizer which is designed to preserve at their best the flow field discontinuities. An optical flow algorithm, which is robust to strong illumination changes (and which suits to different modalities), has also been developed in this framework. The registration accuracy and robustness of the proposed methods are tested on both publicly available datasets for optical flow estimation and on simulated bladder and skin phantoms. Results on patient data acquired with rigid and flexible cystoscopes under the white light and the fluorescence modality show the robustness of the proposed approaches. These results are also complemented with those of other real endoscopic data, dermoscopic sequences, underwater scenes and space exploration data.
Full-text available
In recent years monocular SLAM has produced algorithms for robust real-time 3D scene modeling and camera motion estimation which have been validated experimentally using low cost hand-held cam-eras and standard laptops. Our contribution is to extend monocular SLAM methods to deal with images coming from a hand-held standard monocular endoscope. With the endoscope image sequence as the only input to the algorithm, a sparse abdominal cavity 3D model –a 3D map– and the endoscope motion are computed in real-time. A second contribution is to exploit the recovered sparse 3D map and the endoscope motion to: 1) produce real-time photorealistic 3D models that ease cavity visualization; 2) measure distances in 3D between two points of the cavity; and 3) support augmented reality (AR) annotations. All this information can provide useful support for surgery and diagnose based on endoscope sequences. The results are validated with real hand-held endoscope sequences of the abdominal cavity.
Full-text available
Recent years have witnessed significant progress in example-based texture synthesis algorithms. Given an example texture, these methods produce a larger texture that is tailored to the user's needs. In this state-of-the-art report, we aim to achieve three goals: (1) provide a tutorial that is easy to follow for readers who are not already familiar with the subject, (2) make a comprehensive survey and comparisons of different methods, and (3) sketch a vision for future work that can help motivate and guide readers that are interested in texture synthesis research. We cover fundamental algorithms as well as extensions and applications of texture synthesis.
Conference Paper
Full-text available
We address the problem of globally consistent estimation of the trajectory of a robot arm moving in three dimensional space based on a sequence of binocular stereo images from a stereo camera mounted on the tip of the arm. Correspondence between 3D points from successive stereo camera positions is established through matching of 2D SIFT features in the images. We compare three different methods for solving this estimation problem, based on three distance measures between 3D points, Euclidean distance, Mahalanobis distance and a distance measure defined by a maximum likelihood formulation. Theoretical analysis and experimental results demonstrate that the maximum likelihood formulation is the most accurate. If the measurement error is guaranteed to be small, then Euclidean distance is the fastest, without significantly compromising accuracy, and therefore it is best for on-line robot navigation.
Conference Paper
Full-text available
In this paper, we propose a practical method for removing specular artifacts on the epicardial surface of the heart in robotic laparoscopic surgery while preserving the underlying image structure. We use freeform temporal registration of the non-rigid surface motion to recover chromatic information saturated by highlights. The diffuse and specular image components are then separated by shifting pixel intensities with respect to chromaticity gathered from the spatio-temporal volume. Results on in vivo data and reconstructions of 3D structure from the diffuse images show the potential value of the technique.
In this paper, we describe a study on the relationship between gaze position and cursor position on a computer screen during web browsing. Users were asked to browse several web sites while their eye/mouse movements were recorded. The data suggest that there is a strong relationship between gaze position and cursor position. The data also show that there are regular patterns of eye/mouse movements. Based on these findings, we argue that a mouse could provide us more information than just the x, y position where a user is pointing. This implies that we can use an inexpensive and extremely popular tool as an alternative of eye-tracking systems, especially in web usability evaluation. Moreover, by understanding the intent of every mouse movement, we may be able to achieve a better interface for human computer interaction.
Today, photodynamic diagnostics is commonly used in endoscopic intervention of the urinary bladder. Excited by a narrow band illumination, fluorescence markers enhance the visual contrast between benign and malignant tissue. Since in this modality the endoscope must be moved close to the bladder wall to provide sufficiently exposed images, the field of view (FOV) of the endoscope is very limited. This impedes the navigation and the re-identifying of multi-focal tumors for the physician. Thus, an image providing a larger FOV, composed from single images is highly desired during the intervention for surgery assistance. Since endoscopic mosaicking in real-time is still an open issue, we introduce a new feature-based image mosaicking algorithm for fluorescence endoscopy. Using a multi-threaded software design, the extraction of SURF features, the matching and the image stitching are separated in single processing threads. In an optimization step we discuss the trade-off between feature repeatability and processing time. After adjusting an optimal thread synchronization, the optimal workload of each thread results in a fast and real-time capable computation of image mosaics. On a standard hardware platform our algorithm performs within the RealTimeFrame framework with an update rate of 8.17 frames per second at full input image resolution (780×576). Providing a fast growing image with an extended FOV during the intervention, the mosaic displayed on a second monitor promises high potential for surgery assistance. KeywordsBladder cancer–Fluorescence endoscopy–Image mosaicking–Panorama–Image composition
Recent advances in surgical robotics have provided a platform for extending the current capabilities of minimally invasive surgery by incorporating both preoperative and intraoperative imaging data. In this tutorial article, we introduce techniques for in vivo three-dimensional (3-D) tissue deformation recovery and tracking based on laparoscopic or endoscopic images. These optically based techniques provide a unique opportunity for recovering surface deformation of the soft tissue without the need of additional instrumentation. They can therefore be easily incorporated into the existing surgical workflow. Technically, the problem formulation is challenging due to nonrigid deformation of the tissue and instrument interaction. Current approaches and future research directions in terms of intraoperative planning and adaptive surgical navigation are explained in detail.
Barycentric coordinates can be used both to express a point inside a tetrahedron as a convex combination of the four vertices and to linearly interpolate data given at the vertices. In this paper we generalize these coordinates to convex polyhedra and the kernels of star-shaped polyhedra. These coordinates generalize in a natural way a recently constructed set of coordinates for planar polygons, called mean value coordinates. Key words: barycentric coordinates, parameterization, mean value theorem. 1.