Content uploaded by Avgousta Stanitsa
Author content
All content in this area was uploaded by Avgousta Stanitsa on Nov 12, 2022
Content may be subject to copyright.
Multimodal Transportation 2 (2023) 100049
Contents lists available at ScienceDirect
Multimodal Transportation
journal homepage: www.elsevier.com/locate/multra
Full Length Article
Investigating pedestrian behaviour in urban environments: A
Wi-Fi tracking and machine learning approach
Avgousta Stanitsa
β
, Stephen H Hallett , Simon Jude
Cranο¬eld Environment Centre, School of Water, Energy and Environment, Cranο¬eld University, Cranο¬eld, UK, Bedfordshire, MK43 0AL
ξ ξ ξ ξ
ξ ξ ξ ξ
ξ ξ ξ
Keywords:
Pedestrian movement
Machine-learning
Urban environment
Wi-Fi tracking
Human behaviour
ξ ξ ξ ξ ξ ξ ξ ξ
Urban geometry plays a critical role in determining paths for pedestrian ξow in urban areas. To
improve the urban planning processes and to enhance quality of life for end-users in urban spaces,
a better understanding of the factors inξuencing pedestrian movement is required by decision-
makers within the urban design and planning industry. The aim of this study is to present a novel
means to assess pedestrian routing in urban environments. As a unique contribution to knowledge
and practice, this study: (a) enhances the body of knowledge by developing a conceptual model
to assess and classify pedestrian movement behaviours, utilising machine learning algorithms and
location data in conjunction with spatial attributes, and (b) extends previous research by revealing
spatial visibility as a driver for pedestrian movement in urban environments. The importance of
the ξndings lies in the perspective of revealing novel insights concerning individual preferences
and behaviours of end-users and the utilisation of urban spaces. The approaches developed can be
utilised for observations in large-scale contexts, as an addition to traditional methods. Application
of the model in a high pedestrian traξc-dense retail urban area in London reveals clear and
consistent relationships amongst spatial visibility, individualsβ motivation, and knowledge of the
area. Key behaviours established in the study area are grouped into two activity categories: (i)
Utilitarian walking (with motivation - expert and novice striders) and (ii) Leisure walking (no
motivation - expert and novice strollers). The approach oξers an insightful and automated means
to understand pedestrian ξow in urban contexts and informs wider wayξnding, walkability, and
transportation knowledge.
Introduction
Open urban spaces represent an important asset within cities, providing opportunities for users to engage with their communities
and enhance their quality of life ( Mouratidis, 2021 ). Nevertheless, urban growth and development have pressured public urban spaces
and, subsequently, their design. The planning and design of a city are inξuenced by several factors, with mobility being one of the most
inξuential ( Mendiola and GonzΓ‘lez, 2021 ). Movements within street networks and the act of walking aid planners implement road
Abbreviations: ML, Machine Learning; BDAs, Big Data Approaches; TFL, Transport for London; AI, Artiξcial Intelligence; UWB, Ultra-wideband
transceivers; CCTV, Closed-circuit television; MAC, Media Access Control; VGA, Visibility Graph Analysis; GIS, Geospatial Information Systems;
HPC, High-Performance Computer; GDPR, General Data Protection Regulation; NaN, Not a Number; EM, Elbow Method; SA, Silhouette Analysis;
WCSS, Within-Cluster Sum-of-Squares; CH, Calinski-Harabasz coeξcient.
β Corresponding author.
E-mail addresses: avgousta.stanitsa@cranξeld.ac.uk , augusta.stanitsa@outlook.com (A. Stanitsa), s.hallett@cranξeld.ac.uk (S.H. Hallett),
s.jude@cranξeld.ac.uk (S. Jude) .
https://doi.org/10.1016/j.multra.2022.100049
Received 27 March 2022; Received in revised form 26 May 2022; Accepted 28 June 2022
2772-5863/Β© 2022 The Authors. Published by Elsevier Ltd on behalf of Southeast University. This is an open access article under the CC BY
license ( http://creativecommons.org/licenses/by/4.0/ )
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
development, public transport, and placement of amenities, and create designs promoting physical activity, healthier communities,
and well-being ( JΓ€rv et al., 2012 ; OβSullivan et al., 2000 ; Brown et al., 2014 ). Although diversity in walking activities is generally
recognised in wayξnding and walkability studies ( Lynch, 1960 ; Cornell et al., 2003 ; Bitgood, 2010 ), there is not yet a systematic way
to best categorise pedestrian behaviours, leaving a gap in knowledge. While there is a large body of research on the eξect of the built
environment on walking patterns from various perspectives ( Gehl, 2011 ; Zacharias, 2001 ; Clifton et al., 2007 ; Mehta, 2009 ), less is
known about how and to what extent it inξuences pedestrian behaviour.
Information is sought by urban planners and designers to aid an understanding of the impact of the built environment on pedes-
trian behaviour. However, there remains a lack of objective data in the study of pedestrian movement patterns, and consequently,
there is a limited understanding of the role and value of implementing novel technologies in design. Increased urban data avail-
ability has renewed the interest in urban mobility for better understanding pedestrian activity and the eξects of diverse physical
factors on behaviour. Nevertheless, to date, studies exploring human experience on urban spaces are mainly based on traditional
data sources and analysis methods, such as observational data and statistical techniques ( Hillier et al., 1993 ; Krizek et al., 2009 ).
Although traditional qualitative methodologies have several advantages, including data richness and validity, they present several lim-
itations in terms of controllability, data quality, representativeness, and associated costs ( Feng et al., 2021 ). Such limitations include
the diξculty of recording crowd movements in public spaces via observational data and experiments containing bias ( Feng et al.,
2021 ).
Existing studies have illustrated the necessity of both contemporary data collection and analysis methods, such as objective
walking patterns from large-scale monitoring and machine learning analysis techniques, to enable the study of new types of pedestrian
movement ( Feng et al., 2021 ; Lee, 2020 ). These methods hold the prospect for the collection of new types of pedestrian movement
data due to their several advantages, such as increased experimental control and lower implementation costs. They subsequently
help to overcome some of the limitations found on traditional approaches (e.g., surveys or observational data collection). Although
such approaches highlight the potential of data-driven methodologies in supporting more informed design decisions, it is yet unclear
in what way and how novel sources of information and approaches could be realised to provide insights on pedestrian movement
behaviours in urban spaces, leaving an additional gap.
These studies make substantial progress in translating great amounts of data and spatial information from cities into speciξc
pedestrian knowledge (e.g., ( Zhang et al., 2020 ; Karbovskii et al., 2019 )). However, most studies fail to investigate behaviour in large
contexts by focusing on small-scale data samples, bivariate analysis; not multifaceted, manually annotated training data which can
be proven costly, or individual detection; which is not systematic, for supervised learning ( Koh et al., 2020 ). As a result, unsupervised
pedestrian movement data evidence, such as relationships between pedestrians and other urban objects, is lacking. Employment of
large-scale monitoring via smartphones and sensor networks, machine learning and evolutionary computation are becoming promi-
nent in the domain of pedestrian mobility to complement traditional methods for extracting and improving the semantics of human
movement behaviour ( Wirz et al., 2013 ).
The aim of this study is to present a novel means to assess pedestrian routing in urban environments. A conceptual model is
presented to classify behaviours and spatial conξguration interactions, utilising machine learning (ML) algorithms and location
data derived from Wi-Fi tracking techniques. The proposed model illustrates numerous diξerences and advantages compared to
other methods in the existing literature. The conceptual model developed utilises large-scale data samples, while existing literature
in the study of urban space recognition and the role of sensorial experience on movement patterns is limited to small-scale data
samples. To overcome the limitation of lacking labelled data, un-supervised ML clustering is employed utilising data collected from
Wi-Fi tracking techniques combined with urban design attributes, and more speciξcally, spatial visibility as extracted from space
syntax methodologies. Thus, the results reveal novel insights concerning individual preferences and behaviours of end-users and the
utilisation of urban spaces, which in existing literature are achieved by the collection of qualitative data, such as on-site observations.
Finally, the proposed model developed provides a systematic way to assess pedestrian behaviour utilising novel sources of information
and approaches in large-scale contexts, covering the existing literature gaps.
As a unique contribution to knowledge and practice, this study: (a) enhances the body of knowledge by developing a conceptual
model to assess and classify pedestrian movement behaviours, utilising machine learning algorithms and location data in conjunction
with spatial attributes, and (b) extends previous research by revealing the spatial visibility aspect as a driver for pedestrian movement
in urban environments. The importance of the ξndings lies in the perspective of revealing novel insights concerning individual
preferences and behaviours of end-users and the utilisation of urban spaces.
Pedestrian behaviour and spatial production
Several studies have been conducted in the ξeld of built environment-pedestrian behaviour relationships from various disciplines
over the last three decades ( Lynch, 1960 ; Dridi, 2015 ; Gehl, 2011 ; Gibson, 1988 ). Due to the complexity of pedestrian movement,
the approaches suggested to explain it in urban space and the focus of research varies among the ξelds. For example, in the ξelds
of health and urban design, the emphasis is placed on the qualities and attributes of urban design, treated with reference to the
immediate condition of individual streets. Such studies have documented relations among street-level design, pedestrian activity,
and environmental correlates of walking ( Loukaitou-Sideris, 2020 ). Research in transportation and planning though, turned its at-
tention to urban form aspects of walkability (i.e., proximity and distance) and connectivity to reveal their relations with pedestrian
movement behaviour ( Frank, 2000 ). A thorough literature review on pedestrian behaviour in these ξelds reveals several impor-
2
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
tant observations. The following outcomes, methods and technologies used in the research to achieve their objectives are presented
below:
Findings from observations
In 1970, sociologist William H. Whyte founded The Street Life Project; a small research group that observed many plazas and
small parks in New York City to determine the factors that explain how some city spaces respond well to peopleβs needs while
others do not, and then documented what might be the basic elements of a successful small urban space ( Whyte, 1980 ). Whyte
observed how people seek more than mere physiological comfort, and how pedestrians will consequently undergo a certain degree of
physical discomfort to satisfy psychological needs. Distractions and pleasures divert the pedestrianβs attention away from intentions to
reduce distance and eξort, even if reducing eξort is related with carrying heavy goods ( Al-Widyan et al., 2017 ; GΓ€rling and GΓ€rling,
1988 ).
Research to date highlighted the signiξcance of urban space recognition and the role of sensorial experience on movement patterns
( Whyte, 1980 ; Choi, 2012 ), however, the understanding on the way they are related to spatial behaviour remains limited, with the
greater sensory ξeld receiving little attention ( Zacharias, 2001 ). Examples investigating these aspects are recent studies around
streetscape features relating to comfort and pleasurability ( Capitanio, 2019 ), emotions against the background of environmental
information ( Resch et al., 2020 ) or familiar and unfamiliar spaces ( Phillips et al., 2013 ). These studies highlight that pedestrian
behaviour changes diξer based on diverse parameters relating to the physiological characteristics of the individuals. Emotional
responses are not only part of the individual or collective subjective experiences but constitute a motivational factor for behaviour
and choice.
Research investigating the signiξcance of preference and emotional qualities to pedestrian walking patterns, mainly utilises manual
gathering of observational and qualitative data. Whyteβs work ( Whyte, 1980 ) was seminal, being one of the ξrst attempts to quantify
human activity in open spaces using data-driven approaches, via interviews, observations, and the use of a time lapse camera with
a digital clock overlooking the plazas to record daily patterns. Employment of traditional qualitative approaches for data collection
and analysis presents several advantages, such as data richness and validity ( Feng et al., 2021 ). Field observations and surveys can be
conducted over long periods of time, collecting speciξc characteristics of pedestrians, such as sex, direction, personal items, clothing
information, psychological insights (preferences, motivations etc.) and others, resulting in rich and detailed information considering
fundamental concepts of inξuence of human behaviour. In addition, pedestrians do not have the knowledge of being tracked, hence,
their response to urban settings is in a more natural fashion. The analysis techniques of traditional data sources rely mainly on
statistical models, in which designers and urban planners have been traditionally trained to undertake such tasks (statistics, survey
research and estimations) ( French et al., 2015 ). Therefore, such approaches can ensure their practical application in academia, without
minimising their research potential.
However, several disadvantages around controllability, data quality, representativeness and associated costs of these methods
exist. These approaches remain to-date time consuming and labour-intensive, often limiting the scope of research ( Feng et al., 2021 ).
In addition, the accuracy of behavioural data relies on the setup, or the techniques used, such as granularity of the data, mechanisms
of recording the data and distribution of their densities, or respondentsβ internal characteristics, such as past experiences or personal
views, resulting in often unreliable datasets, not suitable for detailed analyses. Existing studies have illustrated the necessity of
contemporary data collection and analysis methods, whilst highlighting a lack of novel techniques employed in the ξelds of sensory
and urban design, demonstrating the current limitations and disadvantages concerning the types of pedestrian behaviour that can be
studied with traditional approaches ( Feng et al., 2021 ). For example, recording concurrent crowd movements in public spaces via
observational data is diξcult, introducing biased information collected via experimental setups.
Pedestrian activities categorisation
Several researchers attempted to categorise pedestrian activities and their inξuencing factors. Gehl ( Gehl, 2011 ) simpliξed out-
door activities in urban spaces to three categories: necessary, optional, and social. Gehl argued that when the quality of the urban
environment is good, optional activities increase in frequency. As those activities rise, so the number of social activities also increase
( Gehl and Gemzoe, 1996 ). Other researchers have divided the types of activities to two categories, driven by the pedestrianβs moti-
vations, as an example the study of Ki and Lee ( Ki and Lee, 2021 ) where they divided activities into utilitarian and leisure, whilst
others have followed similar categorisation, further explained in Table 1 . The contribution to knowledge of Table 1 to existing body
of literature is to reveal the lack of a systematic knowledge about how to best categorise pedestrian behaviours.
Wayο¬nding research ο¬ndings
The extensive literature on wayξnding typically supports the notion that the complexity of spatial design is connected to success
in reaching a destination ( Dridi, 2015 ; Weisman, 1981 ; Gibson, 1988 ). Literature on wayξnding identiξes how within complex built
environments there are two key types of journeys: (i) goal-oriented and (ii) non-goal-directed or exploratory ( Gibson, 1988 ). The ξrst
refers to pedestrianβs motivation in moving towards speciξc points within a space, e.g., residential buildings or transit terminals. The
second is stimulated by visually attractive objects encountered along the path to the goal, e.g., window displays or street performances.
According to Transport for Londonβs (TFL) research, there are four diξerent and distinct types of journeys, each with speciξc travel
characteristics, thus: Novice strider, Expert strider, Novice stroller and Expert stroller ( Davies, 2007 ). Within this categorisation,
3
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Table 1
Summary of walking activities categorisation in research.
Walking activities categories Source Title Date Method Data source
β’Utilitarian walking
β’Leisure walking
Ki, D. and Lee, S.
(
Ki and Lee, 2021 )
Analyzing the eξects
of Green View Index
of neighborhood
streets on walking
time using Google
Street View and
deep learning
2021 Green View Index
(GVI), Semantic
segmentation,
deep neural
network model
fully convolutional
network
Google Street View (GSV)
images
β’Walking for transport
β’Walking for recreation
Zhang X.,
Melbourne S.,
Sarkar C.,
Chiaradia A.,
Webster C.
(
Zhang et al.,
2020
)
Eξects of green
space on walking:
Does size, shape, and
density matter?
2020 Regression model
and statistical
analysis
Green spaces from UKMap,
Pedestrian data from London
Travel Demand Survey
2009/2010
β’Essential trips for commuting
β’Optional trips for
recreational activities
Lee, J.M.
(
Lee, 2020 )
Exploring Walking
Behavior in the
Streets of New York
City Using Hourly
Pedestrian Count
Data
2020 Placemeter
utilising computer
vision algorithms.
Produced video
feeds from streets
and creates
automated reports
tracking the
number of
pedestrians.
Pedestrian count and relative
speed data captured by video,
weather data from Central Park
weather station (KNYC) and
other private weather stations
in New York City, and sunlight
and wind simulation results
from massing models of the
corresponding locations.
β’Stationery activities
β’Passer-by activities
Cultural and social activities
Istrate et al.
(
Istrate et al.,
2020
)
How Attractive for
Walking Are the
Main Streets of a
Shrinking City
2020 Case study
approach
Observational data
β’Optional
Necessary
β’Social (Resultant activities)
Jan Gehl &
Birgitte Svarre
(
Gehl and
Svarre, 2013
)
How to Study Public
Life
2013 Systematic study Qualitative data, including
Observations, interviews,
mapping existing infrastructure
β’Walking for exercise
β’Walking for sports/ exercise
Walking for transport
Tudor-Locke,
Bittman, Merom,
D. (
Tudor-
Locke et al., 2005
)
Patterns of walking
for transport and
exercise: a novel
application of time
use data
2005 Nesting 3-digit
code classiξcation
and statistical
analysis
Time use diaries
β’Goal- oriented
β’Non-goal-directed or
exploratory
Gibson, E.J.
(
Gibson, 1988 )
Exploratory behavior
in the development
of perceiving, acting,
and the acquiring of
knowledge
1988 Experiments Observational data
knowledge of the area is incorporated for all types of activities. These studies highlight that built environment design can either help
or hinder individual wayξnding in a variety of ways depending on environmental factors. Two of the key factors that aξect pedestrian
movement as identiξed in previous literature are the spatial characteristics of a given setting, e.g., visibility, layout, or diversity, and
the wayξnding support system, such as signage and information boards ( Weisman, 1981 ).
Researchers aiming to observe wayξnding behaviour have developed operational models of individual behaviour using a symbolic
artiξcial intelligence (AI) approach ( Dridi, 2015 ). The goal in these models is to simulate human decision-making and problem-solving
processes. An AI modelβs detailed behaviour though is not explicitly described by its algorithm, rather its behaviour is inξuenced by the
speciξc problem presented to it ( Moore, 2017 ). Although these new types of approaches, are employed in the study of pedestrian move-
ment, evolving prediction methodologies, they fall short as a comprehensive theory of environmental psychology ( Aschwanden et al.,
2019 , Van Dijk, 2018 , Ye et al., 2019 ). Therefore, additional sources of information are an inevitable requirement of successful
modelling in the ξeld of pedestrian walking behaviour ( Angelelli et al., 2018 ).
Space syntax
The Social Logic of Space by Hillier and Hanson ( Hillier and Hanson, 1984 ) investigated historic cities and discovered that their
organic development resulted in remarkably similar street patterns. They developed β space syntaxβ; a set of techniques for describing
and analysing spatial conξgurations in the context of human socioeconomics. Although the complex question of societal behaviour
and spatial production was developed in the ξelds of architecture and urban planning in early 80s, leading to the investigation of
various spatial models, such as space syntax research, one of its disadvantages is its limitation to measuring space in a static way (e.g.,
4
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
geometry or topology) ( Hillier and Hanson, 1984 ; Till, 2007 ). Majority of these studies researching urban design and walkability rely
on the assumption that street conξguration is the most important inξuencing factor of pedestrian movement ( Wang and Huang, 2019 ;
Hillier and Hanson, 1984 ). A wide variety of researchers utilise conξguration analysis to better estimate pedestrian ξow volumes and
route choices ( Capitanio, 2019 ; Mansouri and Ujang, 2017 ; Γzer and Kubat, 2015 ). This approach translates complex street networks
into behavioural principles of the individualsβ preference for high street network legibility ( Boumezoued et al., 2020 ). Nevertheless,
this lacks qualitative information, such as aesthetics of chosen routes, safety feelings, light conditions, and others, and impacts due
to increased time spent in area or change in route directions.
Such simulation approaches do not reξect well the theory of environmental psychology. Although they present advantages relating
to the high controllability of the study, such approaches require manual recording of information to verify simulation results, and
although there are opportunities to capture unbiased behavioural data, this can result in relatively small sample sizes, not being
representative of the population, and further lacking temporal information ( Feng et al., 2021 ). In addition, movements of individuals
are ξexible, hence they cannot be considered as continuous over space, as they are entitled to the freedom of revisiting places or
changing their movement decisions continuously in time, adding into the modelling exercise complexity. Therefore, the outcomes of
these approaches are hypothetical and do not always represent reality as they can be biased and lacking in accuracy.
Research using novel data sources and methodologies
The increasing availability of data sources oξered opportunities to researchers to renew the concepts and methods currently used
in urban space design. A wide variety of new methodologies, often referred to as βBig Data Approaches β(BDAs), have become appar-
ent including ML, network analysis and visualisation techniques, used to better capture and analyse a range of complex urban space
problems ( Aschwanden et al., 2019 ; Kontokosta et al., 2018 ; DΓaz-Γlvarez et al., 2018 ). BDAs have been employed to manage increas-
ing urban data complexity in cities, with ML techniques employed in transportation and environmental studies ( Aschwanden et al.,
2019 ; Kontokosta et al., 2018 ; DΓaz-Γlvarez et al., 2018 ; Yue et al., 2022 ; Ali et al., 2021 ; Pollard et al., 2018 ). Unsupervised ML, and
more speciξcally, clustering algorithms have been used in retail to reveal customer behaviours, while other researchers have used
similar principles to cluster transit information based on temporal and spatial characteristics ( Mauri, 2003 ; Chang and Chen, 2009 ;
Ma et al., 2013 ).
Introduction of new sources and types of data therefore promise opportunities to better understand end-user needs, capturing
βpanopticβ data which is not easy to observe in the real world and addressing problems at both the city and neighbourhood scale,
whilst reducing traditional approach limitations, such as cost and scale implications from traditional data collection techniques. For
example, employment of large-scale monitoring via smartphones and sensor networks ( Wirz et al., 2013 ) enables researchers to study
crowds in large settings. Key to this is the mobile phone, which has been used widely in urban planning and transportation sector
applications ( Shi and Abdel-Aty, 2015 ; Moreira and Ferreira, 2017 ; MartΓn et al., 2019 ), phones now being an integral part of human
life. The device itself is transformed into a complex gadget that includes multimedia technologies that can reveal user preferences
in terms of commercialism, daily routines, and cultural choices ( Lee, 2011 ). These devices also include Wi-Fi and Bluetooth radio
communication and connections of these to triangulated base station nodes can be used for precise geolocation.
Various studies, from indoor environments to transportation hubs and mass events, have applied Wi-Fi tracking techniques, video
footage or traξc cameras to collect large-scale movements or to achieve real-time crowd monitoring ( Duives et al., 2020 ; Peftitsi et al.,
2020 ). For example, Duives et al. ( Duives et al., 2020 ) combined video systems and computer vision algorithms to study pedestrian
movement in mass events, while Li et al. ( Li et al., 2020 ) used process imaging techniques to analyse pedestrian behaviour in zigzag
corridors in the context of safety. Sensor based approaches, such as Wi-Fi tracking, present several advantages compared to other
data sources, such as closed-circuit television (CCTV) video, Bluetooth, ultra-wideband transceivers (UWB) and others. The use of
the Wi-Fi location tracking is very promising compared to other solutions and it represents a suitable solution for the following
reasons: Cost eξective solution; reasonable bandwidth, which leads to a high range resolution; wide coverage as Wi-Fi networks are
extensively used in commercial, public and private sectors, overcoming some of the limitations of other approaches (e.g., cameras
record a small area which is then diξcult to stitch together or UWB can support speciξc types of devices); and reasonable transmitted
power, which gives the Wi-Fi signal an advantage over short-range sensing technology such as UWB ( Colone et al., 2011 ; Wang et al.,
2018 ). However, a major disadvantage of a Wi-Fi location system is the positioning accuracy, as it depends on the experiment setup,
hence it should be considered as a solution in relation to the research question.
The analysis of large amounts of data from large-scale wireless networks though can be challenging due to the inherent charac-
teristics of the wireless environment, such as user mobility, noise, and data redundancy ( Koh et al., 2020 ; Medeiros et al., 2020 ). In
addition, although such techniques oξer solutions to the manual collection of information compared to observational studies, there
are many restrictions related to installation permissions in the public domain and ethical considerations. Such techniques oξer op-
portunities for studying population patterns in larger contexts, while as pedestrians have limited knowledge of being tracked, results
have a high degree of validity. However, the factors inξuencing pedestrian behaviour cannot be controlled, and the conditions under
the data were recorded are not controllable by the researcher ( Feng et al., 2021 ).
Study area
A high-street area in London, UK was selected as a case study site for the assessment of pedestrian walking patterns ( Fig. 1 ).
More speciξcally, Oxford Circus in London is a road junction and one of the busiest pedestrian crossings in the city, connecting
two of the most prominent retail streets, Oxford Street and Regent Street, located in the Londonβs West End. Oxford Street is a key
5
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Fig. 1. Study area and the distribution of Wi-Fi nodes.
transportation corridor, used extensively by taxis and cyclists and providing east-west routes for bus services and tube stations. The
Oxford Street district includes residential and retail areas, with commercial/oξce use. Regent Street is a major shopping street,
containing ξagship retail stores. Regent Street is c.1.3km long, while Oxford Street is 1.9km. Their intersection was transformed in
2011 from a segregated junction with barriers with limited overξow of pedestrians to an open diagonal crossing allowing pedestrians
to follow their desired route. This change reξects a shift in street design towards the concept of integration and space sharing as a way
of improving quality of environments, further enhanced by removal of street furniture, and with as many shared βsingle βsurfaces as
possible ( Mercieca et al., 2011 ).
The study area location was chosen based on the availability of datasets, the mix of uses, and street networks connecting to wider
residential areas. This therefore comprises an urban context with signiξcantly diξerent conditions and potentials, while it presents
environments similar to those found in many other cities and urban areas ( Carmona, 2015 ). The study area extent is deξned by the
location of the Wi-Fi nodes and their coverage. This includes all the streets located within the coverage area of the nodes (Blue line
in Fig. 1 ). This is further explained in the following sections.
Materials and methods
Data preparation
This study utilises multiple data sources, allowing a number of spatial attributes to be explored. These data include pedestrian
movement, urban geometry, and weather information data. All datasets selected, excepting the Wi-Fi location data, are publicly
available data, described in Table 2 . Data information reξect the category of information captured in each dataset, while geometry
type and date describe the shape of the captured information and the timeframe in which it was captured.
Pedestrian movement data, obtained by Wi-Fi tracking, was provided by The Crown Estate ( The Crown Estate 2021 ), with data
deriving from an earlier study aiming to inform a base year model used to simulate existing pedestrian ξow conditions and predictive
impact of changes in demand or spatial layouts ( Angelelli et al., 2018 ). Accuracy of the data has been tested in that same study, based
on comparison of the Wi-Fi obtained data and CCTV image data. Data was collected by capturing signals from Wi-Fi devices across
19 Open Mesh nodes (OM2P-HS) attached to ξoodlights on building cornices ( Fig. 1 βnode location). The nodes were installed at
3 to 5 m height, to be clear from obstructions that could aξect signal reception. Data was transmitted in real-time via the 3G/4G
mobile network for storage in a cloud-based server.
The mobile data used covers the period August to October 2017 ( Fig. 2 ), in total 22 days, incorporating 3,240,361 unique mobile
users. The August period presents the biggest continuous data sample and is the only period including weekend data collection. Data
was pre-processed by the technology provider ( Accuware Inc 2017 ) eliminating all privacy-related information, providing outputs as
a multi-ξeld .csv ξle per day, capturing unique Media Access Control (MAC) addresses, signal strengths, location in X Y Z coordinates
and a timestamp. MAC addresses are unique identiξers assigned to a network interface controller for use as a unique network address,
6
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Table 2
Types of data collected for this study with source.
Data information Geometry type Date Source
Pedestrian Data Point 2017 Wi-Fi tracking
Important buildings &
Infrastructure
Point 2018 Ordnance Survey ( https://osmaps.ordnancesurvey.co.uk/ ) Open street map
(
https://www.geofabrik.de/geofabrik/ )
POIS (
https://osmaps.ordnancesurvey.co.uk/ )
Registered EPC non-domestic (
www.gov.uk )
Park areas Polygon 2018 Ordnance Survey (https://osmaps.ordnancesurvey.co.uk/) Open street map
(https://www.geofabrik.de/geofabrik/)POIS
(https://osmaps.ordnancesurvey.co.uk/)Registered EPC non-domestic
(www.gov.uk)
Transportation access (bus stops/
tube entrances) location
Point 2018 Ordnance Survey (https://osmaps.ordnancesurvey.co.uk/) Open street map
(https://www.geofabrik.de/geofabrik/)POIS
(https://osmaps.ordnancesurvey.co.uk/)Registered EPC non-domestic
(www.gov.uk)
Street geometry Polygon 2018 Ordnance Survey (https://osmaps.ordnancesurvey.co.uk/) Open street map
(https://www.geofabrik.de/geofabrik/)POIS
(https://osmaps.ordnancesurvey.co.uk/)Registered EPC non-domestic
(www.gov.uk)
Amenities Point 2018 Ordnance Survey (https://osmaps.ordnancesurvey.co.uk/) Open street map
(https://www.geofabrik.de/geofabrik/)POIS
(https://osmaps.ordnancesurvey.co.uk/)Registered EPC non-domestic
(www.gov.uk)
Hourly temperature, humidity &
weather events
Point (temporal) 2017 Weather Underground/ Private weather stations
common in technologies, such as Ethernet, Wi-Fi, or Bluetooth. Signal strength was used for triangulation purposes by the provider
to derive location, indicating nodes in greater proximity to the recorded devices. Data were being captured on a frequency varying
from 1-60 seconds, dependent upon the type of handset devices, manufacturer, or activity level. The Wi-Fi tracking system used in
this study recorded signals with an accuracy of -3/ + 3m.The Wi-Fi nodes can capture the presence of a device at a distance of up to
100 m, while the signal strength from devices decreases exponentially with distance ( Accuware Inc 2017 ).
All the data handling, storage, processing, and presentation observe the data security and privacy requirements as speciξed in
General Data Protection Regulation (GDPR) on handling personal data and the protection of privacy ( EUROPEAN PARLIAMENT AND
OF THE COUNCIL 2016 ). Personal information was truncated via the system and converted to non-personal form, thus permitting
collection of information without consent ( Fuxjaeger and Ruehrup, 22 February 2018 ).
Conceptual model
Location data were assessed using a data analysis conceptual model developed based on the data collected ( Fig. 3 ). Analysis was
performed on a daily resolution utilising the following steps:
β’Data pre-processing: Data pre-processing utilising bespoke algorithms and space syntax methodologies was used to extract valuable
information and to enrich existing datasets. Walking pedestrian characteristics, spatial visibility and weather information were
mapped against individual points recorded via Wi-Fi tracking technique, described in detail in the following sections.
β’Data preparation for K-means clustering analysis and model development.
β’Data cleaning and normalisation using outlier removal (Interquartile Range Method) ( Vinutha et al., 2018 ) and Min-Max scaler
method ( Han et al., 2012 ).
β’Multiple factor analysis to remove multi-dimensionality of the data and to select key variables for the model.
β’Analysis & Results: Cluster analysis (unsupervised machine learning) to extract key behavioural patterns and identify classes of
homogeneous proξles.
Finally, based on the similarities returned by the clustering analysis, a data mapping exercise against these categories was under-
taken to investigate three key hypotheses: (i) recorded speed and purpose present a clear and consistent relationship, (ii) visibility
as a driver for movement, and (iii) knowledge of the area based on number of unique recorded devices, representing repeated visits.
The level of experience within the area was calculated by identifying and counting instances with only one visit recorded through
the number of days of the recorded dataset across all the days, with the assumption that this implied non-regular use of the area.
Data pre - processing
Following the conceptual model development, a bespoke algorithm written in Python was used to extract the variables calculated
on a point-by-point basis. The point-by-point method follows the pattern of subtracting the n + 1 point from the n point to extract the
absolute values, where n is the ξrst recorded location of the device within the study area. The variables extracted are: (i) Duration in
seconds, (ii) Distance in metres, (iii) Speed in m/s, (iv) Bearing in degrees and (v) Day period ( Table 3 ). To remove false recordings, a
threshold for outlier distances was set at 5,000m, removing all points with such recorded distances. Bearing was calculated to provide
7
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Fig. 2. Sample size: number of devices recorded for the 22 days (all data per day breakdown).
an indication of direction. The transformation to geographical degrees considers the geographical north at 0/360Β°, the Python code
thus calculating a bearing as N = 0/ 360; E = 90; S = 180; W = 270. Day period classiξcation was undertaken as follows, based on August
1
st
sun cycle in London, U.K. which for consistency it was used for all the study days:
β’βFirst lightβ: 05:24 to 08:40
β’βMorningβ: 08:40 to 12:10
β’βLunchtimeβ: 12:10 to 14:00
β’βAfternoonβ: 14:00 to 20:47
β’βLast lightβ: 20:47 to 21:28
β’βNighttimeβ: 21:28 to 05:24
A key limitation with the use of such big datasets is that the processing of information required polynomial run time and
the researchers had to utilise the High-Performance Computer (HPC) facility of Cranξeld University to overcome this issue
( Cranξeld University 16 August 2017 ). Sixteen CPU cores were needed, with a three-hour simulation time required per input csv
ξle.
To better understand pedestrian movement in relation to spatial attributes, further information was acquired, using the space
syntax methodology and more speciξcally, the visibility graph analysis (VGA), serving as a constant for spatial visibility ( Turner et al.,
8
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Fig. 3. Overview of proposed algorithm ξow of the conceptual model (blue boxes represent inputs and orange box represents outcome.
Table 3
Variable set used.
Variable name Data type Description Year collected Source
ID Categorical MAC address 2017 Wi-Fi tracking
end_lon ξoat X coordinate 2017 Wi-Fi tracking
end_lat ξoat Y coordinate 2017 Wi-Fi tracking
Period Categorical Name of the assigned
period
2017 Wi-Fi tracking
end_time Categorical Date and time 2017 Wi-Fi tracking
bearing_segment ξoat Bearing in degrees 2017 Wi-Fi tracking
duration_segment Integer Time spent in seconds 2017 Wi-Fi tracking
distance_segment ξoat Distance travelled in
metres
2017 Wi-Fi tracking
speed_segment ξoat Walking speed in m/s 2017 Wi-Fi tracking
rvalue_1 ξoat Spatial visibility 2018 (Building
polygons used
for VGA
simulation)
https://osmaps.ordnancesurvey.co.uk/ ,
https://www.openstreetmap.org ,
https://www.geofabrik.de/geofabrik/ )
Humidity_% ξoat Humidity in percentages 2017 Weather Underground. Weather Station
ID: ILONDON636
Speed_mph ξoat Wind speed in mph 2017 Weather Underground. Weather Station
ID: ILONDON636
Precip. Rate _in ξoat Precipitation rate in inches 2017 Weather Underground. Weather Station
ID: ILONDON636
Solar_w/m
2 ξoat Solar radiation in w.m
2 2017 Weather Underground. Weather Station
ID: ILONDON636
Temperature_C ξoat Temperature in Celsius 2017 Weather Underground. Weather Station
ID: ILONDON636
hours ξoat Hour of the day 2017 Wi-Fi tracking
2001 ). For the purposes of this paper, space syntax theories were employed using VGA assessments via the open-source software
DepthmapX_net_035 ( Varoudis, 2021 ). An area of 2km diameter was adopted as the distance threshold, preventing result distortion
due to the small scale. According to AhrnΓ© et al. (2009), minimum distance thresholds range from 300m to 1km radius, hence the 2km
diameter scale was chosen by the authors ( AhrnΓ© et al., 2009 ). Following the VGA, the use of a Geospatial Information Systems (GIS)
platform and the function βExtract Values to Points βwas used, and values were mapped against each point, serving as an additional
analysis parameter.
9
A. Stanitsa, S.H. Hallett and S. Jude Multimodal Transportation 2 (2023) 100049
Finally, weather data were exported from a private weather station located in the area for each date. Weather information were
recorded every ξve minutes and the most appropriate ξelds were selected to reξect the microclimate conditions in the study area. The
parameters comprised humidity, wind speed, precipitation, solar radiation, and temperature. All weather information was mapped
against the location dataset, using the bisect method (array bisection algorithm) in Python. The full set of variable inputs compiled
are displayed in Table 3 .
Choosing ML clustering method: k-means
Clustering algorithms are classiξed into several types based on their partitioning, density, and model ( Zhai et al., 2014 ). A
clustering algorithm divides a physical or abstract object into a group of related things ( Yuan and Yang, 2019 ). A partition-based
clustering algorithm is required in this study as the goal was to exclusively segregate the input observations so that each point belongs
to one group only. The K-means algorithm has numerous advantages in comparison to other recognised methods, including simple
mathematical concepts, rapid convergence, better scaling to large datasets, eξcient handling of high dimensional datasets and ease
of implementation ( Li et al., 2017 ). Additionally, this method can be applied in a broad range of ξelds, and it can be easily adapted
to new examples.
Data preparation for K-means analysis and model development
Cluster analysis was utilised to reveal walking behaviours and to identify key groups in the case study area. ML pattern-mining
techniques are generally used to identify unknown patterns within normalised datasets ( Abu-Bakar et al., 2021 ). Cluster analysis
labels observations (data points) within assigned groups, or βclustersβ, extracting key patterns and identifying classes of homogeneous
proξles. K-means algorithms partition data into clusters by minimising the within-clusters sum-of-squares ( Yuan and Yang, 2019 ,
Wang et al., 2012 ) ( Eq. 1 ).
π =
π
β
π =1
π
β
π =1
(π₯
π
β π’
π
)2 (1)
where d is the main function of sum of the squared error, k is the number of clusters, n is the number of observations, x
i
is observation
i and u
π
is the centroid formed for x
i
βs cluster. The mean of the recorded data is constantly updated, and each observation is placed
within the cluster having the nearest centre until no more observations can be assigned ( Forgy, 1965 ).
This method was chosen due to the nature of the input observations and the manner by which this method exclusively segregates
clusters, so as each point belongs to one group only, where each partition is represented by one cluster only and k β€ n ( Han et al., 2012 ;
Zhu et al., 2010 ). As traditional K-means also present several limitations, improvements on the method and data preparation have
been suggested by previous research to receive better results when solving practical problems ( Wagstaξ et al., 2001 ; Huang, 1998 ;
Narayanan et al., 2016 ; Narayanan et al., 2019 ). To overcome such limitations and building upon previous literature, the follow-
ing steps were performed to ensure data suitability for the unsupervised ML model ( Celebi et al., 2013 ; Zhang and Leung, 2003 ;
Namratha Reddy and Supreethi, 2017 ), namely: (1) Input variables limited to numerical only, (2) Noise & outlier removal, (3) Data
normalisation, (4) Reduction of the number of variables, (5) Collinearity, (6) Determining the optimal number of clusters. Each is
discussed below.
1) Input variables limited to numerical only
K-means uses distance-based measurements to determine similarity between data points, therefore numerical variables are the
only input that can be processed. Additionally, removal of undeξned (NaN) values was performed following the ξrst .csv output
from the raw data analysis and the weather mapping values. NaN values cannot be considered as 0, as the 0 value is a meaningful in
this type of analysis. Nevertheless, the NaN values were less than 10%, an acceptable percentage when dealing with missing values
( Bennett, 2001 ). Listwise deletion (LD) was used to remove all the NaN values ( Peng et al., 2006 ).
2) Noise & outlier removal
K-means is sensitive to outliers and βnoisyβ data ( Jin and Han, 2011 ). If data is not pre-processed to remove noise and outliers,
then K-means can return false results, driven by the strongest set of information. Outlier values were therefore removed using the
interquartile range method for each individual variable ( Vinutha et al., 2018 ).
3) Data normalisation
For the ML algorithm to consider all attributes as equal, they must all have the same scale, hence the Min-Max Scaler method was
used, implemented via Python and the help of scikit-learn package ( Han et al., 2012 ; Thorndike, 1953 ). This method was chosen as it
transforms each value in the columns proportionally, within the bounded intervals, achieving a linear transformation on the original
data ( Abu-Bakar et al., 2021 ). The Min-Max Scaler method is considered ideal for revealing patterns by highlighting any peaks or
falls in a consistent manner ( Abu-Bakar et al., 2021 ).
Each variable was then transformed by scaling to the range 0-1 ( Eq. 2 ).
π§ =
π₯
π
β πππ
(
π₯
)
max
(
π₯
)
β πππ
(
π₯
)
(2)
where z is the normalised value, x
i
is the original value range, min(x) is the minimum range attribute and max(x) is the maximum
attribute range. This step ensured that diξerent scales would not skew the results and would contribute equally to model ξtting.
10