Content uploaded by Raja Jurdak

Author content

All content in this area was uploaded by Raja Jurdak on Aug 04, 2015

Content may be subject to copyright.

Understanding Human Mobility

•Significance

–Urban planning

–Transport

–Disease spread

•Issues with current data sources

–Census: coarse-grained

–GPS/Call data records/Wifi: proprietary/private

•Geo-tagged tweets as a proxy

–500 Million users

–340 Million tweets/day

–Up to 10m accuracy

Open Issues with Geo-Tagged Tweets as

Mobility Proxy

•Potential sampling bias

–Demographic

–Geographic

•Content limit on tweets

–140 characters/tweet

–Unknown effect of this limit on tweet locations

•Tweet location preference

–Unclear how it can affect observed movement patterns

Overview of this work

•Determine how representative are Twitter-based

mobility patterns of population and individual-

level movement

•Analyse a large dataset with 7,811,004 tweets

from 156,607 Twitter users

•Compare the mobility patterns observed

through Twitter with the patterns observed

through other technologies, such as call data

records

Displacement Distribution

Displacement distribution, namely spatial dispersal kernel P(d), where d is the distance

between a user’s two consecutive reported locations.

Previously observed

distributions:

•Power-law (banknotes)

• Truncated power-law

(mobile phones & travel

surveys)

•Exponential and log-

normal (GPS from

cars/taxis)

Mixed exponential-stretched

exponential fitting

•May stem from multiplicative processes, i.e. the displacement d is determined by the product

of k random variables

•These random variables can be transportation cost, lifestyle aspects such as the preference

on commute distance, or socio-economic status such as personal income.

•The number of these variables k, namely the number of levels in the multiplicative cascade,

is indicated by the exponent β in the above equation.

•When k is small, P(d) converges to a stretched-exponential asymptotically, and k → +∞ leads

to the classic log-normal distribution. In particular, if these random variables are Gaussian

distributed, we have k ≈ 2/β, and the value of k is around 3 or 4 for our data (β ≈ 0.55).

What could be different here?

•2 separate power

laws

•Differences

between short and

long distance

travel patterns

Another fitting possibility

Radius of Gyration

•Quantifies the spatial stretch of an individual

trajectory or the traveling scale of an individual

where is the individual’s i-th location,

is the geometric center of the

trajectory and n is the number of locations in the

trajectory.

Rg Distribution

10

1

10

2

10

3

10

4

10

5

10

6

10

7

d

10

−9

10

−8

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

P(d)

a

data

y=y

1

+y

2

y

1

⇠e

−0.073x

y

2

⇠x

−0.45

e

−0.011x

0.55

y

3

⇠x

−1.32

10

1

10

2

10

3

10

4

10

5

10

6

10

7

d

10

−9

10

−8

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

P(d)

b

data

y

1

⇠x

−0.77

y

2

⇠x

−2.07

10

1

10

2

10

3

10

4

10

5

10

6

10

7

r

g

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

P(r

g

)

c

data

y=y

1

+y

2

y

1

⇠e

−0.12x

y

2

⇠x

−0.23

e

−0.0015x

0.77

y

3

⇠x

−1.11

10

1

10

2

10

3

10

4

10

5

10

6

10

7

r

g

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

P(r

g

)

d

data

y

1

⇠x

−0.40

y

2

⇠x

−1.60

First passage time

•Fpt(t), i.e. the probability of finding a user

at the same location after a period of t

•Similar to CDR trends

•Periodic behavior with

daily cycles

•Content limit does not

appear to affect Fpt

Preferential return to visited locations

•The probability function P(L) of

finding an individual at his/her L-th

most visited location.

•Sort visited locations and perform

spatial clustering (250m)

•Generally follow Zipf law of

preferential return

•People are 50% likely to tweet

from most popular location, higher

than other data sources

Predictability of Tweet Locations

•Study the randomness (entropy) and predictability of the sequence

of tweeting locations for each user

•Bi-modal distribution for users with >20 locations suggests 2 types

of users

Entropy Predictability

Probabilistic Modeling of Human

Movement

Rg: 1-10

km

Rg: 10-100

km

Rg: 100-500 km Rg: 500-1000

km

Isotropy of Motion Patterns

•Isotropy ratio σ = δy/δx, where δy is

the standard deviation of P(x, y) along

the y-axis and δx is the standard

deviation of P(x, y) along the x-axis, to

characterise the orbit of each rg group

• Second peak at ~1000 km

differentiates results from previous

studies

Preferential return decreases with larger

orbits

•Likelihood to tweet

from home location

drops with

increasing rg

•Exponent α also

decreases with

increasing rg

Melbourne

Sydney

Brisbane

a

Melbourne

Sydney

Brisbane

b

Melbourne

Sydney

Brisbane

c

Melbourne

Sydney

Brisbane

d

Twitter-based Mobility Patterns

Rg: 1-10

km

Rg: 10-100

km

Rg: 100-500 km Rg: 500-1000

km

Discussion

•Three observed modes of mobility

–Intrasite

–Metropolitan

–Intercity

•Two apparent groups of tweeters

–Highly predictable group where geo-tags are not

highly useful for mobility prediction

–Less predictable group where geo-tagged tweets can

be representative of movement pattterns

…Discussion

•Long distance movers more diffusive in their movement than

intermediate distance movers, most likely as a reflection of a

switch in transportation mode towards air travel and local

circulation around destination cities.

•Preferential return strongly dependent on a person’s orbit of

movement, with long distance movers less likely to return to

previously visited locations.

•Population-level mobility patterns are well-represented by geo-

tagged tweets, while individual-level patterns are more sensitive

to contextual factors

Implications and Future Work

•Develop agent-based model for disease

spread based on rg group features

•Use tweets to better understand drivers

for movement

•Use location inference algorithms on tweet

content to increase data sample with

higher uncertainty per location