DataPDF Available

Supplementary Material for the paper "Content-Aware Detection of Temporal Metadata Manipulation"

Authors:
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 1
Content-Aware Detection of Temporal Metadata Manipulation
Supplementary Material
Rafael Padilha, Tawfiq Salem, Scott Workman, Fernanda A. Andal´
o, Anderson Rocha, and Nathan Jacobs
We highlight aspects of our method with qualitative exam-
ples that show the learned temporal patterns and further ex-
plore its robustness to changes in the appearance of the scene,
location coordinates, degree of tampering, among others.
I. SEN SI TI VI TY ANA LYSIS: SCENE APPEARAN CE
We present in Figure 10 an additional example of how the
confidence of our method varies for a fixed alleged timestamp
as it is presented with a scene captured in different moments
in the day and year (similar to Figure 4 in Section IV-D of
the main manuscript). In Figure 10(a), our network analyzes
a scene captured throughout the day in January. Whereas
in Figure 10(b), the same scene is analyzed at 6AM in
different months. As seen in the main experiments, the network
correctly captures temporal patterns such as day-night cycle
and seasonal patterns.
Even though the method identified with high confidence
when the timestamp matched the time-of-capture, it wrongly
output high probabilities to neighboring timestamps, especially
in 9AM and 6PM curves in Figure 10(a). We hypothesize that
this might happen due to the combination of clouds and city
lights that gives a yellowish aspect to the night sky, similar to
daylight hours.
Additionally, we extend the sensitivity analysis by aggregat-
ing the results of 10 random AMOS cameras, allowing us to
evaluate the method with considerably more images, which
decreases the influence of possible location-specific noises.
In our experiment, we fix the location coordinates, satellite
imagery, and an alleged timestamp for each camera, feeding
the model with ground-level images with different ground-
truth timestamps. As each camera might be in a different time
zone, we aggregate the results considering the local time of
each image. We present in Figure 11 the consistency prob-
abilities yielded by our approach when prompted with pho-
tos from different moments, considering three alleged hours
(8AM, Noon and 8PM) and three alleged months (January,
July and November). Similarly to our evaluation in the main
manuscript, images whose ground-truth timestamp matches
the alleged hour/month are considered consistent with high
probabilities. As the alleged timestamp changes, the method
shifts accordingly, decreasing the consistent probability.
II. SE NS IT IV IT Y ANALYSIS: GEOGRAPHIC LOCATION
We extend the evaluation made in Section IV-F, considering
additional degrees of location noise and discriminating the
results per subset of the Cross-View Time dataset. We present
the accuracy per class (Accconsistent and Acctampered) for the our
(a) Same location recorded in January under different hours of the day.
(b) Same location recorded at 6A M under different months of the year.
Fig. 10. Consistency probability for a scene recorded in varied moments in
time. Each curve represents a fixed alleged timestamp being verified against
ground-level images captured in different (a) hours of the day in January and
(b) months of the year at 6AM.
approach in Table III and its version trained with Location
Augmentation in Table IV.
As discussed in the main manuscript, there is a drop in
verification performance when we assume pristine geographi-
cal information at training time, but evaluate the model under
a noisy scenario. This decrease in accuracy reflects how the
CVT dataset is organized in the shared-camera protocol and
how the noisy locations influence the model.
The performance in AMOS remains unaffected when we
consider small degrees of location noise (below 15). With
shared cameras between training and testing, the model can
partially recognize where a scene was captured given an input
image without entirely resorting to its coordinates. However,
this is not true for the Flickr subset, whose accuracy drops
considerably for the slightest input noise. This highlights the
sensitivity of the model when we assume correct locations but
inference is done in an unreliable scenario. In the Flickr subset,
each location/device might only have a single picture; thus,
perturbations in the coordinates are seen by the model as ma-
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 2
(a) Hour evaluation. (b) Month evaluation.
Fig. 11. Sensitivity analysis of our method to changes in the appearance of the scene, considering different alleged (a) hours (8AM, Noon and 8PM) and
(b) months (January, July, and November). We evaluate how the output consistency probability changes as the model is fed by ground-level images taken at
different timestamps.
TABLE III
VER IFIC ATIO N PER FO RMA NC E OF TH E BE ST AB LATI ON MO DE L (DE NS ENET G, t, l, S TA) F OR S EVE RA L DEG RE ES OF L OCAT ION N OI SE.
LON GIT UD INAL MOVEM EN TS O F 15ARE E QUI VALE NT TO O NE-H OU R SHI FT S. TH E RE SULT S ARE D IS CRI MIN ATED B ETW EE N EACH S UB SET O F CVT
(AMOS A ND FL ICK R)A ND EAC H CL ASS ( ACC CON SIST ENT A ND AC CTAMP ERE D ).
AMOS Flickr CVT (AMOS +Flickr)
Acc Accconsistent Acctampered Acc Accconsistent Acctampered Acc Accconsistent Acctampered
Without noise 92.7 96.8 88.5 76.9 78.7 75.1 81.1 83.2 78.9
0.192.8 96.8 88.7 58.5 41.9 75.1 65.2 52.6 77.9
192.8 96.8 88.7 58.5 41.9 75.1 66.3 53.9 78.6
592.7 96.8 88.6 58.4 41.9 75.0 66.3 53.9 78.8
1591.7 95.6 87.9 58.1 42.4 73.8 65.9 53.7 78.0
3088.7 89.4 87.9 57.5 41.2 73.8 64.3 53.5 75.0
4583.0 76.6 89.4 56.8 38.2 75.3 62.2 46.2 78.1
6078.4 66.6 90.1 56.6 36.8 76.4 61.1 43.0 79.2
7576.6 62.9 90.3 56.3 35.3 77.3 60.5 40.9 80.0
TABLE IV
VER IFIC ATIO N PER FO RMA NC E OF TH E BE ST AB LATI ON MO DE L (DE NS ENET G, t, l, S TA) T RAI NE D WIT H LO CATI ON AUG ME NTATIO N FO R SEV ER AL
DE GRE ES O F LOC ATIO N NOI SE . LON GI TUD INA L MOV EM ENT S OF 15A RE E QUI VALEN T TO ON E-H OUR S HI FTS . TH E RES ULTS A RE D ISC RIM INAT ED
BE TWE EN E ACH SU BS ET OF C VT (AMO S AN D FLIC KR )AND E ACH C LAS S (AC CCO NSI STEN T AND ACC TAMPE RED ) .
AMOS Flickr CVT (AMOS +Flickr)
Acc Accconsistent Acctampered Acc Accconsistent Acctampered Acc Accconsistent Acctampered
Without noise 93.3 96.1 90.5 71.6 66.8 76.4 75.9 72.6 79.3
0.193.3 96.1 90.5 71.6 66.8 76.4 75.9 72.6 79.3
193.3 96.1 90.5 71.6 66.8 76.4 75.9 72.6 79.3
593.3 96.1 90.5 71.7 66.8 76.5 76.0 72.7 79.3
1593.3 96.3 90.3 71.5 66.4 76.6 75.8 72.3 79.3
3092.9 96.2 89.6 70.4 64.3 76.4 74.8 70.6 79.1
4592.4 95.8 88.9 68.9 63.1 74.7 73.5 69.4 77.7
6091.8 95.0 88.5 67.7 62.9 72.5 72.5 69.1 75.9
7591.0 94.2 87.8 66.9 63.3 70.4 71.7 69.3 74.1
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 3
nipulations, hindering the performance on consistent samples,
while the model still detects tampered data successfully. The
performance degradation seen on the Flickr subset is a sign
that the model is over-reliant on the paired ground-level picture
and location coordinates. This might be reasonable when we
consider a shared-camera scenario, in which images from the
same device are present during training and inference of the
model.
For all subsets, the performance is stable when considering
perturbations smaller than 15. As pointed in the manuscript,
longitudinal movements of 15represents a one-hour shift,
which is the smallest manipulation captured by our approach.
On the other hand, when the noisy coordinates deviate con-
siderably from the ground-truth location, the model interprets
these differences as signs of manipulation. Consequently, it
assigns real samples as inconsistent, reducing the accuracy
in this class (Accconsistent). Additionally, the accuracy in the
tampered class (Acctampered) increases as more samples are
considered manipulations.
When we include noisy location during training (Table IV),
the model learns to account for possible errors in the coordi-
nates and not to over-rely on the coupled ground-level image
and geographic information. We notice a considerable accu-
racy improvement in the Flickr subset, with a slight increase
in the AMOS subset. More importantly, by augmenting the
location, the model becomes more robust, and its verification
performance is not hindered even when presented to higher
degrees of noise (above 30). Additionally, the previous be-
havior of assigning consistent samples with noisy locations
as manipulations does not occur, as the model understands
geographic errors might not be signs of tampering.
III. SATEL LI TE IM AGE RY QUALITATI VE EX PL OR ATION
We perform the following qualitative experiment to assess
what information is drawn out from the satellite imagery.
Considering a few testing samples, we estimated the sets ˆaS
for the ground-truth timestamp, location coordinates, and their
original satellite images. Then, we estimated a new set of
attributes aSfor each aerial picture in the test set with the
fixed ground-truth timestamp and location coordinates. We
ranked all satellite pictures by the similarity of aSto ˆaS
(using euclidean distance). Our rationale is that images that
share characteristics to the ground-truth satellite or ground-
level images will yield more similar sets of attributes. We
present in Figure 12 the top and bottom six images for a few
testing examples.
Figures 12(a) and 12(b) depict areas with large open
spaces, very few buildings, and some amount of vegetation,
while 12(c) shows a scene by the sea and 12(d) depicts
an urban area with some tree-covered hills far back in the
scene. As we had previously stated, the most similar aerial
pictures capture general characteristics of the surroundings of
the scene, while bottom-ranked aerial shots depict multiple
regions with contrasting characteristics to those present in
ground-truth pictures.
However, when analyzing the top-ranking images, we notice
that some are considerably distinguished from the original
satellite picture. This can be seen in Figure 12(d), for example,
in which the ground-truth satellite frame shows an open
space with vegetation, while the top-ranked footage depicts
urban areas with multiple buildings. This happens because the
model is optimized with respect to the transient attributes. As
the ground-truth set ˆais extracted solely from the ground-
level image and used as a supervisory signal during the
optimization for aGand aS, the model learns to estimate
how the appearance of the scene might be for a particular
timestamp given its coordinates and aerial footage. In this
sense, when we rank by similarity to the attributes, we are
retrieving satellite images that yield a set of attributes closely
related to the ground-level image and not necessarily images
that are similar in appearance to the original aerial picture.
Even though we expect both sets of images to intersect to
some degree, it explains some of the inconsistencies seen in
Figure 12.
In 12(d), the ground-level picture shows a crossing in
the city of Oberkirch (Switzerland) with plenty of buildings
covering most of the photograph. As a result, the aerial images
retrieved depict several dense urban regions, while the bottom-
ranked ones show zones covered by vegetation and low-density
urban areas. Similarly, the most similar images in Figure 12(c)
depict open spaces with low-density vegetation and bodies of
water.
IV. TIM E EST IM ATION:AQUAL ITATI VE EX PL OR ATIO N
As an additional example of our analysis made in Sec-
tion IV-G, we employed our method to infer, for three scenes
from our test set, the heatmap of consistent time-of-capture,
assuming a scenario in which their timestamps are not avail-
able.
We present in Figure 13 the selected examples, along with
their satellite image and heatmap of consistency probability
over local hour and month of capture. We also present the
heatmap generated by our model fine-tuned with subtler times-
tamp manipulations, as well as the method from Salem et
al. [9]. Our method correctly predicted consistent pairs of hour
and month that are close to the ground-truth timestamp, while
producing less false consistent answers in comparison to [9].
Additionally, we extend in Figure 14 the time estimation
evaluation aggregating results for the whole test set. We built
the consistency heatmap for each sample using model trained
with random and subtle manipulations, but normalizing all
samples by placing the ground-truth month and hour of capture
as the center position of the heatmap. Similar to the previous
examples, the model trained with subtle manipulations outputs
high-consistency regions closer to the ground-truth moment-
of-capture than the random-tampering model. Besides that,
as seen in the sensitivity analysis of timestamp manipulation
(Section IV-E), jointly tampering with the hour and month of a
picture increases the chance our method detects a manipulation
has occurred.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 4
(a)
(b)
(c)
(d)
Fig. 12. Examples from the test set (leftmost column), along with satellite images ranked by the similarity between the estimated transient attributes. The
top row in green depicts aerial pictures that yield very similar sets of attributes with respect to the test sample, while those that yield dissimilar attributes are
depicted in the bottom row in red.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 5
Fig. 13. Examples of ground-level picture, satellite image and consistent time-of-capture distribution for all possible sets of month and hour of capture for
our proposed method (trained with random and close-to-ground-truth timestamp tampering) and Salem et al. [9]. Yellow areas represent high consistency and
blue indicates inconsistent moments in time. The red dot marks the ground-truth moment-of-capture.
Fig. 14. Consistency heatmaps for test samples, centered by the ground-truth month and hour of capture. The model trained with subtle manipulations yields
high-consistency timestamps that are closer to the ground-truth moment-of-capture in comparison to the model optimized with random tampering.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 6
Fig. 15. Comparison between subsets of transient attributes aGand aSfor
each scene, considering consistent (left) and inconsistent (right) timestamps.
Inconsistent timestamps often lead to considerably differences between several
attributes of the scene.
V. EXPL AI NI NG THROUGH TRANSIENT ATTRIBU TE S
As an additional example of our analysis made in Figure 8
of Section IV-H, we estimated the set of transient attributes,
aGand aS, under distinct timestamps and present them in
Figure 15. For each scene, we selected a subset of the five
most-divergent attributes between aGand aSconsidering the
inconsistent timestamp. For comparison, we also show the
difference between both sets for a consistent timestamp.
As seen in our previous evaluations, under an inconsis-
tent timestamp, the transient attributes aGand aSdiverge
considerably. These differences among attributes point out to
discrepancies found by our model between the scene depicted
in the ground-level image and what it expected considering
the alleged timestamp.
VI. EX PL AI NI NG THROUGH OCC LU SI ON
To further understand the decision-making process of our
model, we generated occlusion activation maps for different
timestamps. In Figure 16, we present the occlusion maps for
the same images considering the ground-truth time-of-capture
of each image as well as inconsistent timestamps. The network
often activates sky regions and reflective elements, such as
the pyramids from the top example in Figure 16, indicating
important regions for its decision. In nighttime scenes, the
same happens for regions with light sources, especially when
considering a daytime timestamp.
VII. TRA NS IE NT ATT RI BUTES
For the transient attribute estimation task, we adopt the 40
attributes presented by Laffont et al. [8]. They encode the
presence of some aspect of the scene appearance (e.g., illumi-
nation, season, or weather) into the interval [0,1]. Table V
presents the complete list of attributes organized in broad
Fig. 16. Occlusion activation maps for selected examples under different
timestamps. Yellow regions represent areas of the image whose occlusion has
a high impact on the network decision, while occlusion of blue regions have
a low impact.
categories. For a detailed description of each attribute, we
kindly refer the reader to [8].
Even though several attributes are directly influenced by
the passage of time (e.g., those in the Illumination,Period of
the day, and Weather categories), others seem to be unrelated
to the task at hand (e.g., Subjective category). To evaluate
the usefulness of each transient attribute in providing clues
about timestamp inconsistency, we selected all manipulated
samples from the test set that were correctly classified by the
model with more than 90% of confidence. For each attribute i
(with 1i40), we calculated the mean absolute difference
between ai
Gand ai
Sand sorted the attributes in decreasing
order. We present in Table VI the top 15 attributes with the
highest mean absolute difference. Besides that, we calculated
a ranking among the category of attributes from Table V. We
counted the ranking position piof each attribute in the sorted
mean absolute difference rank done previously (p= 1 for
the attribute with highest mean absolute difference, p= 2
for the second highest, and so on). For each category, we
summed the weighted ranking position wcat piof each of
its attributes, with wcat being 1
|cat|(i.e., the inverse of the
number of attributes in category cat) to account more weight
for categories with fewer attributes. We sorted the categories
in Table V by their rank (WRP).
Most attributes with high mean absolute difference which
represents a divergence between the set estimated from the
ground-level image and the set estimated from satellite, loca-
tion, and alleged timestamp information fall into the Illumi-
nation,Period of the Day, and Weather categories, matching
our observations from Section IV-I. When we analyze the
category ranking, we also identify that Subjective attributes
seem to be uninformative for this task. To assess if remov-
ing attributes from low-ranking categories would positively
influence the model performance, we retrained the DenseNet
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, 2021 7
Fig. 17. Mean Absolute Difference (MAD) between aGand aS(|aGaS|) for Consistent (blue bars) and Inconsistent (orange bars) testing samples that
were correctly classified by the model with varying degrees of confidence. While the left plot considers all 40 transient attributes, the rightmost only assess
those related to Illumination, Period of the Day, and Colors. Manipulated samples have a high divergence between both sets of attributes, while consistent
samples are more similar. The gap between sets becomes more apparent as the model gets more confident in its classification. The gray dashed line represents
the average MAD between both classes for samples correctly classified with more than 80% confidence in the training set.
G, t, l, S (TA) model without the Subjective and Appearance
attributes, decreasing the number of attributes (dimensionality
of aGand aS) from 40 to 28. However, the model achieved an
accuracy of 75.65% and an AUC of 0.845, a few percentage
points below the method trained to estimate all attributes.
This might indicate that, even though both categories are not
informative enough for pointing out manipulations on their
own, they aid the detection in conjunction to the remaining
attributes.
In a similar manner, during inference of an individual
example, both sets of estimated attributes can be compared
by calculating the absolute difference between each attribute
(|aGaS|) and averaging them. As discussed in Sections IV-I
of the main manuscript, the sets will present clear differences
as the alleged timestamp diverges from the ground-truth time
of capture. To formalize this, we present in Figure 17 the
mean absolute difference (MAD) for each class considering
testing samples that were correctly classified by the model with
varying degrees of confidence. In our analysis, we consider
all 40 attributes but also the set of attributes related to
Illumination, Period of the Day, and Colors. At the decision
boundary of the model (i.e., samples classified with 50% to
60% of confidence), the two classes have similar MADs.
Whereas at high confidence intervals, Inconsistent samples
have a higher MAD than the Consistent class. These findings
highlight that samples with considerably divergent aGand aS
will probably be confidently classified by the model, while
very similar sets of attributes might indicate harder-to-verify
examples, in which the model would not be certain.
There is no direct decision of whether or not the sample has
been manipulated based on |aGaS|, however, for illustration
purposes only, we calculated a cutoff point in the training set
by averaging the MAD for each class for samples classified
with more than 80% confidence. We present these cutoff points
as the dashed gray lines in Figure 17. These values represent
thresholds for both classes in the test set for most intervals
when analyzing Illumination, Period of the Day, and Color
attributes (rightmost plot) and for high confidence intervals
when we analyze all attributes. They could be used to indicate
possible inconsistencies, ratifying the decision of the model.
TABLE V
TRA NSI EN T ATTRI BU TES C ATEG ORI ES R ANK ED BY T HE W EIG HT ED
RA NKI NG P OSI TIO N (W RP) O F TH EIR AT TRI BUT ES .
Category WRP Transient Attributes
1 Illumination 9.7 dark, bright/luminous, glowing/radiant
2 Colors 14.0 colorful/vivid, dull/faded
3 Period of the day 14.6 sunrise/sunset, daylight, noon/midday, dawn/dusk, night
4 Season 16.0 spring, summer, winter, autumn
5 Vegetation 18.0 lush vegetation, flowers/blossoms
6 Weather 18.9 sunny/direct sun, clouds/overcast, fog/haze, storm, snow,
warm/hot, cold, dry, moist/muddy, windy, rain, ice/frost
7 Appearance 23.5 dirty/polluted, active/busy, rugged, cluttered
8 Subjective 28.6 beautiful, soothing/calm, stressful, exciting/stimulating
mysterious, boring, sentimental/intimate, gloomy/depressing
TABLE VI
TOP -15 ATT RI BUT ES R ANK ED BY T HE IR ME AN A BSO LU TE DI FFE REN CE
CO NSI DE RIN G MAN IP ULAT ED SA MP LES T HAT WE RE CO RR ECT LY
CL ASS IFI ED WI TH MO RE T HAN 90% O F CON FID ENC E.
Attribute Category Mean Absolute
Difference
1 night Period of the Day 0.301
2 daylight Period of the Day 0.301
3 dark Illumination 0.275
4 lush Vegetation 0.240
5 sunny Weather 0.231
6 snow Weather 0.224
7 winter Season 0.224
8 ice Weather 0.220
9 cold Weather 0.213
10 summer Season 0.208
11 dry Weather 0.204
12 glowing Illumination 0.184
13 moist Weather 0.184
14 colorful Colors 0.165
15 spring Season 0.161

File (1)

Content uploaded by Rafael Padilha
Author content
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.