Available via license: CC BY 4.0
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier ...
The not yet exploited goldmine of
OSINT: Opportunities, open challenges
and future trends
JAVIER PASTOR-GALINDO1, PANTALEONE NESPOLI1, FÉLIX GÓMEZ MÁRMOL1, AND
GREGORIO MARTÍNEZ PÉREZ1
1Department of Information and Communications Engineering, University of Murcia, 30100 Murcia, Spain
Corresponding author: Javier Pastor-Galindo (e-mail: javierpg@um.es)
This work has been partially supported by an FPU predoctoral contract (FPU18/00304) granted by the Spanish Ministry of Science,
Innovation and Universities, by an FPU predoctoral contract granted by the University of Murcia, by a Ramón y Cajal research contract
(RYC-2015-18210) granted by the MINECO (Spain) and co-funded by the European Social Fund, and by the project SAFEMAN (A
unified management framework for cybersecurity and safety in the manufacturing industry) with code RTI2018-095855-B-I00.
ABSTRACT The amount of data generated by the current interconnected world is immeasurable, and
a large part of such data is publicly available, which means that it is accessible by any user, at any
time, from anywhere in the Internet. In this respect, Open Source Intelligence (OSINT) is a type of
intelligence that actually benefits from that open natureby collecting, processing and correlating points of
the whole cyberspace to generate knowledge. In fact, recent advances in technology are causing OSINT
to currently evolve at a dizzying rate, providing innovative data-driven and AI-powered applications for
politics, economy or society, but also offering new lines of action against cyberthreats and cybercrime. The
paper at hand describes the current state of OSINT and makes a comprehensive review of the paradigm,
focusing on the services and techniques enhancing the cybersecurity field. On the one hand, we analyze the
strong points of this methodology and propose numerous ways to apply it to cybersecurity. On the other
hand, we cover the limitations when adopting it. Considering there is a lot left to explore in this ample field,
we also enumerate some open challenges to be addressed in the future. Additionally, we study the role of
OSINT in the public sphere of governments, which constitute an ideal landscape to exploit open data.
INDEX TERMS OSINT, cyberintelligence, cybersecurity, cyberdefence, challenges, national security,
computer crime, computational intelligence, knowledge acquisition, social network services, software tools,
data privacy, Internet
I. INTRODUCTION
Open Source Intelligence (OSINT) consists in the collec-
tion, processing and correlation of public information from
open data sources such as the mass media, social networks,
forums and blogs, public government data, publications, or
commercial data. Given some input data, together with the
application of advanced collection and analysis techniques,
OSINT continuously expands the knowledge about the target.
In this way, the information found feeds the gathering process
again to get closer to the final goal [1].
Nowadays, OSINT is widely adopted by governments and
intelligence services to conduct their investigations and fight
against cybercrime [2]. Nevertheless, it is not only utilised
for state affairs, but rather applied to several different goals.
Indeed, current research is focused on (but not limited to)
three main applications which are represented in FIGURE 1
and are described next:
•Social opinion and sentiment analysis: Along with the
boom of online social networks, it is possible to collect
users interactions, messages, interests and preferences
to extract non-explicit knowledge. The evidence accu-
mulated from social media is far-reaching and widely
advantageous [3]. Such collection and analysis could be
applied, for instance, to marketing, political campaign-
sor disaster management [4].
•Cybercrime and organized crime: The open data is con-
tinuously analyzed and matched by OSINT processes
in order to spot criminal intentions at an early stage.
Taking into account adversaries’ patterns and relation-
ships between felonies, OSINT is able to provide se-
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
OSINT
Cybercrime and
Organized Crime
Spot illegal actions
Retrieve suspicious traces
Monitor malicious groups
Cybersecurity and
Cyberdefence
Foot printing
Forensics analysis
Cyberattack attribution
Social engineering /
phishing attack prevention
Social Opinion and
Sentiment Analysis
Marketing
Political campaigns
Disaster management
HR recruiting
Journalism
FIGURE 1: OSINT principal use cases.
curity forces with an opportunity to promptly detect
illegal actions [5]. In this direction, by exploiting the
open data, it would be possible to track the activity of
terrorist organizations, which are increasingly active on
the Internet [6], [7].
•Cybersecurity and cyberdefence: ICT (Information and
Communication Technology) systems are continuously
attacked by criminals aiming at disrupting the avail-
ability of the provided services [8]. Research becomes
hence crucial to defend those systems from cyberattack-
ers, concretely by facing the challenges that are still
open in the field of cybersecurity [9]. In this sense, data
sciences are not only being applied to the footprinting
in pentestings, but also to the preventive protection of
organizations and companies. Concretely, data mining
techniques may help by performing analysis of daily at-
tacks, correlating them and supporting decision making
processes for an effective defense, but also for a prompt
reaction [10]. In the same way, OSINT can be also
considered in this context as a source of information for
tracebacks and investigations. Forensic digital analysis
[11] can incorporate OSINT to complement the digital
evidences left by an incident.
In addition to those, OSINT can be applied to other con-
texts. In particular, one may extract relevant information by
performing social engineering attacks. Ill-motivated entities
leverage publicly-available information released online (e.g.,
on social networks) in order to create appealing hooks to
capture the target [12]. Moreover, it is possible to perform
automatic veracity assessment on the open data aiming at
disclosing fake news and deepfakes, among others [13].
Nonetheless, it is important to notice that the utilization of
public data has also compromising issues. On the one hand,
the EU General Data Protection Regulation (GPDR) limitates
the processing of personal data related to individuals in the
EU zone [14]. On the other hand, there is a strong ethical
component which is linked to the users’ privacy. In particular,
the profiling of people [15] could reveal personal details such
as their political preference, sexual orientation or religious
beliefs, amongst others. Additionally, the exploitation of such
vast amount of information may lead to abuse, resulting
in harming innocents through cyberbullying, cybergossip or
cyberaggressions [16].
The paper at hand, which is an extension of the work pro-
posed in [17], encompasses the present and future of OSINT
by analyzing its positive and negative points, describing ways
of applying this type of intelligence, and enunciating future
directions for the evolution of this paradigm. In addition, a
more detailed description of different techniques, tools and
open challenges is presented in this work. Furthermore, we
propose the integration of OSINT within the DML (Detection
Maturity Level) model to address the attribution problem
from a different perspective in the context of cyberattacks
investigations. We also introduce sample workflows to facil-
itate the understanding and use of OSINT to gather valuable
information starting from basic inputs.
In addition, our purpose is to stimulate researches and
advances in the OSINT ecosystem. The scope of such ecosys-
tem is quite wide, spanning from psychology, social science
to counterintelligence and marketing. As we have seen so far,
OSINT is a promising mechanism that concretely improves
the traditional cyberintelligence, cyberdefence and digital
forensic fields [18]. The impact that this methodology could
have on society thanks to current technology and the large
number of open sources is still unexploited. There is still
a long way ahead to explore in this topic, and this article
presents some future appealing research lines.
The remainder of this paper is organized as follows. SEC-
TION II offers a review of recent research works in the field
of OSINT. SECTION III discusses the motivation, pros and
cons of the development of OSINT. SECTION IV explains
the principal OSINT steps and practical workflows to carry
them out. Then, SECTION V includes an in-depth descrip-
tion of OSINT-based collection techniques and services.
SECTION VI analyzes and compares some OSINT tools that
automatize the OSINT collection and analysis of informa-
tion. SECTION VII proposes the integration of OSINT in the
investigation of cyberattacks. SECTION VIII focuses on the
impact of OSINT within a nation, not only for the sake of
its internal cyberdefence operations, but also as a beneficiary
of transparency policies. Spain is specifically taken as a
reference for affinity and contextualized with the rest of the
world. SECTION IX poses some open challenges regarding
research in OSINT. Finally, SECTION X concludes with
some key remarks, as well as future research directions.
II. STATE OF THE ART
In recent years, with the advances of big data and data
mining techniques, the research community has noticed that
open data represents a powerful source of analyzing social
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
behaviors and obtaining relevant information [19]. Next we
describe some remarkable works pivoting around each of the
three aforementioned principal use cases for OSINT.
With regards to the use of OSINT for extracting social
opinion and emotions, Santarcangelo et al. [20] proposed a
model for determining user opinions about a given keyword
through social networks, specifically studying the adjectives,
intensifiers and negations used in tweets. Unfortunately, it is
a simple keyword-based solution only designed for Italian
language, not taking into account semantic issues. On the
other hand, Kandias et al. [21] could relate people usage
of social networks (in particular, Facebook) to their stress
level. However, the experiments were carried out only with
405 users, while nowadays there is a chance of processing
much larger amounts of data. Another interesting study is
conducted in [22], where authors applied Natural Language
Processing (NLP) to WhatsApp messages in order to possibly
prevent the occurrence of mass violence in South Africa.
Unfortunately, the investigation is limited to text messages,
thus excluding vital information which can be disclosed
through multimedia material.
In the context of cybercrime and organized crime, there
are several works that explore the application of OSINT
for criminal investigations [23]. For example, OSINT could
increase the accuracy of prosecutions and arrests of culprits
with frameworks like the one proposed by Quick et al. in [11].
Concretely, authors apply OSINT to digital forensic data
of a variety of devices to enhance the criminal intelligence
analysis. In this field, another opportunity that OSINT yields
is the detection of illegal actions as well as the prevention
of future crimes such as terrorist attacks, murders or rapes. In
fact, the European projects ePOOLICE [24] and CAPER [25]
were designed to develop effective models for scanning open
data automatically in order to analyze the society and detect
emerging organized crime. In contrast to the previous men-
tioned projects, whose proposals were not practically used
in real cases, Delavallade et al. [26] describe a model based
on social networks data that is able to extract future crime
indicators. Such model is then applied to the copper theft and
to the jihadist propaganda use cases.
From the point of view of cybersecurity and cyberde-
fence, OSINT represents a valuable tool for improving our
protection mechanisms against cyberattacks. Pinto et al. [27]
propose the use of OSINT in the Colombian context to
prevent attacks and to allow strategic anticipation. It includes
not only plugins for collecting information, but also machine
learning models to perform sentiment analysis. Moreover, the
DiSIEM european project [28] maintains as a first goal the
integration of diverse OSINT data sources in current SIEM
(Security Information and Event Management) systems to
help reacting to recently-discovered vulnerabilities in the
infrastructure or even predicting possible emerging threats.
In addition, Lee et al. [29] also designed an OSINT-based
framework to inspect cybersecurity threats of critical infras-
tructure networks. However, all these approaches have not
been applied to real world scenarios, thus their effectiveness
remains questionable.
Extending the dissertation to other application fields,
in [30] authors demonstrate how to passively recollect sig-
nificant information on organizational employees in an au-
tomated fashion. Such information is then related to the
analysis of the so-called social engineering attack surface,
showing the effective feasibility of the proposed approach.
Then, the authors propose a set of potential countermeasures,
including a publicly-available social engineering vulnerabil-
ity scanner which companies may leverage in order to reduce
the exposure of their employees
Furthermore, a systematic review of approaches, method-
ologies and tools which are proposed by the academy to
conduct automatic veracity assessment of publicly-available
data is performed in [31]. Specifically, the authors studied
107 research items between 2013 and 2017 to argue on the
state-of-the-art of veracity assessment, which has become a
great concern during the last decade due to the spread of
fake news and deepfakes. In this direction, the authors out-
line the relative immaturity of this field, identifying several
challenges which will characterize future research trends.
III. OSINT ADVANTAGES AND SHORTCOMINGS
The fields of application of OSINT are numerous and the so-
lutions being developed under this paradigm are increasing.
However, behind this methodology there is a trade-off that
developers and engineers have to deal with. From a technical
point of view, as we can see in TABLE 1, OSINT exposes a
number of benefits, but it has to deal with some restrictions
too, which are detailed next.
A. OSINT BENEFITS
1) Huge amount of available information
There is currently a large volume of worthwhile open source
data to be analyzed, correlated and linked [32]. This includes
social networks, public government documents and reports,
online multimedia content, newspapers and even the Deep
web and the Dark web [33], among others. Actually, both
the Deep Web and the Dark Web (the latter circumscribed
within the former) contain even more information than the
Surface Web (i.e., the Internet known by most users) [34].
In order to be able to access these networks, it is necessary
to use specific tools since their contents are not indexed by
traditional search engines.
Unlike the Surface Web and most of the Deep Web, the
Dark Web offers anonymity and privacy to users who utilize
it. This property facilitates criminals to employ this network
to surf, conduct their searches and publish with illegitimate
purposes while hiding their identity. Therefore, the Dark Web
is an ideal source to apply OSINT and fight against cyber-
crime, organized crime or cyberthreats. On the other hand,
the pursuit and de-anonymization of these people are current
non trivial challenges for OSINT to properly work [35].
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Pros 3Cons 7
Huge amount of available information Complexity of data management
High capacity of computing Unstructured information
Big data and machine learning Misinformation
Complementary types of data Data sources reliability
Flexible purpose and wide scope Strong ethical/legal considerations
TABLE 1: OSINT pros and cons in a nutshell
2) High computing capacity
Advances in computer architecture, processors and GPUs
(graphic processing units) enable to carry out labor-intensive
operations in terms of collection, processing, analysis and
storage [36]. Thanks to this feature, we have the opportunity
to apply OSINT considering large amounts of public infor-
mation and mixing a high number of data sets, relationships
and patterns from different types of open sources, while
applying advanced processing and analysis techniques.
3) Big data and machine learning
Emerging proliferation of data analysis and data mining
techniques, as well as machine learning algorithms, which
can automate and make investigation and decision making
processes more intelligent and efficient [36]. It allows spot-
ting complex correlations that are naturally unpredictable to
humans. This point will be key in future OSINT activities,
as it will mark the difference between human-driven and
artificial intelligence-led research. By incorporating those
techniques, the process of collection and analysis will defini-
tively improve, thus resulting in accurate investigations close
to our goal. Additionally, government counterintelligence
agencies can leverage such paradigm to further enhance the
quality of managed information and, consequently, the battle
against terrorist organizations [37].
4) Complementary types of data
Possibility of feeding OSINT with other types of informa-
tion [38]. The inherent structure of the system is open enough
to include data that has not actually been obtained from open
sources. This fact means that OSINT can be even more ef-
fective if we are able to add external pieces of information to
complement investigations. For example, Law Enforcement
Agencies could take advantage of citizens collaboration to
feed OSINT searches, intelligence services could leverage
classified information about cybercriminals or incidents to
enrich OSINT investigations, or even common users could
combine OSINT with social engineering to profile their tar-
get.
5) Flexible purpose and wide scope
Due to the nature of OSINT, investigations can be extended
to lots of problems and can collect pieces of information all
over the cyberspace. This paradigm could be used for eco-
nomic, psychological, strategic, journalistic, labor or security
aspects, among others. In particular, we could highlight the
benefits in the field of crime and cybersecurity, where OSINT
could monitor suspicious people or dangerous groups, detect
influencing profiles related to radicalization, study worrying
trends of the society, support the attribution of cyberattacks
and crimes, enhance digital forensic analysis, etc. [5], [18].
B. OSINT LIMITATIONS
1) Complexity of data management
The quantity of data is huge and, consequently, it is challeng-
ing to handle it efficiently and effectively [39]. It is beneficial
for OSINT to consider as much information as possible, but
also to have advanced techniques and significant resources to
ensure high quality collection, processing and analysis.
2) Unstructured information
The public information available on the Internet is inherently
massively disorganized. This means that the data collected
by OSINT is so heterogeneous that turns it tough to classify,
link and examine such data in order to extract relevant rela-
tionships and knowledge [4]. In this sense, OSINT requires
mechanisms such as data mining, Natural Language Process-
ing (NLP), or text analytics to homogenize the unstructured
information in order to be able to exploit it.
3) Misinformation
Social networks and communication media are flooded with
subjective opinions, fake news and canards [4]. For this rea-
son, the existence of inaccurate information has to be taken
into account in the implementation of OSINT mechanisms
and should not drive the propagation of the search. OSINT
activities should always deal with reliable information and
follow trusted exploration lines to ensure positive and con-
vincing outcomes [40].
4) Data sources reliability
The trustworthiness and authority of the information are
indeed the key for successful OSINT investigations [41].
Ideally, the collected data should come from authoritative,
reviewed and trusted sources (official documents, scientific
reports, reliable communication media) [39]. In practice,
OSINT will also coexist with subjective or non-authoritative
sources, such as the content of social networks or manipu-
lated media [42]. Even though this type of sources is more
prone to misinformation, it is actually where more knowledge
can be extracted to investigate people, groups or companies.
If the credibility of the open sources of information repre-
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
sents indeed a limitation, it becomes even more challenging
considering the possible ambiguity of users’ queries to re-
trieve the desired information [43].
5) Strong ethical/legal considerations
Numerous concerns about privacy, respect and personal in-
tegrity emerge with the development of OSINT [44]. In this
direction, it has to be noted that the question of whether OS-
INT constitutes an ethical issue is generally situated within
the area of the ethics of intelligence collection [45]. On the
one hand, although publicly accessible, OSINT has the power
to disclose information that is not explicitly posted on the
web. Uncovered results should respect users’ privacy and not
reveal intimate and personal issues [15], while taking into
account current related regulations (such as GPDR [14]).
To this extent, aspects such as sexual orientation, religious
beliefs, political inclination or compromising behaviours can
be inferred from the Internet, and this disclosure process can
be problematic in many countries today. On the other hand,
the scope of OSINT-based searches should be, by definition,
limited to open data sources. Under no circumstances access
controls or authentication methods can be bypassed to extract
knowledge.
IV. OSINT WORKFLOWS
OSINT, like any other type of intelligence, has a well-defined
and precise methodology. From our scientific-technical point
of view, we are particularly interested in three steps.
Firstly, in the collection phase, publicly available data is
retrieved from relevant open sources according to the target
or objective. In particular, the Internet is the resource par
excellence due to the volume of existing material and easy
accessibility. The collection process is particularly relevant
because from this stage onwards the whole process of intelli-
gence generation is triggered.
Then, in the analysis phase, the collected raw material is
treated to generate valuable and comprehensible information.
The data by itself is not useful, so it has to be interpreted to
obtain the first facts derived from an in-depth analysis.
Finally, in the knowledge extraction process, the infor-
mation purified previously is taken as input for more sophis-
ticated inference algorithms. Thanks to the computational
advances of current era, it is possible to detect patterns,
profile behaviours, predict values or correlate events.
It is worth mentioning that the second and third steps com-
prise technologies widely used and known in the context of
data mining. However, the OSINT collection approach differs
from current data-driven services. Nowadays, common data
analysis applications gather as much information as possible
from pre-defined data sources and implement clear gathering
processes. On the contrary, OSINT solutions should collect
specific facts from the sea of all possible and reachable open
resources.
In order to face this latter challenging uncertainty and
go one step further, we propose in FIGURE 2 a practical
framework to carry out OSINT-based investigations. We have
included those exploration paths which are worthwhile to fol-
low for optimizing the analysis of collection results and max-
imizing the extraction of knowledge. This high abstraction
scheme includes the most clear transactions, representative
elements and outstanding operations.
A. OSINT COLLECTION
Before the analysis and intelligence extraction steps, the in-
vestigator has to expand the dataset about the target. With this
aim, we propose some OSINT techniques to represent dif-
ferent collection strategies. In particular, we have considered
search engines, social networks, email address, username,
real name, location, IP address and domain name OSINT
techniques (as we will further describe in SECTION V).
Under each one, there will be innumerable OSINT services
with similar ways of collecting data.
In this phase, it is assumed that, at least, an atomic piece of
data about the target is available (e.g., real name, username,
email address, etc.). From that initial seed and according to
its nature, the investigator applies the most suitable OSINT
techniques to derive more data. In this sense, the results
obtained with a specific technique are a data transfer to
be used by another type of technique. These represented
transactions illustrate possible ways of propagating the inves-
tigation, where the output of the technique of origin becomes
the input to feed the technique of destination.
B. OSINT ANALYSIS
The continuous iterations through the different OSINT tech-
niques should be analyzed and understood to generate valu-
able information. There is an increasing amount of analysis
techniques in the literature to do this task [46], highlighting
below those appealing procedures which are applicable in our
scenario:
•Lexical analysis: Raw data should be examined to ex-
tract entities and relations from text. It is essential to
apply translation processes to the language used in the
OSINT investigation [47] and filter noise which does not
add value from sentences that do not add value.
•Semantic analysis: Having a bag of words is not useful
if the meaning is not extracted [48]. With this purpose
of understanding data, natural language processing al-
gorithms are being used nowadays [49]. In addition,
sentiment analysis techniques permit the contextual-
ization of subjective posts or opinions to classify the
emotional status of the author (e.g, positive, negative or
neutral). Finally, truth discovery procedures address the
challenging task of resolving conflicts in multi-source
data which stands opposing positions on the same sub-
ject [50].
•Geospatial analysis: Recollected data from social net-
works, events, sensors or IP addresses are worthwhile
to be analyzed from a location-based perspective. In
this sense, the usage of maps or graphs facilitates the
representation and comprehension of data [51], as well
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Location
Social
networks
User name
Real name
Search
engines
Domain
name
IP address
Email
address
Telephone
Education
Professional
career
Files,
Images
Subdomains
Registration
info
City,
Country
Age
GPS
coordinates
Website
Company
Operating
system
Hostnames
CV
Political, sexual or
religious preferences
Tendency to crime
Economic situation
Cached life
Places visited
Activity on
the web
Crime attribution
De-anonymization
Relatives
Location
IP address
Email
address User name
Real name
Domain
name
Network
topology
DNS
records
Organizational information
Personal information
Network information
Output
info
Knowledge
elicitation
Potential
findings
Data
transfer
OSINT techniques
A N A L Y S I S
Lexical
analysis
Semantic
analysis
Geospatial
analysis
Social media
analysis
C O L L E C T I O N
Tracking
patterns
ClassificationCorrelation Outlier
detection Clustering Regression
K N O WL E D G E E X T R AC T I O N
Media Government
data Internet Commercial
data
Network
infrastructure
FIGURE 2: Principal OSINT workflows and derived intelligence
as extracting meaningful connections between incidents
or persons.
•Social media analysis: The features brought by modern
social media allow researchers to carry out in-depth
analysis of users [52]. In such a scenario, the analysis of
social data allows the creation of a network of contacts,
interactions, places, behaviours and tastes around the
subject.
The results of launching the aforementioned techniques
are considered as output info and are categorized into three
main groups:
•The personal information fuses the person identity de-
tails which are mainly obtained from the real name,
email address, user name, social networks and search
engines techniques.
•The organizational information is formed by aspects
of a team or company composed of individuals. It
is essentially collected by means of social networks,
search engines, location, domain name and IP address
techniques.
•The network information covers technical data of sys-
tems and communication topologies which is usually
achieved through location, domain name and IP address
techniques.
Logically, these three blocks of information can be ex-
panded with more elements. Moreover, a single investigation
may have different types of output info that complement each
other.
C. OSINT KNOWLEDGE EXTRACTION
The value of the information collected so far is unquestion-
able. However, the intelligence extraction of those findings
leads actually to what will provide an attractive recognition
of the target [53]. To this end, we consider the knowledge
elicitation as the treatment of the analysis results (output info)
making use of data mining and artificial intelligence tech-
niques. In the following we mention some really promising
technologies at this stage:
•Correlation: Detection of relationships between people,
events or pieces of data in general [54]. Strong related
features are specially valuable to reveal those non-
explicit associations existing in the dataset.
•Classification: The data can be divided in groups
according to predefined categories (supervised learn-
ing) [55]. This technique permits the organization of
large amounts of information for more effective knowl-
edge extraction [56].
•Outlier detection: This procedure analyzes the dataset
and detects anomalies in it [57]. They are particularly in-
teresting for the observation of malignant agents, whose
behaviour or actions differ from the general population.
•Clustering: It assigns pieces of data into clusters, being
able to consider big amount of conditions or heuris-
tics [58]. This could reveal, for example, different ways
of behaving in the network, various types of online
profiles or categorizing forms of attacking individuals,
organizations or infrastructures [59] without knowing
the existence of that diversity beforehand (unsupervised
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
learning).
•Regression: The main objective of this technique is to
forecast or predict numeric values or facts [60]. For
example, a linear regression returns a value attending
to a linear function, a neural network is a structure that
maps complex combinations of inputs to an output, or
deep learning that is made up of several layers that
combine and make operations with the input.
•Tracking patterns: Differing from anomaly detection,
pattern recognition is a process for detecting regularities
in data [61]. The methods mentioned above can be
included in this knowledge-discovery broad concept. In
fact, any artificial intelligence technique is suitable for
open data knowledge extraction.
These intelligent techniques allow inferring abstract, com-
plex and juicy issues about the target that are not explic-
itly published on the Internet [62]. However, this process
poses several challenges, mainly residing in researching and
developing this knowledge extraction process to identify,
profile or monitor criminals, recognize and explore malicious
organizations or uncover and attribute cybernetic incidents.
In addition, several privacy considerations arise due to the
powerful inferences that are potentially achievable. The ex-
tracted knowledge about a person, company or organizations
may be specially sensible and its manipulation indirectly
leads to ethical and legal problems (specifically addressed
in SUBSECTION IX-F). Indeed, we should never lose sight
of the fact that these techniques could be even misused to
directly harm people or groups (deeper analysis in SUBSEC-
TION IX-G).
V. OSINT COLLECTION TECHNIQUES AND SERVICES
As it has been shown, OSINT is quite promising and pow-
erful, but its implementation is also challenging. . In fact,
the first consideration is that it precises data as departure
point. Fortunately, the volume of raw data is not a problem
nowadays due to the existence of the Internet. In addition,
there is also an increasing number of applications, known in
this context as OSINT services, that precisely facilitate the
gathering on the web.
In the following, a summary of the most common OSINT
techniques is presented. Within each technique, the most
outstanding associated OSINT services at the time of writing
are shown, giving hints on how to effectively exploit their
potentialities. It is worth mentioning that OSINT services
are ephemeral and can even increase or decrease. On the
contrary, the OSINT technique is a broader concept that will
endure over time.
A. SEARCH ENGINES
Google,Bing or Yahoo search engines, among others, are
well known and widely used tools. The traditional use of
them is the simplest way of applying OSINT. These engines
search within the World Wide Web given a textual query
trying to provide information that matches with the input,
working really well and returning valuable information to the
user.
Nevertheless, the number of results can be so overwhelm-
ing that it can even be counterproductive for the user. For
that reason, a good investigator should know how to specify
the requests within a search engine according to the desired
outcome. Services like Google or Bing support filters to
refine searches1, and retrieve exactly the type of information
we are interested in. For instance, the use of “” permits exact-
matches, OR and AND act as logical operators, or *as a
wildcard. It also allows the introduction of conditions like
filetype to specify a certain file type, site to limit results to
those from a specific website, or intitle to find pages with
certain keywords within their title. TABLE 2 contains some
operators that can be used to refine Google and Bing searches.
Yahoo, in turn, does not permit specific filters, but we can
restrict the date, language or country of the results. The case
of the DuckDuckGo search engine is specially interesting be-
cause it does not track the user, nor it targets the IP address or
the search history. This privacy-preserving approach makes
the findings homogeneous for all users, regardless of habits,
preferences, location, or search history.
Moreover, some search engines have been designed for
specific territories. Yandex is well-known in Russia and East-
ern Europe, and implements search operators2to restrict
the search by URL, file type, language, date, and so on.
Baidu is another specific search service widely used in Asia.
It includes not only the typical keyword search bar, but
additional worthy resources for OSINT such as a social
network, a section of questions and answers, a virtual library
or an encyclopedia, among others. There are also search
engines for the Arabic community such as Yamli or Eiktub,
but they are much less employed. This type of services is
particularly interesting in investigations about people, groups
and companies belonging to specific communities.
Finally, it is mandatory to know specific search engines
to browse the Dark Web. OSINT investigations against drug
traffic, child pornography, weapon sales or terrorism are very
benefited from exploring these not-so-popular resources. To
this end, Ahmia and Torch are search engines available for
use within the Tor anonymous network [63]. However, the
researcher will have to deal with the anonymity of this
network and sites.
B. SOCIAL NETWORKS
Nowadays, the exposure of the daily life of individuals and
organizations in social networks is evident. Any curious per-
son has realized that lot of personal information can be found
with no advanced knowledge needed about these platforms.
As shown in TABLE 3, these applications offer precise search
possibilities in the context of OSINT. Next we describe some
of the most known and used social networks worldwide.
1https://support.google.com/websearch/answer/2466433
2https://yandex.com/support/search/query-language/search-
operators.html
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Google/Bing filter Search operator Example of use
Force an exact-match search “ ” “University of Murcia”
Exclude a term or phrase - university murcia -catholic
Search for X or Y OR,| university murcia|cartagena
Search for X and Y (used by default) AND university AND of AND murcia
Use of a wildcard *university of *
Search for a range of numbers .. university murcia 2010..2019
Group terms or search operators () “university of (murcia|cartagena)”
Search within a given domain site: university murcia site:um.es
Search for a certain file type filetype: university murcia filetype:pdf
Search in page titles intitle: university intitle:umu
Search in URLs inurl: university inurl:um
Search in the text of the pages intext: university intext:murcia
Search the most recent cached version of a page cache: cache:um.es
TABLE 2: Some Google/Bing filters for advanced search
Social Network Type Scope Main potential for OSINT
4chan Online community Worldwide Users interested in illicit activities
Badoo Dating Worldwide Intimate and personal details
Cloob Social connections Iran Personal profile, posting and community membership
Draugiem Social connections Latvia Personal profile, publications in blogs, group membership
Facebook Social connections Worldwide Personal profile, preferences and places visited
Facenama Social connections Iran Personal profile, publications, photos and videos
Flickr Photo-sharing Worldwide Activities, hobbies, places and personal relationships
Instagram Social connections Worldwide Habits, locations and personal relationships
LinkedIn Business Worldwide Professional profile, education, skills and languages
Mixi Social connections Japan Personal profile, interests and opinions
Odnoklassniki Social connections Mainly Russia Personal profile of adults, past and present friendships
Qzone Social connections Mainly China Personal profile, preferences, habits
Reddit Online community Worldwide Users trends, behaviors, and publications
Renren Social connections Mainly China Personal profile of students, friendships and discussions
Taringa! Social connections Mainly Latin America Personal profile, publications and community membership
Tinder Dating Worldwide Intimate and personal details
Tumblr Photo-sharing Worldwide Activities, hobbies, places and personal relationships
Twitter Social connections Worldwide Personal profile, opinions and publications
VKontakte (VK) Social connections Mainly Russia Personal profile, preferences and publications
Weibo Social connections Mainly China Personal profile, opinions and publications
YouTube Video-sharing Worldwide Video content, opinions and comments of subscribers
TABLE 3: Potential of various social networks
Facebook is a social network spread all over the world with
millions of users. It could be considered a diary of society,
where one can find very valuable personal information for
OSINT investigations. The profile of our target can reveal
his employment, education, age, location, visited places or
liked groups, among others. The photos and publications may
also help us contextualize the company or person we are
investigating, the areas it frequents or the type of activities
he/she realizes. In addition, it is also possible to search by
location when the real name is not known, being able to
ultimately find the profile of our target.
YouTube is a video-based platform where big communities
are conformed around shared interests. It is not only valuable
the content uploaded by an specific user (themes, images,
scenes, places, and people appearing in videos), but also the
opinions and comments of subscribers.
Twitter is mainly utilized for live communication where it
is common to find personal publications through an ordered
timeline. Apart from the personal information revealed by the
profile, it is particularly interesting the extraction of the opin-
ions from published tweets, the relationships with followed
and follower users or the likes in certain publications. From
this type of interactions, an OSINT investigator can infer the
orientation of the target on certain issues, the interests and
preferences of an organization, or how dangerous a person
might be. Additionally, a user-friendly interface3is available
3twitter.com/search-advanced
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
where it is possible to search on the whole platform by
keywords, exact phrases, hashtags, language, date and so
on. Thus, we can even define explorations through users,
mentions or responses.
Instagram is also widespread in the modern society as a
mean of sharing photos. The places, persons and activities
shown in pictures can also assist us in profiling our target.
The location is a quite sensitive data that is frequently shared
on this platform. In this sense, we can also mention more
specific photo-sharing services like Tumblr or Flickr.
LinkedIn is the most popular site in the context of business-
related social networking. It permits searching by real name,
company, organization, title or location. In this case, the
professional profiles can reveal full contact data, including
email addresses and cellular telephone numbers. In addition,
we can also extract information about the employment, edu-
cation, skills, languages and business relationships.
It is also worth considering those dating websites used to
contact people in search of a mate. Unlike other social net-
works, where many users restrict their personal details, more
intimate aspects are usually revealed in here. For this reason,
services like Tinder or Badoo are useful for investigating
the background information, personal character, interests,
preferences or behaviour of the target.
Finally, it is possible to browse online communities which
are very similar to social networks. The posts and topics of
these forums generate interesting interactions to be analyzed
by OSINT [64]. Reddit or 4chan are big communities which
host countless threads of discussion and opinion where really
personal and private information about the target can be
identified. However, in these websites users are commonly
anonymous. Additionally, it is not rare to find illicit content
of bullying, pornography or threats.
On the other hand, there are also some social networks
which are typically used within specific regions. The follow-
ing services are specially important in some countries.
Qzone,Weibo and Renren are some of the most used social
networks in China. The first one is a very customizable
platform where users publish blogs, diaries, photos or mu-
sic which reveal details about the person. The second one
has similar features to Twitter, but also including polls, file
sharing and stories (temporal photo and video sharing). The
last one is widespread among college students. Those OSINT
investigations whose target is a Chinese person can get a
valuable profit from these sites.
There are also social networks to interconnect Russian
compatriots and eastern European citizens. In this regard,
VKontakte, also known as VK, is very popular. The function-
alities, and even the appearance, are quite similar to Face-
book. Users are able to stay involved with friends, participate
in online communities, post messages, photos, and videos in
private or public pages, and even share files. Another Russian
site to highlight is Odnoklassniki, mainly used by adults. In
fact, the main purpose of its users is to have an online profile,
keep in touch with real-life friendships and search former
companions or past friends. In this sense, OSINT can be
conducted to discover people-to-people connections from the
past to now.
In Japan, Mixi is a very common social networking site in
society. Apart from typical functionalities, we could highlight
the possibility to make reviews to products, create personal
blogs within the platform, participate in communities or
manage music preferences and listening habits.
For Spanish-speaking countries, specially Latin America,
Taringa! is a well known social platform for sharing photos,
videos and news with friends. In addition, users are able to
create communities, play online games or share music.
Finally, due to the existing censorship with external ser-
vices, in Iran the most popular local social networks are Face-
nama and Cloob. The first is mainly used for sharing posts,
photos and videos whereas the second includes community
discussions, photo sharing, posting or chat rooms. Something
similar about censorship occurs in Latvia, where Draugiem is
widely used to share contents and communicate online.
C. EMAIL ADDRESS TECHNIQUE
Searching by a person’s real name can be frustrating due to
potentially duplicated names, so it is sometimes worth start-
ing from an email address which is unique and achieves much
better results at a faster pace. There are some interesting
OSINT services, as it is shown in TABLE 4, that work with
an email address as an input.
First of all, Hunter can be used to determine whether
an email address is valid or not. Then, Have I Been Pwned
informs whether a given email address is contained in public
breaches (so that it has been compromised at some point).
In particular, it is worth mentioning that the investigator can
browse the list of sites where the email address was compro-
mised. These services are potential sources for finding public
information about the owner. Another worthwhile page is
Pipl, which works really well to find information about the
owner of an email address such as the real name, usernames,
address, telephone number, education, professional career,
etc.
D. USERNAME TECHNIQUE
The nicknames used for online services are also a good
way to collect information regarding a person, as shown in
TABLE 5. Visiting these services will allow an investigator
to automatically check a username in several websites at the
same time to identify more sources of information.
The services KnowEm,Name Chk,Name Checkr, or
User Search verify the presence of a given username on the
most popular social networks and domains.
NameVine, in turn, provides an interesting feature that
helps when trying to guess an exact username. Concretely,
it suggests profiles for the top ten social networks which
partially match with the given username. This real time
solution offers a fast verification of username variants (for
instance, changing the final number of the nickname) instead
of launching time-consuming queries repeatedly with other
services.
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Email address OSINT service URL Main output
Hunter hunter.io Validity and availability
Have I Been Pwned haveibeenpwned.com Appearance in public data breaches
Pipl pipl.com Personal information about the owner
TABLE 4: Utility of the OSINT services belonging to the email address technique
Username OSINT service URL Main output
KnowEm knowem.com
Name Chk namechk.com Presence in social networks, domains
Name Checkr namecheckr.com and online communities
User Search usersearch.org
NameVine namevine.com Suggestions of alternative similar usernames
Lullar com.lullar.com Availability in social networks
TABLE 5: Utility of the OSINT services belonging to the username technique
The website Lullar uses a different approach. It auto-
matically generates URLs to visit the username profile in
different social networks without checking if they exist. If
a link works, then the profile exists for that social network,
whereas if it is broken it obviously means the opposite. In
addition to speeding up manual checking, the most useful
application would be to explore possible usernames when the
one we have is questionable or partial. When the initial URL
fails, similar or alternative users are often listed by the social
networks which can be used to identify the entire existing
username.
E. REAL NAME TECHNIQUE
Searching a target real name could also yield good results, as
shown in TABLE 6. Apart from social networks, particular
services are capable of revealing home addresses, telephone
numbers, email accounts, usernames, among others.
We could highlight Pipl as the website that returns more
information given a first and last name. Due to possible
multiple results for the same real name, it is possible to refine
the search by including additional aspects of the person such
as email, phone, country, state, city, username or age.
That’s Them also offers a remarkable output containing
phone number, email address, residence, associated IP ad-
dress, economic situation, education, occupation or language.
Another well-known service is Spokeo, whose free version
is reduced to show full name, gender, age, previous cities
and states of residency and relatives. More detailed infor-
mation about the target requires to pay a premium subscrip-
tion, which is out of our scope. Similar services would be
Fast People Search,Nuwber,Cubib or Peek You.
The aforementioned services work correctly for the United
States, but if we want to apply OSINT to a target that lives
in another country, the use of Yasni is more appropriate.
However, the results obtained are links related to social
networks, addresses and personal contacts, education, and
miscellany.
Genealogy services like Family Search,Family Tree Now,
GENi, or True People Search cover another point of view in
searches by providing kinship information. Discovering the
family links of our target broadens the amount of information
we can unveil, in this case indirectly.
F. LOCATION TECHNIQUE
Researching the locations that our target frequents can give us
indications of his/her habits and context. It is also interesting
to know the geographic location of a company or the place
where an event occurred. In this sense, images, addresses
and GPS coordinates are worthwhile data to obtain. TABLE 7
shows some services which are particularly designed to these
purposes.
Google Maps,Wikimapia or Bing Maps are well known
sites to find out locations from GPS coordinates. On the other
hand, it is also possible to reversely get such information
from a location name at GPS Coordinates.
Note that the images offered by the commented services
are continuously updated. However, we could be interested
in retrieving old images of past situations. Historic Aerials,
Terra Servers or Land Viewer incorporate historic imagery
functionalities to precisely discover past and outdated views
of locations.
G. IP ADDRESS TECHNIQUE
IP addresses are obtained from cyberattack investigations,
email addresses or connections over the Internet. They are
also crucial for digital forensic analysis in order to collect
as much information as possible from an incident. TABLE 8
summarizes some services which facilitate these tasks.
The service IP Location obtains, from a given IP address,
high-level aspects such as location (latitude and longitude),
country, region, city, domain name or ISP (Internet Service
Provider). If we are interested in specific facts, the website
ViewDNS provides more technical information apart from the
IP location. In particular, it includes services for displaying
registration information about the associated domain name,
showing additional domains hosted on the IP address, discov-
ering common ports that may be open and services running
on them, or seeing the network path from ViewDNS to the
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Real name OSINT service URL Main output
Pipl pipl.com Personal information
That’s Them thatsthem.com
Spokeo spokeo.com
Fast People Search fastpeoplesearch.com Personal details, education, professional career,
Nuwber nuwber.com skills, locations, and relatives.
Cubib cubib.com
Peek You peekyou.com
Yasni yasni.com Social networks profiles
Family Search familysearch.org
Kinship information, relatives
GENi geni.com
Family Tree Now familytreenow.com
True People Search truepeoplesearch.com
TABLE 6: Utility of the OSINT services belonging to the real name technique
Location OSINT service URL Main output
Google Maps google.com/maps
Wikimapia wikimapia.org Locations from GPS coordinates
Bing Maps bing.com/maps
GPS Coordinates gps-coordinates.net GPS coordinates from location
Historic Aerials historicaerials.com
Terra Servers terraserver.com Historic images of the past
Land Viewer eos.com
TABLE 7: Utility of the OSINT services belonging to the location technique
IP address OSINT service URL Main output
IP Location iplocation.net Location, domain and ISP
ViewDNS viewdns.info Technical network-based information
That’s Them thatsthem.com/reverse-ip-lookup Individual or company information
I Know What You Download iknowwhatyoudownload.com Torrent files
TABLE 8: Utility of the OSINT services belonging to the IP address technique
target IP address and analyze associated networks, routers,
and servers.
Nevertheless, the previous resources provide data that
is not sensitive or personal in nature. On the contrary,
That’s Them does offer interesting information about people,
home addresses, companies, or emails addresses related with
the given IP address.
Another powerful service providing personal information
is I Know What You Download. This service monitors online
torrents and discloses the files associated with any collected
IP addresses. The files downloaded by our target could reveal
really sensitive information about his behaviour or interests.
H. DOMAIN NAME TECHNIQUE
A typical point of interest in OSINT investigations are web
pages. They can reveal interesting information about our
target, specially whether we are dealing with a person or a
company. It is worth noting that the majority of techniques
which are explained for IP addresses are also suitable in this
context. In addition to them, we can highlight some other
services as presented in TABLE 9.
DNS Trails extracts DNS records, but also identifies the
number of additional domains that are related to the encoun-
tered results. To this extent, it is a very helpful way to find
relationships and connections. Whoisoly also shows a cross-
reference view from the owner name, address, telephone
number or email address.
Another powerful service is Wayback Machine, which pe-
riodically makes backups of many websites from the whole
Internet. This allows an investigator to analyze the evolution
and changes of a website, being able to see it for particular
screenshots dated in time.
Furthermore, it is possible to visualize domain connections
through Visual Site Mapper or Threat Crowd. Checking DNS
and mailservers is also useful by visiting Whois, which also
offers a ping functionality for checking the connectivity
and a traceroute functionality to study the data path to
the given domain. There are also services like Alexa and
SimilarWeb which calculate traffic statics and others like
FindSubdomains which search for subdomains.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
Domain name OSINT service URL Main output
DNS Trails securitytrails.com/dns-trails DNS records and related domains
Whoisoly whoisology.com Personal or company information
Wayback Machine web.archive.org/web Backups of websites
Visual Site Mapper visualsitemapper.com Map of subdomains
Threat Crowd threatcrowd.org
Whois who.is Registration info and DNS records
Alexa alexa.com Traffic statics
SimilarWeb similarweb.com
FindSubdomains findsubdomains.com Subdomains
TABLE 9: Utility of the OSINT services belonging to the domain name technique
OSINT tool Input Output Extensibility Interface Platform Other feature
Identity
data
Network
data File data Selectable
data source
FOCA 7Domain File name,
Folder
Google,
Bing,
DuckDuckGo
Identity info,
Network info,
File info
7Stand-alone
program Windows Server discovery
module
Maltego
Personal
information,
company,
community
Domain File URL 7
Identity info,
Network info,
File info
Custom
transforms
Stand-alone
program
Linux,
Windows,
MAC
Location,
Auto input/
output refeed,
Results in
oriented graph
Metagoofil 7Domain File type 7Network info,
File info 7Command
line
Linux,
Windows
Option to narrow
results
Recon-NG Personal
information Domain 7Several
Identity info,
Network info,
File info
7Command
line Linux
Location,
Modules for
discovery and
exploitation
Shodan
Country,
City,
Keyword
Operating
system,
IP Address,
Port,
Host name
7 7 Network
info 7Web
interface Online Location,
Webcam captures
Spiderfoot
Email,
Real name,
Phone Number
Domain,
IP Address,
Subnet,
Host name
7Several Network info Custom
modules
Web
interface
Linux,
Windows,
MAC
Different types
of scan,
Results in
oriented graph
The Harvester Company Domain,
DNS server 7Several Identity info,
Network info 7Command
line
Linux,
Windows,
MAC
Results in reports,
Option to narrow
files and results
IntelTechniques
Personal
information,
company,
community
Domain,
IP Address
File name,
File type,
File URL
Several Identity info,
Network info 7Web
interface Online
Location,
Public records,
OSINT virtual
machine
TABLE 10: Main features of the selected OSINT tools
VI. OSINT TOOLS
A manual use of some techniques would be enough for
basic searches. Unfortunately, using a few services might
not be effective for challenging investigations. In this sense,
the potential of OSINT lies in using as many services as
possible in a concatenated fashion. Following the workflows
repeatedly will extend the available information to put all the
pieces of the puzzle together. However, it is not practical for
the end user to manually combine several OSINT techniques
and their associated services. Such a tedious task would entail
lengthy research processes.
For this purpose, researchers and developers have imple-
mented more precise tools for applying OSINT techniques
automatically and gathering better quality information from
many different sources, implementing several workflows in-
ternally and, as a consequence, obtaining further rewarding
information and better inferences.
TABLE 10 presents the main features of the most popular
and relevant OSINT tools today. We indicate the type of
inputs and outputs they allow, the capability of including
custom functionalities, the type of user interface, the platform
of functioning and other interesting miscellany features.
Nevertheless, there are a lot of OSINT applications in the
literature which can be accessed at OSINT framework4.
A. FOCA
The main contribution of FOCA5(Fingerprinting Organi-
zations with Collected Archives), designed by ElevenPaths,
is the extraction and analysis of the metadata present in
electronic documents. This application can be used for both
local files present in our computer and external documents
that are downloaded from a specified webpage using three
different search engines (Google,Bing, and DuckDuckGo).
4osintframework.com
5https://www.elevenpaths.com/es/labstools/foca-2
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
FOCA considers a wide variety of formats such as Microsoft
Office, PDF, Open Office, Adobe InDesign, SVG files, etc.
This application extracts the hidden information of the files
and processes them to show the user relevant aspects. Some
of the details that are discovered with this procedure are the
name of computers related to the documents, the location
where the documents were created, operating systems used,
real names and email addresses of related users, data about
the servers, date of creation of the documents, range of IP ad-
dresses of internal networks, etc. As a result, a network map
can be drawn based on the extracted metadata to recognise
the target.
FOCA additionally includes a server discovery module
to complement the metadata analysis of documents. Some
techniques used in this tool are: (i) Web Search for searching
hosts and domain names through URLs associated to the
given domain; (ii) DNS Search for discovering new hosts and
domain names through the NS, MX and SPF servers; (iii)
IP Resolution for obtaining the IP addresses of encountered
hosts through the DNS; (iv) PTR Scanning for finding more
servers in a discovered network segment; (v) Bing IP for
extracting new domain names associated to encountered IP
addresses.
This tool is usually used in the security sector as it allows
pentesting a company. In fact, it is able to output very good
results because companies do not usually clean metadata
from files that are uploaded to the network.
B. MALTEGO
Maltego6is a well-known application that automatically finds
public information about a certain target within different
sources (DNS records, Whois records, search engines, social
networks, various online APIs, files metadata, etc). The rela-
tionships between the found items of interest are represented
in the form of a directed graph for its analysis. This tool
defines four main concepts:
•Entity: is a node of the graph representing the discov-
ered piece of information. Some default entities are real
name, email address, username, social network profile,
company, organization, website, document, affiliation,
domain, DNS name, IP address, and so on. Furthermore,
we could also define custom entities for our specific
investigation.
•Transform: is a piece of code which is applied to an
entity to discover a new linked entity. For example,
the transform “To IP Address” which resolves a DNS
name to an IP address, could be applied to a domain
name entity “um.es” to create a new IP address en-
tity “155.54.212.103”. Recursively, we would con-
tinue applying more transforms, propagating the process
of search. Apart from default transforms, it is also
possible to implement and include custom ones for more
specific purposes.
6https://www.paterva.com/web7/buy/maltego-clients.php
•Machine: is a set of transforms that are defined together
to be executed in order to automate and concatenate long
processes of search.
•Hub Item: is a group of transforms and entity types
used to allow users of the community to reuse them. By
default, Maltego implements the hub item called “Pa-
terva CTAS” which contains the entities, transforms and
machines maintained by official developers. In addition,
it is possible to create and install third party hub items.
C. METAGOOFIL
Metagoofil7works similarly to FOCA. It is a gathering tool
which downloads public files found in a target domain or
URL and extracts their metadata to output knowledge. It
generates a useful report for pentesters with usernames, real
names, software versions, and servers or machine names. It
can also find further documents that could contain resources
names.
Although it is a command line functionality, some interest-
ing options in favor of OSINT investigations are permitted.
Apart from specifying the target domain or the local folder
to analyze, Metagoofil allows filtering filetypes (pdf, doc,
xls, ppt, odp, ods, docx, xlsx, pptx), narrowing down the
results to search and the number of documents to download,
determining the working directory where downloaded files
are saved, or selecting the file to write the output.
D. RECON-NG
Recon-NG8is a web recognition framework similar to Metas-
ploit9. It presents a command line interface that allows one
to select a module to use, which is essentially an OSINT
resource. Then, we set some parameters if necessary and
launch the process. The results of the searches are continu-
ously saved in a workspace which in turn feeds next rounds
of the process.
This tool includes several independent modules that im-
plement different functionalities. For example, the modules
Bing Domain Web and Google Site Web search in Bing and
Google search engines respectively for hosts connected to
the domains of the workspace; PGP Search scans the stored
domains to find email addresses associated with public PGP
keys; Full Contact gathers users and corresponding social
networks profiles in its database considering stored contacts;
or Profiler searches for additional online services that possess
accounts with the same user names as those in the workspace.
Recon-NG is continuously agglutinating in a local
database all the obtained information. In this way, the user
directs the research by selecting the indicated module and
the tool automates the generation of knowledge from there.
The system scales remarkably for complex investigations.
7https://github.com/laramies/metagoofil
8https://bitbucket.org/LaNMaSteR53/recon-ng/wiki/browse
9https://www.metasploit.com/
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
E. SHODAN
Shodan10 is a search engine that provides public information
of Internet-connected nodes, including IoT devices. This
includes servers, routers, online storage devices, surveillance
cameras, webcams or VoIP systems, amongst others. The
recollection of data is made through protocols like HTTP or
SSH, allowing the user to search by IP address, organization,
country name or city.
This tool is mainly used for network security (to find
devices exposed to the outside or detecting vulnerabilities
of publicly available services), internet of things (to monitor
the growing usage of smart devices and their location in
the world geography), and tracking ransomware (to measure
the infection provoked by this type of attack). It allows
downloading the results in JSON, CSV or XML formats, as
well as generating user-friendly reports.
In addition to the mentioned functionality, there are two
premium services, namely: Shodan Maps (maps.shodan.io),
permitting investigations based on locations, and Shodan Images
(images.shodan.io) displaying collected images from
public devices.
F. SPIDERFOOT
Spiderfoot11 is another reconnaissance tool that automati-
cally goes through lots of public data sources to compile in-
formation. Our input could be an IP address, subnet, domain
name, e-mail address, host name, real name or phone number.
The results are represented in a graph of nodes with all the
entities and relationships found.
Depending on the type of input introduced, this tool
autonomously selects the modules (equivalent to Maltego
transforms) to activate for a more effective reconnaissance.
Moreover, it also considers the level of search selected by
the user. Spiderfoot offers four types of scans: (i) Passive
collects as much information as possible without touching
the target site, avoiding being unveiled by the target; (ii)
Investigate conducts a basic scan in order to find out target’s
maliciousness; (iii) Footprint identifies the network topology
of the target and gathers information from the web and search
engines, sufficient for standard investigations; and (iv) All,
which is advisable for detailed investigations, despite taking
a long time to complete, as it consults absolutely all possible
resources related to the target.
This tool could be used to launch penetration tests to reveal
data leaks and vulnerabilities, red team challenges, or to
support threat intelligence. In addition, it is worth noting that
it is possible to program custom Spiderfoot modules.
G. THE HARVESTER
The Harvester12 allows the collection of public information
related to a domain or company name through search en-
gines. In particular, it is capable of listing emails and host
10https://www.shodan.io
11https://www.spiderfoot.net
12https://github.com/laramies/theharvester
names of the company, as well as subdomains, IP addresses
and URLs related to the domain. It also permits user-friendly
HTML or XML representations of the results. This resource
is used in the early stages of a penetration test.
This tool is managed from the console and implements two
options when scanning our target website. On the one hand,
The Harvester represents the original script which actually
provides the list of related email addresses, whereas, on
the other hand, EmailHarvester improves the procedure by
digging deeper for better results.
H. INTELTECHNIQUES
IntelTechniques13 is a tool, created by Michael Bazzel, which
offers hundreds of online search utilities grouped by tech-
nique.
When using it, the investigator selects the services to be
used and this tool automatically creates the associated query
links. Afterwards, the user can enter them in the browser to
launch the queries. However, the visualization and collection
of the information is still manual.
In spite of the fact that it does not implement an automatic
integration of services, we have considered InterTechniques
as a OSINT tool that facilitates the launch of searches to a
wide range of services from a centralized platform.
Unfortunately, this tool ceased to be free and blocked its
open access as of July 2019 due to constant cyberattacks.
I. OSINT TOOLS COMPARISON
Depending on the user needs (see TABLE 10), some tools
will be more suitable than others for a given task.
Thus, if we intend to extract hidden information from
files,FOCA and Metagoofil are specific tools designed for
this purpose. In particular, the first product seems to be more
complete, mature and powerful than the second one. FOCA
presents additional functionalities, apart from the metadata
analysis of files, to complement the hidden information. As a
result, it is able to infer more knowledge about the target.
Yet, if we are looking for network information,Shodan,
Spiderfoot and The Harvester are recommended options for
this certain task. On the one hand, we would suggest
Spiderfoot to analyze the topology of the target and retrieve
internal (but public) information about the target organiza-
tion. On the other hand, we would complete the results with
Shodan to include specific information about IoT devices,
surveillance cameras, webcams, VoIP systems, or smart ser-
vices in general.
Last but not least, if the aim of the search is to gather
as much information as possible for a given input, the
resources Recon-NG and Maltego are the more complete ones
and will return diverse data and relationships. The first one
contains lots of modules and interacts with a local database
that scales during the investigation, being an ideal framework
to carry out pentestings, phishing and social engineering
attacks prevention, or even the profiling of a person. On
13https://inteltechniques.com
14 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
the contrary, if we want to avoid the command line and
opt for a more user-friendly interface, Maltego is a good
alternative for OSINT activities. It implements automated
inference processes with transforms that raise the scope of
the original search. Moreover, it is extensible with custom
discovery procedures.
Despite the fact that the above described comparison has
been made according to the desired output, in practice the
user will be restricted by the available input and the data
type accepted by the chosen OSINT tools. Finally, note that
these tools are complementary and mutually non-exclusive,
meaning that a deep and thorough OSINT investigation could
profit from several of them at the same time. Although some
of them may produce similar results for a given search, there
can always be details found by a particular tool that are not
obtained by others.
VII. INTEGRATION OF OSINT IN CYBERATTACK
INVESTIGATIONS
The implementation of mechanisms for detection of and
response to cyberincidents is an obligation today. Companies
and organizations, which are increasingly exposed on the
Internet, invest in cybersecurity to protect their assets against
criminals. Therefore, it is remarkably important to manage
threats and incidents against information systems effectively.
Cyberdefence is not only the deployment of technical
solutions such as firewalls, IDSs (Intrusion Detection Sys-
tems), IPSs (Intrusion Prevention Systems), SIEMs (Security
Information and Event Management) or anti-viruses to avoid
known threats, but also the implantation of cyberintelligence
to extract and analyze traces, patterns and conclusions from
the incidents. In fact, the continuous cycle of extracting and
sharing evidences, relationships, and consequences of inci-
dents is known as threat intelligence [65]. It complements the
traditional defence mechanisms with up-to-date information
and highly improves the protection of the infrastructures,
the management of the hazards and the effectiveness of the
responses [41].
Moreover, the information that is typically used for foren-
sics and investigations is merely technical. However, the
traces left by a cyberattack contain valuable information that
should not only be contrasted with repositories of incidents
[66], but also with social networks, forums, media, tech-
nical and governmental documents and other digital public
sources. These open sources contribute with semantic in-
formation in the analysis, which result to be interesting for
computing and reasoning more complex and far-reaching
inferences. Note that cyberattackers use the Internet for their
illegal actions (hacking, phishing, denial of service attacks,
botnets, identity theft, intrusions, etc.), but also for personal
reasons. In this sense, OSINT can be used to connect all those
points.
Several works applying OSINT to cybersecurity focus on
proposing defensive improvements when facing threats. On
the contrary, very seldom they seek the identification of
cyberattackers. OSINT is a source of knowledge that could
support the investigation of a cyberattack by going from the
smallest details of the malicious action to the root of the
problem. This last challenge is not new, since it is tradi-
tionally known as the attribution problem [67]. Concretely,
OSINT would allow us to understand the motivation of the
cyberattack, to guess the procedure and to ultimately profile
the perpetrator.
The suggested application of OSINT is illustrated in FIG-
URE 3. Note that several methodologies and models have
been proposed to define the detection maturity of an organi-
zation, which is crucial to extract evidences from a suffered
cyberattack. Nonetheless, there is a lack of standards to rep-
resent taxonomies and ontologies in this field [68], thus we
propose a modified version of Ryan Stillions’ DML model
[69] to exemplify this section. However, another cyberthreat
detection scheme could be used to show the application of
OSINT in a similar way.
The DML model represents in a hierarchical way different
levels of abstraction in the detection of cyberattacks. A
company that does not invest in cybersecurity will only be
able to reach the lowest steps in the stack. On the contrary,
an organization technically skilled in cyberdefence may in-
terpret more complex facts, that is, to ascend to levels with
more abstraction.
While the lower levels can be easily covered, the challenge
lies in reaching the higher layers. To this end, we suggest
applying OSINT as a source of intelligence that feeds on the
most basic evidence to arrive at more robust facts:
1) Firstly, we assume that it is possible to cover levels
DML-1 and DML-2. The first one, Atomic indicators
of compromise (IOC), is composed by details as simple
as a string in a modified file, the value of a memory
cell or a byte transmitted through the network, which
have very low value on their own, but together form
the next level. The Host and Network Artifacts layer is
built upon the indicators observed during or after the
cyberattack such as IP addresses, domain names, logs,
transactions, hash values, or file manipulation details.
As this type of data resides in the affected informa-
tion systems, in our framework it is considered as an
input for the collection of associated information in
open sources (see SECTION V for more details about
OSINT collection). Therefore, the extraction of these
traces is the starting point of an OSINT process.
2) Next we have from level DML-3 to level DML-6.
The third level Tools consists in detecting the transfer,
presence and functionality of the tools used by the
attacker. The following level Procedures is covered if
one is able to enumerate the steps performed during the
incident. The fifth level Techniques extracts how the
attacker has specifically performed the various phases
of the attack. And the last level here, Tactics, is a
more abstract concept that takes into account the levels
discussed above and derives knowledge by analyzing a
set of activities in time and context.
VOLUME 4, 2016 15
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
8. Goals
7. Strategy
6. Tactics
9. Identity
5. Techniques
4. Procedures
3. Tools
2. Host and Network Artifacts
1. Atomic Indicators
0. None or Unknown
DML MODEL
COLLECTION
Trac es o f cyb er
attack or crime
Attack executio n
plan and methods
PROCESSING
ANALYSIS & CORRELATION
INTELLIGENCE
Attacke r intent ion
and profile
PROCESSING
OSINT
FIGURE 3: OSINT integration with DML model to address the attribution problem
In this case, the information reveals details about the
execution of the cyberattack. Such data highly enriches
the analysis phase of the OSINT cycle. The patterns
derived from this data, as well as the correlation with
other cases already stored, allow us to have a more
intelligent and comprehensive analysis. In fact, these
conclusions should be integrated in conjunction with
the results obtained in the collection phase. In this
way the exploration through the network is refined,
narrowing the investigation towards the final objective.
3) Finally, the continuous gathering and analysis pro-
cess of OSINT generates valuable information to
which knowledge-extraction techniques are applied.
The knowledge extracted with OSINT from level
DML-1 to DML-6 would allow us to reach the highest
levels, that is, from DML-7 to DML-9. The seventh
level, Strategy, refers to a high-level description of
the planned attack of the cybercriminal to complete
his/her purposes. The eighth level, Goals, are the spe-
cific objectives of the attacker and express the real
motivation of the action. At the top we find the Identity
level, which is essentially the name of a person, an
organisation or even a country which is responsible of
the malicious actions. As it is extremely difficult to find
that detailed information, the connection with other
cyberattacks and the similarity with other events can
support the relative attribution [67]. That is, completing
the investigation of the current case with additional
information about other incidents apparently caused
by the same actor brings us closer to the absolute
identification of the cyberattacker.
This application of OSINT represents an innovative line of
action to fight against cyberthreats. The challenge resides in
implementing effective mechanisms of collection and intelli-
gent analysis procedures to extract those high-level details
that can not be directly extracted from malicious actions.
Such details are the most complicated pieces of information
to achieve, as they have a very high degree of abstraction
that are long away from the technical details. That is why
it is smart to look to open sources for any relationship or
pattern that leads us to discover more about the context
and originators of an incident. OSINT is the key piece that
was missing in the gear to profile cyberattackers and to
improve the detection of sophisticated attacks [70] thanks to
the consideration of high-level behaviour aspects from DML-
3 to DML-9.
VIII. OSINT IN COUNTRIES AND STATES
OSINT is not only beneficial in the private sector, but also
represents a resource of public interest in governments. In
this regard, in SUBSECTION VIII-A we discuss that OS-
INT is not a paradigm designed for paranoid analysts or
computer geeks, but indeed has an enormous benefit in the
cyberdefence national system [71]. Likewise, in SUBSEC-
TION VIII-B we observe that official authorities do not only
get profit from OSINT results for internal tasks, but indirectly
make the application of OSINT easier for third parties. In
fact, they become an agent that generates large amounts of
data accessible to everyone. In this sense, governments are a
double-edged sword which benefit from OSINT but at the
16 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
same time they contribute to feed the Internet with really
valuable, and sometimes even sensitive, information.
A. INTERNAL STATE AFFAIRS OPERATIONS
Intelligence Agencies have been traditionally associated with
the labour of Law Enforcement Agencies (LEAs) and Mili-
tary Bodies. In the same way, OSINT is considered nowadays
as an important key of classified investigations and secret
operations in state affairs [5]. To some extent, one could
safely argue that the exploitation of OSINT can provide
critical capabilities for LEAs to complement and enhance
their counterintelligence departments in the investigation and
strategical planning to fight against crime [72].
As far as we were able to explore in the offi-
cial websites, reports and documentation, government
organizations seem to implement internal mechanisms
which basically consist in gathering raw information
and transforming it into useful knowledge, leveraging
OSINT mechanisms [73]. In a representative way, we
could mention the U.S. Federal Bureau of Investigation
(FBI, fbi.gov),U.S. Central Intelligence Agency (CIA,
cia.gov),Canadian Security Intelligence Service (CSIS,
canada.ca/en/security-intelligence-service),
European Union Agency for Law Enforcement Coopera-
tion (EUROPOL, europol.europa.eu),North Atlantic
Treaty Organization (NATO, nato.int),United States
Department of Army (DA, army.mil),U.S. Department
of Defense (DoD, defense.gov), U.S. National Security
Agency (NSA, nsa.gov) or European Defence Agency
(EDA, eda.europa.eu), amongst others.
In this scenario of uncertainty, we have decided to partic-
ularly investigate the case of Spanish LEAs, for affinity, to
demonstrate that official organisms internally indeed apply
OSINT. As a result of this thorough inspection, we can
emphatically confirm that it is not easy to find clear evidences
of the application of OSINT by the state forces. The confiden-
tiality of this type of agencies makes it difficult to discover
their internal operating mode and the impact of OSINT in
their current investigations. Nevertheless, as a consequence
of the deep search, we have some subtle findings that confirm
that OSINT is currently used by Spanish LEAs:
•Back in 2007, the director of the CNI (i.e., Spanish
National Intelligence Agency) said14 that open sources
were “fundamental to the elaboration and work of Intel-
ligence”
•CIFAS (i.e., Spanish Military Intelligence Agency) also
seems to use OSINT as a way of obtaining information.
We have found some slides that confirm this, dated as
early as in 2008, which are uploaded in the Spanish
Defense Staff website15.
14https://www.elconfidencialdigital.com/articulo/vivir/CNI-califica-
fundamental-abiertas-contradice/20071023000000049386.html
15http://www.emad.mde.es/Galerias/EMAD/novemad/fichero/EMD-
CIFAS-esp.pdf
•In 2010, when the director of the CNI announced16 the
creation of an ethical code for special agents, he also
insisted on the fact that modern intelligence was not
just based on physical presence, as today “you might
get more information sitting on a computer, exploring
messages from the bad guys”.
•More recently, in 2017, the Spanish Ministry of Defense
opened a public call17 for the contract called “Develop-
ment of OSINT tool based on IDOL HAVEN platform”.
•In the present, the Spanish Army is designing a new
model called Brigade 2035 which incorporates innova-
tive technological advances for enhancing operations.
In this project18, one of the defined combat functions
is Intelligence, which clearly states OSINT as a key
responsibility: “Other facilities of growing importance
will be open source obtainment (including social net-
working)”.
•The Spanish Ministry of the Interior has published in the
Annual Recruitment Plan for 201919 some investments
in “systems for obtaining OSINT in the cyberspace”.
Bearing in mind all these facts, it seems that currently
OSINT is indeed relevant in the internal affairs of Spain.
Analogously, we could also highlight that European Union
state members are also highly developed in OSINT [74].
B. OPEN DATA POLICIES AND TRANSPARENCY
OSINT depends on the public data available on the Internet,
among other sources, to be effective. In this regard, apart
from social networks and other open data sources, there
are also authoritative and official sites maintained by state
institutions around the world where public information is
published and, therefore, openly available.
The Open Data Barometer (ODB)20 is a global ranking
system designed by the World Wide Web Foundation that
measures the readiness, implementation and impact of coun-
tries’ open data policies. In Figure 4 is shown the scores of
latest full edition21.
As we have already done in the previous subsection, we
study the specific case of Spain for affinity. In fact, regarding
the aforementioned ODB report, Spain is ranked in the 11th
position. Besides, according to the European Data Portal and
its official reports22 about Open Data maturity across Europe,
Spain is one of the most advanced countries in transparency
and open data. It has been in first or second position in the
ranking of Open Data Maturity in the last four years. As it is
stated, the Spanish Government has promoted more than 160
open data initiatives and has over 23,800 public information
16https://www.lavanguardia.com/politica/20100624/53951898847/el-
director-del-cni-anuncia-un-codigo-etico-para-los-agentes-secretos.html
17https://contrataciondelestado.es/wps/wcm/connect/ff96fa82-7fd6-
40bd-be5b-36ef3fd4e65b/DOC_CN2017-498874.pdf?MOD=AJPERES
18www.ejercito.mde.es/en/estructura/briex_2035/principal.html
19http://www.defensa.gob.es/Galerias/gabinete/ficheros_docs/2019/
PACDEF_2019_Documento_Pxblico.pdf
20https://opendatabarometer.org
21https://opendatabarometer.org/4thedition
22https://www.europeandataportal.eu/en/dashboard#2018
VOLUME 4, 2016 17
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
FIGURE 4: Transparency scores by the 4th edition of Open Data Barometer
catalogues. For example, the Open Data Initiative of the Gov-
ernment of Spain23 is a clear proof of how Spain encourages
transparency. OSINT could benefit from that, but it should
deal with aggregated and statistical information by linking it
and inferring new knowledge.
There are also anonymized databases that, a priori, would
not be useful for OSINT because they lack the value to
produce intelligence. These so-called anonymous datasets do
not break the link between the data and its owner, apparently.
Recently, an algorithm [75] has been published allowing
99.98% of Americans to be unequivocally identified from
public data. In particular, it is enough to have 15 parameters
related to medical, behavioral and socio-demographic infor-
mation such as marital status, sex or the zip code of their
home. Therefore, OSINT could again be used to re-identify
people collected in anonymized databases.
On the contrary, there are also governmental plat-
forms which are actually not anonymized. For instance,
the Spanish Ministry of the Treasury, the Spanish Min-
istry of the Interior or the Spanish Ministry of De-
fense usually publish documents with personal infor-
mation (“site:hacienda.gob.es filetype:pdf
intext:dni”, for example). In the same way, this could be
also applied to Spanish Autonomous Communities websites.
Moreover, Europe has a public data platform24 too, where
we could find a lot of public information. For instance,
in the context of foreign policy and security, an updated
list of financial sanctions is presented in the “European
23https://datos.gob.es/es
24http://data.europa.eu/euodp/en/data
Union Consolidated Financial Sanctions List" document. In
particular, it reveals personal information about individuals,
groups and entities.
All the aforementioned facts demonstrate that govern-
ments worldwide are adopting strong Open Data policies. As
a direct consequence, the amount of objective data available
on the Internet is rapidly increasing. OSINT should, in ad-
dition to other open sources of information, take advantage
of this powerful opportunity to collect, analyze, link and
infer knowledge from reliable and official sources. In this
scenario, and according to the ODB, countries such as United
Kingdom, Canada, France, United States, Korea, Australia,
New Zealand, Japan, Netherlands, Norway, or Brazil are real
OSINT goldmines with very similar characteristics to those
commented for Spain.
IX. OPEN CHALLENGES AND FUTURE TRENDS
The review carried out on OSINT shows that there is already
a substantial amount of work in the topic. Numerous tech-
niques and tools have been developed up to now. However,
there are some gaps and limitations in this field to continue
exploiting the offered opportunities. It is necessary to make
more sophisticated solutions applicable to uncontrolled sce-
narios of the real world. We have spotted some challenges
that, as far as we know, are open nowadays and should be
faced by the research community in the next future.
A. AUTOMATION OF THE GATHERING PROCESS
The greater the amount of information collected, the more
likely it is to create inferences and relationships. However,
18 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
the quantity of public data available today is enormous and
can not be collected in a manual way [76]. Although OSINT
techniques (Section V) and tools (Section VI) are already
a big step forward in this direction, most of them are still
largely dependent on the end user. In this sense, it would be
appealing to incorporate more sophisticated techniques. We
highlight current big data techniques such as Web crawling
or Web scraping [77] as potential paradigms to automate and
improve the OSINT exploration of high volumes of open
data.
An important aspect of the recollecting process is the
propagation of the search. The results obtained with searches
should refeed the following rounds of gathering. In OSINT it
is really powerful to extract pivots permitting the concatena-
tion of outputs as new inputs for propagation. This recursive
method increases the scope of research and is closely related
to the analysis process that we will discuss next.
B. ENHANCEMENT OF THE ANALYSIS AND
KNOWLEDGE EXTRACTION PROCESSES
The interpretation of the recollected open data is a key point
in the OSINT procedure. Extracting the essence of the scrap-
ing results, making relationships between separated pieces of
information, or inferring conclusions that are not explicitly
exposed increases the quality of the results. Indeed, the
recursive integration with the propagation of further rounds
of investigation is enhanced by means of better inputs.
However, as far as we know, OSINT analysis is not imple-
menting intelligent mechanisms today. The existing tools are
limited to throwing all the information found and its explicit
relationships. On the contrary, the analysis process should
incorporate semantic analysis, study of patterns, correlation
with other events, occurrences or datasets.
Fortunately, modern data mining techniques [78] such
as Natural Language Processing, Social Network Analysis,
Machine Learning or Deep Learning are actually designed
to solve this type of challenges. A proper selection of algo-
rithms in this field of knowledge will make the difference
between the current static analysis and the future reasoned
processing [79].
Ideally, the OSINT of the future should be able to provide
the end user with the specific piece of information he/she is
searching, as well as to return convincing answers in investi-
gations. The original search would also have, not only direct
inferences, but also indirect and not explicit relationships.
This challenge builds the path between the Second Gener-
ation and the Third Generation of OSINT. As it is presented
in [1], the Second Generation started with the rise of Inter-
net and Social Media, and the challenges were “technical
expertise, virtual accessibility and constant acquisition". In
contrast, the evolution to the Third Generation is supposed
to appear nowadays and will have to include “direct and
indirect machine processing of data, machine learning, and
automated reasoning".
C. INTEGRATION OF SEVERAL OPEN DATA SOURCES
OSINT activities should consult as many sources as possible
in order to cover the widest possible spectrum. It is not a
good idea to focus our research on a single social network or
a specific forum. In this sense, success lies in combining data
sources to obtain the best possible results. This means that
the system has to normalize the available information, which
is typically unstructured, in order to perform an effective
analysis and correlation. As a result, it is important to discard
repeated items. In fact, the different OSINT techniques and
tools explained in this paper are actually applying such sitting
to gather the knowledge related to the target.
On the other hand, the real challenge is to incorporate, not
only several data sources, but different types of data sources
[80]. Apart from data extracted from the Internet, Dark Web
and Deep Web, the OSINT workflow should also consider
information collected face to face, with social engineering, or
with citizens collaboration. Any piece of information which
is interesting to our investigation has to be used in order to
achieve the next milestone of the search. Additionally, it is
a must the implementation of truth discovery processes for
those cases when information from different data sources is
contradictory [81].
D. FILTERING OUT IRRELEVANT DATA AND
MISINFORMATION
Due to the huge amount of data publicly available, an OSINT
process needs to be capable of distinguishing the relevance
of each piece of information, discarding data which do not
add quality to the investigation [82]. A researcher cannot
focus on exploring the details of an entire website, reading
a multi-page news item or analyzing a complex government
document. On the contrary, OSINT research needs to extract
keywords which actually provide value and reveal knowledge
about our target. The piece of information we are interested
in may not be explicitly posted, and the challenge would be
to extract the essence of the data source we are scrutinizing.
At the same time, the precise terms extracted serve as pivots
to create new paths of exploration.
Furthermore, it is crucial to detect misinformation that
would corrupt the results [83]. By nature, the Internet is
subjective and the majority of the content has no guarantee
of being reliable and official. The OSINT community has
to determine whether the increasing reliance on open source
data is still combined with the sources validation, which
represents a primary requirement and priority [84]. That un-
true information can divert our search, leading to erroneous
results or far from our real objective. For that reason, it would
be interesting to analyze not only the objective information,
but also the false information with the aim of extracting
intelligence.
This problem will be present in real-life research. The data
sources where we will find more valuable information about
suspects will be in forums and social networks. In these sites,
the investigator has to deal with opinions, subjective publi-
cations, and personal preferences whose veracity is question-
VOLUME 4, 2016 19
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
able [85]. Profiling of persons who in reality do not represent
a threat (false positives) could provoke discriminatory and
unfair attitudes that could affect the victims.
E. EXTENSION ACROSS THE WHOLE WORLD
One of the main drawbacks of many of the existing OSINT
resources is that they only function for specific countries,
reducing their profiling capability to a constrained group of
people belonging to a few nationalities. However, OSINT
should be a universal technique to tour all the corners of
the Earth instantly without discriminating zones of the cy-
berspace. Thus, interoperability is a desirable property to be
considered in OSINT design as it would increase, not only
the scope of the searches, but also its usage by end users.
Ideally, a good OSINT service or tool should not distin-
guish between countries and take each research as a global
task, without borders. The OSINT workflow should combine
points of information across the world and correlate those
distributed data sources. In fact, although the relationships
between search zones could be done by hand, the real chal-
lenge lies in OSINT applications implementing these jumps.
In addition, the globalization of the process would not
leave aside appealing open data sources from different ter-
ritories which actually could fill the gaps we need to address
in our investigation. In Spain, for instance, we use tools
that are designed in (and for) foreign countries. However,
there are not OSINT solutions which include Spanish public
repositories in the gathering phase (as government open data
platforms could be). In this sense, we are not fully benefiting
yet from the goldmine that supposes being one of the most
transparent countries in Europe.
A generic and flexible implementation is specially useful
for nomad targets in whom mobility is part of their daily
lives. Say that the investigated target is a person who has lived
stages of his life in several countries, or companies which
have headquarters on several continents, or even criminals
who change their location to make it more difficult to pursue
them. In these cases, a static search in a particular country
would leave a lot of information uncollected and a lot of clues
unanalyzed.
F. AWARENESS OF PRIVACY, ETHICAL AND LEGAL
CONSIDERATIONS
From an ethical point of view, OSINT must respect the user’s
privacy so as not to harm his private life, as well as the
privacy of his family, friends and co-workers. The fact that
the information is publicly accessible does not mean that it
is not sensitive. Knowing the personal preferences and tastes
of the target can perpetrate in his privacy. Revealing politi-
cal thoughts can have fatal consequences in certain places.
Communicating a sexual orientation can be potentially life
threatening in certain countries. Knowing religious beliefs
can lead to criminal convictions in specific territories. Thus,
the open source information has to be handled carefully, for
legitimate purposes, in the interests of society.
From the legal point of view, OSINT should be used on the
basis of a law and respecting data protection policies. With
the advent of the EU GDPR, the regulation concerning the
personal data has changed [86]. In this sense, personal data
comprise any information which can relate to any citizen.
Moreover, different pieces of information, which collected
together can lead to the identification of an individual, also
constitute personal data, even if the information is encrypted
or anonymized [14]. A possible solution to address such
challenge is to adapt the design of OSINT tools to embed nor-
mative constraints, specially GPDR legal requirements [87].
By definition, OSINT is completely legal due to the public
nature of the data sources it uses. Nevertheless, investigators
must not publish the gathered personal information, even if it
is posted on the web. In addition, the user who applies OSINT
cannot fall into the error of trying to impersonate the target
in order to find more information. It should also be noted that
authentication barriers cannot be broken in order to access
the information we are looking for.
In short, the use of OSINT should be restricted to le-
gal activities and non-malicious purposes. In principle, OS-
INT does not (and should not) violate human freedom and
rights, therefore its previously-mentioned techniques and
services are legal to this extent [88]. It is a really powerful
methodology, but it is also dangerous if misused. Thanks
to OSINT, journalists can provide up-to-date, objective and
quality news. Human resources managers can get to know
the applicants in their job better. Countries’ authorities can
investigate criminal and terrorist groups. A company can
audit its exposure abroad to cyberthreats. However, such
openness to the utilization of OSINT techniques to specific
categories should be always correctly justified [89].
On the downside, the OSINT end-user could be a delin-
quent trying to commit a crime. A cracker could profile the
target to increase the likelihood of success. A thief could
analyze family members to steal from home at the best
time. An extortionist could publish the private and personal
information of the victim if a ransom is not paid.
Developers have to consider the aforementioned aspects
when implementing OSINT tools. In any case, for our sake,
the most powerful tools should be only available to LEAs and
Intelligence Agencies.
G. BATTLE AGAINST OSINT MISUSE
As already mentioned throughout the previous Sections, the
potentialities of the OSINT paradigm are quite broad. In fact,
it is indeed possible to take advantage of the open data for
cybersecurity and cyberdefence purposes, thus investigating
the attackers and/or terrorist groups [90]. Nevertheless, the
exploitation of the publicly-available data is prone to abuse.
That is, ill-motivated actors may leverage the huge amount
of information in order to commit cyber-aggressions, such
as cyberbullying, cybergossip and cyber-victimization [91].
Unfortunately, those phenomena are increasingly and alarm-
ingly more frequent on the Web, leading the victims to dis-
tress, loneliness, depression, and even to commit suicide in
20 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2965257, IEEE Access
J. Pastor-Galindo et al.: The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends
the worst case [16]. In particular, cybergossip is performed by
group of people making evaluative comments via digital de-
vices about somebody who is not present. This cyberbehavior
affects the social group in which it occurs and can hinder peer
relationships, damaging the victim of such process [92].
To this extent, it is important to control that the OSINT
techniques and services are used in the correct manner, with-
out harming others’ rights and freedom [93]. More specifi-
cally, one could think to give different privileges based on
end-user category, thus avoiding to grant full-access to the
entire spectrum of information. For example, employees may
have access to basic information in order to enhance their
tasks (e.g., for HR recruitment duties), while government and
police forces