Conference PaperPDF Available

The Use of the Google Search Engine for Accessing Private Information on the World Wide Web

Authors:
The Use of the Google Search Engine for Accessing Private
Information on the World Wide Web
Ş. Ahmet Gürel, Erhan Basri, and Yıltan Bitirim
Department of Computer Engineering, Eastern Mediterranean University,
Famagusta, TRNC
{ahmet.gurel, erhan.basri, yiltan.bitirim}@emu.edu.tr
Abstract. This paper discusses how the search engine Google can be used to access
private information on the World Wide Web. Google hacking, the use of search
operators of the Google search engine enables hackers to perform complicated
queries to retrieve information located at misconfigured web-servers. In this paper,
private information that might be sensitive is classified and Google hacking
techniques that can be used to retrieve this information is investigated. It is found
out that Google hacking enables access to private information even if they are not
published publicly and kept unlinked.
Keywords: privacy, private information, Google hacking, Google search operators,
World Wide Web
1 Introduction
Nowadays, Internet is used as a source of information. Various types of information can
be accessed through the internet. Some examples are documents, government information,
advertising, and commerical applications. Even though the Internet provides advantages
on information access, such as scientific articles referencing web pages and online
goverment reports [1], private information such as contact information can also be
accessed on the Internet. However, the access of private information can create privacy
risks for Internet users. These risks occur because security flaws of operating systems or
web applications or configuration mistakes on servers connected to Internet can result in
unwanted access of the private information of the users to the public. In 2002, 9.9 million
users suffered from personal information theft in USA [2]. Furthermore, there are 800.000
personal databases under threat today according to research of University of California at
Los Angeles (UCLA) [3].
Since there are a huge number of documents on the web, there is a need to index
and filter these documents according to the information necessities of the Internet users
[4]. Therefore, different search engines (such as Google, Yahoo, and MSN) are developed
to index the web documents and help users to retrieve most relevant information on the
web [5].
The search engine Google is the most popular search tool and is one of the most
visited web sites [1]. Google, when first established in 1998, had indexed 25 million of
web pages, which increased to 8 billion at June 2005, with 17 million of images and 6600
printed catalogs. Nowadays, Google has 25 billion of web pages and 1.3 billion of images
indexed in its database. Google services encounter more than 200 million of queries every
day. In the year 2006, Google had up to 54% of the world search engine markets, which
provided Google a net income of 1.47 billion dollars [6]. Today, search engine tool is not
the only service that Google provides. Google also has advertisement services, e-mailing
services, some mobile services, and web applications and software tools which are used in
the business world [7].
However, success of Google as a search engine has resulted in certain security
risks for web sites, and Internet users. The term “Google Hack” refers to using Google to
retrieve certain information by hackers [12]. Since Google provides an extensive list of
search options through basic and advanced search operators, an experienced hacker can
make use of these search operators to retrieve sensitive information such as what type of
server is being used, administrator passwords, and server side programming language [8].
Although research has been done for Google hacking, there is no known study that
investigates the effect of Google hacking techniques on private information of users or
companies. In this article, Google hacking is investigated for retrieving private
information over the Internet and Google Hacking techniques for this purpose are
revealed.
This paper is organized as follows. In Section 2, the architecture of Google
search engine is briefly discussed. In Section 3, query formulation for the Google search
engine is presented. In Section for 4, the methodology for experiments to obtain private
information and a classification of private information are outlined. In Section 5,
experiments that are performed to obtain private information using search queries are
presented. And finally, the affect of Google Hacking on private information is
demonstrated with concluding remarks in the last section.
2 Architecture of Google
The architecture of Google search engine consists of different components, as shown in
the figure 1.
Google services use more than one Server Farm (collection of servers), working
with Linux as operating system, and using C, C++ and Python programming languages for
server-side coding [9, 10]. In 2006, it is estimated that Google has more than 450,000
servers around the world [6].
Google collects web pages by using a program called Web Crawler (Web Spider or Web
Robot), which surfs the Internet and expands the web pages list. A web page can also be
added directly by using Google’s site. Web crawler program used by Google is called
“Googlebot”. List of URLs (Uniform Resource Locator) is sent to the Googlebot by a
URL Server, with assigned unique ID tags, called “docID”s, for each URL. Then those
web pages are sent to the “Store Server”. Web pages are compressed by “Store Server”
and then stored into a “Repository”. For indexing, “Indexer” and “Sorter” work together.
Compressed documents in the “Repository” are taken by “Indexer”. “Indexer”
uncompresses those documents, parses them and converts them into a set of word
Fig. 1. High Level Google Architecture1
occurrences called hits. “Indexer” classifies these set of words and sends them into the
storages called “Barrels”. “Indexer” also defines the keywords in the web pages. The
component called “Doc Index” is responsible for storing the documents’ IDs, locations in
the “Repositories”, statuses, contents and URL title. “Lexicon” component stores a huge
collection of words and pointers. “PageRank” is an algorithm used by Google to rank the
pages according to their relevancy. The algorithm evaluates the pages by taking hits in the
pages and the positions of these hits, (like hits in the title, hits in the body part, etc.), into
account [9].
3 Query Formulation
While using search engines, search queries are essential for obtaining the most relevant
results. For queries, Google provides basic and advanced search operators. To obtain
results faster and more efficiently, search query formulation is done by using one or more
specific operators. For general information, Google basic search operators can be used,
but for something more specific than general, Google’s advanced search operators can be
used.
1 The figure is compiled from the study of Brin and Page [9].
3.1 Basic Search Operators
Basic search query, contains basic search operators. These operators are ‘+’,- ,‘~’,‘.’,‘
’,‘| and wildcards. Google’s basic queries are not case sensitive, and they can include
up to 10 terms. Google eliminates stopwords (such as “the”, “of”, a, i”, “to”, “and”)
from the query [11].
To bring the pages which include a specific word, plus operator (+) can be used.
In opposite of the plus operator, minus operator (-) is used to bring the pages which do not
include a specific word. Google search queries can also include wildcard operators. The
asterisk operator (*) is used as a wildcard that will be replaced by zero or more characters,
while the dot operator (.) is replaced by one character that separates two words. To bring
the pages which include words relevant to a specific word, tilda operator (~) is used. To
bring the pages which include a specific phrase, quotation operators (“ ”) are used, by
surrounding the phrase. Finally, Google search queries also support a logical or (|)
operator , to bring the pages which satisfy one of the conditions separated by the operator.
3.2 Advanced Search Operators
Google’s queries can consist of special terms called “Advanced Operators”, which are
used to create advanced filters to obtain more accurate results. To create even advanced
filters, most of these advanced operators can be used in combination with special
characters, boolean operators, and other advanced operators. An advanced operator syntax
is strictly in this form: operator:search-term. According to the syntax, there should be no
space before or after the colon (:), operators should be written by using lower case
characters, and if the search-term includes more than one word, it can be quoted by using
double quotation marks [11, 12].
To create a filter according to the titles of the pages, the advanced operator intitle
can be used. For example, the query intitle:EMU brings the pages with EMU in their web
page titles. But, instead of using the intitle operator twice, for bringing the pages which
include two specific words in their titles, the operator allintitle can be used. For example,
the query intitle:EMU intitle:TRNC brings the pages which include both EMU and TRNC
in their titles, which can also be expressed by the query allintitle:EMU TRNC.
To create a filter according to the URLs of the pages, inurl and site operators can
be used. The operator inurl is used to bring the pages which include a specific word in
their URLs. For example, the query inurl:emu brings the pages with emu in their web
addresses, like http://www.emu.edu.tr and http://cmpe.emu.edu.tr. Morever, the operator
site can be used to bring the pages which are located on a specific URL. For example, the
query site:cmpe.emu.edu.tr filters the pages which are located on the web address
http://cmpe.emu.edu.tr. Pages which include links to a specific URL can also be listed by
using the operator link. For example, the query link:www.emu.edu.tr lists the pages which
link to the web address http://www.emu.edu.tr
Pages, which include a specific word in their body parts, are searched by use of
the operator intext. For example, although the query intext:academic lists the pages which
includes the word academic in their body parts. For pages which include more than one
specific words in their body parts, instead of using the operator intext for each specific
word, the operator allintext can be used. For example, although the query intext:EMU
intext:academic brings the pages which include both of the words EMU and academic in
their body parts, the query allintext:EMU academic can be used for the same purpose.
For listing the documents through the Internet according to their file types, the
operator filetype can be used. However, this operator must be used with other advanced
operators or basic search queries. For example, the following query lists the Microsoft
Excel files (with extension .xls) which include the term “TRNC”: filetype:xls TRNC. As
an another example, the following query lists all the Microsoft Word files in
http://www.emu.edu.tr: filetype:doc site:www.emu.edu.tr.
To bring the pages which include numbers in a specific range, the advanced
operator numrange can be used. The range is specified in this format: rangebegin-
rangeend. For example, the following query filters the pages which include the numbers
between 995 and 1005: numrange: 995-1005.
To retrieve the image of a web page from the cache memory of Google servers,
the operator cache can be used. The following example brings the last cached home page
of the Eastern Mediterranean University’s (EMU) web site, from the Google cache:
cache:www.emu.edu.tr
For retrieving the definitions of a specific word from the Internet, the operator
define can be used. For example, the following query brings the definitions of the word
“university”: define:university.
4 Experiments
The search engine Google can also be used to search for unguarded data which should be
hidden, like private information or technical information on a server computer. Google
hacking is a term which is used to express the usage of Google search engine as an
aggressive hack tool [13]. Long has published the Google search queries that can be used
to reveal the security flaws of a server computer in his study [11].
Our research, however, deals more about the security of personal and
organizational information and how Google hacking enables access to private information.
Personal information includes ID numbers, contact information, chat logs, e-mail
accounts, passwords, credit card numbers, etc. and organizational information includes
employee list, salary list, financial reports, private communication logs of a company, etc.
To analyze the effect of Google hacking for accessing private information, first of all,
private and organizational information are categorized in groups: (1) IDs, (2) contact
information, (3) confidential documents, (4) personal passwords, and (5) private
communication data. These groups are discussed by giving some sample hack scenarios
below.
IDs are unique numbers like citizen identity numbers, social security numbers,
taxpayer numbers, driver license numbers, etc. When revealed, IDs can be used to perform
personal operations in the name of the victim, without his or her consent. For example,
when a hacker obtains a person’s social security number, the hacker can get the victim’s
personal details. By using those details, the hacker can withdraw credits from a bank and
leaving the victim in debt [15].
Contact information includes phone numbers, e-mail addresses, home addresses,
etc. When a hacker obtains contact information of a person, the hacker can use that
information for unsolicited advertisements. If an e-mail address is obtained, it can be used
for spam mails without consent of the user.
Confidential information is a kind of information that is used internally by an
organization. Confidential information can include an organization’s secrets, strategies,
long-term or short-term plans, financial status, employee list, salaries, etc. When the
strategies of an organization are obtained, other organizations can illegally benefit from
the secrets they contain. When the obtained information is about the salary list, another
organization can start a mass employee theft by offering higher salaries.
Personal passwords can be described as passwords for e-mail accounts, instant
messaging accounts, portal accounts, server administration accounts, etc. When a hacker
obtains a password, that password can be used to harm a person’s account, or an
organization, or a server disk. If a hacker obtains a personal password like an e-mail
password, the hacker can read private information or send malicious e-mails. If the
password is the administrator password of a portal, then the hacker can access all private
information of portal members. If the password belongs to an online-banking account,
then the hacker can perform illegal operations on the victim’s bank account.
Private communication data include chat logs, chat contact lists, e-mail address
books, etc. These kinds of information can reveal the private daily life or business secrets
of a victim. If the communication data include passwords, the hacker who obtained the
data can use those passwords to access the victim’s personal accounts.
5 Experimental Result Samples
Based on the classification provided in Section 4, this section presents some examples of
successful queries to retrieve private information using Google search engine. The type of
query formulated is presented with advance search operators that might be used for
the purpose. Furthermore, screen shots of the results of the queries are also presented.
Sensitive information in these pictures are blurred out on purpose to provide
confidentiality.
Microsoft Excel files (with extension .xls), Microsoft Word Documents (with
extension .doc), and text files (with extension .txt), can be used to store ID information. If
so, the following queries can be used to search for ID information: filetype:xls “TC kimlik
no” (as shown in figure 2), filetype:xls “ssk sicil no”, filetype:txt “my social security
number is”.
Contact information can be found again in popular document types like “.xls”,
“.doc”, “.txt”. As shown in figure 3, the following query displays the Excel documents
which includes the word “gsm no”: filetype:xls “gsm no”.
Fig 2. Result of the query filetype: xls “TC kimlik no” (“T.C. Kimlik no”: Republic of Turkey ID
No, “Soyadı”: Surname, “Adı”: Name, “Doğum Tarihi”: Date of Birth, “SSK/Emekli Sicil no”:
Health System Number)
Fig. 3. Result of the query filetype:xls “gsm no” (“Adı Soyadı”: Name Surname, “Bulunduğu
Şehir”: City, “İşe Giriş Tarihi”: Date of Employement, “GSM No”: Mobile Phone number)
Furthermore, Outlook Express E-mail folder files (with extension .dbx), Outlook
Personal Folder files (with extension .pst), MSN Contact list files (with extension .ctt),
contact list files of the instant messaging program Trillan (with the file name
mystuff.xml), contact list files of the instant messaging program Aim (with the file name
buddylist.blt) can be used to store contact information. Moreover, tabular file formats like
Comma separated values (with extension .csv) and Microsoft Excel files (with extension
.xls) can store contact information. An example of how this type of contact information
can be revealed is shown in figure 4.
Fig. 4. Result of the query filetype:ctt “msn”
Confidential information can also be found in document types like Microsoft
Excel files (with extension .xls) Microsoft PowerPoint files (with extension .ppt) and
Adobe Acrobat files (with extension .pdf). For example, the following queries can be used
to search for internal confidential presentations: filetype:ppt confidential “for internal use
only”. The result of the query filetype:ppt confidental "for internal use only" is shown in
figure 5.
Fig. 5. Result of the query filetype:ppt confidental “for internal use only”
Server log files contain information such as passwords usernames, and access
times for a web site, and they can be searched to obtain these types of information by a
hacker. For example, HKEY_CURRENT_USER nodes of Windows Registry files (with
extension .reg) can be checked to find passwords, as seen in figure 6.
Fig. 6. Result of the query filetype:log inurl:“password.log
Chat log files and Outlook Personal Folder files (with extension .pst) can include
private communication information. Chat log files store the logs of dialogs between two
peers and Outlook Personal Folder files store the inbox of an e-mail user, including the
contents of sent and received e-mails. An example of how chat log files can be obtained is
shown in figure 7.
Fig. 7. Result of the query “has joined” filetype:txt
6 Conclusion
Private documents in a server disk can be accessed by Google search, if those documents
are configured as accessible documents. Moreover, Google can access and index file and
directory structures of a server disk, if the server is configured to allow directory listing,
and this makes it easier to hack into that server by using Google search [14].
With successful query formulation, a Google hacker can obtain private
information. These exploits are results of both misconfigured web-servers, and user
knowingly publishing their private information on the Internet. For example, contact
information kept on a web site can be retrieved by simply searching document files
include the word “contact”. Furthermore, the ID of a person can be retrieved to allow
identity theft [11].
Hackers can also use Google for stealth purposes. To stay unidentified, hackers
can use Google’s cache memory by working on the memory image of the target site, thus
by not even visiting the target site and leaving any signature [14, 16]. Furthermore, when
private information has been hacked, it might not be possible to trace this activity.
As it can be seen in the sample experiments private information of a person or a
company can be revealed with search techniques using the Google search engine.
Therefore, private information such as ID numbers, contacts etc. can only be protected
through eliminating misconfigurations at servers and through control of what information
is published. Google hacking enables hacks to private information even though they are
not published publicly and kept unlinked. Since some of hacking techniques utilizes
misconfigured web servers, it is important that administrators apply patches frequently.
There are some tools to prevent Google hack, by locating the security flaws in a
web page. These tools use Google Hack Database (GHDB) [17] or their own database to
locate these security flaws. Some known the tools are Gooscan [18], Sitedigger [19],
Goolink [20] and Athena [21]. Although these tools can be used for scanning security
flaws, they are not specialized to scan the security of private information. Therefore,
future work on this topic includes the development of a tool to detect security flaws
specifically for private information that will be available online for public use.
References
1. J. Bar-Ilan. Comparing rankings of search results on the web. Information Processing
and Management: an International Journal, 41(6):1511-1519, December 2005.
2. Federal Trade Commission. Identity theft survey report, pp. 6 -7, September 2003.
http://www.consumer.gov/idtheft/pdf/synovate_report.pdf
3. US University notifies 800.000 of database hack. Computer Fraud & Security, pp. 3,
January 2007.
4. R. B.Yates and B. Riberio-Neto. Modern information retrieval. Addison Wesley, 1999.
5. Y. Shang and L. Li. Precision evaluation of search engines. World Wide Web: Internet
and Web Information Systems, 5(2):159-173, 2002.
6. http://en.wikipedia.org/wiki/Google (visited on November 8, 2006)
7. F. Mantar. Google desktop. December 2005.
http://ftp.cs.hacettepe.edu.tr/pub/dersler/BIL4XX/BIL447_YML/belgeler/GoogleDeskt
op_20221815.pdf (visited on January 19, 2007)
8. E.I. Tatlı. Google reveals cryptographic secrets, July 2006.
http://th.informatik.uni-mannheim.de/people/tatli/pub/ghack_crypto.pdf (visited on
January 6, 2007)
9. S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine,
Computer Networks and ISDN Systems, 30(1):107-117, 1998.
10. http://www.python.org/about/quotes/ (visited on January 27, 2007)
11. J. Long, Google hacking for penetration testers. Syngress Publishing Inc. Rockland,
MA, 2005.
12. T. Calishain and R. Dornfest. Google hacks, O'Reilly, 2003.
13. E. I. Tatlı. Google ile güvenlik açıkları tarama, 2006.
http://th.informatik.uni- mannheim.de/people/tatli/resources/pdf/googlehacking.pdf
(visited on February 13, 2007)
14. S. A. Mathieson. Google – Swiss army knife for hackers? InfoSecurity Today,
November/December 2005.
15. http://www.ssa.gov/pubs/10064.html (visited on December 9, 2006)
16. E. Rabinovitch, Project your users against the latest web-based threat: malicious code
on caching servers, IEEE Communications Magazine, 45(7):20-22, March 2007.
17. http://johnny.ihackstuff.com/ (visited on April 17, 2007)
18. Gooscan, http://johnny.ihackstuff.com/downloads/task,doc_details/gid,28 (visited on
April 17,2007)
19. Sitedigger,
http://www.foundstone.com/index.htm?subnav=resources/navigation.htm&subcontent=/
resources/proddesc/sitedigger.htm (visited on April 17, 2007)
20. Goolink, www.ghacks.net/2005/11/28/goolink-scanner-beta-preview-2/ (visited on
April 17, 2007)
21. Athena, http://www.buyukada.co.uk/projects/athena (visited on April 17, 2007)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Google hacking is a term to describe the search queries that find out security and privacy flaws. Finding vulnerable servers and web applications, server fingerprinting, accessing to admin and user login pages and revealing username-passwords are all possible in Google with a single click. Google can also reveal secrets of cryptography applications, i.e., clear text and hashed passwords, secret and private keys, encrypted messages, signed messages etc. In this paper, advanced search techniques in Google and the search queries that reveal cryptographic secrets are explained with examples in details.
Article
The strategies used by various organizations to protect themselves from Google hacking problems, are discussed. One of the configuration checks used by this search engine is that it picks up where servers are misconfigured to allow directory browsing. The detection of hacking can be achieved by viewing Google's cached pages or its HTML versions or other document types, which use the same cache. It also provides only restricted access to its application programming interface (API) in terms of numbers of searches a day and the programming language used. Although Google remains the leading brand, MSN Search appears to have improved its performance recently.
Article
Google Hacking for Penetration Testers explores the explosive growth of a technique known as "Google Hacking." This simple tool can be bent by hackers and those with malicious intents to find hidden information, break into sites, and access supposedly secure information. Borrowing the techniques pioneered by malicious "Google hackers," this book aims to show security practitioners how to properly protect clients from this often overlooked and dangerous form of information leakage. The sophistication and functionality of Google searches has resulted in several publications boasting Google's superiority to other search engines, providing tips, tricks and even hacks for novice, intermediate, and advanced Internet users. However few of these publications even mention security, and none are written with the IT professional's security tasks in mind. This book not only explores the more obscure and compound features of Google, but it educates the reader how to protect himself against the hacking muscle that this supreme search engine has become. Google.com domain continues to distance itself from the competition and has reached an all-time high in U.S. search referral market share. As of March 23, 2004, Google.com posted a U.S.search referral percentage of nearly 41 percent. Second place competitor and former leading search referral domain, Yahoo.com, posted a referral percentage of 27.40 percent. Google's market dominance is due in large part to the detail, sophistication, and accuracy of the results it provides. These same factors that make Google so useful to the everyday Web surfer are the same ones that make it so dangerous in the hands of a malicious hacker.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Article
The Web has become an information source for professional data gathering. Because of the vast amounts of information on almost all topics, one cannot systematically go over the whole set of results, and therefore must rely on the ordering of the results by the search engine. It is well known that search engines on the Web have low overlap in terms of coverage. In this study we measure how similar are the rankings of search engines on the overlapping results.We compare rankings of results for identical queries retrieved from several search engines. The method is based only on the set of URLs that appear in the answer sets of the engines being compared. For comparing the similarity of rankings of two search engines, the Spearman correlation coefficient is computed. When comparing more than two sets Kendall’s W is used. These are well-known measures and the statistical significance of the results can be computed. The methods are demonstrated on a set of 15 queries that were submitted to four large Web search engines. The findings indicate that the large public search engines on the Web employ considerably different ranking algorithms.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at
Article
Internet security threats are continually evolving as hackers try to stay ahead of countermeasures. In the last two years, a growing number of hackers have shifted their focus from straight-on firewall assaults and virus-laden emails - although these threats are not entirely things of the past - to Web-based attacks that expose Web site visitors to spyware, phishing scams, viruses, trojans, and other malicious code. An especially insidious new threat to Web users is the existence of malicious content residing in cached Web pages on storage and caching servers, such as those used by leading search engine providers, Web 2.0 sites, and Internet service providers (ISPs). This paper discusses the step-by-step process describing the infection method using search engine caching servers. The three examples of the type of Web threat, based on a recent analysis of Web pages on public storage and caching servers of three popular search engine providers are presented
World Wide Web: Internet and Web Information Systems
  • Y Shang
  • L Li
Y. Shang and L. Li. Precision evaluation of search engines. World Wide Web: Internet and Web Information Systems, 5(2):159-173, 2002.