ArticlePDF Available

Architecture for Checking Trustworthiness of Websites

Authors:
International Journal of Computer Applications (0975 8887)
Volume 44 No14, April 2012
22
Architecture for Checking Trustworthiness of Websites
Sana Ansari
Asst. Professor
Don Bosco Institute of Technology,
Mumbai
Jayant Gadge
Asst. Professor
Thadomal Shahni Engineering College, Mumbai
ABSTRACT
Today information retrieval from Internet is becoming a
commonplace phenomenon. Since, information is readily
available and accessible to everyone. Whenever any user
types a query in search engine, they would get answers within
few micro seconds. However, the results which they get may
or may not be accurate because different websites may give
different information about the same entity. So, the biggest
question is, which website should the user trust?
There are many characteristics using that users can determine
trustworthiness in content provided by Web information
sources. In the proposed system, the filtering of website
trustworthiness is based on five major areas as Authority,
Related resources, Popularity, Age and Recommendation. The
proposed system defines eighteen factors which are
categorized under the mentioned five major areas. The
website trustworthiness is calculated based on these eighteen
factors of each URL and it is stored thereby increasing the
performance in retrieving the trustworthy websites. The
objective of the proposed system is to provide more
trustworthy websites as top results which would save
considerable amount of searching time.
General Terms
Trustworthy, URLs, Information Retrieval, Links, Factors,
Webpage, Algorithm
Keywords
Authority, Popularity, Recommendation, Page Rank, Inbound
Link, Alexa Rank, WOT, Dmoz Listing
1. INTRODUCTION
Every day, people retrieve all kinds of information from the
Web. However, there is no guarantee for the correctness of
information as they come from different sources, varying in
quality. At times people may get conflicting information from
different websites [1]. In current scenario, Internet is the most
popular as well as important source of information. If
somebody is searching for any kind of information, they
would write a query in any search engine (such as
google.com, ask.com etc.) and get their answer in few micro
seconds. However, the answer, which the search engine
provides, may or may not be trustable, because various
websites may provide different result for the same query.
For example: If you type in Google following
query: (The query was sent on 15 August 2010) “What is the
depth of Indian Ocean?” You will find following results:
1. www.eoearth.org gives “3900m”
2. en.wikipedia.org/wiki/Indian_Ocean gives “3890m”
3. www.infoplease.com gives “3400m” and so on….
So, which website should the user rely on? Thus it is clear that
the search results are not correlative and credible [3]. To
achieve desire data on web, users needs to go through each
website manually. Since this process is not only time
consuming but also inefficient.
2. LITERATURE REVIEW
The usefulness of a search engine depends on the relevance of
the result set it gives back. While there may be millions of
web pages that include a particular word or phrase, some
pages may be more relevant, popular, or authoritative than
others. Most search engines employ methods to rank the
results to provide the "best" results first. How a search engine
decides which pages are the best matches, and what order the
results should be shown in, varies widely from one engine to
another. Google, a search engine with a full text and hyperlink
database, is designed to crawl and index the Web efficiently
and return much more satisfying search results than existing
systems. It makes use of the link structure of the Web to
calculate a quality ranking for each Web. The rank algorithm
used by Google is PageRank. PageRank extends the idea that
the importance of an academic publication can be evaluated
by its citations to pages on the Web, which can be similarly be
evaluated by counting back links. In PageRank algorithm of
Google, the ranking value PR of a page A is measured using
given formula:
PR(A)=(1-d)+d (PR(T1)/C(A)+ …+ PR(Tn)/C(A)) (1)
Where T1…Tn are pages pointing to page A, hence
representing backlinks. The parameter d is a damping factor
which is scaled between 0 and 1, and C(A) is the number of
links leaving page A, hence representing outgoing links. The
rank of page A or PR(A) can be calculated using a simple
iterative algorithm. As shown by formula (1), the ranking
process recursively defines the relevance of page A to be the
weighted sum of its backlinks [26].The Page Rank algorithm
is used to find out the pages with high authorities. In next, the
Hyperlink Induced Topic Search(HITS) algorithm, the first
step is to retrieve the set of results to the search query. The
computation is performed only on this result set, not across all
Web pages. Authority and hub values are defined in terms of
one another in a mutual recursion. An authority value is
computed as the sum of the scaled hub values that point to
that page. A hub value is the sum of the scaled authority
values of the pages it points to [8]. These both approaches are
identifying most important web pages, but the popularity of
web pages do not necessarily lead to accuracy of information.
Even the most popular website may contain many errors [6,
7]. Related work has been also done with TRUTHFINDER
algorithm. This algorithm is used to find out the trustworthy
websites. But there are certain limitations. The first limitation
is, initial assumption of Website Trustworthiness is taken as
International Journal of Computer Applications (0975 8887)
Volume 44 No14, April 2012
23
0.9 in all cases even if it is popular, authoritative or
untrustworthy websites. And second limitation is,
recalculation of trustworthiness of websites for each query
given by the user reduces the performance of the system [1].
3. FACTORS AFFECTING TRUST
While determining trust, it is not a good practice to assume
that Web page‟s main site is the only factor. Apart from this,
there are many parameters that affect, how a user determine
trust in the content provided by Web information sources [3].
Therefore, the proposed system does filtering of website‟s
trustworthiness based on five major areas as mentioned below
[4]:
1. Authority domain specific
2. Related resources links from trusted websites
3. Popularity- most visited websites
4. Age - lifespan of time-dependent information
5. Recommendation - referrals from other users
The proposed system, defines eighteen factors which are
categorized under the above mentioned five major areas. The
trustworthiness of a website is calculated based on these
eighteen factors and it is stored thereby increasing the
performance in retrieving the trustworthy websites. All the
defined eighteen factors are categorized in those major five
parameters which can be shown in below Figure1:
The Authority parameter is calculated by determining a high
co-occurrence of keywords or semantic phrases from multiple
pages of website. This can be done by analyzing the URL.
Different weights are assigned depending on the domain
names like Page Title, Meta keyword and Meta description
value gained for each URL.
The Age parameter consists of two aspects; firstly the last
modified date and second aspect is the age of domain. For
information‟s point of view, the newer the content the better.
So, to calculate this, a comparison is done between the current
date and the last modified date. However, for domain
age, it is clear that a long-lasting running website is strongly
recommended.
The Popularity parameter is calculated using the number of
“In” links, which refer to the number of times a particular
website is referenced from other trustworthy websites like
Google Page Rank. It would also mean that more the number
of visitors to a particular website, more popular that particular
website is. Basically the more good quality links a website
has, the better it is.
The Related Links parameter is used by adding an appropriate
weight to each URL‟s trustworthiness that is listed out in a
Highly-Trustworthy website Like Google inbound links,
Yahoo inbound links, Bing inbound links and Alexa inbound
links.
The Recommendation is basically when one praises or
commend to other as being worthy or desirable. In the same
way, when any trustworthy website recommends other
website, the value of the recommended website increases.
This parameter is calculated based on factors like Alexa Rank,
WOT Rating, Site Adviser Rating and Dmoz Listing. These
eighteen factors are used to calculate the website
trustworthiness.
In the proposed system, the filtering of the trustworthiness is
based on all the above mentioned eighteen factors, which
would give trustworthy websites as top results.
4. PROPOSED SYSTEM
The proposed architecture for the Trustworthy System is
given in Figure 2. The system provides an interface where the
user can write his/her query. Once the search button is
clicked, it gives the list of URLs on the same page .In
meantime; it saves all the URLs in the database.For each
URL, all mentioned eighteen factors are calculated and saved
as total_score in the database. Then based on the total_score
value of each URL, they are rearranged in the descending
order, which means if URL has high total_score value then it
Major areas
(that affect how users determine trust in content provided by Web)
Authority Age Popularity Related Links Recommendation
Page Title Last-Modified date Google Page Rank Google inbound links Alexa Rank
Meta keyword Domain age Yahoo inbound links WOT Rating
Meta description Bing inbound links Site Adviser Rating
Alexa inbound links Dmoz Listing
Google indexed pages
Yahoo indexed pages
Bing indexed pages
Figure 1: Tree Diagram representing 18 Factors
International Journal of Computer Applications (0975 8887)
Volume 44 No14, April 2012
24
Figure 2: Proposed System Architecture
will appear as the top most result and accordingly rest comes
as per their total_score value.
5. ANALYSIS OF EIGHTEEN IMPACT
FACTORS
Once user entered his/her query, list of URLs will be available
on the same screen. Then each URL is extracted to retrieve
the eighteen factors value for calculating them. Then it stores
all the calculated factors in the database. Next and final step is
that it shows the new result (URLs) in descending order of
their Total score. That means if record has high score then it
will come to the top and rest comes as per the total score
value. Calculations of all 18 factors are as follows:
1. Page Title:
Every html document must have a TITLE Element in the
head section. From a search engine's point of view, page
title is the first indication of the contents of the
page. Additionally, page title is the key information
returned when search engines list results to a keyword
search [9, 10]. If the searched keyword is in the title tag,
then it means the page is most matched.
2. Meta Keyword:
A Meta keywords tag is supposed to be a brief and concise
list of the most important themes of the webpage. Long
ago in Internet time, the Meta keywords tag was very
useful in helping pages to win on search engines. But
many unscrupulous webmasters have abused the Meta
keywords tag that search engines have had to de-
emphasize their importance. Though Meta keywords tags
are not a major factor search engines consider when
ranking sites, they should not be left off the page [11]. If
the searched keyword is in the Meta tag then it means the
page is most matched.
3. Meta Description:
The Meta description tag is indented to be a brief and
concise summary of web page's content. The Meta
description tag is designed to provide a brief description
of the website which can be used by search engines or
directories. If the searched keyword is in the Meta
description then it means the page is most matched [12].
4. Alexa Rank:
Alexa is a very powerful tool used to rank web site traffic.
This is one of the most accurate freely available tools to
find out how well your site ranks up against millions of
other sites on the web. The lower the Alexa ranking
number the more heavily visited the site [13].
5. Google Page Rank:
Google Page Rank is one of the methods Google uses to
determine relevance or importance of a page. It matters
because it is one of the factors that determine a page's
ranking in the search results.If the page rank of a
particular page is closer to 5 then it is considered as more
popular [15].
6. Google Inbound Links:
Defines number of third party websites link to particular
website that is identified by Google. Back links are
important for SEO because some search engines,
especially Google, will give more credit to websites that
have a good number of quality back links, and consider
those websites more relevant than others in their results
pages for a search query. More number of Google
inbounds links then more Weight-age is given to it [16].
7. Google Indexed Pages:
Google Indexed Pages gives an indication of number of
pages indexed by Google and available in Google servers.
There are various on-page SEO factors helping to get
higher search engine rankings including the number of
web pages indexed by Google and other search engines
(indexation). It plays an important role in the SEO score
of particular site. In many cases, the winning factor of a
site compared to its competitor is the number of its pages
indexed by Google or other search engines.[17]
8. Last Modified Date:
Last Modified Date is the date of the particular page
created or modified by the website owner. From an
informational point of view, newer content is usually
better than older content within a web page. When ever
user tries to search some information and anticipate that
newer information rather than information that is few-
years-old [18].
9. Domain Age:
Domain Age defines how old a particular website is. A
long-lasting and well-kept domain name on the
cyberspace reflects its importance to search engine
optimization [18].
10. Yahoo Inbound Links:
Yahoo Inbound Links defines number of third party
websites link to particular website that is identified by
Yahoo [16]. More number of Yahoo inbound links then
more Weight-age is given to it.
11. Web Of Trust (WOT) Rating:
WOT ratings are powered by a global community of
millions of trustworthy users who have rated millions of
websites based on their experiences. Website's reputation
rating is based on ratings from the WOT community, tells
Search
Engine
Store List
of URLs
18 factors Calculated &
stored as total_score in
Database
Rearrange all the URLs
based on their total_score
user
Search
query
Query
Result
URLs
For
each
URL
From
database
International Journal of Computer Applications (0975 8887)
Volume 44 No14, April 2012
25
how much other users trust your site. It defines that
particular site is safe or not as ‟Trustworthy‟, „Mostly‟,
„Suspicious‟, „Untrustworthy‟, „Dangerous‟, „Unknown‟
and depending upon this Weight-age is assigned to it[19].
12. Yahoo Indexed Pages:
Yahoo Indexed Pages gives an indication of number of
pages indexed by yahoo and available in yahoo servers
[20]. If more pages are indexed by Yahoo and available in
Yahoo servers, more it is better.
13. Alexa Inbound Links:
Alexa Inbound Links defines number of third party
websites link to particular website that is identified by
Alexa. Quality inbound links are an essential element of
web site marketing and search engine optimization
programs to increase traffic and online sales. The greater
the number of relevant and authoritative links to a web
page, the greater the potential for higher search engine
rankings and qualified traffic [13, 14]. More number of
Alexa inbounds links then more Weight-age is given to it.
14. Dmoz Listing:
Dmoz Listing is the Open Directory Project (ODP) is a
multilingual open content directory of WWW links. That
means site comes under the particular category. DMOZ is
the most respected online directory. All major search
engines like Google, Yahoo, and Bing give a lot of
importance to websites having links from DMOZ.
Inclusion in DMOZ can dramatically increase your
website ranking and traffic [21]. If the site comes under
the ODP category that means marks will be 100%.
15. Site Advisor Rating:
Ratings from SiteAdvisor.com are based on a variety of
measures. It claims to protect user by labeling Web sites
green, yellow, or red to indicate that they are safe,
questionable, or dangerous [22]. Site Advisor Rating
defines that site is good (Green), bad (Red), compromised
(Yellow) and (Grey). Depending upon the color Weight-
age is assigned.
16. Bing Indexed Pages:
Bing Indexed Pages gives an indication of number of
pages indexed by Bing and available in Bing servers
[24].If more pages are indexed by Bing and available in
Bing servers, more it is better.
17. Bing Inbound Links:
Bing Inbound Links defines number of third party
websites link to particular website that is identified by
Bing [24].More number of Bing inbounds links then more
Weight-age is given to it.
18. Ask Indexed Pages:
Ask Indexed Pages gives an indication of number of pages
indexed by Ask and available in Ask servers [25]. If more
pages are indexed by Ask and available in Ask servers,
more it is better.
For each of the mentioned eighteen factors, administrator will
provide weight-age based on the importance of each factor.
Weight-age would be given to each of the 18 factors which
are normalized to 100%. For each saved URL, all 18 factors
are calculated and sum it as total score value.
6. CONCLUSION
The WWW is the most important source of information. But,
there is no guarantee for information correctness and lots of
conflicting information is retrieved by the search engines and
the quality of provided information also varies from low
quality to high quality.
The proposed system provide trustworthy websites for
queries in web searching by filtering of website
trustworthiness based on the mentioned eighteen factors and it
is stored thereby increasing the performance in retrieving
more trustworthy websites. Since proposed system provides
more trustworthy websites as Top results, it would save
considerable amount of searching time.
7. REFERENCES
[1] Xiaoxin Yin, Jiawei Han, Senior Member, IEEE, and
Philip S. Yu, Fellow, IEEE, “Truth Discovery with
Multiple Conflicting Information Providers on the
Web”, Los Angeles, CA, USA, VOL. 20, NO. 6, JUNE
2008, pp. 796-808
[2] Xin Luna Dong, Laure BertiEquille, Divesh Srivastava,
Truth Discovery and Copying Detection in a Dynamic
World”, VLDB „09, August 2428, 2009.
[3] Sumalatha Ramachandran, Sujaya Paulraj, Sharon
Joseph and Vetriselvi Ramaraj, “Enhanced Trustworthy
and High-Quality Information Retrieval System for
Web Search Engines”, IJCSI International Journal of
Computer Science Issues, Vol. 5, October 2009, pp38-
42.
[4] Gil, Y. and Artz, D.,”Towards content trust of web
resources”, Edinburgh, Scotland, May 23 - 26, 2006,
DOI=http://doi.acm.org/10.1145/1135777.1135861
,NY, pp565-574.
[5] Soo Young Rieh,”Judgment of Information
Quality and Cognitive Authority in the Web”,
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.
8991
[6] “Page rank algorithm”, www-
personal.ksu.edu/~eddery/Google.pdf
[7] “Page rank algorithm”,
http://www.markhorrell.com/seo/pagerank.html
[8] “HITS”,
http://en.wikipedia.org/wiki/HITS_algorithm#Algorith
m
[9] “Page Title”, http://www.seologic.com/faq/title-tags
[10] “Significance of page title”,
http://www.knowthis.com/principles-of-marketing-
tutorials/internet-marketing/page-title-for-sem/
[11] “Meta keywords , http://www.seologic.com/faq/meta-
keywords
[12] “Meta description”, http://www.seologic.com/faq/meta-
descriptions
[13] “Alexa rank”, http://developers.evrsoft.com/find-traffic-
rank.shtml
[14] “Alexa rank”, http://kisswebmaster.com/importance-of-
alexa-ranking-why-should-we- increase-it/
[15] “Google Page rank”,
http://www.webworkshop.net/pagerank.html
International Journal of Computer Applications (0975 8887)
Volume 44 No14, April 2012
26
[16] “Google inbound links”,
http://googlewebmastercentral.blogspot.com/2008/10/g
ood-times-with-inbound-links.html
[17] “Google indexed pages”,
http://www.googleguide.com/google_works.html
[18] “Last modified date and Domain age”,
http://www.raidenhttpd.com/en/manual/seo.html
[19] “WOT Rating”, http://www.mywot.com/
[20] “Yahoo indexed pages”,
http://www.webmastertools.info/tools/yahoo-indexed-
pages-checker/
[21] “Dmoz Listing”,
http://www.submitedge.com/dmoz_listing.html
[22] “Site adviser rating”, http://windowssecrets.com/top-
story/siteadvisor-ratings-may-be-1-year-out-of-date/
[23] “Significance of SEO Factors”,
http://www.dotsandcoms.us/data2/analysis_report.pdf
[24] “Bing indexed pages”,
http://www.x10tools.com/tools/bing-indexed-pages-
checker/
[25] “Ask indexed pages”,
http://www.ask.com/wiki/Category:Indexed_pages
[26] Haider A. Ramadhan and Khalil Shihab,”A Heuristic
Based Approach for Increasing the Page Ranking
Relevancy in Hyperlink Oriented Search Engines:
Experimental Evaluation”,International Journal of
Theoretical and Applied Computer Sciences, Volume 1
Number1(2006) pp.49-62
... Many different methods have been proposed to automate Web analysis so as to provide insight into the status of website usage. Ansari and Gadge (2012) proposed 18 factors to check a websites trustworthiness, including its Alexa rank and Alexa inbound links. Quoniam's (2011) approach for estimating website traffic which is based on the position of a measured website in a search engine's results page. ...
Preprint
Full-text available
Users are posting millions of questions on Community question answering sites each day. The quality of those questions significantly affects the satisfactions of the sites' users and, therefore, sites' traffic. We gathered 15 question-quality related features from one of the largest CQA sites and the site's pageview data to estimate the scale of the effect in the corresponding time series. By using a Grey Relational Analysis, we rank those question quality features and estimate the relative strength of these factors on a page's view numbers. Our results show that the features of question quality have a significant influence on web performance. We generate a ranked list of features and find that digital popularity and textual features can drive the page traffic more than questioner related features and question difficulty. The implications of the findings for Web growth and future research are discussed.
... Many different methods have been proposed to automate Web analysis so as to provide insight into the status of website usage. Ansari and Gadge (2012) proposed 18 factors to check a websites trustworthiness, including its Alexa rank and Alexa inbound links. Quoniam's (2011) approach for estimating website traffic which is based on the position of a measured website in a search engine's results page. ...
Conference Paper
Full-text available
Users are posting millions of questions on Community question answering (CQA) sites each day. The quality of those questions significantly affects the satisfactions of the sites' users and, therefore, sites' traffic. We gathered 15 question-quality related features from one of the largest CQA sites and the site's page view data to estimate the scale of the effect in the corresponding time series. By using a Grey Relational Analysis, we rank those question quality features and estimate the relative strength of these factors on a page's view numbers. Our results show that the features of question quality have significant influence on web performance. We generate a ranked list of features and find that digital popularity and textual features can drive the page traffic more than questioner related features and question difficulty. The implications of the findings for Web growth and future research are discussed.
... Bhuvaneswaran and Sarojini (2014) implemented the combination of the ranking model along with the truthfinder algorithm for evaluating the similarity between two or more websites. Ansari and Gadge (2012) introduced the method to validate the trustworthiness of URL link. The system filters trustworthy URL based on five major areas namely authority, related resources, popularity, age, and recommendation. ...
... In paper [3], the characteristic of websites is used for checking the trustworthiness, the filtering of website trustworthiness is based on five major areas as Authority, Related resources, Popularity, Age, hits, and Recommendation. The website trustworthiness is calculated based on these eighteen factors of each URL and it is stored thereby increasing the performance in retrieving the trustworthy websites. ...
... Important pages are encountered to have higher PageRank and have higher probability to appear at the top of search results. If PageRank value for a given URL is less than 5 then the URL will be classified as phishing URL [17]. ...
... Ansari and Gadge (2012) propose a system for checking the trustworthiness/ performance of websites. The proposed system consists of 18 factors: Alexa Rank, Google Page Rank, Page title, Meta Keyword, Meta Description, Google inbound links, Google Indexed pages, Last modified date, Domain age, Yahoo inbound Links, Web of Trust (WOT) Rating, Yahoo Indexed pages, Alexa Inbound Links, Dmoz Listing, Site Advisor Rating, Bing indexed Pages, Bing Inbound Links and Ask Indexed Pages (Ansari, Gadge, 2012). ...
Article
Full-text available
Purpose – This study aims to examine the performance of Indian research Institutions’ websites using webometrics by investigating their visibility, traffic ranks, number of links, time on site, Indian/Foreign users and page ranks besides focusing on the two designed hypotheses. Design/methodology/approach – The Council of Scientific and Industrial Research (CSIR) directory (www.csir.res.in/external/heads/aboutcsir/lab_directory.htm) is used to identify the research institutions in India. The directory lists 40 research institutions across India. However, Alexa did not offer required information for some of the websites due to very high traffic ranks, and accordingly the list is reduced to 21 research institutions. The data collected were analysed and tabulated to reveal findings in accordance with the desired objectives. Findings – The results reveal that global traffic ranks, number of page views, number of links and time on site of Indian research institutions are low. However, the page ranks are to some extent satisfactory. The traffic ranks of Indian research institutions differ significantly, whereas no major difference in the page ranks is found. Further, the results show that Indian research institutions’ websites have not been able to attract foreign visitors given the calibre and reputation of these institutions. Originality/value – The results of this study will be useful for website administrators of research institutions in India and across the globe.
Article
Purpose ‐ The purpose of this paper is to evaluate Indian newspaper web sites using Alexa databank. Design/methodology/approach ‐ The list compiled by Audit Bureau of Circulations of top newspapers in India by daily circulation was used for selecting newspapers. The list included 28 newspapers in various languages. Out of these 26 were available online. These 26 newspapers are taken for evaluation in the present study. Each newspaper web site was searched in Alexa databank and relevant data including traffic rank, pages viewed, speed, links, bounce percentage, time on site, search percentage, and Indian/foreign users were collected. The data collected were analysed and tabulated to reveal findings in accordance with the desired objectives. Findings ‐ The results of this study show that Dainik Bhaskar has the highest traffic rank. Punjab Kesari has the highest number of average pages viewed per day and estimated daily time spent on site by the visitors. The fastest downloading speed is for Economic Times. Hindustan Times has the highest number of links. Decan Herald has the highest reach amongst the global internet users, where as Udayavani has the lowest bounce percentage. The highest percentage of visits that came from search engines is for Dainik Jagran. The highest number of foreign users is for Ananda Bazar Patrika. Most of the foreign users to Indian newspapers come from the USA. Originality/value ‐ Besides administrators of Indian newspapers, the results of this study will be useful for web site managers in any field including those in charge of library web sites. The study will also help librarians and anyone interested to increase usage of a web site by analysing the use of web site.
Article
Full-text available
In the WWW, people are engaging in interaction with more, and more diverse information than ever before, so there is an increasing need for information "filtering." But because of the diversity of information resources, and the lack of traditional quality control on the WWW, the criteria of authority and quality of information that people have used for this purpose in past contexts may no longer be relevant. This paper reports on a study of people's decision making with respect to quality and authority in the WWW. Seven facets of judgment of information quality were identified: source, content, format, presentation, currency, accuracy, and speed of loading. People mentioned source credibility with two levels: institutional level and individual level. Authority was identified as a underlying theme in source credibility. Institutional authority involved: institutional domain identified by URL; institution type; and institution reputation recognized by names. Individual authority involved: identification of creator/author; creator/author affiliation; and creator/author's name. People were more or less concerned with evaluating information quality depending upon: the consequence of use of information; act or commitment based on information; and the focus of inquiry. It was also found that people believed that the web, as an institution, was less authoritative and less credible than other types of information systems.
Conference Paper
Full-text available
The world-wide web has become the most important infor- mation source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting in- formation on a subject, such as different specifications for the same product. In this paper we propose a new problem called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting informa- tion on many subjects that is provided by various web sites. We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which uti- lizes the relationships between web sites and their informa- tion, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. Our ex- periments show that TruthFinder successfully finds true facts among conflicting information, and identifies trustwor- thy web sites better than the popular search engines.
Article
Full-text available
Modern information management applications often require integrating data from a variety of data sources, some of which may copy or buy data from other sources. When these data sources model a dynamically changing world (e.g., people's contact information changes over time, restaurants open and go out of business), sources often provide out-of-date data. Errors can also creep into data when sources are updated often. Given out-of-date and erroneous data provided by different, possibly dependent, sources, it is challenging for data integration systems to provide the true values. Straightforward ways to resolve such inconsistencies (e.g., voting) may lead to noisy results, often with detrimental consequences. In this paper, we study the problem of finding true values and determining the copying relationship between sources, when the update history of the sources is known. We model the quality of sources over time by their coverage, exactness and freshness. Based on these measures, we conduct a probabilistic analysis. First, we develop a Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies. Second, we develop a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time. Experimental results on both real-world and synthetic data show high accuracy and scalability of our techniques.
Article
Full-text available
The WWW is the most important source of information. But, there is no guarantee for information correctness and lots of conflicting information is retrieved by the search engines and the quality of provided information also varies from low quality to high quality. We provide enhanced trustworthiness in both specific (entity) and broad (content) queries in web searching. The filtering of trustworthiness is based on 5 factors - Provenance, Authority, Age, Popularity, and Related Links. The trustworthiness is calculated based on these 5 factors and it is stored thereby increasing the performance in retrieving trustworthy websites. The calculated trustworthiness is stored only for static websites. Quality is provided based on policies selected by the user. Quality based ranking of retrieved trusted information is provided using WIQA (Web Information Quality Assessment) Framework.
Article
Full-text available
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFlNDER successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.
Article
Trust is an integral part of the Semantic Web architecture. Most prior work on trusts focuses on entity-centered issues such as authentication and reputation and does not take into account the content, i.e., the nature and use of the information being exchanged. This paper defines content trust and discusses it in the context of other trust measures that have been previously studied. We introduce several factors that users consider in deciding whether to trust the content provided by a Web resource. Our goal is to discern which of these factors could be captured in practice with minimal user interaction in order to maximize the quality of the system's trust estimates. We present results on a study to determine which factors were more important to capture, and describe a simulation environment that we have designed to study alternative models of content trust.
Article
In the Web, making judgments of information quality and authority is a difficult task for most users because overall, there is no quality control mechanism. This study examines the problem of the judgment of information quality and cognitive authority by observing people's searching behavior in the Web. Its purpose is to understand the various factors that influence people's judgment of quality and authority in the Web, and the effects of those judgments on selection behaviors. Fifteen scholars from diverse disciplines participated, and data were collected combining verbal protocols during the searches, search logs, and postsearch interviews. It was found that the subjects made two distinct kinds of judgment: predictive judgment, and evaluative judgment. The factors influencing each judgment of quality and authority were identified in terms of characteristics of information objects, characteristics of sources, knowledge, situation, ranking in search output, and general assumption. Implications for Web design that will effectively support people's judgments of quality and authority are also discussed.
Significance of page title
  • Page Title
"Page Title", http://www.seologic.com/faq/title-tags [10] "Significance of page title", http://www.knowthis.com/principles-of-marketingtutorials/internet-marketing/page-title-for-sem/
A Heuristic Based Approach for Increasing the Page Ranking Relevancy in Hyperlink Oriented Search Engines: Experimental Evaluation
  • A Haider
  • Khalil Ramadhan
  • Shihab
Haider A. Ramadhan and Khalil Shihab,"A Heuristic Based Approach for Increasing the Page Ranking Relevancy in Hyperlink Oriented Search Engines: Experimental Evaluation",International Journal of Theoretical and Applied Computer Sciences, Volume 1 Number1(2006) pp.49-62