Measuring Peculiarity of Text Using Relation between Words on the Web.
ABSTRACT We define the peculiarity of text as a metric of information credibility. Higher peculiarity means lower credibility. We extract
the theme word and the characteristic words from text and check whether there is a subject-description relation between them.
The peculiarity is defined using the ratio of the subject-description relation between a theme word and characteristic words.
We evaluate the extent to which peculiarity can be used to judge by classifying text from Wikipedia and Uncyclopedia in terms
of the peculiarity.
- SourceAvailable from: citeseerx.ist.psu.edu
Conference Proceeding: Query Modification by Discovering Topics from Web Page Structures.[show abstract] [hide abstract]
ABSTRACT: We propose a method that identifies from Web pages pairs of keywords in which one word describes the other and uses these relations to modify the query. It takes into account the positions of the words in the page structures when counting their occurrences and applies statistical tests to examine the differences between word co-occurrence rates. It finds related keywords more robustly regardless of the word type than the conventional methods, which do not consider page structures. It can also identify subject and description keywords in the user's input and find additional keywords for detailing the query. By considering the document structures, our method can construct queries that are more focused on the user's topic of interest.Advanced Web Technologies and Applications, 6th Asia-Pacific Web Conference, APWeb 2004, Hangzhou, China, April 14-17, 2004, Proceedings; 01/2004
Conference Proceeding: Combating Web Spam with TrustRank[show abstract] [hide abstract]
ABSTRACT: Web spam pages use various techniques to achieve higher-than-deserved rankings in a search en- gine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose tech- niques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a sig- nificant fraction of the web, based on a good seed set of less than 200 sites.08/2004
- [show abstract] [hide abstract]
ABSTRACT: Users often encounter unreliable information on the Web, but there is no system to check the credibility easily and efficiently. In this paper, we propose a system to search useful information for checking the credibility of uncertain facts. The objective of our system is to help users to efficiently judge the credibility by comparing other facts related to the input uncertain fact without checking a lot of Web pages for comparison. For this purpose, the system collects comparative facts for the input fact and important aspect for comparing them from the Web and estimates the validity of each fact.10/2009: pages 291-305;