Automatically mining review records from forum Web sites
ABSTRACT The rapid development of Web 2.0 bring the flourish of web reviews. Web reviews are usually released in form of structured records. As the important information source for many popular applications(e.g. monitoring and analysis of public opinion), review records need to be extracted accurately from web pages. To the best of our knowledge, little work in literatures has systemically investigated this problem. Besides the variety of web page templates, the user-generated review contents raises a new challenge. The inconsistency of review contents on both DOM tree and visual appearance impair the similarity among review records, which makes a serious impact on performance of the existing solutions on web data record extraction. To tackle this challenge, we propose a novel approach that performs automatic extraction of review records by employing sophisticated techniques. Our experimental results over 20 forum web sites indicate that the proposed approach can achieve high extraction accuracy.