Conference PaperPDF Available

Integration of Semantic, Metadata and Image Search Engines with a Text Search Engine for Patent Retrieval.

Authors:

Abstract and Figures

The combination of different search techniques can improve the results given by each one. In the ongoing R&D projec t PATExpert1, four different search techniques are combined to perform a patent search. These techniques are: metadata search, keyword-based sear ch, semantic search and image search. In this paper we propose a general ar chitecture based on web services where each tool works in its own domain an d provides a set of basic functionalities to perform the retrieval. To be abl e to combine the results from the four search engines, these must be fuzzy (using a membership function or similarity grade). We focus on how the fuzzy result s can be obtained from each technique, and how they can then be combined. This combination must take into account the query, the similarity of the paten t to each part of the query, and the confidence on the technique
Content may be subject to copyright.
A preview of the PDF is not available
... Furthermore there seems to be discrepancies between the material uses by professional searchers and what is used for the development of new techniques. When reviewing the literature one quickly notices that a great number of publications [14], [15], [16], [17], [18], [19], [20], [21], [22], [ [8], [26] (Lupu, 2011;. Here start the problems, e.g. the total lack of uniformity and sometime the difficulty of accessing such documents. ...
... The Optical Structure Recognition Software (OSRA) was applied to recover chemical information [20] and the Content-Based Image Retrieval (CBIR) for drawing retrieval [30] (Lupu, 2013). In addition some rare attempts have been made to associate several techniques including image search in the IP domain [26] . This is obviously not a trivial task but some other domains seem to have done better in this matter. ...
... Recently, both the Intellectual Property and the Information Retrieval communities have shown great interest in patent image search expressed with research activities and works in the area (e.g. [3], [4]), as well as with prototype systems and demos (e.g. [5], [6]). ...
Chapter
Nowadays most of the patent search systems still rely upon text to provide retrieval functionalities. Recently, the intellectual property and information retrieval communities have shown great interest in patent image retrieval, which could augment the current practices of patent search. In this chapter, we present a patent image extraction and retrieval framework, which deals with patent image extraction and multimodal (textual and visual) metadata generation from patent images with a view to provide content-based search and concept-based retrieval functionalities. Patent image extraction builds upon page orientation detection and segmentation, while metadata extraction from images is based on the generation of low level visual and textual features. The content-based retrieval functionality is based on visual low level features, which have been devised to deal with complex black and white drawings. Extraction of concepts builds upon on a supervised machine learning framework realised with Support Vector Machines and a combination of visual and textual features. We evaluate the different retrieval parts of the framework by using a dataset from the footwear and the lithography domain.
... Searching patent drawings is currently a labour-intensive, error-prone task which would be facilitated by automatic indexing methods. Generic methods for automatically indexing patent drawings have been reported Vrochidis et al., 2010;Codina et al., 2009) but the heterogeneity of patent drawings means that it is difficult for a one-size-fits-all approach to reach high levels of performance on all image classes. Flowcharts represent a large and useful subclass of patent images for which specially-adapted methods will improve indexing performance. ...
Article
Full-text available
The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of owcharts taken from patent drawings to produce summaries containing information about their structure. The textual summaries include information about the owchart title, the box-node shapes, the connecting edge types, text describing owchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: Text-graphic segmentation based on connected-component clustering; Line segment bridging with an adaptive, oriented filter; Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by comparing algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge directivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most important failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently evaluated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an improvement of 9% in OCR accuracy and a 26% reduction in the word error rate.
... The actual merging of the results is performed by a merger service, so that the orchestrator only has to communicate with a single patent search service. The PATExpert search facility is described in more detail in [100]. For illustration, figure 9.6 shows a query that contains four query fragments for full-text, metadata, semantic and image search. ...
Thesis
Patents are of great importance for national and international economies, since they have a high impact on the development of products, trade, and research. Therefore, patents are analyzed by multiple interest groups for different purposes, for example for getting technical details, for observing competitors, or for detecting technological trends. Patent analysis typically includes the following steps: formulating a patent query, submitting it to one or more patent data services, merging and analyzing the results for getting an overview, selecting relevant patents for detailed analysis, continuing with refining the query based on the gained insights until the results are satisfying. In general, patent analysis is a complex, time and knowledge intensive process, because of multiple application domains, data sources, and legal systems but also because of the intentionally used abstract expressions in patents and the huge amount of patent data. Therefore, this thesis investigates the research question of how patent analysis can be supported by the field of visual analytics which combines visualization, human-computer interaction, data analysis, and data management. A contribution of this thesis is a general architecture for visual patent analysis which is based on a semantic representation model. The model can be accessed from analysis and presentation components, enriched with analysis results, published and reused. Analysis and presentation components are loosely coupled and exchangeable based a shared terminology defined by ontologies. A second contribution is an ontology for patent metadata. It provides an integrated semantic representation of the major patent metadata aspects and allows metadata restrictions to be added to a semantic search. As a third contribution, this thesis describes new visualization techniques, for example a treemap techniques for visualizing patent portfolios and co-classification relations and a graph overlay based technique for visualizing dependencies between text parts based on semantic relations. Additionally, this thesis contributes to the field of data security by describing a new encryption based method for secure sharing of semantic models to a known group of recipients. The main idea of this method is to only encrypt sensitive information while keeping all non-sensitive information publicly readable (partial encryption). This method allows the usage of open infrastructures for the exchange of sensitive semantic models. The adequacy of the proposed architecture has been validated by three prototypes for different use cases. It has been shown that the architecture is flexible and extensible. A first evaluation with patent experts suggested that visual analytics is a promising approach for improving patent analysis.
... Due to the predictability and controllability of Boolean search queries, they are predominant in the patent domain and thus familiar to the targeted user group. They also facilitate the integration of the various search engines by performing simple set operation on their individual result sets [Codina et al., 2008]. The query editor provides two views on the same query model. ...
Thesis
The often cited information explosion is not limited to volatile network traffic and massive multimedia capture data. Structured and high quality data from diverse fields of study become easily and freely available, too. This is due to crowd sourced data collections, better sharing infrastructure, or more generally speaking user generated content of the Web 2.0 and the popular transparency and open data movements. At the same time as data generation is shifting to everyday casual users, data analysis is often still reserved to large companies specialized in content analysis and distribution such as today's internet giants Amazon, Google, and Facebook. Here, fully automatic algorithms analyze metadata and content to infer interests and believes of their users and present only matching navigation suggestions and advertisements. Besides the problem of creating a filter bubble, in which users never see conflicting information due to the reinforcement nature of history based navigation suggestions, the use of fully automatic approaches has inherent problems, e.g. being unable to find the unexpected and adopt to changes, which lead to the introduction of the Visual Analytics (VA) agenda. If users intend to perform their own analysis on the available data, they are often faced with either generic toolkits that cover a broad range of applicable domains and features or specialized VA systems that focus on one domain. Both are not suited to support casual users in their analysis as they don't match the users' goals and capabilities. The former tend to be complex and targeted to analysis professionals due to the large range of supported features and programmable visualization techniques. The latter trade general flexibility for improved ease of use and optimized interaction for a specific domain requirement. This work describes two approaches building on interactive visualization to reduce this gap between generic toolkits and domain-specific systems. The first one builds upon the idea that most data relevant for casual users are collections of entities with attributes. This least common denominator is commonly employed in faceted browsing scenarios and filter/flow environments. Thinking in sets of entities is natural and allows for a very direct visual interaction with the analysis subject and it stands for a common ground for adding analysis functionality to domain-specific visualization software. Encapsulating the interaction with sets of entities into a filter/flow graph component can be used to record analysis steps and intermediate results into an explicit structure to support collaboration, reporting, and reuse of filters and result sets. This generic analysis functionality is provided as a plugin-in component and was integrated into several domain-specific data visualization and analysis prototypes. This way, the plug-in benefits from the implicit domain knowledge of the host system (e.g. selection semantics and domain-specific visualization) while being used to structure and record the user's analysis process. The second approach directly exploits encoded domain knowledge in order to help casual users interacting with very specific domain data. By observing the interrelations in the ontology, the user interface can automatically be adjusted to indicate problems with invalid user input and transform the system's output to explain its relation to the user. Here, the domain related visualizations are personalized and orchestrated for each user based on user profiles and ontology information. In conclusion, this thesis introduces novel approaches at the boundary of generic analysis tools and their domain-specific context to extend the usage of visual analytics to casual users by exploiting domain knowledge for supporting analysis tasks, input validation, and personalized information visualization.
Chapter
In this chapter, we will analyse the current technologies available that deal with graphical information in patent retrieval applications and, in particular, with the problem of recognising and understanding information carried by flowcharts. We will review some of the state-of-the-art techniques that have arisen from the graphics recognition community and their application in the intellectual property domain. We will present an overview of the different steps that compound a flowchart recognition system, looking also at the achievements and remaining challenges in such a domain.
Article
The ability to access patents and relevant patent-related information pertaining to a patented technology can fundamentally transform the patent system and its functioning and patent institutions such as the USPTO and the federal courts. This paper describes an ontology-based computational framework that can resolve some of difficult issues in retrieving patents and patent related information for the legal and justice system.
Thesis
Today’s society generates and stores digital information in enormous amounts and at rapidly increasing rates. This trend affects all parts of modern society, such as commerce and economy, politics and governments, health and medicine, science in general, media and entertainment, the private sector, etc. The stored information comprises text documents, images, audio files, videos, structured data from a variety of sources, as well as multimodal combinations of them, and is available in a multitude of electronic formats and flavors. As a consequence, the need for automated and interactive tools supporting tasks, such as searching, exploring, monitoring, sorting, and making sense of this information at different levels of abstraction and within different but steadily converging domains, increases at the same pace as the data is generated and represents one of the biggest challenges for current computer science. A relatively young approach to tackle these tasks by exploiting human analytic power in synergetic combination with advanced computerized techniques has emerged with the research field of visual analytics. Visual analytics aims at combining automated methods, visualization techniques, and approaches from the field of human computer interaction in order to equip analysts with more powerful tools, tailored to domains, where large amounts of data must be analyzed. In this work, visual analytics methods and concepts play a central role. They are used to search and analyze texts or multimodal documents containing a considerable amount of textual content. The presented approaches are primarily employed for analyzing a very special type of document from the intellectual property domain, namely patents. Since the retrieval and analysis tasks carried out in the patent domain differ greatly from standard search and analysis tasks regarding rigorous requirements, high costs, and the involved risks, new, more effective, efficient, and more reliable methods need to be developed. Accordingly, this thesis focuses on researching the combination of automatic methods and information visualization by using advanced interaction techniques in order to improve upon the state of the art in patent literature retrieval. Such integration is achieved and exemplified through different visual analytics prototypes, aiming at creating support for real-world tasks and processes. The main contributions presented in this thesis encompass enhancements for all stages of patent literature analysis processes. This includes improving patent search by presenting techniques for interactive visual query building, which helps analysts to formulate complex information needs, the development of a technique that allows users to build their own precise search mechanism in the form of binary classifiers, and advanced approaches for making sense of a retrieved result set through visual analysis. The latter builds the base to let users generate insights needed for judging and improving previous query formulations. Interaction methods facilitating forward analysis as well as feedback loops, which constitute a critical part of visual analytics approaches, are discussed afterwards. These methods are the key to integrating all stages of the patent analysis process in a seamless visual manner. Another contribution is the discussion of scalability issues in context of the described visual analytics approaches. Especially interaction scalability, the recording of analytic provenance, insight management, the visualization of analytic reporting, and collaborative approaches are addressed. Although the described approaches are exemplified by applying them to the field of intellectual property analysis, the developments regarding search and analysis have the potential to be adapted to complicated text document retrieval and analysis tasks in various domains. The general ideas regarding the facilitation of low-level feedback loops, user-steered machine classification, and technical solutions for diminishing negative scalability effects can be directly transferred to other visual analytics scenarios.
Chapter
Ontologies have become a prominent topic in Computer Science where they serve as explicit conceptual knowledge models that make domain knowledge available to information systems. They play a key role in the vision of the Semantic Web where they provide the semantic vocabulary used to annotate websites in a way meaningful for machine interpretation. As studied in the context of information systems, ontologies borrow from the fields of symbolic knowledge representation in Artificial Intelligence, from formal logic and automated reasoning and from conceptual modeling in Software Engineering, while also building on Web-enabling features and standards. Although in Computer Science ontologies are a rather new field of study, certain accomplishments can already be reported from the current situation in ontology research. Web-compliant ontology languages based on a thoroughly understood theory of underlying knowledge representation formalisms have been and are being standardized for their widespread use across the Web. Methodological aspects about the engineering of ontologies are being studied, concerning both their manual construction and (semi)automated generation. Initiatives on “linked open data” for collaborative maintenance and evolution of community knowledge based on ontologies emerge, and the first semantic applications of Web-based ontology technology are successfully positioned in areas like semantic search, information integration, or Web community portals. This chapter will present ontologies as one of the major cornerstones of Semantic Web technology. It will first explain the notion of formal ontologies in Computer Science and will discuss the range of concrete knowledge models usually subsumed under this label. Next, the chapter surveys ontology engineering methods and tools, both for manual ontology construction and for the automated learning of ontologies from text. Finally, different kinds of usage of ontologies are presented and their benefits in various application scenarios illustrated.
Article
Relatively little research has been done on the topic of patent image retrieval and in general in most of the approaches the retrieval is performed in terms of a similarity measure between the query image and the images in the corpus. However, systems aimed at overcoming the semantic gap between the visual description of patent images and their conveyed concepts would be very helpful for patent professionals. In this paper we present a flowchart recognition method aimed at achieving a structured representation of flowchart images that can be further queried semantically. The proposed method was submitted to the CLEF-IP 2012 flowchart recognition task. We report the obtained results on this dataset.
Conference Paper
Full-text available
Knowledge articulation costs are the bottleneck for efficient Personal Knowledge Management (PKM). Current tools either allow to few structures and hence have to rely only on keyword searches in plain text, allow no associative browsing, and cannot infer new knowledge. Semantic modelling tools on the other hands are too cumbersome to use and force the user to formalise everything all the time – this is too costly in PKM usage. Conceptual Data Structures (CDS) are what is found to be the largest common denominator of information structures used in common knowledge artefacts. CDS allow step-wise and gradual formalisation and representing the spectrum from informal notes up to formal ontologies. This paper describes the CDS data model and ontology in detail and shows how CDS can largely be implemented with existing semantic web technologies. This research was supported by the European Commission under the Nepomuk project FP6-027705.
Article
Full-text available
iMapping is a technique for visually structuring informa-tion objects. It supports the full range from informal note taking over semi-structured personal information management to formal knowledge models. With iMaps, users can easily go from overview to fine-grained structures while browsing, editing or refining the knowledge base in one comprehensive view. An iMap is comparable to a large white-board where information items can be positioned like post-its but also nested into each other. Spatial browsing and zooming as well as graphical editing facilities make it easy to structure content in an intuitive way. iMapping builds on a zooming user interface approach to facilitate navigation and to help users maintain an overview in the knowledge space. While a first implementation is being developed, iMapping is still in a conceptual stage. In this paper we describe the iMapping approach and how it tries to combine and extend the advantages of other approaches.
Article
Full-text available
Word Sense Disambiguation (WSD) is tra-ditionally considered an AI-hard problem. A breakthrough in this field would have a significant impact on many relevant web-based applications, such as information re-trieval and information extraction. This pa-per describes JIGSAW, a knowledge-based WSD system that attemps to disambiguate all words in a text by exploiting WordNet 1 senses. The main assumption is that a spe-cific strategy for each Part-Of-Speech (POS) is better than a single strategy. We evalu-ated the accuracy of JIGSAW on SemEval-2007 task 1 competition 2 . This task is an application-driven one, where the applica-tion is a fixed cross-lingual information re-trieval system. Participants disambiguate text by assigning WordNet synsets, then the system has to do the expansion to other lan-guages, index the expanded documents and run the retrieval for all the languages in batch. The retrieval results are taken as a measure for the effectiveness of the disam-biguation.
Article
Full-text available
The introduction of semantics on the web will lead to a new generation of services based on content rather than on syntax. Search engines will provide topic-based searches, retrieving resources conceptually related to the user informational need. Queries will be expressed in several ways, and will be mapped on the semantic level defining topics that must be retrieved from the web. Moving towards this new Web era, effective semantic search engines will provide means for successful searches avoiding the heavy burden experimented by users in a classical query-string based search task. In this paper we propose a search engine based on web resource semantics. Resources to be retrieved are semantically annotated using an existing open semantic elaboration platform and an ontology is used to describe the knowledge domain into which perform queries. Ontology navigation provides semantic level reasoning in order to retrieve meaning-ful resources with respect to a given information request.
Article
Full-text available
By analysing the current structure and the usage patterns of collaborative tagging systems, we can find out many im-portant aspects which still need to be improved. Problems related to synonymy, polysemy, different lexical forms, mis-pelling errors or alternate spellings, different levels of preci-sion and different kinds of tag-to-resource association cause inconsistencies and reduce the efficiency of content search and the effectiveness of the tag space structuring and orga-nization. They are mainly caused by the lack of semantic information inclusion in the tagging process. We propose a new way to describe resources: the semantic tagging. It allows user to state semantic assertions: each of them ex-presses a defined characteristic of a resource associating it with a concept. We present SemKey, a semantic collabora-tive tagging system, describing its global architecture and functioning along with the most relevant organizational is-sues faced. We explore the adequacy of the support offered by the entries of Wikipedia and WordNet in order to access to and reference concepts.
Conference Paper
Full-text available
Classifications have been used for centuries with the goal of cataloguing and searching large sets of objects. In the early days it was mainly books; lately it has also become Web pages, pictures and any kind of electronic information items. Classifications describe their contents using natural language labels, which has proved very effective in manual classification. However natural language labels show their limitations when one tries to automate the process, as they make it very hard to reason about classifications and their contents. In this paper we introduce the novel notion of Formal Classification, as a graph structure where labels are written in a propositional concept language. Formal Classifications turn out to be some form of lightweight ontologies. This, in turn, allows us to reason about them, to associate to each node a normal form formula which univocally describes its contents, and to reduce document classification to reasoning about subsumption.
Conference Paper
Full-text available
An information-seeking system is described which combines traditional keyword querying of WWW resources with the ability to browse and query against RDF annotations of those resources. RDF(S) and RDF are used to specify and populate an ontology and the resultant RDF annotations are then indexed along with the full text of the annotated resources. The resultant index allows both keyword querying against the full text of the document and the literal values occurring in the RDF annotations, along with the ability to browse and query the ontology. We motivate our approach as a key enabler for fully exploiting the semantic Web in the area of knowledge management and argue that the ability to combine searching and browsing behaviours more fully supports a typical information-seeking task. The approach is characterised as "low threshold, high ceiling" in the sense that where RDF annotations exist they are exploited for an improved information-seeking experience but where they do not yet exist, a search capability is still available.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Conference Paper
Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases. We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.