[Show abstract][Hide abstract] ABSTRACT: The concept of culturomics was born out of the availability of massive amounts of textual data and the interest to make sense of cultural and language phenomena over time. Thus far however, culturomics has only made use of, and shown the great potential of, statistical methods. In this paper, we present a vision for a knowledge-based culturomics that complements traditional culturomics. We discuss the possibilities and challenges of combining knowledge-based methods with statistical methods and address major challenges that arise due to the nature of the data; diversity of sources, changes in language over time as well as temporal dynamics of information in general. We address all layers needed for knowledge-based culturomics, from natural language processing and relations to summaries and opinions.
Full-text · Article · Apr 2015 · International Journal on Digital Libraries
[Show abstract][Hide abstract] ABSTRACT: In this paper we investigate the problem of segmenting images using the information in text annotations. In contrast to the general image understanding problem, this type of annotation guided segmentation is less ill-posed in the sense that for the output there is higher consensus among human annotations. In the paper we present a system based on a combined visual and semantic pipeline. In the visual pipeline, a list of tentative figure-ground segmentations is first proposed. Each such segmentation is classified into a set of visual categories. In the natural language processing pipeline, the text is parsed and chunked into objects. Each chunk is then compared with the visual categories and the relative distance is computed using the word-net structure. The final choice of segments and their correspondence to the chunked objects are then obtained using combinatorial optimization. The output is compared to manually annotated ground-truth images. The results are promising and there are several interesting avenues for continued research.
[Show abstract][Hide abstract] ABSTRACT: We present a comparative exploratory analysis of two proximity networks of mobile phone users, the Proximates network and the Reality Mining network. Data for both networks were collected from mobile phones carried by two groups of users. Periodic Bluetooth scans were performed to detect the proximity of other mobile phones. The Reality Mining project took place in 2004-2005 at MIT, while Proximates took place in Sweden in 2012-2013. We show that the differences in sampling strategy between the two networks has effects on both static and dynamic metrics. We also find that fundamental metrics of the static Proximates network capture social interactions characteristics better than in the static Reality Mining network.
[Show abstract][Hide abstract] ABSTRACT: A smartphone is a personal device and as such usually hosts multiple public user identities such as a phone number, email address, and Facebook account. As each smartphone has a unique Bluetooth MAC address, Bluetooth discovery can be used in combination with the user registration to a server with a Facebook account. This makes it possible to identify a nearby smartphone related to a given Facebook contact. Using this capability, we designed Memorit, an application that handles reminders alerting the user when a contact is nearby her/him. This way it is possible to trigger a user-defined reminder, for example to give back a book, when a registered contact comes in proximity. Data collected from Memorit will allow study of pervasive social context. This paper gives an overview of Memorit, its features, implementation, and evaluates its applicability through a user study.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we describe a novel system that identifies relations between the objects extracted from an image. We started from the idea that in addition to the geometric and visual properties of the image objects, we could exploit lexical and semantic information from the text accompanying the image. As experimental set up, we gathered a corpus of images from Wikipedia as well as their associated articles. We extracted two types of objects: human beings and horses and we considered three relations that could hold between them: Ride, Lead, or None. We used geometric features as a baseline to identify the relations between the entities and we describe the improvements brought by the addition of bag-of-word features and predicate-argument structures we derived from the text. The best semantic model resulted in a relative error reduction of more than 18% over the baseline.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifying the original document. We used the Avro binary format to serialize the documents. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework, the annotation model, the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.
[Show abstract][Hide abstract] ABSTRACT: The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce a method to use written natural language instructions to program assembly tasks for industrial robots. In our application, we used a state-of-the-art semantic and syntactic parser together with semantically rich world and skill descriptions to create high-level symbolic task sequences. From these sequences, we generated executable code for both virtual and physical robot systems. Our focus lays on the applicability of these methods in an industrial setting with real-time constraints.
[Show abstract][Hide abstract] ABSTRACT: Several studies have shown the value of using proximity data to understand the social context of users. To simplify the use of social context in application development we have developed Proximates, a social context engine for mobile phones. It scans nearby Bluetooth peers to determine what devices are in proximity. We map Bluetooth MAC ids to user identities on existing social networks which then allows Proximates to infer the social context of the user. The main contribution of Proximates is its use of link attributes retrieved from Facebook for granular relationship classification. We also show that Proximates can bridge the gap between physical and digital social interactions, by showing that it can be used to measure how much time a user spends in physical proximity with his Facebook friends. In this paper we present the architecture and initial experimental results on deployment usability aspects of users of an example application. We also discuss using location for proximity detection versus direct sensing using Bluetooth.
[Show abstract][Hide abstract] ABSTRACT: Using semantic parsing or related techniques, it is possible to extract knowledge from text in the form of predicate–argument structures. Such structures are often called propositions. With the advent of massive corpora such as Wikipedia, it has become possible to apply a systematic analysis of a wide range of documents covering a significant part of human knowledge and build large proposition databases from them. While most approaches focus on shallow syntactic analysis and do not capture the full meaning of a sentence, semantic parsing goes deeper and discovers more information from text with a higher accuracy. This deeper analysis can be applied to discover temporal and location-based propositions from documents. Medical researchers could, for instance, discover articles regarding the interaction of bacteria in a specific body part. Christensen et al. (2010) showed that using a semantic parser in information extraction can yield extractions with higher precision and recall in areas where shallow syntactic approaches have failed. This accuracy comes at a cost of parsing time.
[Show abstract][Hide abstract] ABSTRACT: This paper describes an attempt to provide more intelligence to industrial robotics and automation systems. We develop an architecture to integrate disparate knowledge representa-tions used in different places in robotics and automation. This knowledge integration framework, a possibly distributed en-tity, abstracts the components used in design or production as data sources, and provides a uniform access to them via standard interfaces. Representation is based on the ontology formalizing the process, product and resource triangle, where skills are considered the common element of the three. Pro-duction knowledge is being collected now and a preliminary version of KIF undergoes verification.
[Show abstract][Hide abstract] ABSTRACT: From metropolitan areas to tiny villages, there is a wide variety of organizers of cultural, business, entertainment, and social events. These organizers publish such information to an equally wide variety of sources. Every source of published events uses its own document structure and provides different sets of information. This raises significant customization issues. This paper explores the possibilities of extracting future events from a wide range of web sources, to determine if the document structure and content can be exploited for time-efficient hyperlocal event scraping. We report on two experimental knowledge-driven, pattern-based programs that scrape events from web pages using both their content and structure.
[Show abstract][Hide abstract] ABSTRACT: This paper describes the structure of the LTH coreference solver used in the closed track of the CoNLL 2012 shared task (Pradhan et al., 2012). The solver core is a mention classifier that uses Soon et al. (2001)'s algorithm and features extracted from the dependency graphs of the sentences. This system builds on Björkelund and Nugues (2011)'s solver that we extended so that it can be applied to the three languages of the task: English, Chinese, and Arabic. We designed a new mention detection module that removes pleonastic pronouns, prunes constituents, and recovers mentions when they do not match exactly a noun phrase. We carefully redesigned the features so that they reflect more complex linguistic phenomena as well as discourse properties. Finally, we introduced a minimal cluster model grounded in the first mention of an entity. We optimized the feature sets for the three languages: We carried out an extensive evaluation of pairs of features and we complemented the single features with associations that improved the CoNLL score. We obtained the respective scores of 59.57, 56.62, and 48.25 on English, Chinese, and Arabic on the development set, 59.36, 56.85, and 49.43 on the test set, and the combined official score of 55.21.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we describe an end-to-end system that automatically extracts RDF triples describing entity relations and properties from unstructured text. This system is based on a pipeline of text processing modules that includes a semantic parser and a coreference solver. By using coreference chains, we group entity actions and properties described in different sentences and convert them into entity triples. We applied our system to over 114,000 Wikipedia articles and we could extract more than 1,000,000 triples. Using an ontology-mapping system that we bootstrapped using existing DBpedia triples, we mapped 189,000 extracted triples onto the DBpedia namespace. These extracted entities are available online in the N-Triple format.
[Show abstract][Hide abstract] ABSTRACT: This paper describes a named entity recognition (NER) system for short text messages (SMS) running on a mobile platform. Most NER systems deal with text that is structured, formal, well written, with a good grammatical structure, and few spelling errors. SMS text messages lack these qualities and have instead a short-handed and mixed language studded with emoticons, which makes NER a challenge on this kind of material.We implemented a system that recognizes named entities from SMSes written in Swedish and that runs on an Android cellular telephone. The entities extracted are locations, names, dates, times, and telephone numbers with the idea that extraction of these entities could be utilized by other applications running on the telephone. We started from a regular expression implementation that we complemented with classifiers using logistic regression. We optimized the recognition so that the incoming text messages could be processed on the telephone with a fast response time. We reached an F-score of 86 for strict matches and 89 for partial matches.
Full-text · Article · Dec 2011 · Procedia - Social and Behavioral Sciences
[Show abstract][Hide abstract] ABSTRACT: Model-based systems in control are a means to utilize efficiently human knowledge and achieve high performance. While models consisting of formalized knowledge are used during the engineering step, running systems usually do not contain a high-level, symbolic representation of the control and most of its properties, typically named numerical parameters. On a system level and beyond the plant data, there is also a need to represent the meaning of the data such that deployment and fault analysis could be augmented with partly automated inference based on the semantics of the data. To that end, we extended the formalized knowledge traditionally used in control to include the control purpose, engineering assumption, quality, involved state machines, and so on. We then represented the control semantics in a format that allows an easier extraction of information using querying and reasoning. It aims at making knowledge in control engineering reusable so that it can be shipped together with the control systems. We implemented prototypes that include automatic conversion of plant data from AutomationML into RDF triples, as well as the automated extraction of control properties, the conversion of parameters, and their storage in the same triple store. Although these techniques are standard within the semantic web community, we believe that our robotic prototypes for semantic control represent a novel approach.
[Show abstract][Hide abstract] ABSTRACT: Robots used in manufacturing today are tailored to their tasks by system integration based on expert knowledge concerning both production and machine control. For upcoming new generations of even more flexible robot solutions, in applications such as dexterous assembly, the robot setup and programming gets even more challenging. Reuse of solutions in terms of parameters, controls, process tuning, and of software modules in general then gets increasingly important. There has been valuable progress within reuse of automation solutions when machines comply with standards and behave according to nominal models. However, more flexible robots with sensor-based manipulation skills and cognitive functions for human interaction are far too complex to manage, and solutions are rarely reusable since knowledge is either implicit in imperative software or not captured in machine readable form. We propose techniques that build on existing knowledge by converting structured data into an RDF-based knowledge base. By enhancements of industrial control systems and available engineering tools, such knowledge can be gradually extended as part of the interaction during the definition of the robot task.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we describe a coreference solver based on the extensive use of lexical features and features extracted from dependency graphs of the sentences. The solver uses Soon et al. (2001)'s classical resolution algorithm based on a pairwise classification of the mentions. We applied this solver to the closed track of the CoNLL 2011 shared task (Pradhan et al., 2011). We carried out a systematic optimization of the feature set using cross-validation that led us to retain 24 features. Using this set, we reached a MUC score of 58.61 on the test set of the shared task. We analyzed the impact of the features on the development set and we show the importance of lexicalization as well as of properties related to dependency links in coreference resolution.