Archive for the ‘Language’ Category

Adapting Associative Classification to Text Categorization

Associative classification, which originates from numerical data mining, has been applied to deal with text data recently. Text data is firstly digitalized to database of transactions, and then training and prediction is actually conducted on the derived numerical dataset. This intuitive strategy has demonstrated quite good performance. However, it doesn’t take into consideration the inherent characteristics of text data as much as possible, although it has to deal with some specific problems of text data such as lemmatizing and stemming during digitalization. In this paper, we propose a bottom-up strategy to adapt associative classification to text categorization, in which we take into account structure information of text. Experiments on Reuters-21578 dataset show that the proposed strategy can make use of text structure information and achieve better performance.

Read the paper:

Adapting Associative Classification to Text Categorization

by Baoli Li, Neha Sugandh, Ernie Garcia, Ashwin Ram

ACM Conference on Document Engineering (ACM-DocEng-07), Winnipeg, Canada, August 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-13.pdf

Machine Learning Based Semantic Inference: Experiments and Observations at RTE-3

Textual Entailment Recognition is a semantic inference task that is required in many natural language processing (NLP) applications. In this paper, we present our system for the third PASCAL recognizing textual entailment (RTE-3) challenge. The system is built on a machine learning framework with the following features derived by state-of-the-art NLP techniques: lexical semantic similarity (LSS), named entities (NE), dependent content word pairs (DEP), average distance (DIST), negation (NG), task (TK), and text length (LEN).

On the RTE-3 test dataset, our system achieves the accuracy of 0.64 and 0.6488 for the two official submissions, respectively. Experimental results show that LSS and NE are the most effective features. Further analyses indicate that a baseline dummy system can achieve accuracy 0.545 on the RTE-3 test dataset, which makes RTE-3 relatively easier than RTE-2 and RTE-1. In addition, we demonstrate with examples that the current Average Precision measure and its evaluation process need to be changed.

Read the paper:

Machine Learning Based Semantic Inference: Experiments and Observations at RTE-3

by Baoli Li, Joseph Irwin, Ernie Garcia, Ashwin Ram

Association for Computational Linguistics (ACL) Challenge Workshop on Textual Entailment and Paraphrase (WTEP-07), Prague, Czech Republic, June 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-12.pdf

Domain Ontology Construction from Biomedical Text

NLM’s Unified Medical Language System (UMLS) is a very large ontology of biomedical and health data. In order to be used effectively for knowledge processing, it needs to be customized to a specific domain. In this paper, we present techniques to automatically discover domain-specific concepts, discover relationships between these concepts, build a context map from these relationships, link these domain concepts with the best-matching concept identifiers in UMLS using our context map and UMLS concept trees, and finally assign categories to the discovered relationships. This specific domain ontology of terms and relationships using evidential information can serve as a basis for applications in analysis, reasoning and discovery of new relationships. We have automatically built an ontology for the Nuclear Cardiology domain as a testbed for our techniques.

Read the paper:

Domain Ontology Construction from Biomedical Text

by Saurav Sahay, Baoli Li, Ernie Garcia, Eugene Agichtein, Ashwin Ram

International Conference on Artificial Intelligence (ICAI-07), Las Vegas, NV, June 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-10.pdf

Emotionally Driven Natural Language Generation for Personality Rich Characters in Interactive Games

Natural Language Generation for personality rich characters represents one of the important directions for believable agents research. The typical approach to interactive NLG is to hand-author textual responses to different situations. In this paper we address NLG for interactive games. Specifically, we present a novel template-based system that provides two distinct advantages over existing systems. First, our system not only works for dialogue, but enables a character’s personality and emotional state to influence the feel of the utterance. Second, our templates are resuable across characters, thus decreasing the burden on the game author. We briefly describe our system and present results of a preliminary evaluation study.

Read the paper:

Emotionally Driven Natural Language Generation for Personality Rich Characters in Interactive Games

by Christina Strong, Kinshuk Mishra, Manish Mehta, Alistair Jones, Ashwin Ram

Third Conference on Artificial Intelligence for Interactive Digital Entertainment (AIIDE-07), Stanford, CA, June 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-09.pdf

Evaluating Player Modeling for a Drama Manager Based Interactive Fiction

A growing research community is working towards employing drama management components in story-based games that guide the story towards specific narrative arcs depending on a particular player’s playing patterns. Intuitively, player modeling should be a key component for Drama Manager (DM) based approaches to succeed with human players.

In this paper, we report a particular implementation of the DM component connected to an interactive story game, Anchorhead, while specifically focusing on the player modeling component. We analyze results from our evaluation study and show that similarity in the trace of DM decisions in previous games can be used to predict interestingness of game events for the current player. Results from our current analysis indicate that the average time spent in performing player actions provides a strong distinction between players with varying degrees of gaming experience, thereby helping the DM to adapt its strategy based on this information.

Read the paper:

Evaluating Player Modeling for a Drama Manager Based Interactive Fiction

by Manu Sharma, Manish Mehta, Santi Ontañón, Ashwin Ram

Third Conference on Artificial Intelligence for Interactive Digital Entertainment (AIIDE-07), Workshop on Player Satisfaction, Stanford, CA, June 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-08.pdf

Towards Player Preference Modeling for Drama Management in Interactive Stories

There is a growing interest in producing story based game experiences that do not follow fixed scripts pre-defined by the author, but change the experience based on actions performed by the player during his interaction. In order to achieve this objective, previous approaches have employed a drama management component that produces a narratively pleasing arc based on an author specified aesthetic value of a story, ignoring a player’s personal preference for that story path. Furthermore, previous approaches have used a simulated player model to assess their approach, ignoring real human players interacting with the story-based game.

This paper presents an approach that uses a case-based player preference modeling component that predicts an interestingness value for a particular plot point within the story. These interestingness values are based on real human players’ interactions with the story. We also present a drama manager that uses a search process (based on the expectimax algorithm) and combines the author specified aesthetic values with the player model.

Read the paper:

Towards Player Preference Modeling for Drama Management in Interactive Stories

by Manu Sharma, Santi Ontañón, Christina Strong, Manish Mehta, Ashwin Ram

20th International FLAIRS Conference on Artificial Intelligence (FLAIRS-07), Key West, FL, May 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-03.pdf

Detecting Medical Rule Sentences with Semi-Automatically Derived Patterns: A Pilot Study

We propose a semi-supervised method to extract rule sentences from medical abstracts. Medical rules are sentences that give interesting and non-trivial relationship between medical entities. Mining such medical rules is important since the rules thus extracted can be used as inputs to an expert system or in many more other ways. The technique we suggest is based on paraphrasing a set of seed sentences and populating a pattern dictionary of paraphrases of rules. We match the patterns against the new abstract and rank the sentences.

Read the paper:

Detecting Medical Rule Sentences with Semi-Automatically Derived Patterns: A Pilot Study

by Shreekanth Karvaje, Bharat Ravisekar, Baoli Li, Ernie Garcia, Ashwin Ram

International Symposium on Bioinformatics Research and Applications ( ISBRA-07), Atlanta, GA, May 2007
www.cc.gatech.edu/faculty/ashwin/papers/er-07-07.pdf

Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships

Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations.

The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.

Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships

by Ying Liu, Sham Navathe, Jorge Civera, Venu Dasigi, Ashwin Ram, Brian Ciliax, Ray Dingledine

IEEE/ACM Transactions on Computational Biology and Bioinformatics,2(4):380-384, Oct-Dec 2005
www.cc.gatech.edu/faculty/ashwin/papers/er-05-01.pdf

Preventing Failures by Mining Maintenance Logs with Case-Based Reasoning

The project integrates work in natural language processing, machine learning, and the semantic web, bringing together these diverse disciplines in a novel way to address a real problem. The objective is to extract and categorize machine components and subsystems and their associated failures using a novel approach that combines text analysis, unsupervised text clustering, and domain models. Through industrial partnerships, this project will demonstrate effectiveness of the proposed approach with actual industry data.

Read the paper:

Preventing Failures by Mining Maintenance Logs with Case-Based Reasoning

by Mark Devaney, Ashwin Ram, Hai Qui, Jay Lee

59th Meeting of the Society for Machinery Failure Prevention Technology (MFPT-59), Virginia Beach, VA, April 2005
www.cc.gatech.edu/faculty/ashwin/papers/er-05-04.pdf

Interactive Case-Based Reasoning for Precise Information Retrieval

The knowledge explosion has continued to outpace technological innovation in search engines and knowledge management systems. It is increasingly difficult to find relevant information, not just on the World Wide Web at large but even in domain- specific medium-sized knowledge bases—online helpdesks, maintenance records, technical repositories, travel databases, e-commerce sites, and many others. Despite advances in search and database technology, the average user still spends inordinate amounts of time looking for specific information needed for a given task.

This paper describes an adaptive system for the precise, rapid retrieval and synthesis of information from medium-sized knowledge bases in response to problem-solving queries from a diverse user population. We advocate a shift in perspective from “search” to “answers. Instead of returning dozens or hundreds of hits to a user, the system should attempt to find answers that may or may not match the query directly but are relevant to the user’s problem or task.

This problem has been largely overlooked as research has tended to concentrate on techniques for broad searches of large databases over the Internet (as exemplified by Google) and structured queries of well-defined databases (as exemplified by SQL). However, the problem discussed in this chapter is sufficiently different from these extremes to both present a novel set of challenges as well as provide a unique opportunity to apply techniques not traditionally found in the information retrieval literature. Specifically, we discuss an innovative combination of techniques‚ case-based reasoning coupled with text analytics‚ to solve the problem in a practical, real-world context.

We are interested in applications in which users must quickly retrieve answers to specific questions or problems from a complex information database with a minimum of effort and interaction. Examples include internal helpdesk support, web-based self-help for consumer products, decision-aiding systems for support personnel, and repositories for specialized documents such as patents, technical documents, or scientific literature. These applications are characterized by the fact that a diverse user population accesses highly focused knowledge bases in order to find precise answers to specific questions or problems. Despite the growing popularity of on-line service and support facilities for internal use by employees and for external use for customers, most such sites rely on traditional search engine technologies and are not very effective in reducing the time, expertise, and complexity required on the user’s part.

Read the paper:

Interactive Case-Based Reasoning for Precise Information Retrieval

by Ashwin Ram, Mark Devaney

In Case-Based Reasoning in Knowledge Discovery and Data Mining, David Aha and Sankar Pal (editors).
www.cc.gatech.edu/faculty/ashwin/papers/er-05-02.pdf