We present an application of a clustering technique to a large original dataset of SCI publications which is capable at disentangling the different research lines followed by a scientist, their duration over time and the intensity of effort devoted to each of them. Information is obtained by means of software-assisted content analysis, based on the co-occurrence of words in the full abstract and title of a set of SCI publications authored by 650 American star-physicists across 17 years. We estimated that scientists in our dataset over the time span contributed on average to 16 different research lines lasting on average 3.5 years and published nearly 5 publications in each single line of research. The technique is potentially useful for scholars studying science and the research community, as well as for research agencies, to evaluate if the scientist is new to the topic and for librarians, to collect timely biographic information.
Read the paper:
Using Content Analysis to Investigate The Research Paths Chosen by Scientists over Time
by Chiara Franzoni, Chris Simpkins, Baoli Li, Ashwin Ram
Scientometrics journal 83(1):321-335, April 2010. (Earlier version in 1th International Conference on Scientometrics and Infometrics (ISSI-07), Madrid, Spain, June 2007.)
Effective encoding of information is one of the keys to qualitative problem solving. Our aim is to explore Knowledge Representation techniques that capture meaningful word associations occurring in documents. We have developed iReMedI, a TCBR-based problem solving system as a prototype to demonstrate our idea. For representation we have used a combination of NLP and graph based techniques which we call as Shallow Syntactic Triples, Dependency Parses and Semantic Word Chains. To test their effectiveness we have developed retrieval techniques based on PageRank, Shortest Distance and Spreading Activation methods. The various algorithms discussed in the paper and the comparative analysis of their results provides us with useful insight for creating an effective problem solving and reasoning system.
Read the paper:
iReMedI – Intelligent Retrieval from Medical Information
by Saurav Sahay, Bharat Ravisekar, Anu Venkatesh, Sundaresan Venkatasubramanian, Priyanka Prabhu, Ashwin Ram
9th European Conference on Case-Based Reasoning (ECCBR-08), Trier, Germany
In this paper we investigate how to automatically determine the subjectivity orientation of questions posted by real users in community question answering (CQA) portals. Subjective questions seek answers containing private states, such as personal opinion and experience. In contrast, objective questions request objective, verifiable information, often with support from reliable sources. Knowing the question orientation would be helpful not only for evaluating answers provided by users, but also for guiding the CQA engine to process questions more intelligently. Our experiments on Yahoo! Answers data show that our method exhibits promising performance.
Read the paper:
Subjectivity Analysis for Questions in QA Communities
by Baoli Li, Yandong Liu, Ashwin Ram, Ernie Garcia, Eugene Agichtein
31st Annual International ACM SIGIR Conference (ACM-SIGIR-08), Singapore, July 2008
To realize the vision of a Semantic Web for Life Sciences, discovering relations between resources is essential. It is very difficult to automatically extract relations from Web pages expressed in natural language formats. On the other hand, because of the explosive growth of information, it is difficult to manually extract the relations. In this paper we present techniques to automatically discover relations between biomedical resources from the Web. For this purpose we retrieve relevant information from Web Search engines and Pubmed database using various lexico-syntactic patterns as queries over SOAP web services. The patterns are initially handcrafted but can be progressively learnt. The extracted relations can be used to construct and augment ontologies and knowledge bases. Experiments are presented for general biomedical relation discovery and domain specific search to show the usefulness of our technique.
Read the paper:
Discovering Semantic Biomedical Relations utilizing the Web
by Saurav Sahay, Sougata Mukherjea, Eugene Agichtein, Ernie Garcia, Sham Navathe, Ashwin Ram
ACM Transactions on Knowledge Discovery from Data, 2(1):3, 2008
Associative classification, which originates from numerical data mining, has been applied to deal with text data recently. Text data is firstly digitalized to database of transactions, and then training and prediction is actually conducted on the derived numerical dataset. This intuitive strategy has demonstrated quite good performance. However, it doesn’t take into consideration the inherent characteristics of text data as much as possible, although it has to deal with some specific problems of text data such as lemmatizing and stemming during digitalization. In this paper, we propose a bottom-up strategy to adapt associative classification to text categorization, in which we take into account structure information of text. Experiments on Reuters-21578 dataset show that the proposed strategy can make use of text structure information and achieve better performance.
Read the paper:
Adapting Associative Classification to Text Categorization
by Baoli Li, Neha Sugandh, Ernie Garcia, Ashwin Ram
ACM Conference on Document Engineering (ACM-DocEng-07), Winnipeg, Canada, August 2007
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations.
The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.
Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships
by Ying Liu, Sham Navathe, Jorge Civera, Venu Dasigi, Ashwin Ram, Brian Ciliax, Ray Dingledine
IEEE/ACM Transactions on Computational Biology and Bioinformatics,2(4):380-384, Oct-Dec 2005
The knowledge explosion has continued to outpace technological innovation in search engines and knowledge management systems. It is increasingly difficult to find relevant information, not just on the World Wide Web at large but even in domain- specific medium-sized knowledge bases‚Äîonline helpdesks, maintenance records, technical repositories, travel databases, e-commerce sites, and many others. Despite advances in search and database technology, the average user still spends inordinate amounts of time looking for specific information needed for a given task.
This paper describes an adaptive system for the precise, rapid retrieval and synthesis of information from medium-sized knowledge bases in response to problem-solving queries from a diverse user population. We advocate a shift in perspective from “search” to “answers. Instead of returning dozens or hundreds of hits to a user, the system should attempt to find answers that may or may not match the query directly but are relevant to the user’s problem or task.
This problem has been largely overlooked as research has tended to concentrate on techniques for broad searches of large databases over the Internet (as exemplified by Google) and structured queries of well-defined databases (as exemplified by SQL). However, the problem discussed in this chapter is sufficiently different from these extremes to both present a novel set of challenges as well as provide a unique opportunity to apply techniques not traditionally found in the information retrieval literature. Specifically, we discuss an innovative combination of techniques‚ case-based reasoning coupled with text analytics‚ to solve the problem in a practical, real-world context.
We are interested in applications in which users must quickly retrieve answers to specific questions or problems from a complex information database with a minimum of effort and interaction. Examples include internal helpdesk support, web-based self-help for consumer products, decision-aiding systems for support personnel, and repositories for specialized documents such as patents, technical documents, or scientific literature. These applications are characterized by the fact that a diverse user population accesses highly focused knowledge bases in order to find precise answers to specific questions or problems. Despite the growing popularity of on-line service and support facilities for internal use by employees and for external use for customers, most such sites rely on traditional search engine technologies and are not very effective in reducing the time, expertise, and complexity required on the user’s part.
Read the paper:
Interactive Case-Based Reasoning for Precise Information Retrieval
by Ashwin Ram, Mark Devaney
In Case-Based Reasoning in Knowledge Discovery and Data Mining, David Aha and Sankar Pal (editors).