ů Text Mining Day 2009
Text Mining
Day 2009                   
Ben-Gurion University

Ben-Gurion University
Text Mining Day 2009

Abstracts and Bios
Enhancing Digital Libraries Using Missing Content Analysis
David Carmel, IBM Israel

This work shows how the content of a digital library can be enhanced to better satisfy its users' needs. Missing content is identified by finding missing content topics in the system's query log or in a pre-defined taxonomy of required knowledge. The collection is then enhanced with new relevant knowledge, which is extracted from external sources that satisfy those missing content topics. Experiments we conducted measure the precision of the system before and after content enhancement. The results demonstrate a significant improvement in the system effectiveness as a result of content enhancement and the superiority of the missing content enhancement policy over several other alternative policies. Joint work with Elad Yom-Tov and Haggai Roitman

David Carmel is a research staff member at the IBM Research Lab in Haifa, Israel. David earned his PhD in Computer Science from the Technion, Israel Institute of Technology in 1997. David research interests include multi-agent systems and information retrieval. In the latter field, David has been working on various issues such as performance and scalability issues, IR over XML data, evaluation methodology, social search, and query performance prediction. At IBM, David is a co-founder of the Juru search engine which provides integrated search capabilities to several IBM products. David has published numerous papers in Information Retrieval and Web conferences, serves in the Program Committee of several IR conferences (SIGIR, CIKM, WWW, WSDM, ECIR). and organized several SIGIR workshops, among them on predicting query difficulty and on XML and IR.

It's time for a semantic engine
Ido Dagan, Bar-Ilan University

Abstract: A common computational goal is to encapsulate the modeling of a target phenomenon within a unified and comprehensive “engine”, which addresses a broad range of the required processing tasks. This goal is followed in common modeling of the morphological and syntactic levels of natural language, where most processing tasks are encapsulated within morphological analyzers and syntactic parsers. In this talk I suggest that computational modeling of the semantic level should also focus on encapsulating the various processing tasks within a unified module (engine). The input/output specification of such engine (API) can be based on the textual entailment paradigm, which will be described in brief and suggested as an attractive framework for applied semantic inference. The talk will illustrate an initial proposal for the engine’s API, designed to be embedded within the prominent language processing applications. Finally, I will sketch the entailment formalism and efficient inference algorithm developed at Bar-Ilan University, which illustrates a principled transformational (rather than interpretational) approach towards developing a comprehensive semantic engine.

Ido Dagan is a Senior Lecturer at the Department of Computer Science at Bar Ilan University, Israel. His research area is Natural Language Processing (NLP), where he was elected as the president of the Association for Computational Linguistics (ACL) for 2010 (currently Vice-President). He received the Wolf Foundation Krill Prize in 2008, and the IBM Faculty Award in 2007. His areas of interest are largely within empirical NLP, particularly empirical approaches for applied semantic processing and their applications. In the last few years Ido and his colleagues introduced textual entailment as a generic framework for applied semantic inference and organized the first three rounds of the PASCAL Recognizing Textual Entailment Challenges. Starting from the fourth challenge in 2008, it has been adopted by the U.S. National Institute of Standards and Technology (NIST), under the Text Analysis Conference (TAC) scientific benchmark framework.

Text Mining in Hebrew - Impact of Morphology Analysis on Topic Analysis and on Search Quality
Michael Elhadad, Ben-Gurion University

We present preliminary results on text mining in Hebrew. We specifically investigate the impact of morphological analysis on more advanced text processing tasks. The Hebrew language is characterized by rich morphology and very high morphological ambiguity. We review recent results in unsupervised morphological analysis in Hebrew, including an Expectation-Maximization (EM) based algorithm for the treatment of unknown words. We then analyze the impact of morphological analysis on unsupervised topic analysis. The objective of topic analysis is to detect a collection of recurring important semantic topics in a text collection, and to index unseen documents according to these topics. A popular method for topic analysis is Latent Dirichlet Allocation (LDA), introduced by Blei and Jordan in the past ten years. We measure how morphological in Hebrew improves LDA analysis on a corpus of the Rambam's Mishne Torah. We then present results on how such topic analysis improves search quality on this corpus.

Michael Elhadad is a Senior Lecturer of Computer Science at Ben-Gurion University of the Negev, Israel. His research interests include computational linguistics, automatic text summarization and natural language generation. His current research focuses on analysis of the Hebrew language. Elhadad received a B.Sc. and M.Sc. in Applied Mathematics from Ecole Centrale de Paris in France in 1985. He received a Ph.D. in Computer Science from the Columbia University, New York at Stony Brook in 1993.

Graph-Based Methods for Multilingual Text and Web Mining
Mark Last, Ben-Gurion University

Most text mining and web content mining methods are based on the vector-space model of information retrieval. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags. A recently developed graph-based representation of text and web documents can preserve the structural information disregarding the original language of the analyzed document. The talk will start with a brief overview of several alternative graph-based document models. Then we will describe the applications of these models to such common text mining tasks as document categorization, document clustering, and keyword extraction.

Mark Last is currently an Associate Professor at the Department of Information Systems Engineering, Ben-Gurion University of the Negev, Israel and the Head of the Data Mining and Software Quality Engineering Group. Prior to that, he was a Visiting Research Scholar at the US National Institute for Applied Computational Intelligence (Summer 2001, 2002, and 2003) and a Visiting Assistant Professor at the Department of Computer Science and Engineering, University of South Florida, USA (1999 – 2001). Mark obtained his Ph.D. degree from Tel Aviv University, Israel in 2000. He has published over 130 papers and chapters in scientific journals, books, and refereed conferences. He is a co-author of two monographs and a co-editor of seven edited volumes. Prof. Last serves as an Associate Editor of IEEE Transactions on Systems, Man, and Cybernetics, where he has received the Best Associate Editor Award for 2006, and Pattern Analysis and Applications (PAA). His main research interests are focused on data mining, cross-lingual text mining, software testing, and security informatics.

NHECD -- Automatic Understanding of Scientific Contents
Oded Maimon, Tel-Aviv University

Abstract: We develop a system for understanding a new scientific area from the corpus of research available in the field. The project is supported by the 7th frame of ECC research grants, where TAU is the coordinator, with participation of groups from Italy, Germany and Holland. The mission of Nano-Health and Environment Commented Database (NHECD) is to generate knowledge about potential toxicity hazards from nano particles for the use of scientists, regulators and the public. The system accepts as input (by crawlers) scientific papers and reports. Each paper enters a work flow, whereby the paper is ranked and classified according to taxonomy. The text, tables and graphs are automatically understood, and the results are stored as meta data. We develop the methods for ranking, generating taxonomies (and ontologies), and creating methods for table and graph understanding, as well as for information extraction. The accumulative results are also used for modeling toxicity from the extracted attributes.

Bio: Professor Oded Maimon is a full professor at Tel Aviv University in the department of Industrial Engineering. Before that he was at MIT in EE/CS (Laboratory for information and decision systems). He is a chaired professor and heads the Oracle research fund. He is the coordinator of the NHECD project (7th frame ECC research, to be presented in the seminar). Professor Maimon wrote over a hundred scientific papers, participated in numerous conferences and is the author of seven books in data mining and control theory. His current research interests are knowledge discovery from data, including text and voice analysis.

Extracting the Features from fMRI Data for Cognitive Classification:
A recent Success Story for One Class Training Classification

Larry Manevitz, University of Haifa

We show how to classify cognitive classes from pure physiological data using one-class training. Five years ago, it was considered a break-through when 60% accuracy could be obtained; since it showed that the information was in principle available. Very recently, on the same data, results have now reached approximately 90% accuracy. This is a result of aggressive search for appropriate features. The tools used are a combination of compression neural networks, a wrapper approach and a genetic algorithm.

Larry Manevitz is on the faculty of the Department of Computer Science, University of Haifa. He is the head of the HIACS Research Center, and the Neurocomputation Laboratory. An old term for his current interests could be "theoretical and applied cybernetics". This means he works in cognitive and psychological computational modeling; physiological brain modeling, applications of neurocomputation and machine learning; and the development of new methods of neurocomputation.
He has a B.Sc. from Brooklyn College, CUNY; and M.Phil. and Ph.D. in mathematical logic (advisor Abraham Robinson)from Yale University. He has held visiting positions at Oxford University, Courant Institute (NYU), NASA (Ames), U. Texas (Austin), U. Wisc (Madison), Stanford U., U. Maryland (College Park), and CUNY. In Israel, he has held positions at Hebrew University, Bar Ilan University and the University of Haifa.