Keyword Clustering in Biomedical Information Retrieval Using Evolutionary Algorithms


V. Dorfer, S. M. Winkler, T. Kern, S. Blank, G. Petz, P. Faschang - Keyword Clustering in Biomedical Information Retrieval Using Evolutionary Algorithms - Proceedings of the 19th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 10th European Conference on Computational Biology (ECCB), Vienna, Österreich, 2011


As the amount of available data in the field of life sciences grows exponentially, intelligent search strategies are necessary to help people in information retrieval. We here describe the use of a new keyword clustering method: Based on a set of documents (D), keyword clusters are optimized so that the identified groups of keywords consist of keywords that often occur in combination in D. The so generated keyword clusters shall in the near future serve as a solid base for a new PubMed search tool based on query extension, using also user feedback to optimize the search process. We have defined several important characteristics for clustering candidates, including the data set coverage, the cluster confidence (measuring the ratio of clustered keywords that are found in the same documents), and the document confidence (measuring the amount of equal keywords in the documents assigned to a cluster through their keywords). Evolutionary algorithms have been applied for solving this optimization task, amongst others evolution strategies (ES) and a multi-objective genetic algorithm (NSGA-II, used because the optimization objectives are partially contradictory). For testing this approach we have used data published for the TREC-9 conference containing 36,890 entries. Out of this data set we extracted the most significant keywords for clustering using tf-idf weighting. Analyzing first optimization results we see that the best result obtained with 10+1 ES provides 23.5% data set coverage, 45.2% cluster confidence, and 23.4% document confidence; using the NSGA-II we for example got results with respective values 71%, 56% and 37%.