RapTAT is a Java-based tool designed to identify and optimize machine-learning methods for accelerating and/or automating free-text annotation. In the initial version of the tool, the user will train the machine-learning system by loading it with document phrases annotated with particular concepts. Once the system has processed the phrases, the tool saves a "solution" file, which represents the most probable mapping of words in a phrase to a concept. The tool will be supplied as a series of plug-in modules for use within the GATE framework.
Once created and saved, a user can apply the solution file to two potential tasks. For one, a user can apply the solution to map phrases from novel documents to a particular concept. For example, a user might select phrases of interest within the free-text. The tool can then use the solution file to label the phrases with concepts included in the training set.
A second potential use is to evaluate training efficacy. RapTAT will evaluate how accurately a trained solution is able to map phrases to concepts within a novel set of documents. To do this, the user provides annotated phrases from a novel document set along with the mapped concepts. The system then utilizes the solution file to carry out its own mapping of the phrases to potential concepts. Finally, the tool provides the user with a summary comparing the concept mappings provided by the user to those determined by the machine-learning system. It can also statistically evaluate the accuracy of the mappings via automated cross-validation and bootstrap analysis.
Training and evaluation of training efficacy is the intended primary use for the tool. The optimal machine learning method and settings may vary depending on the nature of documents to be annotated. Because of this, the RapTAT tool provides a way to determine the optimal machine learning method and settings for training and testing. The current version provides two different machine learning algorithms for use. The first is based on a naive Bayes network, and the second is based on a non-naive, sequential Bayes network.
Glenn Gobbel Michael Matheny Ruth Reeves Shrimalini Jayaramaraja
Tennessee Valley Health System VA Vanderbilt University