Word Sense Disambiguation (WSD) Test Collection

Word sense ambiguity is a pervasive characteristic of natural language. For example, the word "cold" has several senses and may refer to a disease, a temperature sensation, or an environmental condition. The specific sense intended is determined by the textual context in which an instance of the ambiguous word appears. In "I am taking aspirin for my cold" the disease sense is intended, in "Let's go inside, I'm cold" the temperature sensation sense is meant, while "It's cold today, only 2 degrees", implies the environmental condition sense.

It is convenient to refer to an ambiguous word along with all of its individual senses as an ambiguity case. Further, we call each textual occurrence of the ambiguity an instance. In the UMLS Metathesaurus, a large number of ambiguity cases are represented by separate concepts, each of which refers to one of the individual senses.

In order to support research investigating the automatic resolution of word sense ambiguity using natural language processing techniques, we have constructed this test collection of medical text in which the ambiguities were resolved by hand. Evaluators were asked to examine instances of an ambiguous word and determine the sense intended by selecting the Metathesaurus concept (if any) that best represents the meaning of that sense.

The test collection consists of 50 highly frequent ambiguous UMLS concepts from 1998 MEDLINE. Each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 MEDLINE citations, for a total of 5,000 instances. We had a total of 11 evaluators of which 8 completed 100% of the 5,000 instances, 1 completed 56%, 1 completed 44%, and the final evaluator completed 12% of the instances. Evaluations were only used when the evaluators completed all 100 instances for a given ambiguity.

Please Note: The 5,000 MEDLINE citations included at this site are for exclusive use with the Test Collection and cannot be redistributed. In addition, the citations were retrieved in late 1999 and represent a static view of MEDLINE at that time.


Jim Mork
Alan Aronson

Associated Institutions

U. S. National Library of Medicine

Resource URL
Application Domains
  • Clinical
  • Clinical records
  • Domain independent
  • Literature
Other Resource Type
Software Subtype
Programming Languages
Operating Systems
Included Components
Dataset Subtype
  • Human annotated
  • Structured data
Data Model Subtype
  • Flat file
Online Resource Subtype
Knowledge Base Subtype
  • Controlled vocabulary
Intended User Types
  • Informatics researcher
  • NLP researcher or developer
  • Software developer

Weeber M, Mork JG, Aronson AR. Developing a Test Collection for Biomedical Word Sense Disambiguation. AMIA Annu Symp Proc. 2001 ;():746-50.

Available Documentation
  • Web page/HTML documentation
Licensing Notes
Users are responsible for compliance with the UMLS Metathesaurus License Agreement.
Date of Latest Version