DrugNerAr Corpus: a corpus annotated with drug anaphoras. Text were collected from the Drugbank database.
There is no corpus dedicated to the resolution of the anaphoric expressions occurring in drug interaction descriptions in pharmacological documents, so the first challenge was to build a corpus for research purposes.
DrugBank is an annotated database with about 4900 drug entries. Each entry contains more than 100 data fields that gather detailed chemical and pharmacological information (type, category, brand names, chemical formula, drug interactions, etc).
A collection of 49 unstructured and plain documents was taken randomly from the field 'interactions' in the DrugBank database. Documents have on average 40 sentences and 716 words. Documents were downloaded by using an automatic robot developed with the free tool openKapow .
Anaphora is a linguistic device to refer to entities that have come up in recent discourse (antecedents). There are two kinds of anaphors prevalent in this kind of literature:
• Pronominal anaphora. In this case an entity is referred to by a pronoun: personal (it,they), reflexive (itself,themselves), relative (which, that) and distributive (both, each, either and neither). Pronominal forms in first and second person (I, me, you, your and who) were disregarded for not referring to drugs.
• Nominal (phrase) anaphora. This is the case of an entity being referred to by a nominal phrase. These phrases consists of a definite article (the), possessive (its, their), demonstrative (this, these, those), distributive (both, such, each, either, neither) followed by a generic term for drugs (such as antibiotic, medicine, medication, etc) or a drug property or effect, e.g., these anticoagulants, its pharmacological effects.
The corpus was annotated manually by a linguist with the assistance of a pharmaceutical expert over the output of MMTx and DrugNer. The corpus contains a total of 331 anaphoric expressions . A more detailed description of the corpus can be found in http://www.biomedcentral.com/1471-2105/11/S2/S1.