"DrugDDI: an annotated corpus for drug-drug interactions" submitted for publication.
The DrugDDI corpus is part of a larger study about automatic Drug-Drug
Interaction Extraction. The corpus provides data for the development and
automatic evaluation of systems that annotate and extract drug-drug interactions.
It can be downloaded from http://basesdatos.uc3m.es/DrugDDI/DrugDDI.html.
The unstructured textual information on drugs and their interactions have been
obtained from the DrugBank database (Wishart et al., 2007)
(available at http://www.drugbank.ca/ ). This database is a rich resource which
combines chemical and pharmaceutical information of approximately 4,900
pharmacological substances. The corpus contains a total of 579 documents
extracted from the Interactions Field in DrugBank which is unstructured. Note
that the Drug Interactions and Food Interactions fiels are structured though.
A total of 3,160 pairs of drugs were manually annotated at the sentence
level as expressing a drug-drug interaction with the assistance of an expert
pharmacist. In total, it contains 5,806 sentences and 30,779 candidate drug
In addition to the manual annotation of DDI, the corpus has been automatically
processed for linguistic annotation using UMLS Metamap Transfer Tool (MMTx)
(Aronson, 2001a). MMTx performs sentence splitting, tokenization, POS-tagging,
shallow syntactic parsing, and linking of phrases with UMLS concepts.
=========== BRIEF FORMAT DESCRIPTION ==========================================
The corpus is formatted in XML and uses stand-off annotation for the different
layers of annotation. Each file is named after the drug that was used to extract
the text and the "_ddi.xml" suffix. A local DTD is provided as the file DTD.dtd.
The complete texts of the sentences is always available from the TEXT attribute
of the SENTENCE element. The DDIs element is the root of all the annotated DDI
in the sentence. Each DDI element identifies the phrase containing a drug that
takes part in the interaction using their phrase ID and the text. Interactions
are assumed to be symmetrical, and so far, certainty and severity have not been
Documents in the folder GOLD_DDI form the complete annotated set (579 documents).
TestFiles.lst defines the test documents that we are using in our current
experiments and therefore are recommended to allow direct comparison.
====================== CONTACT =================================================
====================== PUBLICATION =============================================
Please acknowledge the following work if you use this corpus on
I. Segura-Bedmar, P. Martinez and C. de Pablo-Sanchez.
Extracting drug-drug interactions from biomedical texts.
Workshop on Advances in Bio Text Mining May 10-11, 2010, Ghent, Belgium.
To appear in BMC BioInformatics, 2010.