This corpus originated from the BioCreAtIvE task 1A data set for named entity recognition of gene/protein names. We randomly selected 1000 sentences from this set and added additional annotation for interactions between genes/proteins. 173 sentences contain at least one interaction, 589 sentences contain at least one gene/protein. There are 255 interactions, some of which include more than two partners (e.g., one partner occurs with full name and abbreviated).
This is not any of the corpora used in the actual BioCreAtIvE community challenges, please see http://www.biocreative.org for information on those!
Humboldt-Universität zu Berlin
Conrad Plake, Jörg Hakenberg, Ulf Leser: Optimizing Syntax Patterns for Discovering Protein-Protein Interactions. In: Proc ACM Symposium on Applied Computing, Bioinformatics track, 1:195-201. Santa Fe, USA, March 2005.