BioCreative-PPI corpus

This corpus originated from the BioCreAtIvE task 1A data set for named entity recognition of gene/protein names. We randomly selected 1000 sentences from this set and added additional annotation for interactions between genes/proteins. 173 sentences contain at least one interaction, 589 sentences contain at least one gene/protein. There are 255 interactions, some of which include more than two partners (e.g., one partner occurs with full name and abbreviated).

This is not any of the corpora used in the actual BioCreAtIvE community challenges, please see for information on those!


Jörg Hakenberg
Conrad Plake
Ulf Leser

Associated Institutions

Humboldt-Universität zu Berlin

Application Domains
  • Biology
  • Genomics
  • Proteomics
Other Resource Type
Software Subtype
Programming Languages
Operating Systems
Included Components
Dataset Subtype
  • Human annotated
Data Model Subtype
Online Resource Subtype
Knowledge Base Subtype
Intended User Types
  • Informatics researcher
  • NLP researcher or developer
  • Software developer

Conrad Plake, Jörg Hakenberg, Ulf Leser: Optimizing Syntax Patterns for Discovering Protein-Protein Interactions. In: Proc ACM Symposium on Applied Computing, Bioinformatics track, 1:195-201. Santa Fe, USA, March 2005.

Available Documentation
Licensing Type
Open source
Date of Latest Version