GENIA Corpus

Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections. To support this we are building a corpus of annotated abstracts taken from National Library of Medicine's MEDLINE database. In GENIA Corpus we annotate a subset of the substances and the biological locations involved in reactions of proteins, based on a data model (GENIA ontology) of the biological domain, in XML format (GPML). GENIA Corpus Version 3.0x consists of 2000 abstracts. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells and Transcription Factors.

Associated Institutions

Tsujii Laboratory

Graduate School of Information Science and Technology

The University of Tokyo

Application Domains
  • Biology
  • Clinical
Other Resource Type
Software Subtype
Programming Languages
Operating Systems
Included Components
Dataset Subtype
  • Human annotated
Data Model Subtype
Online Resource Subtype
Knowledge Base Subtype
Intended User Types
  • Clinical researcher
  • Informatics researcher
  • NLP researcher or developer
  • Software developer
Available Documentation
  • Web page/HTML documentation