E.coli Corpora
Bio-Event Linguistically Annotated Corpus (BELA)
The Bio-Event Linguistically Annotated (BELA) corpus is a corpus of MEDLINE abstracts on the subject of E.Coli. The corpus has been annotated with bio-events relating to gene regulation, with a specific view to the acquisition of semantic frames. Annotation includes event structure annotation (semantic arguments of verbs and nominalised verbs of interest are identified and marked with appropriate semantic role labels) and named entity categorization (semantic role fillers that correspond to named entities are marked according to their semantic type). Within the BOOTStrep project, the corpus was used to acquire the semantic frames for verbs within the
BioLexicon.
The final
BELA resource, which is the result of a joint effort by
UoM and CNR-ILC, includes:
GREC Corpus
The GREC corpus is a semantically annotated corpus of MEDLINE abstracts, developed during the
BootSTREP project, which is intended for training IE systems and/or resources which are used to extract events from biomedical literature.
The corpus has been manually annotated with event instances relating to gene regulation by biologists. Each event is centred on either a verb (e.g. transcribe) or nominalised verb (e.g. transcription) and annotation consists of identifying, as exhaustively as possible, the structually-related arguments of the verb or nominalised verb within the same sentence. Each event argument is then assigned the following information:
- A semantic role from a fixed set of 13 roles which are tailored to the biomedical domain.
- A biomedical concept type (where appropriate).
The corpus consists of 240 MEDLINE abstracts on the subjects of E.coli and human. In total, 3067 events have been annotated. Evaluation of various different facets of the annotation task, which showed that average inter-annotator agreement rates fall within the range 66% - 90%.
The corpus and guidelines can be found at the following location, together with further information:
http://www.nactem.ac.uk/GREC/
Gene Regulation Modality Corpus
Modality is concerned with the opinion and attitude of authors of literature. It is beneficial for bio-text mining systems to identify modality of biological event descriptions because literature describes not only biological facts but also speculations and opinions on biological events.
To investigate the nature of modality in gene regulation literature, we annotated modal information to 202 MEDLINE abstracts that contain a total of 1469 gene regulation events. 249 of these events (i.e. 16.95%) were annotated with modality information.
Our categorisation scheme for modality in biomedical texts consists of the following 3 "dimensions" of information:
- Knowledge Type, encoding the type of "knowledge" that underlies a statement, encapsulating both whether the statement is a speculation or based on evidence and how the evidence is to be interpreted.
- Level of certainty, indicating how certain the author (or cited author) is about the statement.
- Point of View, indicating whether the statement is based on the author's own or a cited point of view or experimental findings.
Contact details:
Sophia Ananiadou (
Sophia.Ananiadou@manchester.ac.uk) School of Computer Science, and UK National Centre for Text Mining, University of Manchester & Simonetta Montemagni (Instituto di Linguistica Computazionale del CNR)
Gene Regulation Modality Corpus distribution: to be publicly available from the
NaCTeM web site and CNR.
GeneReg Corpus
The
GeneReg corpus consists of 314 MEDLINE abstracts dealing with gene regulation in E. Coli. It provides three types of semantic annotations:
- named entities involved in gene regulatory processes, such as TFs (transcription factors, cofactors and regulators) and genes,
- pairwise relations between TFs and genes,
- triggers (e.g., clue verbs) essential for the description of gene regulation relations.
In quantitative terms, this amounts to approximately 6,700 named entity annotations (for genes and transcription factors, co-factors and regulators), approximately 1200 core conceptual relations (agent of the regulation is transcription factor, cofactor or regulator) and about 3,200 core trigger annotations (for gene expression, transcription, unspecified, positive and negative regulations of gene expression). We further enhanced these annotations by so-called “additional constraints”, i.e., qualifying conditions which accompany or constrain the observed relations. Among the many choices of additional constraints, we here focused on ligands, i.e., chemical entities of interest, as well as experimental interventions. For all three annotation levels and the additional constraints, the annotation vocabulary was taken from the Gene Regulation Ontology (
GRO).
GeneReg (version 1.1) contains in total 1,770 annotated conceptual relations (core relations and additional constrains relations).
Stage of development: A first release of the
GeneReg Corpus is available on demand.
Contact details:
Udo Hahn (
udo.hahn@uni-jena.de),
Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany