BioLexicon

The BioLexiconDB is an integrated resource in that, initially populated with data collected from available biomedical sources (e.g. UniProtKb/ Swiss-Prot, ChEBI, BioThesaurus, NCBI taxonomy and other biomedical resources), it has been incremented with entries and lexical properties for terms and events automatically extracted from texts in biomedical literature. Since the BioLexicon is especially designed for information extraction, the integration of lexical information typically contained in a computational lexicon becomes crucial and constitutes one of the innovations with respect to related works. The BioLexicon, by its very nature, is a resource that integrates features of both terminologies and lexicons. Part of BioLexiconterms and their related syntactic and semantic information are aligned with concepts of the BioOntology, the ontological resource of the project.

Term Repository

The BOOTStrep consortium is developing a lexical resource, called the Bio Lexicon. In the current state, the core content called the Term Repository has generated and exchanged with the partners to augment the content with terms from the literature (NACTEM/UOM) and to feed the results into a database schema that fulfils standard requirements of a lexical resource (CNR, Pisa). The content of the Term Repository has been assessed against the corpus of the BIOCREATIVE II / Task 1b challenge (gene name normalisation). http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html

Pezik, P. Jimeno, A. Lee, V., Rebholz-Schuhmann, D. (2008) Static Dictionary Features for Term Polysemy Identification. Proceedings of the Language Resources and Evaluation Conference (LREC-2008), workshop on "Building and evaluating resources for biomedical text mining", Marrakech (Morocco), 28-30 May 2008

Subcategorization Extractor

This web service automatically extracts subcategorization frames for verbs from texts linguistically pre-processed with the Enju syntactic parser for English (version 2.2, http://www-tsujii.is.s.u-tokyo.ac.jp/enju/). Differently from many approaches to subcategorisation acquisition, the extraction tool does not presuppose a battery of predefined subcategorization frames (SCF). In spite of the fact that SCF repertoires exist for the English language, it was preferred to adopt a “SCF discovery” approach in order to be able to acquire, besides subcategorised arguments, also strongly selected modifiers such as e.g. location, manner and timing which play a crucial role in the interpretation of biomedical texts.

Subcategorization extraction can be carried out with respect either to a list of verbs, e.g. verbs which are considered as biologically relevant, or to the whole set of verbs attested in the acquisition corpus. The output is a XIF-compliant XML file containing the list of acquired subcategorization frames and the verb-SCF/SLOT associations, which is ready for uploading in the BioLexicon. The service has been developed by CNR-ILC within the BOOTStrep project (WP03: Population of the Bio-Lexicon). A complete description of the SCF extraction tool and of acquired results is provided in Deliverable 3.3 “Augmented version (2) of the Bio-Lexicon including linguistic information”.

The service address is http://poesix1.ilc.cnr.it/bootstrep/subcatextractor.cgi. Note that the web service can be accessed with a client (SCF_Extractor_client.pl). A Perl interpreter is required on the client machine