|
Recent advances in genomics and DNA microarray technology enable investigators to
simultaneously analyze the expression of thousands of genes under different experimental
conditions. However understanding the functional relationships between co-regulated
genes presents a formidable task to investigators, requiring first hand knowledge
of the biological characteristics of ea`ch gene. There are a variety of public electronic
resources from which investigators may assemble gene information. For instance,
there are over 10,000 annotated human genes in LocusLink and nearly 13 million citations
archived in MEDLINE. However, better automated tools are needed to aid in extraction
and utilization of gene information from these databases. My lab has been collaborating
with Dr. Michael Berry to develop a new software environment called Semantic Gene
Organizer?(SGO) to automatically extract gene relationships from titles and abstracts
in MEDLINE citations. SGO utilizes a variant of the vector-space model of information
retrieval called Latent Semantic Indexing (LSI). LSI implements a classical factorization
method from linear algebra (singular value decomposition) to identify conceptual
relationships between documents. Our studies have provided proof-of-principle that
LSI is a robust automated method for identification of gene-to-keyword and gene-to-gene
relationships from the biological literature. Future aims of this project include:
- expansion of the gene-document collection to include all genes in the LocusLink
database;
- Utilize SGO to expand gene ontology terms and functional gene annotation.
more...
|