|InterJournal Complex Systems, 924
|Manuscript Number: |
Submission Date: 2004
|Biological information networks of genetic loci and the scientific literature|
Graph theory provides a formal framework for specifying and modelling the relationships amongst a set of objects. In the real-world, undirected and directed graphs involving many thousands of nodes and edges have been employed to represent and investigate social, information, and technological networks. In biology, extant research has focussed on metabolic, gene regulatory, and protein interaction networks in which nodes are equated with molecules and edges denote (bio)physical associations. In contrast, this work considers an under-explored and unexploited large biological network: an undirected graph in which a node corresponds to a genetic locus and an edge signifies that the two connected loci are discussed in the same academic paper. Such a network of inter-gene relationships as delineated in the scientific literature has both theoretical and practical applications. First, it provides a novel real-world example for use in examining the spatial and dynamic properties of large, complex networks. Second, it acts as a resource to enhance the ability of biologists to explore gene relationships, and to synthesize new knowledge in order to generate novel predictions for subsequent experimental validation. Here, a strategy for estimating the topology of the aforementioned type of network from a text-based biomedical corpus is described and applied to NCBI's well-known and integrative LocusLink database. The result is a graph that makes explicit assertions about the relationships between thousands of genes according to human-curated literature records. Various properties used commonly to characterize graphs are computed and the values for a LocusLink-based gene-paper information network compared with those for other known networks. The practical utility of the network in the scientific discovery process is illustrated by employing it to address questions such as ``Which scientific papers should I read to understand the relationship(s), if any, between gastrin-releasing peptide (GRP) and the estrogen receptor 1 (ESR1)?''. Such a query could be the outcome of a microarray study indicating that GRP and ESR1 can distinguish node-negative breast carcinoma samples that were estrogen receptor alpha-positive from those that were estrogen receptor alpha-negative. Given a network, this question can be rephrased as ``Find one or more paths connecting the nodes representing GRP and ESR1''. This is a convex optimization problem, soluble using a standard shortest-path algorithm. Its solution contains a set of papers directly relevant to the enquiry at hand. Each discrete release of a biomedical corpus defines a graph with a particular topology. Over time, the curation process may add and/or removes nodes and/or edges. Thus, an evolving corpus such as LocusLink provides not only the foundation for analyzing the growth and behavior of large complex networks, but also the substrate for gaining insights into the sociology of biological research.
|Submit referee report/comment|