Bioinformatics is an exciting area of research where computer science and mathematics are used as tools in genetics and molecular biology to study the genes of a cell and to try to understand how they function, interact and control the processes in an organism. The sequencing of the human genome and a range of other genomes have resulted in enormous databases of DNA and protein sequences. Research and development of new and better tools to analyse this data has become essential to "crack the code" and extract information of scientific importance. Due to the huge amounts of data available, the opportunities to make scientific discoveries are great, but the computational problems are also challenging.
Sequencing centres around the world have now determined the complete genome sequences of more than 600 organisms. These efforts have resulted in huge amounts of sequence data that are still growing rapidly. The challenges are to find out in detail what genes and other signals these sequences consist of, and what the form and function of the gene products are. Computational analyses of the sequences can often answer many of these questions, and is a great help for later experimental biochemical work. The group is therefore working closely with other groups that study genes using advanced molecular biology methods.
We are using computers to analyse genome sequences to find new genes and determine their function. Advanced statistical and computational tools are used and developed to find patterns and particular sequences that indicate the presence of genes and regulatory elements. In order to identify new relationships between genes, improved methods are being developed to compare sequences and to search sequences databases. The group is also creating databases with information about genes of particular interest, e.g. genes involved in DNA repair.
We are working together with other research groups, in particular the groups in CMBN working on DNA repair. Usually we are only able to predict things about the genes, and experimental molecular biology work in the lab is therefore necessary to verify or invalidate our hypotheses.
For a long time, we have been working with rapid and sensitive methods to perform homology searches, but now we are interested in a range of other problems as well, as can be seen from the project list below.
Methods for rapid and sensitive sequence similarity searches
Novel tools for rapid and sensitive sequence database similarity searches have been developed in the group and they are now available at www.paralign.org. Parallelisation and advanced hardware features are exploited to get the highest performance. We have developed a very rapid parallel implementation of the Smith-Waterman algorithm and the new ParAlign algorithm - a combined rapid and sensitive sequence similarity search tool. Both these methods exploit parallel technology known as SIMD or multimedia technology. The PARALIGN software is distributed by Sencel Bioinformatics. We have also contributed to the sequence homology networks that are a new feature of the PubGene system.
Predicting the function of genes by comparing genomes
The availability of a large number of different completely sequenced genomes makes it possible to predict the function of genes by comparing the genomes. By looking of the patterns of which genes that are present and absent in the different genomes, it is possible to predict which genes are related. One can also look at which genes are consistently located close to each other on the genome. Another challenge is to identify which genes are orthlogs and paralogs to each other across organisms.
Identification, classification, and databasing of DNA repair genes
General sequence analysis and computational identification of new DNA repair genes are important topics where the group collaborates closely with other CMBN groups. We are also creating a web portal with an underlying database containing information on genes involved in DNA repair. We are developing and using methods for identifying novel repair genes, both using advanced homology approaches, and by exploiting the information derived from complete genome sequences from a large number of different organisms. We have recently published a classification of bacterial AlkB proteins (Res. Microbiol. 2003).
The image above illustrates distribution within human cells of a protein (APEX2) similar in sequence to DNA repair endonucleases, visualised by fusing the protein with an Enhanced Green Fluorescent Protein (EGFP). The protein was initially identified by computational methods. (Image by Luisa Luna)
Modelling the 3D structure of DNA repair proteins
A protein 3D structure visualisation and modelling lab is being established. The aim is to model DNA repair enzymes using homology modelling, in order to understand the molecular mechanisms involved. This work is carried out in close collaboration with Magnar Bjørås' structural biology group.
Identification of non-coding RNA genes
The group is developing computational methods to identify new members of an interesting class of genes that does not encode proteins, but stable and functional non-coding RNA genes (ncRNAs). Apart from tRNAs and rRNAs these include microRNAs, tmRNAs and many more very important genes. In addition to identifying new groups of ncRNAs we are also improving the systematic annotation of tRNAs and rRNAs in genomes (Microbiology 2004) as can be seen in the CBS GenomeAtlas.
A CBS GenomeAtlas showing the DNA Atlas from Listeria monocytogenes strain 4b. Note the skewed distribution of the bases, possibly corresponding to the direction of replication. Predictions of tRNAs and rRNAs are also available.
An RNA secondary structure model consists of loops and of stems formed by pairing bases. (Note that uracil is represented by a T in this figure, as opposed to the usual U.)