Nikolaj Blom
Center for Biological Sequence Analysis BioCentrum-DTU
Technical University of Denmark
nikob@cbs.dtu.dk ”Resources of Biomolecular Data: Sequences, Structures and Functionality”
PhD course #27803
Nikolaj Blom
Center for Biological Sequence Analysis BioCentrum-DTU
Technical University of Denmark
nikob@cbs.dtu.dk ”Resources of Biomolecular Data: Sequences, Structures and Functionality”
PhD course #27803
Outline
Magnitudes and Scales
Resources: Data Sources & Tools
Primary DNA sources
Sequence Repositories
Structure Repositories
Functional Categorization
Integration of Databases
The Human Genome
Genome Browsers
Prediction Tools
Evaluation of Prediction Servers
Starting points
Link collections
Resources: Sources & Tools
There is A LOT OF biomolecular databases/sources
A LOT OF overlap of information/redundancy
A LOT OF TOOLS
Personal picks/preferences
User-friendliness
Update intervals
Curation efforts / error correction
Linkage to other DBs
Faster than Moore’s law...
Human Genome Published
HUGO: Nature, 15.feb.2001
Celera: Science, 16.feb.2001
Magnitudes and Scales
Human genome 3,200,000,000 bp
Single basepair full genome is 9 orders of magnitude
Genome = Football field: ~3 billion leaves of grass
Single base A T G C (or SNP) = 1 leaf of grass
Genome browsing
Zooming from whole stadium to single leaf
How we got the sequence
Sanger chain termination method
Primary DNA sources
Trace files repositories
Single read: 500-1000 bp (~golf ball size / jig saw puzzle)
Variable quality
WashU-Merck Human EST Project / Trace files
”Base-calling” non-trivial
Assembly is Non-trivial!
Sequence repositories - GenBank et al.
GenBank / EMBL / DDBJ
Highly redundant (many versions of same gene)
Cross-updated daily
Version history is recorded
Previous sequence records can be retrieved
Contigs/HTGS (100-200 kb) finishing at different stages
Draft Finished
Includes genomic DNA, cDNA, ESTs, translated peptides
Non-redundant and Curated databases
Non-redundant
Manual or automatic curation
DNA
RefSeq (NCBI; semi-automated)
Ensembl gene index (automated)
Protein
RefSeq (NCBI; semi-automated)
TrEMBL (EMBL; automated)
Curated database: UniProt/SwissProt
SIB - Swiss Institute of Bioinformatics
Protein Knowledgebase / Sequence Database
Highly curated
Experimental evidence evaluated (e.g. modifications)
All 80,000 entries checked by Amos Bairoch himself ;-)
ExPASy - Expert Protein Analysis System
Proteomics tools: links + local servers
Structure databases / Protein Data Bank (PDB)
X-ray , NMR biomolecular structures
Protein Data Bank (PDB)
>22,000 structures (April 2003)
http://www.rcsb.org/pdb/
Gene Ontology (GO) http://www.geneontology.org/
Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase
Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
Integration of databases - Webs of web-sites
http://srs.ebi.ac.uk/ Links, links, links...
SRS = Sequence Retrieval System
Powerful, complex query language
BioDAS – Distributed Annotation System
For ’my gene’, how do I:
Get an overview of the sequence information known? (GeneCards)
Examine the ’Genome Neighbourhood’? (Genome Browsers)
Predict protein post-translational modifications (PTMs)? (Prediction servers)
(Evaluate the value of predicted features)
Comments