Database in Bioinformatics
Before going for this question we have several questions in our mind……
1. Biological Databases: Why?
2. The different types of Databases.
3. Accession codes Vs identifiers
4. Nucleotide Databases.
5. Protein sequence Databases.
6. Sequence motif Databases.
7. Macromolecular 3D structure Databases.
8. other relevant Databases.
9. System of searching, indexing and cross-referencing.
Biological Databases: Why?
There are two main functions of Biological Databases:
Making Biological Data available to Scientists: As much of information should be available in one single place (book, sit, database). Public data ay be difficult to find or access, and collecting it from literature is very time consuming. And not all data is actually published explicitly in an article.
To make Biological Data available in Computer-readable form: Since analysis of Biological Data almost always involves Computers, having the Data in Computer-readable form ( rather than print or paper) is a necessary first step.
One of the first Biological sequence Database was probably the book “Atlas of Protein Sequence and Structure” by Margaret Dayhoff and colleagues, first published in 1965. It contained the Protein sequences determined at the time, and new editions of the book were published well into the 1970s.
The Computer became h storage medium of choice as soon they came with in the reach of normal scientists. Databases were distributed on tapes, and later on various kinds of discs. When universities and research institutions were connected to Internet or its precursors (National Computer Network), it is easy to understand why it became the medium of choice. And it is easier to see why WWW ( World Wide Web) based on http (Hyper text markup language) since beginning of the 1990s is the standard method of Communication and access for nearly all biological Databases.
As biology has increasingly turned into a data-rich science, the need for storing and communicating large database has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural Data produced by X-Ray crystallography and macromolecular NMR. An new field of Science dealing with issue, challenges and new possibilities created by these database has emerged: Bioinformatics. Other type of data that or will soon be available in databases are metabolic pathways ( KEGG), gene expression data (microarrays), protein-protein interactions and other types of data related to Biological function and processes.
Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life.
The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.
An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly vailable online databases related to biology and bioinformatics.
Most important public databases
for molecular biology
Primary Sequence DBs
(collaborative project)</FONT< H3>
· DDBJ (DNA DataBase of Japan)
· EMBL Nucleotide DB (European Molecular Biology Laboratory )
· GenBank (National Center for Biotechnology Information)
· Entrez Gene Unified retrival of gene-centred information (NCBI)
· euGenes Assembled information on eukaryotic genomes (Univ. of Indiana)
· GeneCards (Weizmann Inst.)
· GenLoc / UDB (Weizmann Inst.)
· SOURCE (Univ. of Stanford)
· LocusLink (National Center for Biotechnology Information)
Genome Annotation Systems
· Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and Wellcome Trust Sanger Inst.)
· UniGene Automatic partitioning of GenBank sequences (NCBI)
· Golden Path / UCSC (Univ. of California, Santa Cruz)
· CGAP Cancer Genes (National Cancer Institute)
· Clone Registry Clone Collections (National Center for Biotechnology Information)
· I.M.A.G.E Clone Collections (Image Consortium)
· DBGET H.sapiens, retrieval system (Univ. of Kyoto)
· DIP Interacting Proteins (Univ. of California)
· GDB (Human Genome Organization)
· KEGG Functional Db (Univ. of Kyoto)
· MGI Mouse Genome (Jackson Lab.)
· OMIM Inherited Diseases (National Center for Biotechnology Information)
· SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)
· PEDANT Protein Db (Forschungszentrum f. Umwelt & Gesundheit)
· List with SNP-Databases
· ArrayExpress (European Bioinformatic Institute)
· Gene Expression Omnibus (National Center for Biotechnology Information)
· maxd (Univ. of Manchester)
· SMD (Univ. of Stanford)
Accession codes Vs identifiers
Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways. Basically, it has two names:
- Accession code (or number)
The question how to deal with changed, updated and deleted entries in databases is a very tricky problem, and the policies for how accession codes and identifiers are changed or kept constant are not completely consistent between databases or even over time for one single database.
The exact definition of what the identifier and accession code are supposed to denote varies between the different databases, but the basic idea is the following.
An identifier (“locus” in GenBank, “entry name” in SWISS-PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.
SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.
An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it’s not really a big problem.
Accession code (number)
An accession code (or number) is a number (possibly with a few characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.
The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.
In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.
NCBI’s sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research.
GenBank: An annotated collection of all publicly available nucleotide and amino acid sequences.
EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).
GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.
HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs.
HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences.
SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.
RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, support our data-gathering efforts.
STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome.
UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.
DNA & RNA Databases
Major Sequence Repositories – Human Chromosome Information – Organelle Genome Databases – RNA Databases – Comparative & Phylogenetic Databases – SNPs, Mutations and Variations Databases – Alternative Splicing Databases – Specialized Databases
Major Sequence Repositories:
Human Chromosome Information:
Click the link below to access chromosome information
Organelle Genome Databases:
Comparative & Phylogenetic Databases:
COG: Phylogenetic classification of proteins
DHMHD: Human-mouse homology database
HomoloGene: Gene homologies across species
Homophila: Human disease to Drosophila gene database
HOVERGEN: Database of homologous vertebrate genes
TreeBase: A database of phylogenetic knowledge
XREF: Cross-referencing with model organisms
SNPs, Mutations & Variations Databases:
Alternative Splicing Databases:
ASAP: Alternate splicing analysis tool at UCLA
ASG: Alternate splicing gallery
HASDB: Human alternative splicing database at UCLA
AsMamDB: alternatively spliced genes in human, mouse and rat
ASD: Alternative splicing database at CSHL
ABIM: Links to several genomics database
ACUTS: Ancient conserved untranslated sequences
AGSD: Animal genome size database
AmiGO: The Gene Ontology database
ARGH: The acronym database
ASDB: Database of alternatively spliced genes
BACPAC: BAC and PAC genomic DNA library info
BBID: Biological Biochemical image database
Cardiac gene database:
CHLC: Genetic markers on chromosomes
COGENT: Complete genome tracking database
COMPEL: Composite regulatory elements in eukaryotes
CUTG: Codon usage database
dbEST: Database of expressed sequences or mRNA
dbGSS: Genome survey sequence database
dbSTS: Sequence tagged sites (STS)
DBTSS: Database of transcriptional start sites
DOGS: Database of genome sizes
EID: The exon-intron database – Harvard
Exon-Intron: Exon-Intron database – Singapore
EPD: Eukaryotic promotor database
FlyTrap: HTML based gene expression database
GDB: The genome database
GenLink: Resources for human genetic and telomere research
GeneKnockouts: Gene knockout information
GENOTK: Human cDNA database
GEO: Gene expression omnibus NCBI
GOLD: Information on genome projects around the world
GSDB:The Genome Sequence DataBase
HGI: TIGR human gene index
HTGS: High-through-put genomic sequence at NCBI
IMAGE: The largest collection of DNA sequences clones
IMGT: The international ImMunoGeneTics information system
IPCN: Index to Plant Chromosome Numbers database
LocusLink: Single query interface to sequence and genetic loci
TelDB: The telomere database
MitoDat: Mitochondrial nuclear genes
Mouse EST: NIA mouse cDNA project
MPSS: Searchable databases of several species
NDB: Nucleic acid database
NEDO: Human cDNA sequence database
NPD: Nuclear protein database
Oomycetes DB: Oomycetes database at Virginia Bioinformatics Institute
PLACE: Database of plant cis-acting regulatory DNA elements
RDP: Ribosomal database project
RDB: Receptor database at NIHS, Japan
Refseq: The NCBI reference sequence project
RHdb: Radiation hybrid physical map of chromosomes
SHIGAN: SHared Information of GENetic resources, Japan
SpliceDB: Canonical and non-canonical splice site sequences
STACK: Consensus human EST database
TAED: The adaptive evolution database
TIGR: Curated databases of microbes, plants and humans
TRANSFAC: The Transcription Factor Database
TRRD: Transcription Regulatory region database
UniGene: Cluster of sequences for unique genes at NCBI
UniSTS: Nonredundent collection of STS
Protein Sequence Databases – Protein Structure Databases – Protein Domains, Motifs and Signatures – Others
Protein Sequence Databases:
Antibodies: Sequence and Structure
BRENDA: Enzyme database
CD Antigens: Database of CD antigens
dbCFC: Cytokine family database
Histons: Histone sequence database
HPRD: Human protein reference database
InterPro: Intergrated documentation 5resources for protein families
iProClass: An integrated protein classification database
KIND: A non-redundant protein sequence database
MHCPEP: Database of MHC binding peptides
MIPS: Munich information centre for protein sequences
PIR: Annotated, and non-redundant protein sequence database
PIR-ALN: Curated database of protein sequence alignments
PIR-NREF: PIR nonredundent reference protein database
PMD: Protein mutant database
PRF: Protein research foundation, Japan
ProClass: Non-redundant protein database
ProtoMap: Hierarchical classification of swissprot proteins
REBASE: Restriction enzyme database
RefSeq: Reference sequence database at NCBI
SwissProt: Curated protein sequence database
SPTR: Comprehensive protein sequence database
Transfac: Transcription factor database
TrEMBL: Annotated translations of EMBL nucleotide sequences
Tumor gene database: Genes with cancer-causing mutations
WD repeats: WD-repeat family of proteins
Cath: Protein structure classification
HIV Protease: HIV protease database 3D structure
PDB: 3-D macromolecular structure data
PSI: Protein structure initiative
S2F: Structure to function project
Scop: Structural Classification of Proteins
BLOCKS: Multipe aligned segments of conserved protein regions
CCD: Conserved domain database and search service
DOMO: Homologous protein domain families
Pfam: Database of protein domains and HMMs
ProDom: Protein domain database
Prints: Protein motif fingerprint database
Prosite: Database of protein families and domains
SMART: Simple modular architecture research tool
TIGRFAM: Protein families based on HMMs
Model Organism Databases and Resources
Arabidopsis – Bacteria – Sea Bass – Cat – Cattle – Chicken – Cotton – CyanoBacteria – Daphnia – Deer – Dictyostelium – Dog – Frog – Fruit Fly – Fungus – Goat – Horse – Madaka Fish – Maize – Malaria – Mosquito – Mouse – Pig – Plants – Protozoa – Puffer Fish – Rat – Rice – Rickettsia – Salmon – Sheep – Soy – Sorgham – Tetradon – Tilapia – Turkey – Viruses – Worm – Yeast – Zebra Fish
ABRC: Arabidopsis biological resource center
AGI: Arabidopsis genome initiative
AREX: Arabidopsis gene expression database
Arabinet: Arabidopsis information on the www
AtGDB: An Arabidopsis thalina plant genome database
AtGI: TIGR Arabidopsis thaliana gene index
ATGC: Genome sequencing at ATGC
ATIDB: Arabidopsis insertion database
CSHL: Arabidopsis genome analysis at Cold Spring
ESSA: Arabidopsis thalina project at MIPS
Genoscope: AGI in France
Kazusa: Arabidopsis thaliana genome info Japan
MPSS: Massively parallel signature sequencing
NASC: Nottingham Arabidopsis stock center
Stanford: Sequencing of the Arabidopsis genome at Stanford
TAIR: Arabidopsis information resource
TIGR: TIGR Arabidopsis genome annotation database
Wustl: Arabidopsis genome at Washington university
Trees: A forest tree genome database
B. Subtilus: Bacillus subtilus database
Chlamydomonas: Chlamydomonas genetics center
E. coli: E.coli genome project
MGD: Microbial germ plasm database
Microbial: Microbial Genome Gateway
Microbial: Microbial genomes
Micado: Genetics maps of B. subtilis and E. coli
MycDB: A integrated Mycobacterial database
Neisseria: Neisseria meningitidis genome
Neurospora: Neurospora crassa database
OralGen: Oral pathogen database
Salmonella: Salmonella information
STDGen: Sexulally transmitted disease database
Bass: Sea Bass Mapping project
Cat ArkDB: Cat mapping database
ARK: Farm animals
BoLA: Bovine MHC information
Bovin: Bovine genome database
BovMap: Mapping the bovine genome
CaDBase: Genetic diversity in cattles
ComRad: Comparative radiation hybrid mapping
Cow ArkDB: Bovine ArkDB
GemQual: Genetics of meat quality
Cotton: Cotton data collection site
Cyano Bacteria: Anabaena genome
Daphnia pulex: Daphnia genomics consortium
Deer ArkDB: Deer mapping database
ENSEMBL: Drosophila Genome Browser at ENSEMBL
Fruitfly: Drosophila genome project at Berkeley
FlyBase: A Database of the Drosophila Genome
FlyMove: A Drosophila multimedia database
FlyView: A Drosophila image database
Goat: GoatMap, mapping the caprine genome
Horse ArkDB: Horse mapping database
Medaka: Medaka fish home page
Maize: Maize genome database
Mosquito: Mosquito genome web server
ENSEMBL: Mouse genome server at ENSEMBL
Jackson Lab: Mouse Resources
MRC: Mouse genome center at MRC, UK
MGI: Mouse genome informatics at Jackson Labs
MGD: Mouse genome database
MGS: Mouse genome sequencing at NIH
MIT: Genetic and physical maps of the mouse genome
Mouse SNP: Mouse SNP database
NCI: Mouse repository
NIH: NIH mouse initiative
ORNL: Mutent mouse database
RIKEN: Mouse resources
Rodentia: The whole mouse catalog
PlantGDB: Resources for plant comparative genomics
Protozoa: Protozoan genomes
RicBase: Rickettsia genome database
Salmon ArkDB: Salmon mapping database
Soy: Soybeans database
Sorghum: Sorghum Genomics
Turkey ArkDB: Turkey mapping database
C. elegans: C. elegans genome sequencing project
NemBase: Resource for nematode sequence and functional data
WormAtlas: Anatomy of C. elegans
WormBase: The Genome and biology of C. elegans
ACEDB: A C. elegans database
WWW Server: C. elegans web server
SCPD: The promoter database of Saccharomyces cerevisiae
SGD: Saccharomyces genome database
S. Pompe: Schizosaccharomyces pompe genome project
TRIPLES: Functional analysis of Yeast genome at Yale
Yeast Intron database: Spliceosomal introns of the yeast