Database in Bioinformatics


Before going for this question we have several questions in our mind……

1.      Biological Databases: Why?

2.      The different types of Databases.

3.      Accession codes Vs identifiers

4.      Nucleotide Databases.

5.      Protein sequence Databases.

6.      Sequence motif Databases.

7.      Macromolecular 3D structure Databases.

8.      other relevant Databases.

9.      System of searching, indexing and cross-referencing.


Biological Databases: Why?

There are two main functions of Biological Databases:

Making Biological Data available to Scientists: As much of  information should be available in one single place (book, sit, database). Public data ay be difficult to find or access, and collecting it from literature is very time consuming. And not all data is actually published explicitly in an article.

To make Biological Data available in Computer-readable form: Since analysis of Biological Data almost always involves Computers, having the Data in Computer-readable form ( rather than print or paper) is a necessary first step.

One of the first Biological sequence Database was probably the book “Atlas of Protein Sequence and Structure” by Margaret Dayhoff and colleagues, first published in 1965. It contained the Protein sequences determined at the time, and new editions of the book were published well into the 1970s.

  The Computer became h storage medium of choice as soon they came with in the reach of normal scientists. Databases were distributed on tapes, and later on various kinds of discs. When universities and research institutions were connected to Internet or its precursors (National Computer Network), it is easy to understand why it became the medium of choice. And it is easier to see why WWW ( World Wide Web) based on http (Hyper text markup language) since beginning of the 1990s is the standard method of Communication and access for nearly all biological Databases.

As biology has increasingly turned into a data-rich science, the need for storing and communicating large database has grown tremendously. The obvious examples are the nucleotide sequences, the  protein sequences, and the 3D structural Data produced by X-Ray crystallography and macromolecular NMR. An new field of Science dealing with issue, challenges and new possibilities created by these database has emerged: Bioinformatics. Other type of data that or will soon be available in databases are metabolic pathways ( KEGG), gene expression data (microarrays),  protein-protein interactions and other types of data related to Biological function and processes.

Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life.

The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.

An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly vailable online databases related to biology and bioinformatics.

Most important public databases
for molecular biology


Primary Sequence DBs
(collaborative project)</FONT< H3>

·         DDBJ (DNA DataBase of Japan)

·         EMBL Nucleotide DB (European Molecular Biology Laboratory )

·         GenBank (National Center for Biotechnology Information)


·         Entrez Gene Unified retrival of gene-centred information (NCBI)

·         euGenes Assembled information on eukaryotic genomes (Univ. of Indiana)

·         GeneCards (Weizmann Inst.)

·         GenLoc / UDB (Weizmann Inst.)

·         SOURCE (Univ. of Stanford)

·         LocusLink (National Center for Biotechnology Information)

Genome Annotation Systems

·         Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and Wellcome Trust Sanger Inst.)

·         UniGene Automatic partitioning of GenBank sequences (NCBI)

·         Golden Path / UCSC (Univ. of California, Santa Cruz)

Specialized DBs

·         CGAP Cancer Genes (National Cancer Institute)

·         Clone Registry Clone Collections (National Center for Biotechnology Information)

·         I.M.A.G.E Clone Collections (Image Consortium)

·         DBGET H.sapiens, retrieval system (Univ. of Kyoto)

·         DIP Interacting Proteins (Univ. of California)

·         GDB (Human Genome Organization)

·         KEGG Functional Db (Univ. of Kyoto)

·         MGI Mouse Genome (Jackson Lab.)

·         OMIM Inherited Diseases (National Center for Biotechnology Information)

·         SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)

·         PEDANT Protein Db (Forschungszentrum f. Umwelt & Gesundheit)

·         List with SNP-Databases

·         Reactome, The Genome Knowledgebase (EBI)


·         ArrayExpress (European Bioinformatic Institute)

·         Gene Expression Omnibus (National Center for Biotechnology Information)

·         maxd (Univ. of Manchester)

·         SMD (Univ. of Stanford)


 Accession codes Vs identifiers


Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways. Basically, it has two names:

  • Identifier
  • Accession code (or number)

The question how to deal with changed, updated and deleted entries in databases is a very tricky problem, and the policies for how accession codes and identifiers are changed or kept constant are not completely consistent between databases or even over time for one single database.

The exact definition of what the identifier and accession code are supposed to denote varies between the different databases, but the basic idea is the following.


An identifier (“locus” in GenBank, “entry name” in SWISS-PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.

SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.

An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it’s not really a big problem.

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.

The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.

In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.



NCBI’s sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research.


GenBank:  An annotated collection of all publicly available nucleotide and amino acid sequences.

EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).

GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.

HomoloGene:  A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs.

HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences.

SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.

RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, support our data-gathering efforts.

STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome.

UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).

UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.



DNA & RNA Databases
Major Sequence Repositories – Human Chromosome Information – Organelle Genome Databases – RNA Databases – Comparative & Phylogenetic Databases – SNPs, Mutations and Variations Databases – Alternative Splicing Databases – Specialized Databases 


Major Sequence Repositories:

DNA databank of Japan
EMBL: Maintained by EMBL
GenBank: Maintained by NCBI


Human Chromosome Information:

Click the link below to access chromosome information


























Organelle Genome Databases:

OGMP:  Organell genome megasequencing program
GOBASE:  An organelle genome database
MitoMap:  Human mitochondrial genome database


RNA Databases:

Rfam: RNA familiy database
RNA base:  Database of RNA structures
tRNA database:  Database of  tRNAs
tRNA:  tRNA sequences and genes
sRNA:  Small RNA database


Comparative & Phylogenetic Databases:

COG:  Phylogenetic classification of proteins 
DHMHD:  Human-mouse homology database
HomoloGene:  Gene homologies across species
Homophila:  Human disease to Drosophila gene database
HOVERGEN:  Database of homologous vertebrate genes 
TreeBase:  A database of phylogenetic knowledge
XREF:  Cross-referencing with model organisms


SNPs, Mutations & Variations Databases:

ALPSbase: Database of mutations causing human ALPS
dbSNP:  Single nucleotide polymorphism database at NCBI
HGVbase:   Human Genome Variation database


Alternative Splicing Databases:

ASAP: Alternate splicing analysis tool at UCLA
ASG: Alternate splicing gallery
HASDB:  Human alternative splicing database at UCLA
AsMamDB:  alternatively spliced genes in human, mouse and rat
ASD: Alternative splicing database at CSHL


Specialised Databases:

ABIM:  Links to several genomics database
ACUTS:  Ancient conserved untranslated sequences
AGSD: Animal genome size database
AmiGO: The Gene Ontology database
ARGH: The acronym database
ASDB:  Database of alternatively spliced genes
BACPAC:  BAC and PAC genomic DNA library info
BBID: Biological Biochemical image database
Cardiac gene database:
CHLC:  Genetic markers on chromosomes
COGENT: Complete genome tracking database
COMPEL: Composite regulatory elements in eukaryotes 
CUTG:  Codon usage database 
dbEST:  Database of expressed sequences or mRNA
dbGSS: Genome survey sequence database
dbSTS: Sequence tagged sites (STS)
DBTSS: Database of transcriptional start sites
DOGS:  Database of genome sizes
EID:  The exon-intron database – Harvard
Exon-Intron: Exon-Intron database – Singapore
EPD:  Eukaryotic promotor database
FlyTrap: HTML based gene expression database
GDB:  The genome database
GenLink:  Resources for human genetic and telomere research 
GeneKnockouts:  Gene knockout information
GENOTK:  Human cDNA database
GEO:  Gene expression omnibus NCBI
GOLD:  Information on genome projects around the world
GSDB:The Genome Sequence DataBase
HGI:  TIGR human gene index
HTGS: High-through-put genomic sequence at NCBI
IMAGE:  The largest collection of DNA sequences clones
IMGT:  The international ImMunoGeneTics information system
IPCN:  Index to Plant Chromosome Numbers database
LocusLink:  Single query interface to sequence and genetic loci
TelDB:  The telomere database
MitoDat: Mitochondrial nuclear genes
Mouse EST:  NIA mouse cDNA project
MPSS: Searchable databases of several species
NDB:  Nucleic acid database
NEDO:  Human cDNA sequence database
NPD: Nuclear protein database
Oomycetes DB: Oomycetes database at Virginia Bioinformatics Institute
PLACE: Database of plant cis-acting regulatory DNA elements
RDP:  Ribosomal database project
RDB:  Receptor database at NIHS, Japan
Refseq: The NCBI reference sequence project
RHdb: Radiation hybrid physical map of chromosomes
SHIGAN: SHared Information of GENetic resources, Japan
SpliceDB:  Canonical and non-canonical splice site sequences
STACK:  Consensus human EST database
TAED:  The adaptive evolution database
TIGR:  Curated databases of microbes, plants and humans
TRANSFAC: The Transcription Factor Database
TRRD:  Transcription Regulatory region database
UniGene: Cluster of sequences for unique genes at NCBI
UniSTS:  Nonredundent collection of STS  



Protein Databases

Protein Sequence Databases – Protein Structure Databases – Protein Domains, Motifs and Signatures – Others


Protein Sequence Databases:


Antibodies: Sequence and Structure
BRENDA: Enzyme database
CD Antigens:  Database of CD antigens
dbCFC:  Cytokine family database
Histons:  Histone sequence database
HPRD:  Human protein reference database
InterPro: Intergrated documentation 5resources for protein families
iProClass: An integrated protein classification database
KIND:  A non-redundant protein sequence database
MHCPEP:  Database of MHC binding peptides
MIPS:  Munich information centre for protein sequences
PIR:  Annotated, and non-redundant protein sequence database
PIR-ALN: Curated database of protein sequence alignments
PIR-NREF: PIR nonredundent reference protein database
PMD:  Protein mutant database
PRF: Protein research foundation, Japan
ProClass: Non-redundant protein database
ProtoMap:  Hierarchical classification of swissprot proteins
REBASE:  Restriction enzyme database
RefSeq:   Reference sequence database at NCBI
SwissProt:  Curated protein sequence database
Comprehensive protein sequence database
Transfac:  Transcription factor database
TrEMBL:  Annotated translations of EMBL nucleotide sequences
Tumor gene database:  Genes with cancer-causing mutations
WD repeats:  WD-repeat family of proteins


Protein Structure Databases:

Cath:  Protein structure classification
HIV Protease:  HIV protease database 3D structure
PDB:  3-D macromolecular structure data
PSI: Protein structure initiative
S2F: Structure to function project
Scop:  Structural Classification of Proteins


Protein Domains, Motifs & Signatures:

BLOCKS:  Multipe aligned segments of conserved protein regions
CCD:  Conserved domain database and search service
DOMO:  Homologous protein domain families
Pfam:  Database of protein domains and HMMs
ProDom:  Protein domain database
Prints:  Protein motif fingerprint database
Prosite:   Database of protein families and domains
SMART:  Simple modular architecture research tool
TIGRFAM: Protein families based on HMMs



Phospho Site:  Database of phosphorylation sites
PROW:  Protein reviews on the web
Protein Lounge: Complete systems biology



Other Databases:


Carbohydarate Databases:

Carb DB: Carbohydrate Sequence and Structure Database
GlycoWord: Glycoscience related information
SPECARB: Raman Spectra of carbohydrates


Other Databases:

AlzGene: Alzheimer’s disease
Polygenic pathways: Alzheimer’s disease, Bipolar disorder or Schizophrenia



Model Organism Databases and Resources

ArabidopsisBacteriaSea BassCatCattleChickenCottonCyanoBacteriaDaphniaDeerDictyosteliumDogFrogFruit FlyFungusGoatHorseMadaka FishMaizeMalariaMosquitoMousePigPlantsProtozoaPuffer FishRatRiceRickettsiaSalmonSheepSoySorghamTetradonTilapiaTurkeyVirusesWormYeastZebra Fish


General Information:

GMOD: Generic Model Organism Database
Model Organisms:
 The WWW virtual library of model organisms


Arabidopsis thaliana:

ABRC: Arabidopsis biological resource center
AGI: Arabidopsis genome initiative 
AREX: Arabidopsis gene expression database
Arabinet:  Arabidopsis information on the www
AtGDB: An Arabidopsis thalina plant genome database
AtGI: TIGR Arabidopsis thaliana gene index
ATGC:  Genome sequencing at ATGC
ATIDB: Arabidopsis insertion database
CSHL: Arabidopsis genome analysis at Cold Spring
ESSA: Arabidopsis thalina project at MIPS
Genoscope:  AGI in France
Kazusa:  Arabidopsis thaliana genome info Japan
MPSS: Massively parallel signature sequencing
NASC: Nottingham Arabidopsis stock center
Stanford:  Sequencing of the Arabidopsis genome at Stanford
TAIR:  Arabidopsis information resource
TIGR:  TIGR Arabidopsis genome annotation database
Wustl:  Arabidopsis genome at Washington university
Trees:  A forest tree genome database


Bacterial genomes:

B. Subtilus:  Bacillus subtilus database
Chlamydomonas:  Chlamydomonas genetics center
E. coli:  E.coli genome project
MGD:  Microbial germ plasm database
Microbial:  Microbial Genome Gateway
Microbial:  Microbial genomes
Micado:  Genetics maps of B. subtilis and E. coli
MycDB:  A integrated Mycobacterial database
Neisseria:  Neisseria meningitidis genome
Neurospora:  Neurospora crassa database
OralGen:  Oral pathogen database
Salmonella: Salmonella information
STDGen:  Sexulally transmitted disease database



Bass:  Sea Bass Mapping project


Cat (Felis catus):

Cat ArkDB: Cat mapping database


Cattle (Bos taurus):

ARK: Farm animals
BoLA:  Bovine MHC information
Bovin:  Bovine genome database
BovMap:  Mapping the bovine genome
CaDBase:  Genetic diversity in cattles
ComRad:  Comparative radiation hybrid mapping
Cow ArkDB: Bovine ArkDB
GemQual:  Genetics of meat quality


Chicken (Gallus gallus):

Chicken:  Poultry gene mapping project
ChickMap:  Chicken genome project
Chicken ArkDB: Chicken database
ChickEST:  Chick EST database
Poultry:  Poultry genome project



Cotton:  Cotton data collection site


Cyano Bacteria (Blue green algae):

Cyano Bacteria:  Anabaena genome


Daphnia (Crustacea):

Daphnia pulex: Daphnia genomics consortium



Deer ArkDB:  Deer mapping database


Dictyostelium discoideum:

Dicty_cDB:  Dictyostelium discoideum cDNA project
DGP:  Dictyostelium discoideum genome project
Dictybase:  Online informatics resources for Dictyostelium


Dog (Canis familiaris):

Dog: Dog genome project
Dog genome project:


Frog (Xenopus):

Xenbase:  A Xenopus web resource
Xenopus:  Xenopus tropicalis genome


Fruit fly (Drosophila melanogaster):

ENSEMBL:  Drosophila Genome Browser at ENSEMBL
Fruitfly:  Drosophila genome project at Berkeley
FlyBase: A Database of the Drosophila Genome
FlyMove:  A Drosophila multimedia database
FlyView:  A Drosophila image database



Aspergillus:  Aspergillus Genomics
Candida:  Candida albicans information page
FungalWeb: Fungi database
FGSC:  Fungal genetic stocks center


Goat (Capra hircus):

Goat:  GoatMap, mapping the caprine genome


Horse (Equus caballus):

Horse ArkDB:  Horse mapping database


Madaka Fish:

Medaka:  Medaka fish home page



Maize:  Maize genome database


Malaria (Plasmodium spp):

Malaria:  Malaria genetics and genomics
PlasmoDB:  Plasmodium falciparum genome database
Parasites:  Parasite databases of clustered ESTs
Parasite Genome:  Parasite genome databases



Mosquito:  Mosquito genome web server


Mouse (Mus musculus):

ENSEMBL:  Mouse genome server at ENSEMBL
Jackson Lab:  Mouse Resources
MRC: Mouse genome center at MRC, UK
MGI: Mouse genome informatics at Jackson Labs
MGD:  Mouse genome database
MGS:  Mouse genome sequencing at NIH
MIT:  Genetic and physical maps of the mouse genome
Mouse SNP:  Mouse SNP database
NCI: Mouse repository
NIH:  NIH mouse initiative
ORNL: Mutent mouse database
RIKEN: Mouse resources
Rodentia:  The whole mouse catalog  


Pig (Sus scrofa):

INCO: Pig trait gene mapping
Pig:   Pig EST database
Pig:  Pig gene mapping project
PiGBase:  Pig genome mapping
Pig ArkDB: Pig Ark DB



PlantGDB:  Resources for plant comparative genomics



Protozoa:  Protozoan genomes



Fugu:  Puffer fish project, UK site
Fugu: Fugu genome project, Singapore
Fugu: Puffer fish project, USA


Rat (Ratus norvigicus):

MIT:  Genetic maps of the Rat genome
Rat genomics and genetics
Rat: RatMap
RGD: Rat genome database


Rice (Oriza sativa):

MPSS: Massively parallel signature sequencing
Rice-research:  Rice genome sequence database
Rice:  Rice genome project



RicBase: Rickettsia genome database



Salmon ArkDB:  Salmon mapping database


Sheep (Ovis aries):

Sheep:  Sheep gene mapping
SheepBase:  Sheep gene mapping
Sheep ArkDB:  Sheep mapping database



Soy:  Soybeans database



Sorghum:  Sorghum Genomics



Tetraodon: Tetraodon nigroviridis genome
Tetraodon: Tetraodon nigroviridis genome at Whitehead



HCGS:  Tilapia genome
Tilapia ArkDB:  Tilapia mapping database



Turkey ArkDB:  Turkey mapping database



HIV:  HIV sequence database
Herpes: Human herpes virus 5 database


Worm (Caenorhabditis elegans):

C. elegans:  C. elegans genome sequencing project
NemBase: Resource for nematode sequence and functional data
WormAtlas: Anatomy of C. elegans
WormBase: The Genome and biology of C. elegans
ACEDB:  A C. elegans database
WWW Server:  C. elegans web server



SCPD:  The promoter database of Saccharomyces cerevisiae
SGD:  Saccharomyces genome database
S. Pompe:  Schizosaccharomyces pompe genome project
TRIPLES:  Functional analysis of Yeast genome at Yale
Yeast Intron database:  Spliceosomal introns of the yeast


Zebra fish (Danio rerio):

ZFIN: Zebrafish information network
ZGR:  Zebrafish genome resources
ZIS:  Zebrafish information server
Zebrafish:  Zebrafish webserver










Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

One Response to “Database in Bioinformatics”

RSS Feed for Turbocad55’s Weblog Comments RSS Feed

Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...


Get every new post delivered to your Inbox.

%d bloggers like this: