Database in Bioinformatics

 

Before going for this question we have several questions in our mind……

1.      Biological Databases: Why?

2.      The different types of Databases.

3.      Accession codes Vs identifiers

4.      Nucleotide Databases.

5.      Protein sequence Databases.

6.      Sequence motif Databases.

7.      Macromolecular 3D structure Databases.

8.      other relevant Databases.

9.      System of searching, indexing and cross-referencing.

 

Biological Databases: Why?

There are two main functions of Biological Databases:

Making Biological Data available to Scientists: As much of  information should be available in one single place (book, sit, database). Public data ay be difficult to find or access, and collecting it from literature is very time consuming. And not all data is actually published explicitly in an article.

To make Biological Data available in Computer-readable form: Since analysis of Biological Data almost always involves Computers, having the Data in Computer-readable form ( rather than print or paper) is a necessary first step.

One of the first Biological sequence Database was probably the book “Atlas of Protein Sequence and Structure” by Margaret Dayhoff and colleagues, first published in 1965. It contained the Protein sequences determined at the time, and new editions of the book were published well into the 1970s.

  The Computer became h storage medium of choice as soon they came with in the reach of normal scientists. Databases were distributed on tapes, and later on various kinds of discs. When universities and research institutions were connected to Internet or its precursors (National Computer Network), it is easy to understand why it became the medium of choice. And it is easier to see why WWW ( World Wide Web) based on http (Hyper text markup language) since beginning of the 1990s is the standard method of Communication and access for nearly all biological Databases.

As biology has increasingly turned into a data-rich science, the need for storing and communicating large database has grown tremendously. The obvious examples are the nucleotide sequences, the  protein sequences, and the 3D structural Data produced by X-Ray crystallography and macromolecular NMR. An new field of Science dealing with issue, challenges and new possibilities created by these database has emerged: Bioinformatics. Other type of data that or will soon be available in databases are metabolic pathways ( KEGG), gene expression data (microarrays),  protein-protein interactions and other types of data related to Biological function and processes.

Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life.

The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.

An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly vailable online databases related to biology and bioinformatics.

Most important public databases
for molecular biology

 

Primary Sequence DBs
(collaborative project)</FONT< H3>

·         DDBJ (DNA DataBase of Japan)

·         EMBL Nucleotide DB (European Molecular Biology Laboratory )

·         GenBank (National Center for Biotechnology Information)

Meta-DBs

·         Entrez Gene Unified retrival of gene-centred information (NCBI)

·         euGenes Assembled information on eukaryotic genomes (Univ. of Indiana)

·         GeneCards (Weizmann Inst.)

·         GenLoc / UDB (Weizmann Inst.)

·         SOURCE (Univ. of Stanford)

·         LocusLink (National Center for Biotechnology Information)

Genome Annotation Systems

·         Ensembl Genome BrowserAutomatically Annotated Genomes (EMBL-EBI and Wellcome Trust Sanger Inst.)

·         UniGene Automatic partitioning of GenBank sequences (NCBI)

·         Golden Path / UCSC (Univ. of California, Santa Cruz)

Specialized DBs

·         CGAP Cancer Genes (National Cancer Institute)

·         Clone Registry Clone Collections (National Center for Biotechnology Information)

·         I.M.A.G.E Clone Collections (Image Consortium)

·         DBGET H.sapiens, retrieval system (Univ. of Kyoto)

·         DIP Interacting Proteins (Univ. of California)

·         GDB (Human Genome Organization)

·         KEGG Functional Db (Univ. of Kyoto)

·         MGI Mouse Genome (Jackson Lab.)

·         OMIM Inherited Diseases (National Center for Biotechnology Information)

·         SWISS-PROT Protein Db (Swiss Institute of Bioinformatics)

·         PEDANT Protein Db (Forschungszentrum f. Umwelt & Gesundheit)

·         List with SNP-Databases

·         Reactome, The Genome Knowledgebase (EBI)

Microarray-DBs

·         ArrayExpress (European Bioinformatic Institute)

·         Gene Expression Omnibus (National Center for Biotechnology Information)

·         maxd (Univ. of Manchester)

·         SMD (Univ. of Stanford)

  

 Accession codes Vs identifiers

 

Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways. Basically, it has two names:

  • Identifier
  • Accession code (or number)

The question how to deal with changed, updated and deleted entries in databases is a very tricky problem, and the policies for how accession codes and identifiers are changed or kept constant are not completely consistent between databases or even over time for one single database.

The exact definition of what the identifier and accession code are supposed to denote varies between the different databases, but the basic idea is the following.

Identifier

An identifier (“locus” in GenBank, “entry name” in SWISS-PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.

SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.

An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it’s not really a big problem.

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.

The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.

In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.

 

NUCLEOTIDE DATABASES

NCBI’s sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research.

 

GenBank:  An annotated collection of all publicly available nucleotide and amino acid sequences.

EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).

GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.

HomoloGene:  A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs.

HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences.

SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.

RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, support our data-gathering efforts.

STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome.

UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).

UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.

 

 

DNA & RNA Databases
Major Sequence Repositories – Human Chromosome Information – Organelle Genome Databases – RNA Databases – Comparative & Phylogenetic Databases – SNPs, Mutations and Variations Databases – Alternative Splicing Databases – Specialized Databases 

 

Major Sequence Repositories:


DDBJ:
DNA databank of Japan
EMBL: Maintained by EMBL
GenBank: Maintained by NCBI

 

Human Chromosome Information:

Click the link below to access chromosome information

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

X

Y

 

Organelle Genome Databases:

OGMP:  Organell genome megasequencing program
GOBASE:  An organelle genome database
MitoMap:  Human mitochondrial genome database

 

RNA Databases:

Rfam: RNA familiy database
RNA base:  Database of RNA structures
tRNA database:  Database of  tRNAs
tRNA:  tRNA sequences and genes
sRNA:  Small RNA database

 

Comparative & Phylogenetic Databases:

COG:  Phylogenetic classification of proteins 
DHMHD:  Human-mouse homology database
HomoloGene:  Gene homologies across species
Homophila:  Human disease to Drosophila gene database
HOVERGEN:  Database of homologous vertebrate genes 
TreeBase:  A database of phylogenetic knowledge
XREF:  Cross-referencing with model organisms

 

SNPs, Mutations & Variations Databases:

ALPSbase: Database of mutations causing human ALPS
dbSNP:  Single nucleotide polymorphism database at NCBI
HGVbase:   Human Genome Variation database

 

Alternative Splicing Databases:

ASAP: Alternate splicing analysis tool at UCLA
ASG: Alternate splicing gallery
HASDB:  Human alternative splicing database at UCLA
AsMamDB:  alternatively spliced genes in human, mouse and rat
ASD: Alternative splicing database at CSHL

 

Specialised Databases:

ABIM:  Links to several genomics database
ACUTS:  Ancient conserved untranslated sequences
AGSD: Animal genome size database
AmiGO: The Gene Ontology database
ARGH: The acronym database
ASDB:  Database of alternatively spliced genes
BACPAC:  BAC and PAC genomic DNA library info
BBID: Biological Biochemical image database
Cardiac gene database:
CHLC:  Genetic markers on chromosomes
COGENT: Complete genome tracking database
COMPEL: Composite regulatory elements in eukaryotes 
CUTG:  Codon usage database 
dbEST:  Database of expressed sequences or mRNA
dbGSS: Genome survey sequence database
dbSTS: Sequence tagged sites (STS)
DBTSS: Database of transcriptional start sites
DOGS:  Database of genome sizes
EID:  The exon-intron database – Harvard
Exon-Intron: Exon-Intron database – Singapore
EPD:  Eukaryotic promotor database
FlyTrap: HTML based gene expression database
GDB:  The genome database
GenLink:  Resources for human genetic and telomere research 
GeneKnockouts:  Gene knockout information
GENOTK:  Human cDNA database
GEO:  Gene expression omnibus NCBI
GOLD:  Information on genome projects around the world
GSDB:The Genome Sequence DataBase
HGI:  TIGR human gene index
HTGS: High-through-put genomic sequence at NCBI
IMAGE:  The largest collection of DNA sequences clones
IMGT:  The international ImMunoGeneTics information system
IPCN:  Index to Plant Chromosome Numbers database
LocusLink:  Single query interface to sequence and genetic loci
TelDB:  The telomere database
MitoDat: Mitochondrial nuclear genes
Mouse EST:  NIA mouse cDNA project
MPSS: Searchable databases of several species
NDB:  Nucleic acid database
NEDO:  Human cDNA sequence database
NPD: Nuclear protein database
Oomycetes DB: Oomycetes database at Virginia Bioinformatics Institute
PLACE: Database of plant cis-acting regulatory DNA elements
RDP:  Ribosomal database project
RDB:  Receptor database at NIHS, Japan
Refseq: The NCBI reference sequence project
RHdb: Radiation hybrid physical map of chromosomes
SHIGAN: SHared Information of GENetic resources, Japan
SpliceDB:  Canonical and non-canonical splice site sequences
STACK:  Consensus human EST database
TAED:  The adaptive evolution database
TIGR:  Curated databases of microbes, plants and humans
TRANSFAC: The Transcription Factor Database
TRRD:  Transcription Regulatory region database
UniGene: Cluster of sequences for unique genes at NCBI
UniSTS:  Nonredundent collection of STS  

 

 

Protein Databases

Protein Sequence Databases – Protein Structure Databases – Protein Domains, Motifs and Signatures – Others

 

Protein Sequence Databases:

 

Antibodies: Sequence and Structure
BRENDA: Enzyme database
CD Antigens:  Database of CD antigens
dbCFC:  Cytokine family database
Histons:  Histone sequence database
HPRD:  Human protein reference database
InterPro: Intergrated documentation 5resources for protein families
iProClass: An integrated protein classification database
KIND:  A non-redundant protein sequence database
MHCPEP:  Database of MHC binding peptides
MIPS:  Munich information centre for protein sequences
PIR:  Annotated, and non-redundant protein sequence database
PIR-ALN: Curated database of protein sequence alignments
PIR-NREF: PIR nonredundent reference protein database
PMD:  Protein mutant database
PRF: Protein research foundation, Japan
ProClass: Non-redundant protein database
ProtoMap:  Hierarchical classification of swissprot proteins
REBASE:  Restriction enzyme database
RefSeq:   Reference sequence database at NCBI
SwissProt:  Curated protein sequence database
SPTR:
Comprehensive protein sequence database
Transfac:  Transcription factor database
TrEMBL:  Annotated translations of EMBL nucleotide sequences
Tumor gene database:  Genes with cancer-causing mutations
WD repeats:  WD-repeat family of proteins

 

Protein Structure Databases:

Cath:  Protein structure classification
HIV Protease:  HIV protease database 3D structure
PDB:  3-D macromolecular structure data
PSI: Protein structure initiative
S2F: Structure to function project
Scop:  Structural Classification of Proteins

 

Protein Domains, Motifs & Signatures:

BLOCKS:  Multipe aligned segments of conserved protein regions
CCD:  Conserved domain database and search service
DOMO:  Homologous protein domain families
Pfam:  Database of protein domains and HMMs
ProDom:  Protein domain database
Prints:  Protein motif fingerprint database
Prosite:   Database of protein families and domains
SMART:  Simple modular architecture research tool
TIGRFAM: Protein families based on HMMs

 

Others:

Phospho Site:  Database of phosphorylation sites
PROW:  Protein reviews on the web
Protein Lounge: Complete systems biology

 

 

Other Databases:

 

Carbohydarate Databases:


Carb DB: Carbohydrate Sequence and Structure Database
GlycoWord: Glycoscience related information
SPECARB: Raman Spectra of carbohydrates

 

Other Databases:

AlzGene: Alzheimer’s disease
Polygenic pathways: Alzheimer’s disease, Bipolar disorder or Schizophrenia

 

 

Model Organism Databases and Resources

ArabidopsisBacteriaSea BassCatCattleChickenCottonCyanoBacteriaDaphniaDeerDictyosteliumDogFrogFruit FlyFungusGoatHorseMadaka FishMaizeMalariaMosquitoMousePigPlantsProtozoaPuffer FishRatRiceRickettsiaSalmonSheepSoySorghamTetradonTilapiaTurkeyVirusesWormYeastZebra Fish

 

General Information:

GMOD: Generic Model Organism Database
Model Organisms:
 The WWW virtual library of model organisms

 

Arabidopsis thaliana:

ABRC: Arabidopsis biological resource center
AGI: Arabidopsis genome initiative 
AREX: Arabidopsis gene expression database
Arabinet:  Arabidopsis information on the www
AtGDB: An Arabidopsis thalina plant genome database
AtGI: TIGR Arabidopsis thaliana gene index
ATGC:  Genome sequencing at ATGC
ATIDB: Arabidopsis insertion database
CSHL: Arabidopsis genome analysis at Cold Spring
ESSA: Arabidopsis thalina project at MIPS
Genoscope:  AGI in France
Kazusa:  Arabidopsis thaliana genome info Japan
MPSS: Massively parallel signature sequencing
NASC: Nottingham Arabidopsis stock center
Stanford:  Sequencing of the Arabidopsis genome at Stanford
TAIR:  Arabidopsis information resource
TIGR:  TIGR Arabidopsis genome annotation database
Wustl:  Arabidopsis genome at Washington university
Trees:  A forest tree genome database

 

Bacterial genomes:

B. Subtilus:  Bacillus subtilus database
Chlamydomonas:  Chlamydomonas genetics center
E. coli:  E.coli genome project
MGD:  Microbial germ plasm database
Microbial:  Microbial Genome Gateway
Microbial:  Microbial genomes
Micado:  Genetics maps of B. subtilis and E. coli
MycDB:  A integrated Mycobacterial database
Neisseria:  Neisseria meningitidis genome
Neurospora:  Neurospora crassa database
OralGen:  Oral pathogen database
Salmonella: Salmonella information
STDGen:  Sexulally transmitted disease database

 

Bass:

Bass:  Sea Bass Mapping project

 

Cat (Felis catus):

Cat ArkDB: Cat mapping database

 

Cattle (Bos taurus):

ARK: Farm animals
BoLA:  Bovine MHC information
Bovin:  Bovine genome database
BovMap:  Mapping the bovine genome
CaDBase:  Genetic diversity in cattles
ComRad:  Comparative radiation hybrid mapping
Cow ArkDB: Bovine ArkDB
GemQual:  Genetics of meat quality

 

Chicken (Gallus gallus):

Chicken:  Poultry gene mapping project
ChickMap:  Chicken genome project
Chicken ArkDB: Chicken database
ChickEST:  Chick EST database
Poultry:  Poultry genome project

 

Cotton:

Cotton:  Cotton data collection site

 

Cyano Bacteria (Blue green algae):

Cyano Bacteria:  Anabaena genome

 

Daphnia (Crustacea):

Daphnia pulex: Daphnia genomics consortium

 

Deer:

Deer ArkDB:  Deer mapping database

 

Dictyostelium discoideum:

Dicty_cDB:  Dictyostelium discoideum cDNA project
DGP:  Dictyostelium discoideum genome project
Dictybase:  Online informatics resources for Dictyostelium

 

Dog (Canis familiaris):

Dog: Dog genome project
Dog genome project:

 

Frog (Xenopus):

Xenbase:  A Xenopus web resource
Xenopus:  Xenopus tropicalis genome

 

Fruit fly (Drosophila melanogaster):

ENSEMBL:  Drosophila Genome Browser at ENSEMBL
Fruitfly:  Drosophila genome project at Berkeley
FlyBase: A Database of the Drosophila Genome
FlyMove:  A Drosophila multimedia database
FlyView:  A Drosophila image database

 

Fungus:

Aspergillus:  Aspergillus Genomics
Candida:  Candida albicans information page
FungalWeb: Fungi database
FGSC:  Fungal genetic stocks center

 

Goat (Capra hircus):

Goat:  GoatMap, mapping the caprine genome

 

Horse (Equus caballus):

Horse ArkDB:  Horse mapping database

 

Madaka Fish:

Medaka:  Medaka fish home page

 

Maize:

Maize:  Maize genome database

 

Malaria (Plasmodium spp):

Malaria:  Malaria genetics and genomics
PlasmoDB:  Plasmodium falciparum genome database
Parasites:  Parasite databases of clustered ESTs
Parasite Genome:  Parasite genome databases

 

Mosquito:

Mosquito:  Mosquito genome web server

 

Mouse (Mus musculus):

ENSEMBL:  Mouse genome server at ENSEMBL
Jackson Lab:  Mouse Resources
MRC: Mouse genome center at MRC, UK
MGI: Mouse genome informatics at Jackson Labs
MGD:  Mouse genome database
MGS:  Mouse genome sequencing at NIH
MIT:  Genetic and physical maps of the mouse genome
Mouse SNP:  Mouse SNP database
NCI: Mouse repository
NIH:  NIH mouse initiative
ORNL: Mutent mouse database
RIKEN: Mouse resources
Rodentia:  The whole mouse catalog  

 

Pig (Sus scrofa):

INCO: Pig trait gene mapping
Pig:   Pig EST database
Pig:  Pig gene mapping project
PiGBase:  Pig genome mapping
Pig ArkDB: Pig Ark DB

 

Plants:

PlantGDB:  Resources for plant comparative genomics

 

Protozoa:

Protozoa:  Protozoan genomes

 

Pufferfish:

Fugu:  Puffer fish project, UK site
Fugu: Fugu genome project, Singapore
Fugu: Puffer fish project, USA

 

Rat (Ratus norvigicus):

MIT:  Genetic maps of the Rat genome
NIH:
Rat genomics and genetics
Rat: RatMap
RGD: Rat genome database

 

Rice (Oriza sativa):

MPSS: Massively parallel signature sequencing
Rice-research:  Rice genome sequence database
Rice:  Rice genome project

 

Rickettsia:

RicBase: Rickettsia genome database

 

Salmon:

Salmon ArkDB:  Salmon mapping database

 

Sheep (Ovis aries):

Sheep:  Sheep gene mapping
SheepBase:  Sheep gene mapping
Sheep ArkDB:  Sheep mapping database

 

Soy:

Soy:  Soybeans database

 

Sorghum:

Sorghum:  Sorghum Genomics

 

Tetraodon:

Tetraodon: Tetraodon nigroviridis genome
Tetraodon: Tetraodon nigroviridis genome at Whitehead

 

Tilapia:

HCGS:  Tilapia genome
Tilapia ArkDB:  Tilapia mapping database

 

Turkey:

Turkey ArkDB:  Turkey mapping database

 

Viruses:

HIV:  HIV sequence database
Herpes: Human herpes virus 5 database

 

Worm (Caenorhabditis elegans):

C. elegans:  C. elegans genome sequencing project
NemBase: Resource for nematode sequence and functional data
WormAtlas: Anatomy of C. elegans
WormBase: The Genome and biology of C. elegans
ACEDB:  A C. elegans database
WWW Server:  C. elegans web server

 

Yeast:

SCPD:  The promoter database of Saccharomyces cerevisiae
SGD:  Saccharomyces genome database
S. Pompe:  Schizosaccharomyces pompe genome project
TRIPLES:  Functional analysis of Yeast genome at Yale
Yeast Intron database:  Spliceosomal introns of the yeast

 

Zebra fish (Danio rerio):

ZFIN: Zebrafish information network
ZGR:  Zebrafish genome resources
ZIS:  Zebrafish information server
Zebrafish:  Zebrafish webserver

 

 

 

 

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

One Response to “Database in Bioinformatics”

RSS Feed for Turbocad55′s Weblog Comments RSS Feed


Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: