OrthoDB[1][2][3][4] presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each major radiation along the species phylogeny. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.

OrthoDB
Content
DescriptionCatalog of Orthologs.
Contact
Research centerSwiss Institute of Bioinformatics
LaboratoryComputational Evolutionary Genomics Group
AuthorsEvgenia V. Kriventseva
Primary citationKriventseva et al. (2015)[1]
Release date2007
Access
Websitewww.orthodb.org
Download URLhttps://www.orthodb.org/?page=filelist
Sparql endpointsparql.orthodb.org/sparql
Miscellaneous
LicenseCC-BY-3.0

In comparative genomics, the importance of scale cannot be underestimated. As gene orthology delineation requires specific expertise and considerable computational resources, scale is something that individual non-specialist research groups cannot accomplish on their own. This challenging task is achieved by OrthoDB, with very comprehensive sets of species and several unique features such as the extensive functional and evolutionary annotations of orthologous groups, with the integration of many useful links to other world-leading databases that focus on capturing information about gene function. No genome can exist as a useful data source without extensive comparative analyses with other genomes – OrthoDB provides a critically important resource for comparative genomics for the entire community of researchers from those interested in grand evolutionary questions to those focused on the specific biological functions of individual genes.

Methodology

edit

Orthology is defined relative to the last common ancestor of the species being considered, thereby determining the hierarchical nature of orthologous classifications. This is explicitly addressed in OrthoDB by application of the orthology delineation procedure at each major radiation point of the considered phylogeny. The OrthoDB implementation employs a Best-Reciprocal-Hit (BRH) clustering algorithm based on all-against-all Smith–Waterman protein sequence comparisons. Gene set pre-processing selects the longest protein-coding transcript of alternatively spliced genes and of very similar gene copies. The procedure triangulates BRHs to progressively build the clusters and requires an overall minimum sequence alignment overlap to avoid domain walking. These core clusters are further expanded to include all more closely related within-species in-paralogs, and the previously identified very similar gene copies.

Data content

edit

The database contains some 600 eukaryotic species and more than 3600 bacteria[1] sourced from Ensembl, UniProt, NCBI, FlyBase, and several other databases. The ever-increasing sampling of sequenced genomes brings a clearer account of the majority of gene genealogies that will facilitate informed hypotheses of gene function in newly sequenced genomes.

Examples of studies that have employed data from OrthoDB include comparative analyses of gene repertoire evolution,[5][6] comparisons of fruit fly and mosquito developmental genes,[7] analyses of bloodmeal- or infection-induced changes in gene expression in mosquitoes,[8][9][10] analysis of the evolution of mammalian milk production,[11] and mosquito gene and genome evolution.[12] Others studies citing OrthoDB can be found at PubMed and Google Scholar.

Performance

edit

OrthoDB has performed consistently well in benchmarking assessments alongside other orthology delineation procedures. Results were compared to reference trees for three well-conserved protein families,[13] and to a larger set of curated protein families.[14]

BUSCO

edit

Benchmarking sets of Universal Single-Copy Orthologs[15] - Orthologous groups are selected from OrthoDB for the root-level classifications of arthropods, vertebrates, metazoans, fungi, and other major clades. Groups are required to contain single-copy orthologs in at least 90% of the species (in others they may be lost or duplicated), and the missing species cannot all be from the same clade. Species with frequent losses or duplications are removed from the selection unless they hold a key position in the phylogeny. BUSCOs are therefore expected to be found as single-copy orthologs in any newly sequenced genome from the appropriate phylogenetic clade, and can be used to analyse newly sequenced genomes to assess their relative completeness. The BUSCO assessment tool and datasets (accessible here) are being widely used in many genomics projects, with most journal editors now requiring such quality assessments before accepting new genome publications.

Notes and references

edit
  1. ^ a b c Kriventseva EV, Tegenfeldt F, Petty TJ, Waterhouse RM, Simão FA, Pozdnyakov IA, Ioannidis P, Zdobnov EM (January 2015). "OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software". Nucleic Acids Res. 43 (Database issue): D250–6. doi:10.1093/nar/gku1220. PMC 4383991. PMID 25428351.
  2. ^ Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (January 2013). "OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs". Nucleic Acids Res. 41 (Database issue): D358–65. doi:10.1093/nar/gks1116. PMC 3531149. PMID 23180791.
  3. ^ Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV (January 2011). "OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011". Nucleic Acids Res. 39 (Database issue): D283–8. doi:10.1093/nar/gkq930. PMC 3013786. PMID 20972218.
  4. ^ Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM (Jan 2008). "OrthoDB: the hierarchical catalog of eukaryotic orthologs". Nucleic Acids Res. 36 (Database issue): D271–5. doi:10.1093/nar/gkm845. PMC 2238902. PMID 17947323.
  5. ^ Waterhouse RM, Zdobnov EM, Kriventseva EV (January 2011). "Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi". Genome Biol. Evol. 3: 75–86. doi:10.1093/gbe/evq083. PMC 3030422. PMID 21148284.
  6. ^ Hase T, Niimura Y, Tanaka H (2010). "Difference in gene duplicability may explain the difference in overall structure of protein-protein interaction networks among eukaryotes". BMC Evol. Biol. 10: 358. doi:10.1186/1471-2148-10-358. PMC 2994879. PMID 21087510.
  7. ^ Behura SK, Haugen M, Flannery E, Sarro J, Tessier CR, Severson DW, Duman-Scheel M (2011). "Comparative Genomic Analysis of Drosophila melanogaster and Vector Mosquito Developmental Genes". PLOS ONE. 6 (7): e21504. Bibcode:2011PLoSO...621504B. doi:10.1371/journal.pone.0021504. PMC 3130749. PMID 21754989.
  8. ^ Bonizzoni M, Dunn WA, Campbell CL, Olson KE, Dimon MT, Marinotti O, James AA (2011). "RNA-seq analyses of blood-induced changes in gene expression in the mosquito vector species, Aedes aegypti". BMC Genomics. 12: 82. doi:10.1186/1471-2164-12-82. PMC 3042412. PMID 21276245.
  9. ^ Pinto SB, Lombardo F, Koutsos AC, Waterhouse RM, McKay K, An C, Ramakrishnan C, Kafatos FC, Michel K (2009). "Discovery of Plasmodium modulators by genome-wide analysis of circulating hemocytes in Anopheles gambiae". Proc Natl Acad Sci U S A. 106 (50): 21270–5. Bibcode:2009PNAS..10621270P. doi:10.1073/pnas.0909463106. PMC 2783009. PMID 19940242.
  10. ^ Bartholomay LC, Waterhouse RM, Mayhew GF, Campbell CL, Michel K, Zou Z, Ramirez JL, Das S, Alvarez K, Arensburger P, Bryant B, Chapman SB, Dong Y, Erickson SM, Karunaratne SH, Kokoza V, Kodira CD, Pignatelli P, Shin SW, Vanlandingham DL, Atkinson PW, Birren B, Christophides GK, Clem RJ, Hemingway J, Higgs S, Megy K, Ranson H, Zdobnov EM, Raikhel AS, Christensen BM, Dimopoulos G, Muskavitch MA (2010). "Pathogenomics of Culex quinquefasciatus and meta-analysis of infection responses to diverse pathogens". Science. 330 (6000): 88–90. Bibcode:2010Sci...330...88B. doi:10.1126/science.1193162. PMC 3104938. PMID 20929811.
  11. ^ Lemay DG, Lynn DJ, Martin WF, Neville MC, Casey TM, Rincon G, Kriventseva EV, Barris WC, Hinrichs AS, Molenaar AJ, Pollard KS, Maqbool NJ, Singh K, Murney R, Zdobnov EM, Tellam RL, Medrano JF, German JB, Rijnkels M (2009). "The bovine lactation genome: insights into the evolution of mammalian milk". Genome Biol. 10 (4): R43. doi:10.1186/gb-2009-10-4-r43. PMC 2688934. PMID 19393040.
  12. ^ Neafsey DE, Waterhouse RM, Abai MR, Aganezov SS, Alekseyev MA, Allen JE, Amon J, Arcà B, Arensburger P, Artemov G, Assour LA, Basseri H, Berlin A, Birren BW, Blandin SA, Brockman AI, Burkot TR, Burt A, Chan CS, Chauve C, Chiu JC, Christensen M, Costantini C, Davidson VL, Deligianni E, Dottorini T, Dritsou V, Gabriel SB, Guelbeogo WM, Hall AB, Han MV, Hlaing T, Hughes DS, Jenkins AM, Jiang X, Jungreis I, Kakani EG, Kamali M, Kemppainen P, Kennedy RC, Kirmitzoglou IK, Koekemoer LL, Laban N, Langridge N, Lawniczak MK, Lirakis M, Lobo NF, Lowy E, MacCallum RM, Mao C, Maslen G, Mbogo C, McCarthy J, Michel K, Mitchell SN, Moore W, Murphy KA, Naumenko AN, Nolan T, Novoa EM, O'Loughlin S, Oringanje C, Oshaghi MA, Pakpour N, Papathanos PA, Peery AN, Povelones M, Prakash A, Price DP, Rajaraman A, Reimer LJ, Rinker DC, Rokas A, Russell TL, Sagnon N, Sharakhova MV, Shea T, Simão FA, Simard F, Slotman MA, Somboon P, Stegniy V, Struchiner CJ, Thomas GW, Tojo M, Topalis P, Tubio JM, Unger MF, Vontas J, Walton C, Wilding CS, Willis JH, Wu YC, Yan G, Zdobnov EM, Zhou X, Catteruccia F, Christophides GK, Collins FH, Cornman RS, Crisanti A, Donnelly MJ, Emrich SJ, Fontaine MC, Gelbart W, Hahn MW, Hansen IA, Howell PI, Kafatos FC, Kellis M, Lawson D, Louis C, Luckhart S, Muskavitch MA, Ribeiro JM, Riehle MA, Sharakhov IV, Tu Z, Zwiebel LJ, Besansky NJ (January 2015). "Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes". Science. 347 (6217): 62176. Bibcode:2015Sci...347...43N. doi:10.1126/science.1258522. PMC 4380271. PMID 25554792.
  13. ^ Boeckmann B, Robinson-Rechavi M, Xenarios I, Dessimoz C (September 2011). "Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees". Brief. Bioinform. 12 (5): 423–35. doi:10.1093/bib/bbr034. PMC 3178055. PMID 21737420.
  14. ^ http://eggnog.embl.de/orthobench OrthoBench]
    Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, Muller J, Bork P (October 2011). "Orthology prediction methods: a quality assessment using curated protein families". BioEssays. 33 (10): 769–80. doi:10.1002/bies.201100062. PMC 3193375. PMID 21853451.
  15. ^ Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (June 2015). "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs". Bioinformatics. 31 (19): 3210–2. doi:10.1093/bioinformatics/btv351. PMID 26059717.

See also

edit
edit