Methods

Evolutionary

Gene age is inferred by finding homologous sequences across the tree of life. The age of the gene is based on the distribution of species with a detectable homologous sequence.

For more details see the original source, ProteinHistorian and the associated publication ProteinHistorian: tools for the comparative analysis of eukaryote protein origin.

Note: Phylostratigraphy only reliably identifies the minimal age of the gene, since fast sequence divergence can potentially obscure distant homologs. For a review see Capra et al. (2013). For potential limitations, see Moyers and Zhang (2015).

dN is the rate of non-synonymous mutations (i.e., changing the encoded amino acid) in a protein sequence. dS is the rate of synonymous mutations (i.e., amino acid is unchanged). When non-synonymous changes occur faster than synonymous changes (i.e., dN/dS > 1), it is inferred that positive selection is operating on the gene. Most genes are under purifying selection, so typically dN/dS < 1.

For details about how dN/dS was calculated please see the two original studies reported in GEneSTATION, Gayà-Vidal & Albà (2014) and Plunkett et al. (2011).

Note: dN/dS is subject to a number of caveats, particularly when the synonymous substitution rate is very low. For a quick overview see the Wikipedia page.

Summary of methods from Plunkett et al. (2011):
"A likelihood ratio test to identify lineage specific constraints. For each gene of interest, we use the ten upstream and downstream genes to estimate a regional synonymous rate (dSr) and the expected lineage-specific constraint scaling factors (a). These scaling factors take into account that the constraint on each lineage will vary due to the effective population size and other species-specific parameters. Using these regional parameters, a gene-specific dN/dS ratio (w) is estimated. In the null model, the nonsynonymous substitution rate is estimated as aCwndSr. This is compared to the alternative model, where nonsynonymous branch length is set to a free parameter (R)."

Summary of methods from Gayà-Vidal & Albà (2014):
"Detection of positive selection using the branch-site test: We ran branch-site model A implemented in codeml on the sequence alignment data. This test can detect positive selection even if it is only acting on a few sites in a specific lineage (the foreground branch) compared with the rest of the lineages (the background branches). In this study, the alternative hypothesis (positive selection) was compared to the null hypothesis (no positively selected sites) by means of a likelihood ratio test (LRT), because the two models are nested. The statistical significance was obtained with a chi-squared test using the R statistical package (http://www.r-project.org/)."

Treefam is the source of information for gene families. The original data is available here.
To provide consistent genome wide analyses of human genetic variation, variant call format (VCF) files, with coordinates lifted to genome build GRCh38, were downloaded from the 1000 Genomes Project, representing variants for all 2,504 unrelated individuals in the 1000 Genome Project Phase 3 cohort. Variants were filtered to exclude non-SNP variants, fixed sites, and sites with uncalled or unphased genotypes. Variants were validated against dbSNP build 144 using the ValidateVariants tool in Genome Analysis Toolkit. VCFtools was used to calculate pairwise FST statistics.

Organismal

Currently, GEneSTATION only reports human phenotypes from the Online Mendelian Inheritance in Man (OMIM) database. It is manually curated with an emphasis on clinical information.
We use PubMed to find pregnancy related gene expression studies in gestational tissues using the following MeSH search strategy: “Pregnancy”[mh] AND “Humans”[mh] AND (“Gene Expression Profiling”[mh] OR “Gene Expression Regulation”[mh]) AND (“Placenta”[mh] OR “Decidua”[mh] OR “Myometrium”[mh] OR “Cervix Uteri”[mh] OR “Extraembryonic Membranes”[mh] OR “Blood”[mh] OR “Plasma”[mh] OR “Umbilical Cord”[mh]) NOT Review[Publication Type]. Studies that fit this search are added to GEneSTATION after review for quality and availability in NCBI’s Gene Expression Omnibus (GEO). See Eidem et al for details.
We reanalyzed all microarray datasets that were downloaded from NCBI’s Gene Expression Omnibus (GEO) using the R package GEOquery. Ambiguous probes that map to multiple genes were discarded. For multiple probes mapping to a single gene, the probe with median significance value was reported. Pairwise differential expression statistics were computed using the eBayes algorithm in the limma package.
Protein Atlas provides expression for protein and RNA across multiple tissues. RNA expression uses RNA-Seq and protein expression uses immunohistochemistry. They have an extensive description of their methods here and the original data here.

Molecular

The molecular functional annotations are reported using a controlled vocabulary produced by the Gene Ontology Consortium. The annotations themselves are from Bioconductor (release 3.1).

More information about the annotation for each species can be found here:
Bos taurus
Canis familiaris
Homo sapiens
Macaca mulatta
Mus musculus
Pan troglodytes
Rattus norvegicus
Sus scrofa
The enzymatic annotations are reported using Enzyme Commission Numbers from International Union of Biochemistry and Molecular Biology. The annotations themselves are from Bioconductor (release 3.1).

More information about the annotation for each species can be found here:
Bos taurus
Canis familiaris
Homo sapiens
Macaca mulatta
Mus musculus
Pan troglodytes
Rattus norvegicus
Sus scrofa
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) provides protein and gene interactions for GEneSTATION. STRING uses a variety of data sources to infer interactions, including high throughput experiments, co-expression, and literature mining. Original data available here.

Analytical Tools

The p-value for the gene set analysis is calculated using the cumulative hypergeometric probability, which is closely related to Fisher's exact test. The set analysis calculates the probability using the number of background genes evaluated in the analysis set (e.g., only genes with a reported dN/dS value in a study or only the genes present on a microarray), the number genes in a user's submitted set, the number of genes in each gene set stored in GEneSTATION, and the number of genes in the overlap of the submitted set and the test set.

Each type of analysis set has its own definition:
  • Category based annotations are straightforward (e.g., gene ontology and gene age), since each set includes all genes with the annotation.
  • Gene expression sets are the significantly differentially expressed genes and uses a genome wide p-val correction with a 0.05 cutoff.
  • Selection uses the top 5% and bottom 5% of genes for dN and dS. dN/dS has a raw set and a filtered set. It is possible to have very large dN/dS values due to very small dS values, so the filtered set omits genes in the bottom 5% of dS.