CoGTEx: Unscaled coexpression estimation from GTEx
data forecast novel functional gene partners
CoGTEX is a resource for viewing and analyzing coexpression from GTEx v8 data
Gene coexpression is helpful for analysis of pathways, cofactors, regulators, targets, and human health and disease. Coexpression estimations available today are performed on a “tissue level”, which is based on cell type standardized or scaled formulations. Co-GTEx presents estimated coexpression without scaling named “system level”. To provide comparable and robust results over both scenarios, we first filtered the GTEx samples generating clear, unambiguoustissue clusters, then sub-sampled all tissue clusters (n=20 times) at the same number of samples percluster (n=70) to calculate three metrics of coexpression at the system-and tissue levels. We show that our calculations at the tissue level are similar to the estimations available in other databases.
Data collection
Version 8 of gene-level TPM expression data and free-access metadata was downloaded from the GTEx website on May 23, 2021 (https://gtexportal.org/home/datasets
files:
“GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz”,
“GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt” and “GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt”).
The collected data originally had 17,382 RNA-Seq samples from 54 tissues (2 represent duplicates of 2 other tissues with different preservation methods and ischemic time, https://gtexportal.org/home/faq).
After data collection,we excluded tissues with sample counts less than 50 (21 bladder, 9 ectocervix, 10 endocervix, 9fallopian tube, 4 kidney medulla). EBV-transformed lymphocytes (174) and cultured fibroblasts(504) were also excluded(The GTEx Consortium, 2020), leaving 16,651 samples from 47 tissues
Gene selection
Low-expressed genes were filtered out as they can represent noisy measurements (Sha et al, 2015). For this, we used an assessment employed by GTEx (The GTEx Consortium,2020) that tests if each gene is expressed at a value of 0.1 TPM or more in at least 20% of samples (3,331 samples). 24,720 genes failed this test and were initially considered low-expressed (31,480 passed). However, it is anticipated that some of these genes failed the test due to a degree of tissue-specificity as the procedure naturally retains only genes expressed across several tissues. Briefly, for all genes filtered out in the previous test, the mean expression and the percentage of samples expressed at 1 TPM or more per tissue was calculated. The maximum tissue mean expression value greater than 5 TPM combined with a maximum percentage of expressed samples greater than 66% per tissue was considered as evidence that a gene should be retained. This adds certainty that a gene is clearly not low-expressed in at least one tissue. With this procedure, 1,955 genesfrom the 24,720 that had been filtered out initially were retained, leaving a final count of 33,445 genes for downstream analyses.
Data normalization and correction
Data was variance-stabilized by adding a pseudo count of 1 to all TPM values and computing the logarithm base 10 of these quantities (as in the GTEx portal). Quantile normalization was then applied (Bolstad et al,2003)(preprocess CorreR package). ComBat, a bayesian framework for batch correction was employed after normalization (Johnson et al, 2007). ComBat was run in series for the extraction batch first and then for the sequencing batch. Samples with incomplete batch information or that belonged to a batch with less than two samples total were assigned placeholder batches. Default parameters of the R svapackage ComBat implementation were used, and the tissue, sex, and age-matched with each sample were indicated as variables to preserve during the batch correction.