(NON SOLL’S
ELSEVIER
01010
0010101010010
COMPUTATIONAL
001-01010010201010101011
10
101001
0101016
41
01
0101001
1010101
10
ANDSTRUCTURAL
11
0101001
1010101
10
10 0101001 0101010
11
BIOTECHNOLOGY
00
101001
5101010
11
01010101001101010 10
110101010010010000010010
JOURNAL
journal homepage: www.elsevier.com/locate/csbj
Violations of proportional hazard assumption in Cox regression model of transcriptomic data in TCGA pan-cancer cohorts
Check for updates
Zihang Zenga,1, Yanping Gao ª,1, Jiali Lia, Gong Zhangª, Shaoxing Sun ª, Qiuji Wuª, Yan Gong b,c,*, Conghua Xie a,d,e,*
a Department of Radiation and Medical Oncology, Zhongnan Hospital of Wuhan University, Wuhan, China
b Department of Biological Repositories, Zhongnan Hospital of Wuhan University, Wuhan, China
“Tumor Precision Diagnosis and Treatment Technology and Translational Medicine, Hubei Engineering Research Center, Zhongnan Hospital of Wuhan University, Wuhan, China d Hubei Key Laboratory of Tumor Biological Behaviors, Zhongnan Hospital of Wuhan University, Wuhan, China
e Hubei Cancer Clinical Study Center, Zhongnan Hospital of Wuhan University, Wuhan, China
ARTICLE INFO
Article history: Received 12 July 2021
Received in revised form 3 January 2022 Accepted 3 January 2022
Available online 7 January 2022
Keywords: Proportional hazard assumption Cox regression Transcriptome TCGA Pan-cancer
ABSTRACT
Background: Cox proportional hazard regression (CPH) model relies on the proportional hazard (PH) assumption: the hazard of variables is independent of time. CPH has been widely used to identify prog- nostic markers of the transcriptome. However, the comprehensive investigation on PH assumption in transcriptomic data has lacked.
Results: The whole transcriptomic data of the 9,056 patients from 32 cohorts of The Cancer Genome Atlas and the 3 lung cancer cohorts from Gene Expression Omnibus were collected to construct CPH model for each gene separately for fitting the overall survival. An average of 8.5% gene CPH models violated the PH assumption in TCGA pan-cancer cohorts. In the gene interaction networks, both hub and non-hub genes in CPH models were likely to have non-proportional hazards. Violations of PH assumption for the same gene models were not consistent in 5 non-small cell lung cancer datasets (all kappa coefficients < 0.2), indicating that the non-proportionality of gene CPH models depended on the datasets. Furthermore, the introduction of log(t) or sqrt(t) time-functions into CPH improved the performance of gene models on overall survival fitting in most tumors. The time-dependent CPH changed the significance of log haz- ard ratio of the 31.9% gene variables.
Conclusions: Our analysis resulted that non-proportional hazards should not be ignored in transcriptomic data. Introducing time interaction term ameliorated performance and interpretability of non- proportional hazards of transcriptome data in CPH.
@ 2022 The Author(s). Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY-NC-ND license (http://creative- commons.org/licenses/by-nc-nd/4.0/).
Abbreviations: CPH, Cox proportional hazard regression; PH, proportional hazard; TCGA, The Cancer Genome Atlas; OS, overall survival; GEO, Gene Expression Omnibus; AIC, Akaike information criterion; GO, Gene Ontology; CON, Concordance regression; TCGA, tumor abbreviations; ACC, Adrenocortical carcinoma; BLCA, Bladder Urothelial Carcinoma; BRCA, Breast invasive carcinoma; CESC, Cervical squamous cell carcinoma and endocervical adenocarcinoma; CHOL, Cholangiocarcinoma; COAD, Colon adenocarcinoma; DLBC, Lymphoid Neoplasm Diffuse Large B-cell Lymphoma; ESCA, Esophageal carcinoma; GBM, Glioblastoma multiforme; HNSC, Head and Neck squamous cell carcinoma; KICH, Kidney Chromophobe; KIRC, Kidney renal clear cell carcinoma; KIRP, Kidney renal papillary cell carcinoma; LGG, Brain Lower Grade Glioma; LIHC, Liver hepatocellular carcinoma; LUAD, Lung adenocarcinoma; LUSC, Lung squamous cell carcinoma; MESO, Mesothelioma; OV, Ovarian serous cystadenocarcinoma; PAAD, Pancreatic adenocarcinoma; PCPG, Pheochromocytoma and Paraganglioma; PRAD, Prostate adenocarcinoma; READ, Rectum adenocarcinoma; SARC, Sarcoma; SKCM, Skin Cutaneous Melanoma; STAD, Stomach adenocarcinoma; TGCT, Testicular Germ Cell Tumors; THCA, Thyroid carcinoma; THYM, Thymoma; UCEC, Uterine Corpus Endometrial Carcinoma; UCS, Uterine Carcinosarcoma; UVM, Uveal Melanoma.
* Corresponding authors at: Zhongnan Hospital of Wuhan University, 169 Donghu Road, Wuhan, Hubei 430071, China. E-mail addresses: yan.gong@whu.edu.cn (Y. Gong), chxie_65@whu.edu.cn (C. Xie).
1 Zihang Zeng and Yanping Gao contributed equally to this study.
| TCGA ID | PH-Nhub | PH-hub | NPH-Nhub | NPH-hub | hub (NPH%) | Nhub (NPH%) | Chi-Square P |
|---|---|---|---|---|---|---|---|
| ACC | 9161 | 5787 | 534 | 195 | 3.26% | 5.51% | <0.0001 |
| BLCA | 7787 | 5137 | 1302 | 649 | 11.22% | 14.33% | <0.0001 |
| BRCA | 7294 | 4889 | 1160 | 714 | 12.74% | 13.72% | 0.1 |
| CESC | 8535 | 5325 | 699 | 485 | 8.35% | 7.57% | 0.09 |
| CHOL | 9655 | 5845 | 575 | 344 | 5.56% | 5.62% | 0.8936 |
| COAD | 8686 | 5538 | 532 | 346 | 5.88% | 5.77% | 0.8075 |
| DLBC | 9235 | 5730 | 547 | 294 | 4.88% | 5.59% | 0.05758 |
| ESCA | 9423 | 5810 | 533 | 270 | 4.44% | 5.35% | 0.01128 |
| GBM | 8762 | 5074 | 1059 | 946 | 15.71% | 10.78% | <0.0001 |
| HNSC | 8211 | 5443 | 617 | 270 | 4.73% | 6.99% | <0.0001 |
| KICH | 8630 | 5264 | 926 | 663 | 11.19% | 9.69% | 0.003137 |
| KIRC | 8241 | 5114 | 547 | 601 | 10.52% | 6.22% | <0.0001 |
| KIRP | 8410 | 5346 | 679 | 433 | 7.49% | 7.47% | 0.9857 |
| LGG | 6209 | 3705 | 2763 | 2029 | 35.39% | 30.80% | <0.0001 |
| LIHC | 6622 | 3752 | 1955 | 1928 | 33.94% | 22.79% | <0.0001 |
| LUAD | 8410 | 5326 | 859 | 526 | 8.99% | 9.27% | 0.582 |
| LUSC | 8183 | 5197 | 1074 | 631 | 10.83% | 11.60% | 0.1506 |
| MESO | 9134 | 5705 | 753 | 351 | 5.80% | 7.62% | <0.0001 |
| OV | 8163 | 5239 | 1038 | 581 | 9.98% | 11.28% | 0.01339 |
| PAAD | 8732 | 5416 | 1054 | 652 | 10.74% | 10.77% | 0.9807 |
| PCPG | 8787 | 5635 | 566 | 274 | 4.64% | 6.05% | 0.000219 |
| PRAD | 8528 | 5506 | 257 | 182 | 3.20% | 2.93% | 0.3734 |
| READ | 7500 | 4505 | 2203 | 1529 | 25.34% | 22.70% | 0.0001699 |
| SARC | 8471 | 5429 | 949 | 487 | 8.23% | 10.07% | 0.0001544 |
| SKCM | 9010 | 5691 | 703 | 280 | 4.69% | 7.24% | <0.0001 |
| STAD | 8220 | 5326 | 1000 | 578 | 9.79% | 10.85% | 0.04084 |
| TGCT | 10,221 | 6113 | 136 | 72 | 1.16% | 1.31% | 0.4472 |
| THCA | 8390 | 5475 | 191 | 123 | 2.20% | 2.23% | 0.9562 |
| THYM | 9489 | 5840 | 294 | 145 | 2.42% | 3.01% | 0.03506 |
| UCEC | 9327 | 5712 | 528 | 310 | 5.15% | 5.36% | 0.591 |
| UCS | 10,226 | 6088 | 307 | 162 | 2.59% | 2.91% | 0.2389 |
| UVM | 8876 | 5651 | 345 | 186 | 3.19% | 3.74% | 0.07955 |
Note: PH, proportional hazards; NPH, non-proportional hazards; Nhub, non-hub; CPH, COX proportional hazards.
| TCGA ID | Patient number | Proportions of gene CPH model violated PH (%) | Maxima Log likelihood (Proportions %) | Minima AIC (Proportions %) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Non-t | t | log(t) | t^2 | sqrt(t) | exp^t | Non-t | t | log(t) | t^2 | sqrt(t) | exp^t | |||
| ACC | 77 | 5.34% | 0.00% | 15.12% | 36.36% | 13.57% | 14.09% | 20.87% | 0.00% | 15.12% | 36.36% | 13.57% | 14.09% | 20.87% |
| BLCA | 405 | 12.44% | 0.00% | 18.18% | 40.87% | 2.35% | 36.48% | 2.12% | 0.19% | 18.18% | 40.76% | 2.35% | 36.40% | 2.12% |
| BRCA | 1077 | 13.33% | 0.00% | 12.06% | 53.27% | 6.70% | 26.90% | 1.06% | 0.00% | 12.06% | 53.27% | 6.70% | 26.90% | 1.06% |
| CESC | 291 | 7.91% | 0.00% | 16.27% | 37.30% | 5.39% | 36.28% | 4.76% | 0.06% | 16.27% | 37.30% | 5.39% | 36.22% | 4.76% |
| CHOL | 36 | 5.64% | 0.00% | 22.61% | 24.32% | 15.56% | 19.21% | 18.30% | 0.00% | 22.55% | 24.32% | 15.44% | 19.21% | 18.48% |
| COAD | 282 | 5.87% | 0.00% | 27.59% | 20.91% | 23.61% | 23.46% | 4.42% | 0.00% | 27.59% | 20.91% | 23.61% | 23.46% | 4.42% |
| DLBC | 46 | 5.59% | 0.00% | 20.55% | 38.55% | 20.82% | 18.54% | 1.54% | 0.13% | 20.28% | 38.35% | 21.36% | 18.33% | 1.54% |
| ESCA | 181 | 4.82% | 0.00% | 37.12% | 10.55% | 16.83% | 20.22% | 15.28% | 0.00% | 37.12% | 10.55% | 16.83% | 20.22% | 15.28% |
| GBM | 152 | 9.67% | 0.00% | 15.33% | 25.33% | 3.51% | 45.51% | 10.32% | 0.04% | 15.29% | 25.33% | 3.47% | 45.27% | 10.60% |
| HNSC | 517 | 7.48% | 0.00% | 30.29% | 15.55% | 3.55% | 49.86% | 0.75% | 0.07% | 30.15% | 15.55% | 3.55% | 49.80% | 0.89% |
| KICH | 65 | 10.92% | 0.00% | 4.46% | 80.29% | 3.16% | 4.03% | 8.06% | 0.00% | 4.43% | 80.29% | 3.16% | 4.07% | 8.06% |
| KIRC | 528 | 7.61% | 0.00% | 16.00% | 26.13% | 14.15% | 36.84% | 6.88% | 0.00% | 16.00% | 26.13% | 14.15% | 36.84% | 6.88% |
| KIRP | 285 | 8.68% | 0.00% | 16.50% | 45.82% | 12.71% | 18.71% | 6.25% | 0.00% | 16.50% | 45.82% | 12.71% | 18.71% | 6.25% |
| LGG | 504 | 29.31% | 0.00% | 43.85% | 5.43% | 16.60% | 27.45% | 6.67% | 0.00% | 43.85% | 5.43% | 16.60% | 27.45% | 6.67% |
| LIHC | 363 | 23.01% | 0.00% | 40.65% | 8.29% | 7.54% | 41.69% | 1.83% | 0.00% | 40.65% | 8.29% | 7.54% | 41.69% | 1.83% |
| LUAD | 500 | 9.09% | 0.00% | 36.78% | 10.04% | 20.79% | 29.14% | 3.25% | 0.15% | 36.58% | 10.04% | 20.44% | 29.04% | 3.75% |
| LUSC | 491 | 11.89% | 0.00% | 34.87% | 10.56% | 7.08% | 45.43% | 2.07% | 0.00% | 34.87% | 10.56% | 7.08% | 45.39% | 2.11% |
| MESO | 85 | 7.79% | 0.00% | 28.74% | 27.92% | 16.19% | 20.47% | 6.68% | 0.05% | 28.69% | 27.92% | 16.05% | 20.47% | 6.82% |
| OV | 418 | 11.31% | 0.00% | 49.54% | 6.33% | 17.55% | 22.24% | 4.34% | 0.16% | 49.42% | 6.33% | 17.55% | 22.20% | 4.34% |
| PAAD | 177 | 10.75% | 0.00% | 30.10% | 22.21% | 15.03% | 24.14% | 8.52% | 0.15% | 30.10% | 22.21% | 15.03% | 24.03% | 8.48% |
| PCPG | 177 | 5.56% | 0.00% | 10.65% | 49.66% | 27.22% | 9.89% | 2.59% | 0.15% | 10.27% | 49.20% | 27.83% | 9.73% | 2.81% |
| PRAD | 495 | 3.66% | 0.00% | 6.29% | 54.75% | 5.89% | 10.31% | 22.76% | 5.49% | 5.76% | 53.28% | 5.22% | 9.91% | 20.35% |
| READ | 91 | 19.22% | 0.00% | 26.04% | 23.51% | 10.34% | 27.34% | 12.76% | 0.00% | 26.04% | 23.51% | 10.34% | 27.34% | 12.76% |
| SARC | 258 | 9.16% | 0.00% | 16.87% | 36.01% | 14.78% | 20.16% | 12.19% | 0.00% | 16.87% | 36.01% | 14.78% | 20.16% | 12.19% |
| SKCM | 101 | 5.54% | 0.00% | 7.46% | 50.71% | 5.97% | 5.40% | 30.47% | 0.00% | 7.46% | 50.71% | 5.97% | 5.40% | 30.47% |
| STAD | 388 | 9.32% | 0.00% | 22.51% | 29.49% | 0.90% | 43.92% | 3.18% | 0.05% | 22.51% | 29.49% | 0.90% | 43.87% | 3.18% |
| TGCT | 132 | 1.41% | 0.00% | 17.26% | 28.17% | 43.91% | 9.39% | 1.27% | 7.61% | 14.21% | 21.83% | 36.29% | 6.35% | 13.71% |
| THCA | 503 | 2.35% | 0.00% | 17.46% | 17.67% | 24.35% | 11.64% | 28.88% | 2.16% | 16.59% | 17.46% | 23.92% | 11.21% | 28.66% |
| THYM | 118 | 3.14% | 0.00% | 5.05% | 12.76% | 19.01% | 3.25% | 59.93% | 0.00% | 5.05% | 12.76% | 18.77% | 3.25% | 60.17% |
| UCEC | 178 | 6.15% | 0.00% | 19.62% | 30.29% | 14.25% | 22.62% | 13.23% | 0.64% | 19.55% | 30.16% | 13.74% | 21.98% | 13.93% |
| UCS | 56 | 3.24% | 0.00% | 23.82% | 33.89% | 12.07% | 22.04% | 8.18% | 0.10% | 23.82% | 33.89% | 11.86% | 21.93% | 8.39% |
| UVM | 79 | 4.26% | 0.00% | 19.62% | 22.40% | 19.43% | 12.39% | 26.16% | 0.00% | 19.62% | 22.40% | 19.43% | 12.39% | 26.16% |
1. Introduction
As sequencing costs decreased in recent years, transcriptomic sequencing is an efficient and increasingly popular approach to characterize complex biological pathways of tumors [1]. Numerous studies have demonstrated the possibility of transcriptomic signa- tures as prognostic markers, which is often achieved based on Cox proportional hazard regression (CPH) [2-4].
The main premise of CPH is the proportional hazard (PH) assumption [5,6]. Specifically, the model assumes that the hazard of each covariate does not change over time. Violation of the PH assumption may lead to misleading and erroneous scientific find- ings [7,8]. If the hazard of variable increases or decreases over time, the usual CPH model will ignore these time-dependent changes. Bellera et al. found that the negative status of hormone receptors increased the risk of metastasis at early stages but was protective thereafter in breast cancer [9]. Moreover, Shintani et al. demon- strated that time-fixed CPH could have a considerable bias in ana- lyzing the relationship between delirium and ICU length of stay [10].
However, verification of the PH assumption has not received sufficient attention. Only a small proportion of studies relying on Cox models have validated the PH assumption [8,11]. Unlike clini- cal data, the transcriptomic data consisted of high-throughput RNA signatures. Compliance with the PH assumption in transcriptomic data is to be investigated. On the other hand, the CPH with the time-dependent term can deal with the non-proportional hazards of variables [5]. The effects of CPH with the time-dependent term for transcriptomic data are still uncertain.
The Cancer Genome Atlas (TCGA) is one of the most comprehen- sive tumor omics projects containing over 20,000 samples span- ning 33 cancer types [12]. The high-quality sequencing data of TCGA provide an opportunity to systematically characterize the non-proportional hazards of each transcriptomic gene. This study aimed to construct a CPH model for overall survival (OS) prediction at the RNA level for each gene and answer the following 5 ques- tions using TCGA cohorts: 1) What proportions of single-gene CPH models violate the PH assumption in different tumors; 2) How important are these genes with non-proportional hazards in CPH; 3) Are these genes very different across cohorts of the same tumors; 4) The CPH with the time-dependent term is an approach to explain the non-proportional hazards of covariates. Which form of the time-interaction term works better in time-dependent CPH; 5) Does the time-dependent CPH change the significance of the log hazard ratio (HR) of genes compared with the usual CPH.
2. Material and methods
Assessing the association of RNA levels of genes in cancer patients with survival times had important implications for the prognostic markers’ identification. In this study, we aimed to fit cancer patients’ OS (outcome variable) using univariate CPH. Mod- els included only gene RNA level as an independent variable unless otherwise specified.
2.1. Data collection and standardization
We collected 9,056 patients from 32 cohorts with complete RNA sequencing (RNA-seq) of primary tumors and OS annotations (pa- tients with 0 survival time were excluded) from TCGA [12]. The RNA-seq of TCGA was normalized by TPM & log2(x + 1) and down- loaded by Xena database [13]. We removed RNAs that had the same expression values in more than 80% of the patients. Finally, we included 38,154 RNAs in TCGA cohorts.
Three independent transcriptomic datasets of NSCLC were col- lected from the Gene Expression Omnibus (GEO) database: GSE68465 (lung adenocarcinoma, LUAD, n = 442) [14], GSE37745 (non-small cell lung cancer, NSCLC, n = 196) [15], GSE50081 (NSCLC, n = 172) [16]. Their microarray data were normalized by log2 RMA [17]. The basic information of included cohorts was shown in Tab. S1.
2.2. Schoenfeld residual test
Schoenfeld residual test was an effective and widely used method for the diagnosis of PH assumption [6,7]. The Schoenfeld residuals were defined as the following functions [18]:
M(, tk) = >Zs * e B+Zs | > @ B+Zs SERS SERS
Tk(B) = Zk - M(B, tk)
where Rs denotes the risk set at time tk, Zs is the covariate of the patient with an event at time tk, ß is the coefficient of Zs, M(B, tk) is the conditional weighted mean of covariate at time tk. Schoenfeld residuals were scaled using the average variance matrix [19]. Under the assumption of PH, the scaled Schoenfeld residuals were a mean of 0 and independent of time. We used R ‘cox.zph’ function to performed Schoenfeld residual test [19].
2.3. Time-dependent CPH model
Introducing a time-dependent variable in CPH provided a flexi- ble method to evaluate non-proportionality [20]. In this study, we constructed an interaction term between gene expression and time function (t, log(t), t2, sqrt(t), or et) in time-dependent CPH:
hx (t) = ho(t) * e(Bx+õ*X=f(t)
Where hx(t) is the hazard of gene x, ho(t) represents baseline hazard, and f(t) is time function, ß is the coefficient of x, and ô is the coefficient of time-interaction term. The CPH was performed by R ‘coxph’ function, and interaction term was realized by the time-transform functionality of ‘coxph’ [21].
2.4. Performance evaluation of CPH model
Log-likelihood and Akaike information criterion (AIC) were used to assess the fitting capability of models. Log-likelihood was posi- tively associated with favourable fitting performance, while AIC was negatively linked to fitting performance. We used R ‘logLik’ and ‘AIC’ functions to calculate the Log-likelihood and AIC of CPH models [22,23].
2.5. Protein-protein interaction network and node importance assessment
STRING was a protein-protein interaction (PPI) database that contained 7 kinds of interaction evidence (fusion evidence, neigh- borhood evidence, co-occurrence evidence, experimental evidence, textmining evidence, database evidence, and co-expression evi- dence) [24]. We constructed PPI networks consisting of 17,399 genes overlapped with TCGA RNA signatures and 496,557 edges with confidence greater than 0.6 from STRING.
Betweenness, closeness, and degree were indexes reflecting node importance of PPI networks and calculated by R ‘igraph’ pack- age (https://igraph.org/). Betweenness was the proportion of the shortest path through a certain node [25]. Closeness measured the reciprocal distances to all other nodes. Degree measured the number of directly connected nodes. Genes with all betweenness,
closeness and degree above the median were identified as hubs, and others were non-hubs.
2.6. Exploring potential factors associated with non-proportionality in CPH
For genes with non-proportional hazards in a CPH model, since each patient corresponded to one survival time, there were 2 patient sets (corresponded to the 2-time sets) in which genes were inversely related to prognosis. We first constructed the risk scores for each patient using multivariable Cox regressions for fitting OS from a random set of 2000 genes in each cohort. Next, we divided the patients into two groups for each gene: group A, [rank(gene expressions) - median(rank(gene expressions))] and [rank(risk scores) - median(rank(risk scores))] have the same sign; group B, [rank(gene expressions) - median(rank(gene expressions))] and [rank(risk scores) - median(rank(risk scores))] have the different sign. Hierarchical clustering of the matrix [genes, patient groups] resulted in non-proportional related patient clusters. Hierarchical clustering was realized by R ‘hclust’ function [26].
MCPcounter was an algorithm for calculating microenviron- mental cell infiltration scores based on ssGSEA [27]. We performed statistical analysis to explore infiltration score related and clinical factors associated with the above clusters. In order to identify cluster-related genes, we performed differential expressed gene analysis for each cluster vs. others via R ‘limma’ package [28].
2.7. Concordance regression
Concordance regression (CON) was an algorithm that used all available patient pairs (omit censored pairs) to model C’ [29]. The C’ was developed from the c-index to be generalized to contin- uous data: c’ = P(Ti < Tj|xi = Xj + 1)
Where X is gene expression value, T is survival time, and i, j were patients. We used R ‘concreg’ package to construct CON model [29].
2.8. Generation of simulated datasets
Simulated expression matrices for 10,000 genes *100 samples were derived from a standard normal distribution via R ‘rnorm’ function [30]. Generalized gamma distributions are flexible distri- butions fitted to survival times that incorporate commonly used distributions (exponential, Weibull, log normal, and gamma) [31]. The mu, sigma, Q parameters were used to characterize the survival curves, which were realized by R ‘flexsurv’ package [32]. We performed 40 experiments with 8 typically shaped survival curves using different parameters of the generalized gamma distri- bution and 5 event rates (10%, 25%, 50%, 75%, and 90%).
2.9. Statistical analysis
All statistical analysis was performed by R 3.6.1. Survival anal- ysis was performed by R ‘survival’ package [21]. Randomly dis- tributed P values were generated for each gene with the following formula: rank(P)/gene number. The rank(P) is descend- ing ranking of Schoenfeld residuals test P value. Kolmogorov- Smirnov test was performed by R ‘ks.test’ function [33]. The R ‘pracma’ package was used to calculate the area under the curves [34]. Fleiss’ kappa was calculated by R ‘irr’ package [35]. R ‘stats’ package was used to perform chi-square and analysis of variance (ANOVA) tests. Parallel computing was performed by R ‘parallel’ package [36]. Multiple tests were adjusted by Benjamini and Hoch- berg method in gene enrichment analysis [37]. P < 0.05 was consid-
ered as statistical significance in hypothesis tests. All the P values were 2-sided.
3. Results
3.1. Landscapes of violations of the PH assumption for CPH with transcriptome genes in TCGA pan-cancer cohorts
The study design was shown in Fig. 1. A total of 9,056 samples with complete RNA-seq (TPM) and OS information from 32 cancers were collected from TCGA. Overall, the mean follow-up time in TCGA was 2.75 years, with death events observed in a total of 29.5% patients (Tab. S1). The verification of PH assumption was based on Schoenfeld residual test, which identified the non- proportional characteristic of variables in CPH models fitted OS. Through the parallel strategy of R ‘parallel’ package, Schoenfeld residual test of univariate CPH for each gene was performed in dif- ferent tumors, respectively. The distributions of Schoenfeld test P values were left-skewed and light-tail in most tumors (Fig. 2A & Fig. S1). In all tumors, an average of 8.5% gene CPH models violated the PH assumption. The above proportions were the highest in LGG (29.3%), LIHC (23.0%), and READ (19.2%) and the lowest in TGCT (1.4%), THCA (2.4%), and THYM (3.1%). Furthermore, 66.2% of the gene CPH models violated the PH assumption in at least one tumor, but only 7% of the gene CPH models violated PH in more than 5 tumors (Fig. 2B). The Quantile-Quantile (QQ) plot revealed that the p values from the Schoenfeld residual test were significantly better than that from a random distribution in the 81.25% (26/32) tumors (Fig. 2C), except PRAD, SKCM, TGCT, THCA, THYM, and UCS. To explore potential factors, we analyzed the shape of the tumor survival curve, median follow-up time, and number of events as independent variables. The mu, sigma, Q parameters of generalized gamma distributions were used to characterize the survival curves. We included 5 variables (mu, sigma, Q, median follow-up time and number of events) for multivariable linear regression. Only median follow-up time and number of events showed significant positive coefficients in linear models fitted the area under the curve of the QQ plot and the proportion of p value better than random distribution (log10(proportion*30000)), respectively (Fig. 2D & E).
Scaled Schoenfeld residual plot (Fig. 2F-H) showed that the prognostic effect of RPL7A varied significantly over time in KIRC, KIRP, and LGG. For LGG, the residual plot showed a strong effect of RPL7A expression in the first 2 years, and then the effect tends to weaken.
3.2. Both hub and non-hub genes in the interaction networks possibly had non-proportional hazards in CPH models
We used network analysis to assess the importance of genes using PPI networks identified by STRING database. PPI networks consisted of 17,399 genes overlapped with TCGA RNA signatures and 496,557 edges with confidence greater than 0.6. In GBM, KIRC, LGG, LIHC, and READ, genes with non-proportional hazards in CPH models had higher betweenness, closeness, and degree, which were linked to hub positions (Fig. 3A-C). However, the above rele- vance was reversed in ACC, BLCA, HNSC, MESO, and SKCM. We next defined hub genes as possessing betweenness, closeness, and degree values above the median, and a total of 6357 hub genes were identified. We similarly found that hub genes had a higher proportion of non-proportional hazards in ACC, BLCA, ESCA, OV, HNSC, PCPG, MESO, SARC, STAD, SKCM and THYM, whereas more non-hub genes had non-proportional hazards in GBM, KICH, KIRC, LGG, LIHC, and READ (Table 1).
TCGA Pan-cancer (n=9,164, 32 transcriptomic cohorts)
Five main issues
What proportions of single-gene CPH models violate the PH assumption in different
How important are these non- proportional genes?
Are these genes very different across cohorts of the same tumors?
Which form of the time-interaction term works better in time-dependent CPH?
Does the time- dependent CPH change the significance of gene compared with the usual CPH?
Schoenfeld residual test of univariate CPH for each gene in the 32 cohorts
Gene interaction networks of all genes identified by STRING database
Five NSCLC transcriptomic datasets: TCGA LUAD (n = 513), TCGA LUSC (n = 498), GSE68465 (n = 442), GSE37745 (n = 196) and GSE50081 (n = 172)
Five functions of time (f(t) = t, log(t), t2, sqrt(t), et)
Time-dependent CPH vs. CPH
Network analysis: betweenness, closeness and degree
Time-dependent CPH for each gene
Significance change
Gene non- proportional hazard information
Schoenfeld residual test of univariate CPH for each gene in the 5 cohorts
Compare the node importance of proportional and non- proportional genes
Log likelihood and AIC
Comparison with previous clinical studies
Proportions of genes with non-proportional hazard in the 5 GO terms
Binary matrix (1, significant; 0 non- significant)
Genes with non-proportional hazard in at least 3 of 5 datasets
Five GO terms with important biological processes: DNA damage checkpoint, apoptotic process, T cell activation, DNA-dependent DNA replication and cell cycle process
Kappa coefficient
Enrichment analysis
The specific PPI networks in GBM and LGG were presented in Fig. 3D & E. Next, we investigated genes from 5 Gene Ontology (GO) terms[38] with important biological processes: DNA damage checkpoint, apoptotic process, T cell activation, DNA-dependent DNA replication, and cell cycle process (Fig. 3F). The 15/32 cancers had a proportion exceeding 10% of gene CPH models that violated the PH assumption in at least one important process. The above results indicated that both hub or non-hub genes had the potential to have non-proportional hazards.
3.3. The non-proportional hazards of genes varied widely across cohorts of the same tumors
To investigate the consistency of non-proportional hazards of genes in different datasets, we collected five NSCLC transcriptomic datasets containing TCGA LUAD (n = 500), TCGA LUSC (n = 491), GSE68465 (n = 442) [14], GSE37745 (n = 196) [15] and GSE50081 (n = 172) [16]. Violations of the PH assumption of CPH models fit- ted survival times in these datasets ranged from 3.3% to 22% (Fig. 4A). The matrix of Schoenfeld test P value was next converted to a binary matrix (1, significant; 0 non-significant). However,
overlapping patterns of the binary matrix in 5 datasets were scarce (Fig. 4B). Violations of the PH assumption across the 5 datasets were also highly inconsistent (Fig. 4C).
We next explored potential factors associated with non- proportionality. Using our unique strategy (see methods, Fig. 4D), patients were divided into the 2 groups for each non- proportional gene in CPH (patient A and patient B, see methods) and further clustered into different clusters. In TCGA LUAD, all genes had positive significant beta value of univariate Cox regres- sion in patient A group and negative value in patient B group (Fig. 4E). Moreover, the 4 patient clusters had differentiated pat- terns (Fig. 4F). Through the MCPcounter algorithm, we constructed the cellular scores of microenvironment. However, the non- proportionality related clusters were not related to microenviron- ment scores (Fig. 4G). Furthermore, these clusters also were not associated with clinical characters (gender, age, pathologic stage, treatment outcome of first course, Tab. S2). At the transcriptome level, the clusters also showed scarce differential genes (Fig. 4H). In the other 4 NSCLC cohorts (TCGA LUSC, GSE68465, GSE37745 and GSE50081), the non-proportionality clusters were also inde- pendent of clinical characters and microenvironment scores
A
Skewness Kurtosis
B
Genes
ACC
0.09
-1.19
-0009
0,0001
C
4
ACC
4
ESCA
LIHC
·
6
THYM
< 0.0001
PRAD
4
0
3.
3
0.00042
4
< 0.0001
< 0.0001
BLCA
0.34
-1.17
Number of tumors in which genes had non-proportional hazard
0
-log10(P)
-log10(P)
-log 10(P)
-log10(P)
-log10(P)
< 0.0001
.
A
9
BRCA
0.4
-1.12
N
1
1
N
1
CESC
0.1
-1.24
CO
0
0
0
0-
0-
+
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
CHOL
0.08
-1.23
-log 10(expected P)
-log10(expected P)
-log 10(expected P)
-log 10(expected P)
2
3
4
0
-log10(expected P)
1
2
3
4
Un
00
5
BLCA
4
GBM
6
LUAD
READ
·
4
UCEC
COAD
0.07
-1.21
V
I
-log10(P)
4
< 0.000
-log10(P)
< 0.0001
-log10(P)
< 0.0001
-log10(P)
< 0.0001
-log10(P)
< 0.0001
X
CO
4
3
3.
DLBC
0.42
-1.03
Y
00
2
A
ESCA
0.06
-1.19
à
1
1
2.
1
=
0-
0
0
GBM
0.22
-1.22
0
2
3
4
0
1
0
-log10(expected P)
2
3
0
0
1
-log 10(expected P)
-log 10(expected P)
2
1
-log 10(expected P)
2
3
0
4
0
-log10(expected P)
2
3
HNSC
0.13
-1.2
F
BRCA
.
Schoenfeld test of RPL7A in KIRC
*
5
LUSC
SARC
4
UCS
·
6
-log10(P)
< 0.0001
-log10(P)
HNSC < 0.0001
-log10(P)
4
KICH
0.33
-1.15
< 0.0001,
-log10(P)
< 0.0001
-log10(P)
< 0.0001 .**
3
6
.
P = 0.0017
A
3
N
KIRC
0.1
-1.24
Beta(t) of RPL7A
1
2
3
1
1
KIRP
0.16
-1.22
0
0
0
0
0
0
0
V
0
-log 10(expected P)
2
3
LGG
0.73
0.79
-log 10(expected P)
Y
2
“log 10(expected P)
0
-log 10(expected P)
2
3
0
-log10(expected P)
2
3
4
CESC
.
6
KICH
·
MESO < 0.0001
SKCM
·
UVM
·
LIHC
0.63
-0.92
3
3
-3
3
4
LUAD
0.18
-1.23
-log10(P)
< 0.0001
-log10(P)
< 0.0001
-log10(P)
-log10(P)
0.0077
-log10(P)
< 0.000
2’
2
2.
·
V
LUSC
0.34
-1.16
0.3 1.2 2 3 4.3 5.2 6.4 Time (years)
8.5
1
1.
1
MESO
0.15
-1.23
G
0
0
0
0-
-1
0
Schoenfeld test of RPL7A in KIRP
0
1
2
3
4
0
1
2
3
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
-log 10(expected P)
-log10(expected P)
-log 10(expected P)
-log 10(expected P)
-log10(expected P)
OV
0.28
-1.2
.
P = 0.0474
5
CHOL
.
KIRC
6
OV
.
STAD
D
4
4
PAAD
0.23
-1.24
Beta(t) of RPL7A
5.0
-log10(P)
< 0.0001
-log10(P)
< 0.0001
-log10(P)
< 0.0001
-log10(P)
< 0.0001
P.value
0.000306
0.001773
4
0.598391
0.532801
0.011691
3
4
3
PCPG
0.11
-1.19
2.5
2
2
.
2
0.05
-1.13
1
1
AUC
Coef
2.271156
-1.008588
-0.374214
PRAD
1.10791
0.032904
0.0
0
0-
0
0
READ
0.64
-0.87
0
-log 10(expected P)
2
3
0
-log10(expected P)
1
2
3
4
0
-log 10(expected P)
2
3
0
-log 10(expected P)
2
3
-2.5
-1.2
COAD
·
KIRP < 0.0001
·
0.21
5
5
PAAD
.
2.5
TGCȚ
Variables
Median years
Event number
SARC
..
4
< 0.0001
4
4
2.0
SKCM
0.05
-1.18
0.81 1.4 2.8 4 4.2 6.1 7.2 7.6
< 0.0001
< 0.0001
Time (years)
-log10(P)
3
-log10(P)
-log10(P)
3
-log10(P)
1.5.
mu
sigma
Q
A
S
STAD
0.18
-1.23
H
Schoenfeld test of RPL7A in LGG
2-
N
1
1
0.5
E
TGCT
0.13
-1.11
P = 0.0026
0
0
0
0.0
P.value
0.46922
0.34037
0.71117
0.00205 0.01450
2
0
-log 10(expected P)
2
3
4
0
0
1
THCA
0.25
-1.05
Beta(t) of RPL7A
-log10(expected P)
1
2
3
4
-log 10(expected P)
1
2
3
4
0
-log 10(expected P)
2
3
A
4
DLBC
ª
8
LGG
4
PCPG
THCA
proportion
THYM
0.21
-1.13
0
-log10(P)
< 0.0001
-
< 0.0001
< 0.0001
4
1
< 0.0001
Coef
0.169659
0.910744
0.109997
0.681126
Event number 0.01208
3
-log10(P)
6
-log10(P)
3
-log10(P)
3-
UCEC
0.08
-1.21
2
2
2.
R
0.06
-1.16
Variables
Median years
UCS
1
2
1
1
-4
8
0
0
0
0
UVM
0.25
-1.14
sigma
0.96 2.2
2 3.4 5 6.7 9.2 11
12
0
1
-log 10(expected P)
2
3
4
0
-log10(expected P)
1
2
3
4
0
1
2
3
4
0
1
2
4
Time (years)
-log 10(expected P)
-log 10(expected P)
mu
Q
(Fig. S2-5 & Tab. S3-6), which suggested that non-proportional hazards may be due to other unknown causes.
3.4. CPH with the time-dependent variables allowed for a better fit of OS
Introducing time-dependent variables was an effective method to analyze non-proportional hazards in CPH. Introducing one of the 5 time functions (f(t) = t, log(t), t2, sqrt(t), et) of the time-dependent variables increased the Log-likelihood value and reduced AIC of CPH in all tumors (Fig. 5A). The et function performed worse than the other 3 time functions. For gene CPH violated PH assumption,
the CPH with log(t) function had the highest proportions of maxi- mum Log-likelihood value in the 14/32 tumors (Table 2, Fig. 5B). For instance, among the 5,086*6 CPH models composed of 5,086 genes with non-proportional hazards in BRCA, 53.3% of the genes in log(t) function model had the largest Log-likelihood. Moreover, the proportions of maximum Log-likelihood and minimum AIC in different datasets were similar. However, CPH of the 7.6% gene CPH models that violated the PH assumption fitted better without time-dependent variable in TGCT. In general, for gene CPH violated the PH assumption, introducing log(t), t, and sqrt(t) functions had a superior performance in most tumors. In particular, for AIC, 60.2% non-proportional CPH models in THYM benefited from et.
A
Betweenness of genes in TCGA
25-
D
Gene interaction network in GBM
20
15
PH
PH
10
· Conform
Conform
· No
No
5
0
ACC
BLCA
BRCA
CESC
CHOL
COAD
DLBC
ESCA
GBM
HNSC
KICH
KIRC
KIRP
LGG
LIHC
LUAD
LUSC
MESO
OV
PAAD
PCPG
PRAD
READ
SARC
SKCM
STAD
TGCT
THCA
THYM
UCEC
UCS
UVM
E Gene interaction network in LGG
B
Closeness of genes in TCGA
1.20e-06
*
%
PH
1.18e-06
Conform
No
PH
1.16e-06
· Conform
· No
1.14e-06
F
Proportion of gene CPH model that violated PH
ACC
BLCA
0.04
0.04
0.04
0.08
0.04
ACC
BLCA
BRCA
CESC
CHOL
COAD
DLBC
ESCA
GBM
HNSC
KICH
KIRC
KIRP
LGG
LIHC
LUAD
LUSC
MESO
OV
PAAD
PCPG
PRAD
READ
SARC
SKCM
STAD
TGCT
THCA
THYM
UCEC
UCS
UVM
BRCA
0.13
0.12
0.16
0.11
CHOL
0.07
0.11
0.12
0.1
0.13
1
CESC
0.1
0.14
COAD
0.03
0.1
0.04
0.22 0.06
0.05
0.03
0.07
0.09
0.04
DLBC
0.06
0.05 0.05 0.12
0.04
0.06
0.03
0.07
ESCA
0.03
C
Degree of genes in TCGA
GBM
0.06
0.04 0.06
0.03
0.03
0.05
0.04
HNSC
0.29
0.03
0.06
0.42
0.24
0.5
0.1
0.05
0.1
0.04
0.08
0.04
12
KICH KIRC KIRP LGG
0.13 0.05
0.08
0.05
0.03
0.17 0.04
0.44
0.07
0.35
0.04 0.34 0.16
0.1 0.1
0.46 0.48
0.07
LIHC
LUAD
0.48
0.07
0.28
0.4
0.1
0.1 0.07
0.07
0.43
0.07
0
4
**
LUSC
MESO
0.16
0.04
0.08
0.07
0.03
0.24
0.08
0.15
8
OV
PAAD PCPG
0.11
0.11
0.07
0.1
0.2
0.07
0.05
0.05 0.07
0.09 0.09 0.06
0.1
PRAD
0.04 0.19
0.05
0.03
0.03
0.08
· Conform
SARC
0.29
0.01
0.06
PH
READ
0.03 0.22
0.1
0.44
-0.5
·
No
SKCM
STAD
TGCT THCA THYM
0.02 0.09
0.1
0.06
0.1
0.13 0.03
0.06
0.07
4
0.1
0.01
0.07
0.01
0.03
0.01
0.1
0.12
0
0
0.03
0.01
0
0.01
0.02
0.07
0.01
0.01
0.03
0.01
UCEC
0.02
UCS
0.07
0.02
0.04
0.02
0.04
0.03
0.07
0.03
-1
UVM
0.03
0.03
0.07
0.01
0
0.02
0.03
0
T cell
ACC BLCA
LUSC MESO
checkpoint
Apoptotic
process
activation
DNA-dependent
SKCM
THYM
DNA damage
DNA replication
Cell cycle process
BRCA
CESC
CHOL
COAD
DLBC
ESCA
GBM
HNSC
KICH
KIRC
KIRP
LGG
LIHC
LUAD
OV
PAAD
PCPG
PRAD
READ
SARC
STAD
TGCT
THCA
UCEC
UCS
UVM
Concordance regression (CON) provided by Dunkler et al. [29] was an alternative approach to survival analysis which was not constrained by PH assumption. We constructed CPH with time interaction term and CON models for each gene in the TCGA tumor datasets, and used C-index as the evaluation index. We selected the best performing time function in Table 2 for each tumor. More gene CON models were inferior to CPH models with time- interaction term, which may be influenced by high censoring rate of real datasets (Fig. S6).
The et and t2 performed better in TGCT, THCA, THYM and UVM, and these datasets were highly censored: 3% events in TGCT, 3.2% events in THCA, 6.8% events in THYM, 27.8% events in UVM. In order to explore the factors that influence the performance of the time function, we performed 40 simulation experiments (8 sur- vival curve shapes*5 event rates for 10,000 genes*100 patients, see methods, Fig. S7). We found that highly censored datasets
had unusually lower proportions of the log(t) and sqrt(t) function with minimum AIC (Fig. S8 & Tab. S7).
3.5. CPH with the time-dependent variables changed the significance of log HR of the genes with non-proportional hazards
Whether the introduction of time-dependent variables changed the significance of gene variables in CPH was next examined. There were 2 parameters (ß and 8) in time-dependent CPH model (see methods). For ß parameter (log HR), the changes of significance in non-proportional hazard genes with different time-dependent functions in BRCA were presented in Fig. 6A. The CPH with sqrt (t) function had the highest proportion (74.5%) of significant changes of log HR. The significant changes of gene log HR with a time-dependent variable of CPH in different tumors were shown in Tab. S8-12. For CPH model with t function, overall 39.8% of
A
Proportion of non-proportional in 5
B
Proportionality
C
Kappa matrix
0.25
Non-proportionality
0.22
0.19
TCGA LUAD
0.04
0.02
0.02
0.01
1
0.20
NSCLC datasets
TCGA LUSC
0.8
0.15-
0.04
0.01
0
0.03
0.12
0.6
0.10
0.091
GSE68465
0.02
0.01
0.01
0.11
0.4
0.05-
0.033
GSE37745
0.02
0
0.01
0.07
0.2
0.00
TCGA LUAD
TCGA LUSC
GSE68465
GSE37745
GSE50081
GSE50081
0.01
0.03
0.11
0.07
0
GSE68465
GSE37745
TCGA LUSC
TCGA LUAD
GSE50081
TCGA LUAD
TCGA LUSC
GSE68465
GSE37745
GSE50081
D
Patients: Risk score Gene exp
F
G
Microenvironment score in TCGA LUAD
1,2,3, .. ,n
$
All, NS
group
Matrix[Gene Patient]
1
if (Risk score-median)*(Gene exp-median) ⇐ 0, value=0 (A)
1,2,3,.
2
7.5
T cells
3
CD8 T cells
Cytotoxic lymphocytes
if (Risk score-median)*(Gene
4
Value
5.0
B lineage
exp-median)>0, value=1 (B)
NK cells
Monocytic lineage
Myeloid dendritic cells
Hierarchical clustering
2.5
Neutrophils
Endothelial cells
Fibroblasts
Clinical and microenviron- ment analysis
0.0
#
1
2
3
4
group
Beta of Cox regression in patientB
E
TCGA LUAD
0
All, P < 0.05
H
Differentially expressed genes of clusters in TCGA LUAD
Cluster 1 vs. others
Cluster 2 vs. others
Cluster 3 vs. others
Cluster 4 vs. others
..
-5
1.5
P = 0.05
1.5
1.5
P = 0.05
P = 0.05
P = 0.05
1.0-
-log10(adj.P)
-log10(adj.P)
-log10(adj.P)
-log10(adj.P)
-10-
1.0
1.0
1.0
0.5
0.5
0.5
0.5
-15
0.0
0.0
0.0
0.0
1
2
4
Beta of Cox regression in patientA
3
-0.5
0.0
0.5
-1.0
-0.5
0.0
0.5
-0.5
0.0
0.5
1.0
-0.5
0.0
0.5
log2(FC)
log2(FC)
log2(FC)
log2(FC)
CPH models changed the significance of log HR, 46.4% of the non- prognostic genes became significantly prognostic genes after intro- ducing the time-interaction term in CPH, and 21.2% of the signifi- cantly prognostic genes became non-prognostic genes. In terms of other time functions (t, log(t), t2, sqrt(t), and et), there were sig- nificant changes log HR in 15.3%, 22.4%, 52.3%, and 10.8% CPH mod- els violated PH assumption, respectively. Furthermore, pooling results across all time functions, a total of 31.9% genes’ log HR changed significance, including 27.7% non-significant genes became significant, while 4.2% genes were opposite. These results suggested that non-time-dependent CPH may underestimate the
number of prognosis related transcriptome genes. For ô parameter, overall 26.5%, 70.7%, 56.8%, 42.3%, 76.5% and 8.8% of CPH models with non-proportional genes had significant time interaction term with t, ln(t), t^2, sqrt(t) and e^t functions, respectively (Tab. S13).
Next, we tentatively illustrated with 2 specific examples that these significant changes in time-dependent CPH might be com- patible with factual evidence. BCL2 was a BCL family member that regulated apoptosis. A Meta-analysis [39] included 17 studies with over 100 cases of breast cancer that showed that positive BCL2 was significantly linked to a favourable prognosis. In this study, after the introduction of the time-gene interaction term, BCL2 expres-
A
ACC
ACC
BLCA
BLCA
BRCA
BRCA
CESC
CESC
CHOL
CHOL
40
T
6
30
6
I
Relative AIC
Relative Loglik
6
Relative AIC
Relative Loglik
30
J
20
4
20
Relative AIC
Relative Loglik
Relative AIC
Relative Loglik
Relative AIC
1
Relative Loglik
S
4
20
A
20
4
15
w
8
N
N
10
N
10-
7
10
N
10
10
5
-
O
0
L
0
0
0
0
1
Na-
to
A
Ct
9
1
Cs
1
E
6
0
%
N
Ca
4.
Cm
6
1
2
3
4
5
6
0
N
Co-
4.
1
2
6
1
5
6
0
Cm
Cy
S
N
co
.
6
1
2
3
4
5
6
0
L
V
CU
A
5
6
N
1
2
3
4
5
6
0
COAD
DLBC
DLBC
ESCA
ESCA
GBM
GBM
HNSC
HNSC
1
20
5-
16
25
M
A
4
J
4
6
Relative AIC
20
4
Relative Loglik
Relative AIC
1
Relative Loglik
4
Relative AIC
Relative Loglik
Relative AIC
20
Relative Loglik
4
Relative Loglik
15
12
Relative AIC
€
15
G
20
00
G
-
N
10
N
8
2
S
10
A
10
₪
5
1
4
y
5
-
a
0
0
1
0
0
L
L
0
L
1
2
3
A
D
6
2
0
K
Un
8
0
O
1
2
3
A
M
6
1
2
3
4
ES
6
0
2
y
4
E
O
1
2
Co
0
B
N
Ga
A
8
1
2
6
A
Ch
8
N
00
A
5
3
1
3
4
5
8
KICH
KICH
KIRC
KIRC
KIRP
KIRP
G
LGG
LIHC
LIHC
30
10.0
.
6
10.0
50
Relative AIC
Relative Loglik
Relative AIC
40
40
Relative Loglik
7.5
Relative AIC
40
Relative Loglik
7.5
Relative AIC
90
Relative Loglik
20
Relative AIC
Relative Loglik
20
4
30
5.0
30
5.0
15
30
7.5
60
20
20
10
20
5.0
10
N
2.5
2.5
10
0.0
10
30
5
10
2.5
0
0
L
1
2
A
3
5
1
0
0
0.0
L
0
0
L
0
0.0
0
V
4
2
6
1
2
A
5
6
1
2
e
2
V
Ca
6
1
V N
4
CA
C
V
a
4
5
3
1
M
A
5
3
1
V
6.
6
1
4
0
W
6
LUAD
LUAD
LUSC
LUSC
MESO
MESO
OV
OV
PAAD
PAAD
40
A
6
20
4
25
Y
5-
30
Relative AIC
Relative AIC
S .
30
Relative AIC
Relative Loglik
Relative Loglik
Relative AIC
Relative Loglik
20
Relative Loglik
8
4
Relative Loglik
30
15
€
Relative AIC
20
15
3
20
A
4
20
0
N
N
N
N
10
N
10
5
V
10
10
0
0
5
4
L
L
0
L
0
0
M
2
4
5
6
1
2
A
5
6
0
1
2
Q
00
1
V
U
e
5
6
0
L
2
A
6
V
4
4
9
3
0
2
n
3
1
U
4
5
6
0
2
V
5
1
5
Q
S
6
2
6
PCPG
PCPG
PRAD
PRAD
READ
READ
SARC
SARC
SKCM
SKCM
5
20
6
30
I
0
6
20
Relative AIC
15
I
Relative Loglik
4
30
Relative AIC
15
Relative Loglik
Relative AIC
Relative Loglik
Relative AIC
Relative Loglik
Relative AIC
15
Relative Loglik
4
-
Ca
CA
4%
20
A
10
20
2
0
10
₪
N
N
5
-
2
6
5
10
5
-
0
0
0
0
-
0
O
1
2
4
&
5
6
1
2
A
6
6
0
2
S
8
1
N
A
4.
S
2
O
E
%
6
1
0
~
R
4
9
N
da
S,
6
6
1
4
S
9
B
A
3.
4
A
5
6
1
V
2
3
6
STAD
STAD
TGCT
TGCT
THCA
THCA
THYM
THYM
UCEC
UCEC
4
1
20
I
20
5
4
I
Y
4
Relative AIC
Relative Loglik
Relative AIC
10
Relative Loglik
4
Relative AIC
15
Relative Loglik
20
15
Relative AIC
Relative Loglik
15
Relative Loglik 9
4
3
Relative AIC
NJ
3
15
C
10
0
10
NO
N
2
0
N
5
-
C
-
G
-
5
5
0
0
O
1
0
0
0
1
O
0
0
V
4
4
5
B
1
V
u
a
6
6
V
Co-
A
CA
6
1
4
1
6
4
00
Un
00
A
CA
C
4
CD
S NJ
-
Go
4.
00
1
2
És
K
Gh
6
U
4
4
Cn
6
1
0
E
A
5
6
UCS
UCS
UVM
UVM
Loglik
1.00-
0.75-
No t
0.50-
20
4
0.25-
t
Relative Loglik
30
Relative Loglik
O
0.00
Relative AIC
CA
Relative AIC
log(t)
1.00 -
15
t
20
+
AIC
0.75
0,50
10
2
25
sqrt(t
V
10
2
0.00
5
ACC
BLCA
BRCA
CESC
CHOL
COAD
DLBC
ESCA
GBM
HNSC
KICH
KIRC
KIRP
LGG
LIHC
LUAD
LUSC
MESO
PAAD
PCPG
PRAD
READ
SARC
SKCM
THCA
THYM
UCEC
0
AO
STAD
TGCT
-
0
C
UCS
UVM
exp
0
E
N
w
A
6
6
1
2
G
Un
3
S
V
w.
A
VI
00
1
A
0
+
0
GP
sion became a significant prognostic factor, and its positive prog- nostic effects decreased in the first 4 years (Fig. 6B & C, Fig. S9), which was consistent with the previous study [40]. Another exam- ple was VEGFA, an important gene inducing angiogenesis and lym- phangiogenesis, which was identified as an unfavourable prognostic factor in BRCA [41]. In this study, VEGFA also had a non-proportional effect in BRCA (Fig. 6D, Fig. S10), it was a signif- icant prognostic factor in time-dependent CPH rather than general CPH (Fig. 6E).
4. Discussion
Using the 32 TCGA cohorts, we investigated non-proportional hazards of transcriptomic data. An average of 8.5% of single-gene CPH models violated the PH assumption, and 66.2% of the models
violated the PH assumption in at least one tumor, suggesting that non-proportional hazards of transcriptome should not be over- looked. Furthermore, according to the network analysis of PPI, both hub and non-hub genes potentially had non-proportional hazards. In at least one of the 5 important biological processes, more than 10% of the genes in 46.9% of cancers had non-proportional hazards in CPH models.
Although the comprehensive investigation of non- proportionality of gene-based CPH was lacking, some clinical stud- ies provided examples of genes with non-proportional hazards. Werdyani et al. found that GUSBP1 and PDLIM3 copy number vari- ations were significantly associated with prognosis only within the first 3 years after diagnosis in COADREAD [42]. Moreover, Candido- dos-Reis et al. found that patients with BRCA1 mutation had better short-term survival, this prognostic advantage was lost over time and converted to the opposite direction after 5 years [43].
A
t
log(1)
t2
sqrt(1)
et
2500
2500
2500
2500
2500
14.7%
NS
2000
43.3%
NS
2000
34.6%
NS
2000
2000
2000
1500
NS
1500
NS
1500
NS
71.6%
NS
1500
NS
72.3%
1500
NS
86.2%
NS
43.7%
52.4%
Sig
1000
1000
1000
1000
1000
Sig
Sig
2.2%
500
2.8%
500
2.7%
500
15.4%
2.7%
500
500
0.8%
0.9%
Sig
Sig
10.2%
Sig
10.3%
Sig
10.3%
Sig
10.8%
Sig
12.1%
Sig
0
0
0
0
CPH
Time-depen- dent CPH
Time-depen- dent CPH
Time-depen- dent CPH
0
CPH
CPH
CPH
Time-depen-
CPH
Time-depen- dent CPH
dent CPH
B
Schoenfeld test
C
geneval
(N=1077)
0.91
(0.81- 1)
0.104
2
P = 0.0001
# Events: 151; Global p-value (Log-Ranky) -0-1054.
Beta(t) for BCL2
BCL2 CPH
Time-fixed
AIC: 1708.33; Concordance Index: 0.61
0.9
0.95
1
1
··
Time-dependent
geneval
(N=1077)
0.66
(0.55 - 0.8)
<0.001 **
0
-1
tt(geneval)
(N=1077)
1.34
(1.16 - 1.5)
<0.001 **
3.8
9.4 10
12 18
# Events: 151; Global p-value (Log-Rank): 5.1577e-05 AIC: 1693.21; Concordance Index: 0.63 0.6
1.8
5.9
7.6
0.8
1
1.2 1.4 1.6
years
D
Schoenfeld test
E
geneval
(N=1077)(0.93 -1.3}
0.261
P = 0.026
Beta(t) for VEGFA
VEGFA CPH
Time-fixed
# Events: 151; Global p-value (Log-Rank): d, 2634
2
AIC: 1709.7; Concordance Index: 0.57
,95
1.05
1.15 1.2 1.25
.35
Time-dependent
geneval
(N=1077)
1.4
(1.08 - 1.81)
0.01 **
0
-2
tt(geneval)
(N=1077)
0.8
(0.67 - 0.96)
0.019 *
1.8
3.8
5.9
7.6
9.4 10
12 18
# Events: 151; Global p-value (Log-Rank): 0.034938
0.8
1
1.2 1.4 1.6 1.8
years
AIC: 1706.24; Concordance Index: 0.6
Time-dependent CPH contributed to assessing non-proportional hazards and quantifying hazards of the gene over time. In this study, we compared the 5-time functions (t, log(t), t2, sqrt(t), et) in time-dependent CPH. The log(t), and sqrt(t) functions performed well in most tumors. Other strategies also could account for non- proportionality. For instance, perform CPH by stratifying time- dependent variables [44], however, gene expression was not the stratified categorical variable. Moreover, the stratified CPH was linked to lower power than the general CPH. Another strategy was to use intervals of time to code time-dependent variables. Quantin et al. proved that this piecewise model depended on the number of time intervals and the specific parametric function [45]. Furthermore, some non-Cox models could also consider time-varying effects. Accelerated failure time model was a para- metric model without considering the PH assumption. Some stud- ies proved the effectiveness of accelerated failure time model in survival analysis [46,47], but the model needed to specify the prob- ability distribution of survival times. Moreover, additive model and regression splines model were also the flexible and interpretable prognostic models [48,49]. Still, CPH remained the most commonly used prognostic model.
The non-proportional risk of genes meant that the survival curve may cross under a certain threshold. Therefore, the results of Cox regression with time-interaction terms should be inter- preted carefully. The optimal grouping threshold of variables under nonproportional risk needs further studies.
In this study, we found that the genes’ CPH model that violated the PH assumption in 5 NSCLC datasets had a very low kappa coef- ficient (all < 0.2), suggesting that the non-proportionality of genes in CPH models was dataset-dependent. The different time-varying effects of genes might be explained by the heterogeneity of data- sets. TCGA was next-generation sequencing, while GEO data were microarrays based on different platforms (GSE68465, Affymetrix Human Genome U133A Array; GSE37745, Affymetrix Human Gen- ome U133 Plus 2.0 Array; GSE50081, Affymetrix Human Genome U133 Plus 2.0 Array). Moreover, the clinical baseline might be unbalanced in the 5 datasets. According to our results, the PH assumption should be tested for gene expression data, even if this gene had been tested in previous studies.
This study had some limitations. The main results of this study were based on TCGA cohorts, and there was still no investigation of other large cohorts. We did not investigate the non-proportionality
of genes under multivariable CPH. In addition, the patient clusters of non-proportional risk were not associated with clinical features and microenvironment infiltration. The reason for non- proportionality of genes in CPH models needed further study.
In this study, we didn’t perform multiple testing in Schoenfeld residual test. Transcriptomic data were ultrahigh dimensional data characterized by the dimension being much larger than the sample size. The RNAs included in this study were greater than 30000, and significant genes obtained using BH correction were very rare (zero in 22/32 tumors, Tab. S14). There were extensive regulatory rela- tionships and collinearity among genes, and perhaps not all tests were independent. In addition, this paper was an exploratory study, strict adjustment will reduce the sensitivity.
In summary, non-proportional hazards were widespread across the transcriptome, and the tests of PH assumption in the transcrip- tome were not only the statistical premise of CPH but also facili- tated the exploration of specific temporal patterns of gene hazards.
5. Conclusions
This study was original to investigate the non-proportionality of genes in CPH models fitted survival times using transcriptomic data in TCGA pan-cancer cohorts. Non-proportional hazards were widespread across transcriptomic genes. Introducing the time- gene interaction term improved model performance and changed the significance of gene variables in CPH model.
Data availability
The data underlying this article are available in GEO Database at https://www.ncbi.nlm.nih.gov/geo/, and can be accessed with accession number: GSE68465, GSE37745, GSE50081, and Xena, at https://xenabrowser.net/datapages/?cohort = TCGA%20Pan-Can- cer%20(PANCAN).
Funding
This work was supported by National Natural Science Founda- tion of China (81773236, 81800429 and 81972852), Key Research & Development Project of Hubei Province (2020BCA069), Nature Science Foundation of Hubei Province (2020CFB612), Health Com- mission of Hubei Province Medical Leading Talent Project, Young and Middle-Aged Medical Backbone Talents of Wuhan (WHQG201902), Application Foundation Frontier Project of Wuhan (2020020601012221), Zhongnan Hospital of Wuhan University Science, Technology and Innovation Seed Fund (znpy2019001, znpy2019048, and ZNJC201922), Medical Sci-Tech Innovation Plat- form of Zhongnan Hospital, Wuhan University (PTXM2019026), and Chinese Society of Clinical Oncology TopAlliance Tumor Immune Research Fund (Y-JS2019-036).
Authors’ contributions
ZZ, YGong and CX designed this study. ZZ, YGao, JL and GZ col- lected TCGA and GEO data. ZZ, YGao, SS and QW analyzed all the data. ZZ and YGao drafted the manuscript. YGong and CX revised the manuscript and supervised this study. All authors read and approved the final manuscript.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.csbj.2022.01.004.
References
[1] R. Stark M. Grzelak J. Hadfield RNA sequencing: the teenage years. 20 11 2019 631 656
[2] Chen W, Ou M, Tang D, Dai Y (2020) Identification and Validation of Immune- Related Gene Prognostic Signature for Hepatocellular Carcinoma. 2020: 5494858.
[3] Teng H, Mao F, Liang J, Xue M, Wei W, Li X, et al. Transcriptomic signature associated with carcinogenesis and aggressiveness of papillary thyroid carcinoma. Theranostics 2018;8(16):4345-58.
[4] Zhou J-G, Zhao H-T, Jin S-H, Tian Xu, Ma Hu. Identification of a RNA-seq-based signature to improve prognostics for uterine sarcoma. Gynecol Oncol 2019;155 (3):499-507.
[5] Cox D. Regression Models and Life Table. J Roy Stat Soc B 1972;34.
[6] Hess KR. Graphical methods for assessing violations of the proportional hazards assumption in Cox regression. Stat Med 1995;14(15):1707-23.
[7] Xue X, Xie X, Gunter M, Rohan TE, Wassertheil-Smoller S, Ho GYF, et al. Testing the proportional hazards assumption in case-cohort analysis. BMC Med Res Methodol 2013;13(1). https://doi.org/10.1186/1471-2288-13-88.
[8] Mathoulin-Pelissier S, Gourgou-Bourgade S, Bonnetain F, Kramar A. Survival end point reporting in randomized cancer clinical trials: a review of major journals. J Clin Oncol 2008;26(22):3721-6.
[9] Bellera CA, MacGrogan G, Debled M, de Lara CT, Brouste V, Mathoulin-Pélissier S. Variables with time-varying effects and the Cox model: some statistical concepts illustrated with a prognostic factor study in breast cancer. BMC Med Res Methodol 2010;10(1). https://doi.org/10.1186/1471-2288-10-20.
[10] Shintani AK, Girard TD, Eden SK, Arbogast PG, Moons KGM, Ely EW. Immortal time bias in critical care research: application of time-varying Cox regression for observational cohort studies. Crit Care Med 2009;37(11):2939-45.
[11] Altman DG, De Stavola BL, Love SB, Stepniewska KA. Review of survival analyses published in cancer journals. Br J Cancer 1995;72(2):511-8.
[12] Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013;45 (10):1113-20.
[13] Goldman MJ, Craft B, Hastie M, Repečka K, McDade F, Kamath A, et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol 2020;38(6):675-8.
[14] Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008;14:822-7.
[15] Botling J, Edlund K, Lohr M, Hellwig B, Holmberg L, Lambe M, et al. Biomarker discovery in non-small cell lung cancer: integrating gene expression profiling, meta-analysis, and tissue microarray validation. Clin Cancer Res 2013;19 (1):194-204.
[16] Der SD, Sykes J, Pintilie M, Zhu C-Q, Strumpf D, Liu Ni, et al. Validation of a histology-independent prognostic gene signature for early-stage, non-small- cell lung cancer including stage IA patients. J Thorac Oncol 2014;9(1):59-64.
[17] Gautier L, Cope L, Bolstad BM, Irizarry RA. affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004;20(3):307-15.
[18] DAVID SCHOENFELD Partial residuals for the proportional hazards regression model Biometrika 69 1 1982 239 241
[19] Grambsch PM, Therneau TM (1994) Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81: 515-526
[20] Fisher LD, Lin DY. Time-dependent covariates in the Cox proportional-hazards regression model. Annu Rev Public Health 1999;20(1):145-57.
[21] Andersen PK, Gill RD. Cox’s Regression Model for Counting Processes: A Large Sample Study. The Annals of Statistics 1982;10:1100-20.
[22] COX DR. Partial likelihood. Biometrika 1975;62(2):269-76.
[23] Sakamoto Y, Kitagawa G (1986) Akaike information criterion statistics.
[24] Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, et al. (2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. 45: D362-d368.
[25] Freeman LC. Centrality in social networks: Conceptual clarification. Social Networks 1978;1(3):215-39.
[26] Murtagh F, Legendre P. Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? J Classif 2014;31(3):274-95.
[27] Becht E, Giraldo N, Lacroix L, Buttard B, Elarouci N, et al. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biol 2016;17.
[28] M.E. Ritchie B. Phipson D.i. Wu Y. Hu C.W. Law W. Shi et al. limma powers differential expression analyses for RNA-sequencing and microarray studies 43 7 2015 2015 e47 e47
[29] D. Dunkler M. Schemper G. Heinze Gene selection in microarray survival studies under possibly non-proportional hazards 26 6 2010 2010 784 790
[30] Johnson NL, Kotz S. Distributions In Statistics Continuous Univariate Distributions - 2. Advances in Mathematics 1974;26:327.
[31] Cox C, Chu H, Schneider MF, Muñoz A. Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Stat Med 2007;26(23):4352-74.
[32] PRENTICE RL. A log gamma model and its maximum likelihood estimation. Biometrika 1974;61(3):539-44.
[33] Marsaglia G, Tsang WW, Wang J. Evaluating Kolmogorov’s Distribution. J Stat Softw 2003;8:1-4.
[34] Abramowitz M. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover Publications Inc .; 1974.
[35] Conger AJ. Integration and generalization of kappas for multiple raters. Psychol Bull 1980;88(2):322-8.
[36] Rossini A, Tierney L, Li N (2012) Simple Parallel Statistical Computing in R. Journal of Computational and Graphical Statistics 16: 399-420.
[37] Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc: Ser B (Methodol) 1995;57:289-300.
[38] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet 2000;25(1):25-9.
[39] Hwang K-T, Han W, Kim J, Moon H-G, Oh S, Song YS, et al. Prognostic Influence of BCL2 on Molecular Subtypes of Breast Cancer. J Breast Cancer 2017;20 (1):54. https://doi.org/10.4048/ibc.2017.20.1.54.
[40] Callagy GM, Pharoah PD, Pinder SE, Hsu FD, Nielsen TO, Ragaz J, et al. Bcl-2 Is a Prognostic Marker in Breast Cancer Independently of the Nottingham Prognostic Index. Clin Cancer Res 2006;12(8):2468-75.
[41] Mohammed RAA, Green A, El-Shikh S, Paish EC, Ellis IO, Martin SG. Prognostic significance of vascular endothelial cell growth factors -A, -C and -D in breast cancer and their relationship with angio- and lymphangiogenesis. Br J Cancer 2007;96(7):1092-100.
[42] Werdyani S, Yu Y, Skardasi G, Xu J, Shestopaloff K, et al. (2017) Germline INDELs and CNVs in a cohort of colorectal cancer patients: their characteristics, associations with relapse-free survival time, and potential time-varying effects on the risk of relapse. 6: 1220-1232.
[43] Candido-dos-Reis FJ, Song H, Goode EL, Cunningham JM, Fridley BL, Larson MC, et al. Germline mutation in BRCA1 or BRCA2 and ten-year survival for women diagnosed with epithelial ovarian cancer. Clin Cancer Res 2015;21(3):652-7.
[44] Therneau T, Grambsch P (2013) Modeling Survival Data: Extending the Cox Model.
[45] Quantin C, Abrahamowicz M, Moreau T, Bartlett G, Mackenzie T, Adnane Tazi M, et al. Variation Over Time of the Effects of Prognostic Factors in a Population-based Study of Colon Cancer: Comparison of Statistical Models. Am J Epidemiol 1999;150(11):1188-200.
[46] Zare A, Hosseini M, Mahmoodi M, Mohammad K, Zeraati H, et al. A Comparison between Accelerated Failure-time and Cox Proportional Hazard Models in Analyzing the Survival of Gastric Cancer Patients. Iran J Public Health 2015;44:1095-102.
[47] Iraji Z, Koshki TohidJafari, Dolatkhah R, Jafarabadi MohammadAsghari. Parametric survival model to identify the predictors of breast cancer mortality: An accelerated failure time approach. J Res Med Sci 2020;25 (1):38. https://doi.org/10.4103/irms.IRMS 743 19.
[48] Pang M, Platt RW, Schuster T, Abrahamowicz M (2021) Spline-based accelerated failure time model. 40: 481-497
[49] Su P-F. Power and sample size calculation for the additive hazard model. J Biopharm Stat 2017;27(4):571-83.