A comparative study of clustering methods on gene expression data for lung cancer prognosis

Jason Z. Zhang, Chi Wang

Research output: Contribution to journalArticlepeer-review


Lung cancer subtyping based on gene expression data is important for identifying patient subgroups with differing survival prognosis to facilitate customized treatment strategies for each subtype of patients. Unsupervised clustering methods are the traditional approach for clustering patients into subtypes. However, since those methods cluster patients based only on gene expression data, the resulting clusters may not always be relevant to the survival outcome of interest. In recent years, semi-supervised and supervised methods have been proposed, which leverage the survival outcome data to identify clusters more relevant to survival prognosis. This paper aims to compare the performance of different clustering methods for identifying clinically prognostic lung cancer subtypes based on two lung adenocarcinoma datasets. For each method, we clustered patients into two clusters and assessed the difference in patient survival time between clusters. Unsupervised methods were found to have large logrank p-values and no significant results in most cases. Semi-supervised and supervised methods had improved performance over unsupervised methods and very significant p-values. These results indicate that unsupervised methods are not capable of identifying clusters with significant differences in survival prognosis in most cases, while supervised and semi-supervised methods can better cluster patients into clinically useful subtypes.

Original languageEnglish
Article number319
JournalBMC Research Notes
Issue number1
StatePublished - Dec 2023

Bibliographical note

Publisher Copyright:
© 2023, The Author(s).


  • Clustering
  • Comparison
  • Gene expression
  • Prognosis

ASJC Scopus subject areas

  • General Biochemistry, Genetics and Molecular Biology


Dive into the research topics of 'A comparative study of clustering methods on gene expression data for lung cancer prognosis'. Together they form a unique fingerprint.

Cite this