Deriving the Distribution and Exploring the Utility of Partial R2 in the Era of Big Data

Research output: Contribution to journalArticlepeer-review

Abstract

A central goal in the world of statistics and data science is the construction of linear regression models for continuous variables of interest. Often, our objective is to examine the impact of one or more explanatory variables, after adjusting for demographic covariates or other known/relevant factor(s). While the traditional approach is to use hypothesis testing to determine statistical significance, the p-values obtained are heavily dependent on sample size. This is particularly problematic for large datasets or “overpowered” studies, where even the tiniest of effects will appear to be highly significant. Computing capabilities and cloud-enhanced data sharing have revolutionized the way we use data worldwide, from healthcare and investments to manufacturing and retail. While machine learning and artificial intelligence are improving predictive analytics, we need better statistical inference to help understand and translate our models into meaningful and actionable insights. The coefficient of partial determination (or partialR2) is widely used in applied science to supplement hypothesis testing, but little work has been done to understand its statistical properties. In this work, we derive the complete distribution of partial R2 and perform simulated and real-world data analyses to show the advantages of adding it to your next analysis of Big Data.

Original languageEnglish
Pages (from-to)115-128
Number of pages14
JournalJournal of Statistical Theory and Applications
Volume23
Issue number2
DOIs
StatePublished - Jun 2024

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Keywords

  • Big data
  • Coefficient of partial determination
  • Linear regression
  • Partial R
  • R

ASJC Scopus subject areas

  • Statistics and Probability
  • Computer Science Applications
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Deriving the Distribution and Exploring the Utility of Partial R2 in the Era of Big Data'. Together they form a unique fingerprint.

Cite this