Exploring topics in the field of data science by analyzing wikipedia documents: A preliminary result wikipedia documents: A preliminary result

Yanyan Wang, Soohyung Joo, Kun Lu

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

In this poster, topics in the field of Data Science were explored from Wikipedia documents based on clustering, principal component analysis (PCA), and topic modeling. As a pilot study, we analyzed part of the dataset of Wikipedia documents to initially identify topics discussed in Data Science. Hierarchical clustering resulted in six clusters of topics while PCA identified eleven dimensions in the Data Science field. In addition, topic modeling based on latent Dirichlet allocation (LDA) produced fifty topics related to Data Science. The researchers plan to further examine hierarchical, structural relationships between topics using structural equation modeling and social network analysis. The findings from this study will be useful to understand what topics are currently discussed in the area of Data Science.

Original languageEnglish
JournalProceedings of the ASIST Annual Meeting
Volume51
Issue number1
DOIs
StatePublished - 2014

Keywords

  • Data science
  • Hierarchical clustering
  • Latent Dirichlet allocation
  • Principal component analysis
  • Structural equation modeling
  • Topic modeling
  • Wikipedia

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Exploring topics in the field of data science by analyzing wikipedia documents: A preliminary result wikipedia documents: A preliminary result'. Together they form a unique fingerprint.

Cite this