Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis

Jinze Liu, Susan Paulsen, Xing Sun, Wei Wang, Andrew Nobel, Jan Prins

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

36 Scopus citations

Abstract

Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional "exact" model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we .propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However, the AFI model tolerates a controlled fraction of errors in each item and each supporting transaction. Motivating this model are theoretical results (and a supporting simulation study presented here) which state that, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size; thus the itemsets cannot be recovered by a routine application of frequent itemset mining. By contrast, we provide theoretical results showing that the AFI criterion is well suited to recovery of block structures subject to noise. We developed and implemented an algorithm to mine AFIs that generalizes the level-wise enumeration of frequent itemsets by allowing noise. We propose the noise-tolerant support threshold, a relaxed version of support, which varies with the length of the itemset and the noise threshold. We exhibit an Apriori property that permits the pruning of an itemset if any of its sub-itemset is not sufficiently supported. Several experiments presented demonstrate that the AFI algorithm enables better recoverability of frequent patterns under noisy conditions than existing frequent itemset mining approaches. Noise-tolerant support pruning also renders an order of magnitude performance gain over existing methods.

Original languageEnglish
Title of host publicationProceedings of the Sixth SIAM International Conference on Data Mining
Pages407-418
Number of pages12
DOIs
StatePublished - 2006
EventSixth SIAM International Conference on Data Mining - Bethesda, MD, United States
Duration: Apr 20 2006Apr 22 2006

Publication series

NameProceedings of the Sixth SIAM International Conference on Data Mining
Volume2006

Conference

ConferenceSixth SIAM International Conference on Data Mining
Country/TerritoryUnited States
CityBethesda, MD
Period4/20/064/22/06

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis'. Together they form a unique fingerprint.

Cite this