TY - GEN
T1 - Mining approximate frequent itemsets in the presence of noise
T2 - Sixth SIAM International Conference on Data Mining
AU - Liu, Jinze
AU - Paulsen, Susan
AU - Sun, Xing
AU - Wang, Wei
AU - Nobel, Andrew
AU - Prins, Jan
PY - 2006
Y1 - 2006
N2 - Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional "exact" model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we .propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However, the AFI model tolerates a controlled fraction of errors in each item and each supporting transaction. Motivating this model are theoretical results (and a supporting simulation study presented here) which state that, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size; thus the itemsets cannot be recovered by a routine application of frequent itemset mining. By contrast, we provide theoretical results showing that the AFI criterion is well suited to recovery of block structures subject to noise. We developed and implemented an algorithm to mine AFIs that generalizes the level-wise enumeration of frequent itemsets by allowing noise. We propose the noise-tolerant support threshold, a relaxed version of support, which varies with the length of the itemset and the noise threshold. We exhibit an Apriori property that permits the pruning of an itemset if any of its sub-itemset is not sufficiently supported. Several experiments presented demonstrate that the AFI algorithm enables better recoverability of frequent patterns under noisy conditions than existing frequent itemset mining approaches. Noise-tolerant support pruning also renders an order of magnitude performance gain over existing methods.
AB - Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional "exact" model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we .propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However, the AFI model tolerates a controlled fraction of errors in each item and each supporting transaction. Motivating this model are theoretical results (and a supporting simulation study presented here) which state that, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size; thus the itemsets cannot be recovered by a routine application of frequent itemset mining. By contrast, we provide theoretical results showing that the AFI criterion is well suited to recovery of block structures subject to noise. We developed and implemented an algorithm to mine AFIs that generalizes the level-wise enumeration of frequent itemsets by allowing noise. We propose the noise-tolerant support threshold, a relaxed version of support, which varies with the length of the itemset and the noise threshold. We exhibit an Apriori property that permits the pruning of an itemset if any of its sub-itemset is not sufficiently supported. Several experiments presented demonstrate that the AFI algorithm enables better recoverability of frequent patterns under noisy conditions than existing frequent itemset mining approaches. Noise-tolerant support pruning also renders an order of magnitude performance gain over existing methods.
UR - http://www.scopus.com/inward/record.url?scp=33745443433&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33745443433&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972764.36
DO - 10.1137/1.9781611972764.36
M3 - Conference contribution
AN - SCOPUS:33745443433
SN - 089871611X
SN - 9780898716115
T3 - Proceedings of the Sixth SIAM International Conference on Data Mining
SP - 407
EP - 418
BT - Proceedings of the Sixth SIAM International Conference on Data Mining
Y2 - 20 April 2006 through 22 April 2006
ER -