基于闭合频繁模式的半随机森林数据流分类算法
Semi-Random Forest Classification Based on Closed Frequent Pattern for Data Streams
-
摘要: 提出了一种基于闭合频繁模式的半随机森林数据流分类算法(Semi-Random Forest based on Closed Frequent Pattern, SRFCFP), 以解决数据流中噪声和概念漂移问题。SRFCFP利用闭合频繁模式对数据流进行表示, 去除冗余信息和噪声, 突出数据特征。采用半随机森林建立分类模型, 并通过基于时间衰减的模式集更新机制适应数据流的无限性。为了检测概念漂移并及时适应, 引入了一种模式集差异性度量方式, 用于测量数据分布变化。实验结果表明, 在MOA平台下使用真实和合成数据集, SRFCFP在平均精度上超越了相关对比算法, 并能有效处理数据流中的概念漂移和噪声问题。Abstract: To solve the issues of noise and concept drift exists in the data stream, a Semi-Random Forest Classification based on Closed Frequent Pattern (SRFCFP) for Data Streams algorithm was proposed. SRFCFP used the closed frequent patterns to represent the input data stream to remove redundant information and noise and highlight the characteristics of data. Semi-random forests were used to construct the classifier after representation, and a pattern set updating mechanism based on time decay model was proposed for the continuous data stream. Meanwhile, in order to detect and adapt to concept drift in time, a difference measurement method for pattern set was proposed, which used the mined patterns to measure distribution changes. The experiments were performed under the MOA using real-world datasets and synthetic datasets, respectively. The results showed that the proposed method can outperform the related comparison algorithm in average accuracy, and can effectively deal with the concept drift and noise.