一种基于噪音清除的网页削重算法

A Web Pages Near-replicas Detection Algorithm Based on Noise Reduction

  • 摘要: 提出了一种基于噪音清除的网页削重算法.首先应用空间向量模型,仅仅使用特征词,权重二元组表示网页,降低削重算法的时空复杂度;其次,通过一组启发式规则来消除网页中包含的“噪音”,消除了无关信息对网页核心内容的干扰.

     

    Abstract: A near-replica of Web pages detection algorithm is introduced.There are two keys in the algorithm,the first is that web page is presented by which using space vector model,which can decrease the time and space complexity of near-replicas of Web pages detection algorithm;the second is that some heuristics are used to reduce noise automatically.Experimental results show that the algorithm is more effective than the existing algorithm of Web pages near-replicas detection in search engine.

     

/

返回文章
返回