Abstract

IR-777: (2011) Huston, S., Moffat, A. and Croft, W. B. , "Efficient Indexing of Repeated n-Grams," Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp.127-136. [View bibtex]

The identification of repeated $n$-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated $n$-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the $1.5$~TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers a useful tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of $n$, and provides a uniform workload balance across the set of available processors.