MM-807: (2011) Yalniz, I.,  Can, E. and Manmatha, R., "Partial Duplicate Detection for Large Book Collections," Proceedings of the 20th Conference on Information Knowledge and Management (CIKM 2011), pp. 469-474. [View bibtex]
A framework is presented for discovering partial duplicates in large
collections of
scanned books with optical character recognition (OCR)
errors. Each book in the collection is represented by the sequence of
words (in the order they appear in the text) which appear only
once in the book. These words are referred to as ``unique
words'' and they constitute a small percentage of all the words in a
typical book. Along with the order information the set of unique words
provides a compact representation which is highly descriptive of the
content and the flow of ideas in the book. By aligning the sequence
of unique words from two books using the longest common subsequence
(LCS) one can discover whether two books are duplicates.
Experiments on several datasets show that DUPNIQ is more accurate
than traditional methods for duplicate detection such as shingling and is
fast. On a collection of 100K scanned English books DUPNIQ detects partial
duplicates in 30 min using 350 cores and has precision
0.996 and recall 0.833 compared to shingling with precision 0.992
and recall 0.720. The technique works on other languages as well and
is demonstrated for a French dataset.