Duplicate Document Detection
If you ever worked with text documents, there is a high probability that you’ve had to deal with duplicate content at some point or the other. The process of manually scanning every document to detect duplicates is a tedious and, often, ineffective one.
This presentation focuses on designing a machine learning model that not only identifies similar documents based on their text similarity, but also utilizes the documents’ metadata to confirm that a pair of similar documents are indeed duplicates, thus reducing the need for any manual intervention in detecting duplicate documents. The solution is a 2-step process which involves calculating the text similarity between two documents, and subjecting the pair to a classifier for a verdict.
A sneak-peek into the solution:
Text Similarity - Candidate Duplicates
When a new document enters the workflow, a cosine similarity score is calculated between the new document and every other document in the corpus. Document pairs that have similarity scores greater than a set threshold are filtered as ‘Candidate Duplicates’. This filtering means that we are ready to accept that all the pairs that have similarity scores lesser than our set similarity threshold are not duplicates.
These candidate duplicates will now be subjected to a trained classifier which will classify pairs of documents as duplicates based on their metadata features. Our classifier is a decision tree classification model that is trained on labeled duplicate documents and labeled non-duplicate documents, along with their corresponding similarity scores and metadata features like the document lengths, filing dates, document types etc.
Manasa Bulusu is a data scientist at S&P Global Market Intelligence, with a background in both business management and statistics. She has a master’s degree in data science & business analytics, and an MBA in Marketing & Finance. Manasa is a research enthusiast with a strong attraction to life-sciences, and enjoys exploring the subjects of machine learning, predictive analytics, artificial intelligence along with bio-statistics, genomics and computational neuroscience. Outside of work, she also enjoys engaging in photography, poetry, music and travel.
Claim the event and start manage its content.I am the organizer