Duplicate Document Detection w/ Manasa Bulusu

Sep 19, 2017 · Richmond, United States of America

Duplicate Document Detection

If you ever worked with text documents, there is a high probability that you’ve had to deal with duplicate content at some point or the other. The process of manually scanning every document to detect duplicates is a tedious and, often, ineffective one. 

This presentation focuses on designing a machine learning model that not only identifies similar documents based on their text similarity, but also utilizes the documents’ metadata to confirm that a pair of similar documents are indeed duplicates, thus reducing the need for any manual intervention in detecting duplicate documents. The solution is a 2-step process which involves calculating the text similarity between two documents, and subjecting the pair to a classifier for a verdict.

A sneak-peek into the solution:

Text Similarity - Candidate Duplicates
When a new document enters the workflow, a cosine similarity score is calculated between the new document and every other document in the corpus. Document pairs that have similarity scores greater than a set threshold are filtered as ‘Candidate Duplicates’. This filtering means that we are ready to accept that all the pairs that have similarity scores lesser than our set similarity threshold are not duplicates.

These candidate duplicates will now be subjected to a trained classifier which will classify pairs of documents as duplicates based on their metadata features. Our classifier is a decision tree classification model that is trained on labeled duplicate documents and labeled non-duplicate documents, along with their corresponding similarity scores and metadata features like the document lengths, filing dates, document types etc.

About Manasa 

Manasa Bulusu is a data scientist at S&P Global Market Intelligence, with a background in both business management and statistics. She has a master’s degree in data science & business analytics, and an MBA in Marketing & Finance. Manasa is a research enthusiast with a strong attraction to life-sciences, and enjoys exploring the subjects of machine learning, predictive analytics, artificial intelligence along with bio-statistics, genomics and computational neuroscience. Outside of work, she also enjoys engaging in photography, poetry, music and travel.

Event organizers
  • RVA Data Hackers

    RVA Data Hackers is a community of data professionals and enthusiasts who meet regularly to develop skills and learn about tools and techniques for working with data. We discuss how to find, organize, understand and serve data sets large and small. We'll cover anything related to 'big data' -- machine learning, artificial intelligence and architectures to scale data processing for the Internet. If you're interested in machine learning algorithms, data analysis, natural language processing, or managing big

    Recent Events

Are you organizing Duplicate Document Detection w/ Manasa Bulusu?

Claim the event and start manage its content.

I am the organizer

based on 0 reviews