Come join us for our Boston NLP Meet up!
Our special guest is Sean Crist.
*Pizza and Beer will be provided :)
Processing Bilingual Dictionaries with Conditional Random Fields
Conditional Random Fields are a type of sequence classifier which considers the neighboring tokens (like HMMs) and also the attributes of an individual sample (like logistic regression models). In the present work, CRFs are used to process human-readable bilingual dictionaries (e.g. a German-English dictionary) as a strategy for producing machine-readable lexicons, particularly for lower-resourced languages where large corpora are not available. Human-readable dictionaries lend themselves well to processing with CRFs, because these documents are semi-structured, and because there are many attributes within the dictionary text which provide useful (but often ambiguous) clues to the structure of an entry.
The talk is primarily concerned with the practicalities of this approach. Two case studies are considered, one involving a relatively simple dictionary and one involving a very complex one.
Sean Crist is a linguist by academic training. He formerly taught linguistics and computational linguistics before moving to industry. He has worked in various areas of natural language processing over the past 20 years, including lexicon and corpus development, voice recognition, and semantic extraction from text.