Adventures in Understanding Documents

0 0

my name is Scott tottenham I work for a small merchant bank out of New York we do alternative lending won't get into that specifically today but that gives you some context for where I'm coming from with what we're dealing with I've been developing enclosure for four or five years been in a software developer for ten years odd I'll get into that part of the agenda I'd like to talk about today is is to just say hope this presentation goes well that you guys can get some useful information from what I'm going to provide that I can just give some high-level talks on the heels of riches speech or his his talk I've shifted the focus away from what I was originally to talk which is a deep dive into specifically some of the semantics of how we get to extract information out of documents to a framework on which to build this thing which can extract documents so definitely more information about a kind of the middle of the sandwich if you guys catch that and yeah then I'll just walk through what we do how we do it how we build up this body of knowledge to gain understanding about documents and how that helps our business and then I'd like to leave some time for questions so hopefully that you guys can take from this just some helpful hints that this is an experienced report of the way that I chose to do things it's not right it's just the way that I thought and process through and iterated to develop a solution that our company needed and something that I've been working on kind of tangentially outside the company so if you can get something from it some patterns that you can choose to follow I'll build some of the framework stuff so definitely some of the things that didn't work so i can highlight what didn't work and what you shouldn't do what we really try to do is just have like constant iterable deliverables so saying peers of small piece of functionality here it is you can go and use it now even non-technical people that's always been my purview is like don't build this crazy convoluted large complex system that's half baked and doesn't work but instead just get one thing out the door that does it and does it well and give a contract for saying yes this works and it's good and you can use it and here's some documentation around it and it's not going to change because it only does the one thing that it's ever going to do so what we really want to get to for this start is to say I have a means by which to say we're working with documents all day everyday business documents arcane legal documents business financial statements all kinds of different documents you know the piece of paper takes a form based on what's written on it sometimes we know exactly what's coming over the door sometimes we don't and so it becomes very very important to understand all that that that document contains what's in it so you want to take what we've learned from that thing and apply it back to itself and then to other disciplines and I'll get into that's more hopefully build on what I've done and then if you learn something that's great just show of hands who in their course of business or daily life deals with primarily or as a subset documents right pieces of paper with stuff on it like that's what your business deals with like it or leave it kind of sort of some people web services companies you know the more old-school companies the more right things are still on pen and paper there's still contracts we get a lot of those often as PDFs so we're ripping apart PDFs to get the information that's buried within them so who has AP is that they use as you're either consuming or exposing keep them up just show hands all right so it's this guy specifically all right it's been my experience of integrating tons and tons of AP is that it's often a pain in the ass and people that claim they have an API endpoint that does a certain thing may or may not actually do that until you start banging on it and learn from what you get back even well established companies that have well-documented guys it's still lies to you and in integrating lots and lots of api's I've developed a pattern which hopefully we can use here that basically provides a contract for defining what it is and saying here it is you can use it here's the specs around it and it's good to go and you can expose it so you can integrate it and know that you're going to get back what you actually asked for alright so quick background with me I can tell you that I was not a programmer to begin with I was not even considering it I was going to go to law school i minored in comp sci just kind of on a whim but actually had a green and philosophy and political science was going to go to law school was working at a law firm and we were doing some a sport firm dealing with millions of documents really had trucks come in to shred all the documents and just dealing with millions and millions of documents and trying to build enormous mass tort cases around all the information that may or may not be buried in document discovery and often they'll just if some defend it doesn't want it's a major firm floods the firm with everything hoping that that red herring will be buried in the details and you'll miss it because the sheer volume right it's just too cost-prohibitive to actually go through everything and there was people on staff that would whose jobs it was to code and by coding I might go through a document and find instances of things and tag it and although the case was interesting that job was something that I wouldn't wish upon anyone but yeah so that worked for a while I left that became got interested in a startup that worked on saying a social enterprise that said here are some information here's a website you're interested in it went and ripped the information out of that website went and found other websites that were similar fed those to you as recommendation so it was definitely a social recommendation thing kind of didn't work till the end of 2008 you know as everyone remembers the financial crisis that got me deeply interested in how the financial markets were working and understanding how things became correlated and I'll speak to that in a little bit later why it is the case that things become correlated and finding correlations between things even when they don't actually exist and I use air quotes too for a reason is to say yes there is a fiscal correlation between a and B but it doesn't really exist in as we hand over things to machines oh I'll throw a word of caution for what that means when you're saying okay there's something in this document thats related to something else but it actually mean anything other than they just occur kind of close to each other so what we're dealing with when we're trying to say extract the knowledge from a document it's really dealing with three branches philosophy and to steal something from rich to unpack the words that we're using so the ontology of something which is to be metaphysical the nature of it or the being what is it that it is the epistemology of it which is the theory around the knowledge that it contains so something is within it if you have a document it has knowledge embedded within it not specifically within the text that is there but some expert looking at that document knows more about what is in it than someone else and so that knowledge needs to be captured in some way and that's really the value-add is to saying take a look at one simple thing like a financial statement as boring as it is applied from the viewpoint of a skilled trained financial analyst and there's knowledge there that can really be gleaned and understood and processed and so that is the game to say something to take it rip stuff out of it and apply more stuff to it based on the people that are looking at it and how can we take what they've looked at previously and apply that further so the next time it's even smarter so getting onto pedagogy to say not only are you dealing with the thing you're dealing with the method by which you've trained it so that you can have it be smarter the next time so the conveyance of the knowledge that was gained it's another important facet of what we're doing this talk is the the techniques that I use to get the job done for what we needed are fairly rudimentary I'll get through some of the nuts and bolts of it but I'll be the first to admit that if you're looking for kind of the leading edge of NLP and machine learning and understanding and in that game there's definitely an interesting talk that was given at closure Europe just a few weeks ago that really gets into saying he's basically providing a language definition for understanding polish which is an interesting language itself but there's definitely other documents and other talks to get to the deep learning aspects or the machine learning ways of doing this thing we have kind of the crude version that gets the job done but I'll say that what i'm doing here is going to be supplanted by these techniques in the future it's kind of a no-brainer that this is kind of the way things are going it's not really important to say the actual techniques were using but to say all right here's the framework by which we did it so how did haven't you watched rich talk from his hammock driven development it was it struck me in a way that was kind of I didn't really process until he kind of you know let it digest for a while I guess sit in a hammock and watch it extending that one further to say let's have a way that I've kind of developed internally for our company is basically readme driven development which is to say here's a contract out front and center that defines exactly what the thing is and you're designing it from the viewpoint of who's going to consume it and that does a number of things for us it really grounded and centered what we're actually focused on delivering it also made it designed in such a way that you're actually thinking about the end user in caring for them and saying okay if you can't understand it then you're not going to use it and if you're trying to build a library is we all have these great open source libraries that I've fully leveraged you know like just basically the readme as the means by which to say here's the thing that it is and it's going to be no more or less and you can reference that as the contract by which to use this thing so this thing a library or framework er whatever it is we found that that just kind of taking that approach kind of ad hoc I just kind of said like here let's build to read me and then make that read me truth right let's let's say it can do a thing and then actually build it so it can do that thing it also helped me get the discipline to really only produce the code that we could actually deliver not you know some spurious body of the system that we might use later down the line that might be useful it's all right it's a real thing it does what it says it's going to do and and it's and it's good i mentioned giving thanks the things we're basically just composing a bunch of different libraries together if anyone's not familiar with these will just quickly run through them liberator is our bread and butter for saying exposing everything is a restful resource bitty great library to do routing as data I'll get us more slides on that buddy also we have it's in the financial selm it's very important to be secure so we bank level security that handles the tokenization the offing secure trip encryption and hashing of all the passwords and all that stuff and also the documents themselves being encrypted HT speak it to serve it all up there's a blog post handling 600,000 concurrent requests it's good enough for our purposes just it works byte streams we're streaming out data we're also pulling it in mostly were dealing with multi-part form params it's just raw data so as soon as it comes across the wire it's coming in a number of different formats we're using that to ingest jesh are because we live in a world of consuming and producing restful Jason api's so our language de jour is obviously JSON and so we're handed documents and often I'm spitting back Jason so give me unstructured documents I'll rip out the text and give you back knowledge and the actual text as data formed up as a JSON document and so you can take that and do what you want with the process over it compose over it we consume it internally we pass it off to another system that I also wrote the fuzzy clj stuff great thing for the text pattern matching that's basically gets into how we actually pull stuff out core match use it in its crudest sense if it has a certain conditions if it matches this then the answer is here so you know basically answer end give me the possible ways to get to that answer so we have loose flexible rules that we're not quite sure what we're going to find but if we find certain things then it has to mean this so there's definitely a lot more interesting use cases for core match we're using you basically just match on these conditionals neo-cons are back store we use graph library on the year for j.crew familiar with it that gives us a lot of nice properties really nice property is really easy to semantically define the relationships between things and we wait those relationships between there's some piece of information here's another piece of information there's somehow associated they're somehow related not quite sure but if you wait it and you just define the relationship any future time that comes up you just say all right give me that note give me everything that's associated to it in some way and then you're provided more contextual knowledge about the thing that you found so each time it finds it it just forms more associativity and then it gets crazy in the kind of the way the brain works is it forms connections and you're not fully sure how it form the connections but the fact that they're there has some meaning so surface that that's all it really is meltdown we have an efficient message parser it's a message bus everyone's got one of those I just choose that because it's from finance industry Almack's really performed message bus produces and consumes messages it's it's solid rocks you know I use it so it is quickly saying I want to have a architectural definition this is a more way of composing any type of restful api it happens to be a means by ripping apart documents and getting into them but it's also something that says okay I'm handed anything and I want to provide utility for it and so this is basically you know kind of the way we handle things is you take these small composable units and we string them together in some way and I don't I come from the JavaScript land and frameworks or the name of the game all the time and you kind of have to buy into their beliefs about what that framework dictates and it always works until you want to do something slightly different from how the author's intended you to use the framework so one of the things I absolutely love about closure is the notion of those small composable libraries that you just build up however you see fit as you need I like to use the analogy one of the things when I'm not coding is climbing and what we need climb when we rock climb you build up anchor systems have small independent pieces protection and you string them together in such a way that it's distributed and equalized and redundant and solid so you want to check those things before you place your weight on it because if you don't the consequences are genuinely dire I can't fall to your death and I think you know the worst thing can happen if you mess up the composability of some functions I thought was just you get a nullpointerexception instead but you know hearing these these talks guys monitoring diastolic blood pressure and Annie are like this code I guess really genuinely has to work and it has to work perfectly and sending code into deep space it's even I love it it's just it's crazy one thing we like to use is bidi and the reason i like to use bidi there's sears just I grabbed this from the site there's multiple routing framework so you can use to be able to process requests and the reason it shows bidi specifically for this third column the syntax is data and its really really nice to have all of your endpoints not hard coded into the application the way that if you if anyone's use the Django or Python or nodejs or even expressed you kind of have to hard code all these endpoints and conditionally match on anything but when you say here's all of your endpoints itself as a data structure you just say okay all of your all of your handlers can be functionally composed over and so what we do is rip out all of our API endpoints to be able to handle different types of documents and different semantic understandings and different pattern matching things on our internal api's we expose that and we say literally here's here's this data structure that is all of our endpoints and we can push that outward to the readme to say here's how you can consume everything that I've produced and also we're using it referential II internally as well so it's the same source of truth both externally for how to consume it and in certainly how we actually match on when something comes across the water what I deal with it it's kind of nice in the sense of anyone's familiar with building up HTTP requests and patios or hypertext as a engine for state management saying all right I give you a request i'm going to send back to you all the other requests as links of what you can now do so if you authenticate i give you a response now that you're you're authenticated i can you're now allowed to hit these other endpoints that's kind of a nice thing it gets to a contract where you're saying i'm going to offer you something i'm going to require something from you and that's all programmatically defined so it's not ambiguous you're not going to all of a sudden break something because it's just it's just there and it's very very nice is the only way I can describe it it's just a loose flexible way to deal with things and it's rooted in a data structure so everyone you know makes can understand that very well the way we handle all the documents as the conveyor bit analogy this is a lot of well-worn in closure is just jamming things on a message bus it tries to do as much as it can with that document and then feedback as much as it can so sometimes like it initially it was not very smart got smarter over time so you know the request comes over rips it apart applies its knowledge that it can pulls the text out structures it in some way if it knows what it's looking at and provides not only the text back as structured data but all the extra information that it could glean about that and they all the associated stuff that comes with it so yeah the first pass of it was literally okay can can you give me a random document as a PDF and can I give you back the actual information that's buried within it so we take Liberator and apply it kind of everywhere it's a notion of just saying there's some functionality that we need expose that as an API endpoint and it I've done it in other languages and it's ridiculous liberator and specifically closure makes it really really nice and I don't want to say easy like almost trivial to say anything that you want to have if your team has some other binary application you can just expose it as a resource and have very fine-grained control over who not only who can access it how they can access it what's required of them to access it what they're provided when they do get it and you know all the methods you can hit on it and what it's going to give you back and that's all you know just defined right there in one little function call has def resource so we definitely just say here post to this end point that's all documents and then it goes to town and tries to figure out what's going on one thing that just a little gotcha we remember to have is we're basically using third-party binaries and some lower-level Linux programs so it's nice to be able to say all right target what you're doing so you know we had the issue of well it works now why doesn't work over there you know you've compiled for one architecture it doesn't work for the other one just the thing to remember saying yeah you know following riches steps just it tries to do the right thing and one aspect of that is all right are you even using the right architecture are the right program on that architecture so yeah if you got a dev ops team and they're like you know they you built it on your macbook but now we're putting it in production in Linux you know those issues do come up what we're essentially doing is shelling out so there's the built in closer macro Sh it basically spawns a new thread and you provided all of the arguments and the binder that you're using we leverage this pretty heavily we tried first implementing a lot of third-party enterprise-e expensive solutions where you hand it to them and then you get back like in emails or stuff and what we actually found is that the lowest level Linux primitives were more efficient and better at being able to extract out the information so in that vein a it's free which is nice be it's faster so we're not rolling on some third-party service where it has to go and have some contract see we can mix and match and and compose exactly how we want so it's it's nice in that regard and there is some learning curve to it but it's not that crazy you just read the docking and you can see how to use it we're basically embedding binaries so if anyone on your team has we're particularly using a PDF to text extractor basically or an image to text converter but whatever whatever logic or program or business thing your company has if they had some guy that built a go binary or Python script or whatever it is that can become that can live with on or underneath this like brand closure architecture which is kind of nice as well it's like yeah you you have your thing and it does its thing okay just as long as it provides a piped output in the grand eunuch style then that's all I need and we can we can embed it into our application so we have this kind of living as a metal layer on top of some other programs so it's basically if you want to expose a resource it's actually exposing and under the hood binary application that's just piping it it in and grabbing the result and feeding it back when we take text out of a PDF the PDF to text is literally the best and most efficient means sometimes it doesn't get all of it and so what we have to do is break it up into images and when you break it up into images there's a program called tesseract that was originally out of HP Labs it was built initially in 1984 so it's been around for a little while they gave it up with 2005 oken source that Google took it over in 2006 and it's now kind of emerged as the de facto best OCR so if anyone's not familiar with optical character recognition it's basically can you look at an image of something that looks like a nine and is it in fact a nine and imagine that over hundreds of pages of dense arcane documents you know I give you something can you give me back what it actually is and not screw up so there's some games to be played with it but it does do the job that's all I can say and much better than even adobe acrobat or any other you know actual full-fledged programs that claim to convert from one format together basically we're interested in like whatever the document is just get it to me as text and then i'm going to compose over it a lot what's important for how we define the knowledge is a number of different pieces so what we have is what we've built up as a corpora or multiple corpuses of knowledge and that knowledge lives separately from the actual document in that knowledge there's no there's no magic to saying there's some ontological understanding of the document it's just okay is there a node that has a term that has a relationship to some other term and does that occur frequently let's recognize that let's define some basic patterns that are you can find in any clausal contract document it's okay whereas this with the parties here in the state of summing the total of you know if you know the loose structure of a document you can start to apply patterns to regex through those things to grab what you want out of it and then when you have that you can basically just slap on all these knowledge pieces that you've gained and say okay here it is and so there's some other things so we use and so when you have all this built-up corpora apply those core Porras to the data and then so we're basically just compelling literally mapping and composing over and building up inferences from what's embedded in the documents again you know there's knowledge types there's nothing fancy it's just those relationships patterns groupings and rule sets we just simple rules if you find this it means this if you know that it starts here and ends there then it has some meaning you want to capture specific financial clauses bank statements that kind of stuff what we've defined as the frequently access qualifiers or things that don't quite fit into any of the categories but it's interesting information and we want to get it so it's a bit hacky but it's here's the thing here's a term here's a section here's a pattern here's a whatever when you encounter it if you encounter it do some extra conditional logic in our case it's basically when you encounter a business name and the owner go out and pull that information from another API that we have so you know just kind of building up okay when you get when you're handed something I can recognize that I'm going to go and grab this other thing pull it in and give it both back to you it's all that's happening now it has to be flexible because it doesn't always know what it's going to get so we built it in a way that it doesn't break if it doesn't find something so you're never gonna get a nullpointerexception it's just going to say it couldn't find it and the first time it was well it should have found it it was just too stupid to know what was doing it couldn't it couldn't correctly identify certain things um you know through iteration we're saying okay first past it didn't work but it should have now it kind of sort of kind of finds that some of the times if it's very generic then it can build up to more complex multi-varied scenarios where it's genuinely handling cases of documents that are complex legal documents that have variable amounts of clauses in terms in them that need to be captured correctly and so basically again just composes over things and says apply this if you find it give it back and then further our non-technical guys just built a simple UI to say okay let's have them feed it more information and so the corporate can build of genuinely a knowledge base of saying when you okay I'm encountering this thing a lot this term this clause this phrase whatever it is this number often when you find it it means the following and then just that's a simple you know form that they can hit and then that adds to the registry of information that can then be applied to all future it's so each time it's being used it's getting a little bit smarter and I don't say smart because it's not intelligent and the AI sense it's just less stupid yeah it's over look at this book of just bleah it's a bit of a treatise on how to use in cancer but it does have some useful information if you're interested in any of this kind of stuff we're just pulling out information doing statistical analysis on the terms it encounters drawing histograms it's you know regular just data science stuff it's not the most fascinating thing in the world but I couldn't think to do it in your language I come from a background of doing a lot of analysis and R & R it's really good for historical stuff but can't really do real time which is more interesting so that's how I actually got into closure in the first place so interesting tangent um so all right so presuming that you've been handed a document you get it back you've got all the information it's now just a big complex string we're trying to as most as possible just use the built-in functions that are in closer to go to over it so I didn't act just like who knows regular expressions to the point where you can actually build up complex regular expressions like very nuanced cases and this is the crowd to do it with I'd always like you know he just grabbed one and over I needed it got the job done and moved on but certainly sat down and said let me genuinely understand what's going on to be able to compose over these things and build up these complex scenarios so little just using a combination of these built-in primitives to say all right seek over something find it find the pattern build up the pattern concatenate these patterns when you find over those things return that is a capture group and and when you're finding that then give it some meaning so we just have all right like one of the most important things is how often something happens and it's just just quickly you know you guys can discern what's happening here it's just saying all right given a regular expression and some threshold I want to be alerted to if something happens more than say some number of times so start at the set of threshold say the regular expression you want to match on so we have all these these built up utility functions that in concert all composed over the same large body of text namely this huge document that's parsed as text to say now we're just applying applying applying applying applying merging it all together to say all right here's what's interesting about this thing here's what I found here's some facet of this thing that you should be alerted to we're not trying to replace anybody but give them something that makes their job better basically you're just building up regular expression patterns from simple all right just find this simple thing all right find the simple thing next is other thing that's the other thing that's the other thing combine that means that it's actually this complex clause or this complex financial statement independently it's kind of stupid but when you start to flexibly compose them it becomes rather powerful to be able to say all right I can go and grab all these different facets of a document and give them back to you as the known entity that they are at which point you can do all the calculations and interpretive analysis that you see fit to do so really kind of interesting if you're you know if you like to geek out on this kinda stuff our underwriters they just in our financial analysts they just want to say here's terms here's terms here's terms tell me when they find it if I don't find it exactly then what we really need to do is to say all right we have all these terms we have a body of knowledge we have all these texts that's cool not really but when it doesn't find it exactly correctly we can't just not use it we have to make sure that we're capturing it even if it's not fully there so the flexibility we kind of have to relax the hard constraints on finding exactly what we're looking for and say well if it's close enough we have all this fuzzy matching stuff and so this is a library that it's got fun cool hats off to them for creating it I don't surmise to fully even know exactly some of the algorithms that are using I've just kind of read through what it's doing and how it does it and it gives us a good baseline for saying all right if word a and word be or somehow related or if they're not too far apart like literally the text then it's probably the same word it's just that the OCR messed up and couldn't find it correctly so we're just getting a dice coefficient the lewd distance the Hamming distance the Jaccard distance and so what these all do slightly different things but if you're familiar with text processing it essentially says how different is one word from another or how similar or how many replacement characters would I need to do to get from A to B and so we found just kind of a arbitrary essentially threshold to say the golden mean of sixty-eight percent 75 roughly in that range for saying if a word is close enough then yeah you can flag it as saying no no that's not right because I'm looking at the actual text or infer that it is that thing you just didn't know what the number was exactly correctly those combined we just you know like just run through all those and basically get the mean of them to say and do that for every single word and then say all right it didn't find it exactly but it's pretty close so it's probably what you're looking for so here it is and I'm going to do some math on it presuming that it was correct and here's some math on it presuming that it's not correct so you can look at the one and choose which one looks like it makes more sense internally we represent everything as a weighted graph and like some documents they're confidential and that's kind of our bread and butter in our company to say what's going on and that attribute to say something is confidential it's if it's a document it's not something intrinsic to the document but it's information about it and to describe it as metadata is kind of misleading to say right it's just things about it that it should be noted and you can gain information from it by virtue of it having that attribute and that's a another way of saying like yeah okay it's confidential what is it about it that is actually important okay combine that with a flag that it's confidential and we wait things accordingly dependent on that and then build up this modeling basically algorithm to say all right based on this document and the weighted graphs of what interpreted do you have what you need to make a decision it shows kind of an architectural style of separating the knowledge that we have about each type of document from the actual application logic that implements that knowledge and that gives us so a number of things one that knowledge becomes persistent because it's literally stored out in a graph to it separates it so we can say okay you've built up this body of knowledge that our experts have looked at and used and added to so it's kind of smarter we can take that knowledge base and say okay it's comprised of what I mentioned before the limiters the patterns the cues the relationships and say tack that on top of this document so you have this historical knowledge base about documents in general that's well in our case very domain-specific but builds up and says okay now it's now it can pull out a lot of stuff from what was otherwise just a simple straightforward document now you know separation of concerns that kind of stuff it's just it keeps it clean in that regard and so one can shift it out and what I'd like to hopefully see just as a hope for the future is that as more general purpose knowledge sets can be attained in some pattern that's roughly analogous to this and definitely some more interesting ones about machine learning to open-source those to say okay you've got this knowledge base about the documents well they did some research into this knowledge set about certain types of information somehow combine those and pull from both of those to have a genuine shared collective knowledge that kind of lives a theory early out in github to say all right let's let's pull the machine learning results from applying something and be able to pluck that and apply to a different set and see what happens it's a different pattern of thinking about it and I've tried to keep those things you know clean in that way to say all right let's repurpose what we're using and to say can I pluck something from a financial document and a legal document and there are two different disparate knowledge sets that have to a very different formats but they have some similarities and so can we pull from knowledge of both that maybe some other company or entity or person created and apply it to what we're working with and I think maybe as a way forward to to say here's a contract for understanding what that information is so I just oh I'll wrap it up right there I want to say a huge thanks to the community you guys make my life fun alright like you guys write these libraries that I think I wouldn't know how to do this in any other language and for a while actually genuinely gave up programming because I thought this is I didn't want to spend my life fixing CSS style sheets and writing some crappy code that was only going to be looked at by a certain number of people but the kind of things we're doing this was just kind of a side project within our company that I personally did in a couple days maybe a couple weeks to just build up this thing that we needed to get done and I couldn't actually think of any other way to do it other than enclosure having looked at alternative ways of approaching the problem so like all good documents you know that's them thank you [Applause]