MongoDB and Gene databases

0 0

hello good morning everybody and thank you for attending this talk as i was already been introduced i'm going to be talking about mongodb engene databases and first off before i start talking and go on with the talk i'd like to thank the organizers i'm really excited to be at the first python i would like to thank the organizers for having me here i'd like to thank Google for sponsoring the diversity program which has enabled me to come here and speak to each and every one of you so thank you moving on so i would like to interest introduce a little bit of what I've done I've been a free software volunteer from 2005 which which is when I started my journey with debian i wrote a sanskrit lokale for debian be locks it's in the deluxe package I found it open to women I volunteer with a number of women's organizations like Linux shakes sisters que te women have been Alice administrator and I've enjoyed understanding how diversity actually works why it is so hard for women to you know be in the community in the free software community when I started out they were very few women but now it's changed a lot and I've done as part of Google i worked with the sisters i did some mailman code we worked on 2.1 point 10 version that was pretty outdated since my oh man is on 3 X now and I did the patches testing and release project four sisters mailman and I also volunteered with the Scripps Research Institute in California which is where for whom I've done the MongoDB project so you can if you're interested you could just check out all the that I've done I have the student there the scripps research institute so basically the scripps research institute is a research organization in california they work a lot on gene databases bioinformatics stuff they do a lot of research in they run there so I worked with the Sioux lab which is a part of the scripps research institute under dr. Andrew su and they do a lot of how do you say gene wiki profiling the hub gene annotation databases in which researchers biologists and practically anybody who's interested in bioinformatics can go update the wiki you can get data if you're a student you can probably just read about it they have a few twenty thirty thousand pages on the gene wiki they also run a gene annotation database site called the bio GPS and my gene info so those two the last two were the ones that I've worked on and if you check out both these sites you'll find out that bio GPS is basically a side that allows you to do gene annotation it allows it allows researchers and biologists to go and get data on gene annotation protein function gene ontology you could do protein you could get wild card queries and you could get that information so a lot of biologists and researchers they use bio GPS in their everyday work and my gene info is the site that powers bio GPS so that is the one that I worked on so as I have already gone through this so it's basically by GPS is a customizable rest interface it allows you to get gene ontology G annotation gene symbols protein functions while kar komak valkar queries get symmetric IDs you could get em RNA sequences you could get protein sequence alignment sequences basically just anything that a biologist sort of scientists requires and moving on so so the service tag that we used is Ubuntu 12.04 mercurial because it's Python and we run the full site on AWS so currently the site is running on couchdb tornado python and by and these libraries that we have our bio Python numpy and PI es for elastic search so I'm pouring it from couchdb to MongoDB so that is what my project was all about and the malady be stacked that we are using is running on pipe we use PI mango which is the driver and mangu kit the object data mapper so moving on this is basically the project what project is all about so as you can see the whole idea is that we collect external gene annotation data from different sources so when you say sources we have multiple databases from which we pull data so it could be ensemble it could be an through as it could be data for about homology knit could be about a rad genome so what happens is that for example a rat genome aside would have data on gene ontology it could have data on gene sequencing or gene annotation but it won't have any data about phenotypes that I need to get from a different database so the idea was that we will collect all the data from multiple databases put them in a single MongoDB document because Mongo allows us to store data which is not exactly a it which does not have sequential from storing which you have in an R DBMS we can store data which have different fields which have different formats so for example in certain databases we have data which comes in a text format in some places it comes as CSV file it in some places they have stored it as a my sequel document and we pulled that data we convert everything into a JSON file and store it in Mongo so the idea is that we collect all the data and we have so scripts which we use to correct the annotation data and we convert them into JSON documents so for example if I have a 1017 that is the ID number on in Mongo I could probably check whether you have what symbol you have what is the name of the particular protein and what's the symbol corresponding symbol for 1017 so probably in a rat genome it could be cdk2 but in in an on Thursday are being database it could be something entirely different so the idea is that we bring all the data that corresponds to the gene ID number 10 17 merge them in one single document and then put them on bio GPS so that the researchers could probably go and check what is the latest because every time anthros updates their database rat genome might not do it or react or might not do it so the researcher has to go to three or four different databases do the research come back and merge that data and see whatever it is that they're searching for I'm new to biology I have learned biology only through this volunteering through this project so I'm not exactly sure what the biologist does or what they're searching for but I do know that when you have multiple databases with different types of data for the same gene they have a problem and that's the problem that you're trying to solve so we are trying put each of these IDs and merge these documents collect them and store them in MongoDB so we have the in the collection part once we've done that we have demons that will run at a periodic interval depending this entirely depends on the researchers research what they're trying to do so sometimes they do it on a daily basis if answer is updates their database every day then the researcher runs up the cron job that I've written and he gets or she gets the data on a daily basis and then done it gets dumped into the MongoDB and they're able to continue whatever if they run it sometimes the so scripts are run on a weekly basis sometimes it's on a monthly basis there are some databases which don't update for a few months altogether so so that's what we look at so the structures you can see on the right hand side of the screen is that we have a scroll saw script that runs are on periodic base and we can add sauce scripts later so that they can have an efficient data they can have mrna data they can have protein sequences or phenotypes data so depending on what the researcher requires so this is actually a work in progress so I'll be working on the sense right now I'm a hacker school so i'll be going back to work on this in january again so so the right now we have for gene annotation and gene ontology for the other things i'll be probably writing scripts much later then the second part is the architecture so earlier as i mentioned m that's the uri for the su lab you could probably just go there and check out all the projects that they have it's pretty interesting they have migrated a lot of their proprietary software into two free software they were part of the google Summer of Code this year and they did a bunch of games and gene wiki so these were the projects about eight or six projects were selected in google Summer of Code and so the old architecture as I mentioned earlier it was in couchdb so and everything the data loading indexing the searching all of this was done as a single file so we broke down the architecture I've separated them into Damon's data building data indexing and data loading so the goal is that we collect X gene annotation data from different sources and then we merge them into the gene source collection in MongoDB and then we use the amazon elastic search which allows us so that's the indexing part where we allow the researcher to search based on vile cards based on gene ontology or whatever it is that they require then the backend MongoDB framework allows me to index all these documents in a periodical format as and when the biological sources are updated then so if you can see the demons are for running all the scripts that the researcher requires it could be on a daily basis it could be on a weekly basis that they will decide if they want a particular interest database and they want it on a weekly pay next week or fifteenth of November so they can just go there and run that particular Damon it's just a single command that they have to do it from the command line they can do that and it will automatically take that build the source do the indexing and load it for them and it will on the bow GPS site this will be served for the researcher so if you check out the code over here the data loading this actually allows the register the register sources module will allow the the plug-in architecture it allows a meter data like collection name it allows me to get the name of the that is being processed the document structure and it will output all of this in each of the collections that are there so we have around nine collections in so far in Mongo you'll see that as we go across so now we come down to the building part so again when I'm building this particular database I'm combining the individual annotation data in one single document so for each of them they are in different they may be having different fields they have different data the only thing common between each of this is the gene ID number which is 10 17 or it could be anything now what happens is this becomes a big challenge because certain databases have a different field structure as you can see over there some of them have just the name and the symbol some of them have uni gene for the same ID number that is the gene ID number and some of them have a name symbol and a unit gene a regular are DBMS does not let us combine these kind of documents whereas in MongoDB which is why that's the advantage of having a no SQL database so I'm going to get databases which collections which have different structures which have different fields put the data together in one single file and I able to get a serve this full thing so that the researcher can you know get all the data that they require this is not possible in a regular re BMS and that's one of the biggest advantage so each of these combined databases sometimes run into 30 40 gb it depends sometimes it's even bigger than that so just getting them to how do you put it running it on a local machine was really hard for me I had a tough time and I had to use their the scripps research institute servers to get this work done because my machine does not have that kind of resource and capability to process 40 gigs of just for a single database and I had around nine collections where I was updating data so that really made it very very difficult for me then so if we move on now if you see this figure this actually is the architecture I'm not sure if you're able to see this okay okay so here you can see that a gene Kron pie oh yeah if you can see this is the architecture of the whole project how it works how we pull data from NCBI deandra's ensemble database and we are able to use the cron job which keeps the which keeps the check on the schedule time it check did the file dl answer is that i execute and downloads all the g-sib files into a data dump folder which i showed you a Liam it dumps them according to the date on which the cron job was run it brings all the data that's available there it needs to have a timestamp so that is one thing that it does it allows you it allows the researcher to go back and forth between two of the databases that they have pulled then once all the files are downloaded it imports there's a parse file for gbff files for ref sec it could be for annotation so we have scripts that parse all this data it again puts them in the data dump folder then the dl unter spy that filed and controls the trunk the program for execution to the parser which is the rough sex gbff file and all the five refseq files process and create new txt files as per the record in the gene bank so it uses as i mentioned earlier we use biopython so there is a library called bio set IQ which reads the data bank it formats parses locates each gene ID gene annotation gene ontology and it lists them according to the gene ID in the record in our collection then after that once we've created the Jean d'Arc source database we I run the data load data load script and that will allow me to index anything that the researcher asks so the pie es library that is an Amazon library that is provided for searching so we use a lot of their files it allows me to combine and serve this data and support all the rich queries that is required then um okay so if you see the data indexing but this is just a moment okay so if you see some of the code you will find that it's highly nested dictionaries and structures that we have so the file the data indexing files will flatten the dictionary because there are they are so deeply nested that sometimes pulling out the data is very difficult so this particular it's a huge file but I've just taken this part to show you that we can flatten the structure and then we are able to then merge I'm sorry we can flatten the structure and then we are able to do the search for the indexing part of it so we get the metadata we can get the mapping we can get it on the basis of the keys on the ontology depending on the structure if it's protein kinase then we can we can define what are the the fields that they have used the for that particular collection in and rest or in react home or in any other database that we have then after that this is a list of the collections that we have so as you can see there might be some data like if I go back over here there might be some data which is there in the anthers go database but it's not available in a homology so each of this this particular for when I'm searching and I'm indexing the data I need to flatten the structure because the structure that each of these databases uses is very different a homo gene database has a completely different file structure when I when I drop that in my collection it didn't does not necessarily match what is there in a uni gene database it does not match what is there in a gene info database oh I'm sorry this is quite a bit to go on but okay so i'll just quickly end this top by saying that you could just go and pull the code from the the uri did everybody get the uri um it's yeah it's bitbucket dark slab so you just go and check out the code that we have gene doc hub is the Jeana hub is the code repo so my code would be in bit bitbucket org structure / jinguk hub so that's my branch it still needs to be pulled into the core rip oh so that's I'm so sorry that this quite a bit that I could you know discuss and talk about but we've run out of time so I'll just say thank you very much for listening and if anybody has questions you could please talk to me after the you know after this is over so I'm around so thank you very much I'm very sorry