Do Angry People Have Poor Grammar?

0 0

hi thanks very much can everybody hear me yeah it's okay good so I'm Ben and I'm from London despite my accent and I'm going to talk to you a little bit about trying to work out if sentiment has any tie to the quality of writing is the more generic way to state that question and so what we're going to do is use this idea of angry people and on the one hand in and grammar or stylistic notions on the other as a way to sort of poke at what we can do with understanding language and Python and also doing some statistical dependency analysis in Python although I should warn you there's probably going to be only slightly more Python in actual slides than it the Julia talk next door so we'll probably beat that one but only because they're not even talking about Python okay so with that in mind let's talk a little bit about what the point of this is and where it comes from so um I don't know how many people in the room you say Twitter like on a semi-regular basis daily weekly whatever okay so have you ever noticed on something like Twitter or really any sort of short form social media that that people who are loud and loud is a sort of metaphorical usage here caps lock is screaming ! et cetera they they don't necessarily put their words together very well but like that well sure yeah you know oh and and and i will say that that that my country the United States we we do our part here right so you know and people yell about things they argue but but do they really get angrier and also get less well considered because because I have this this sort of idea I everyone sees these things of all caps and poor grammar and using the wrong form of two and splitting their infinitives and whatever other particular grammar rules using literally incorrectly you know all these other providers sorts of things but the problem is that that there's this notion in psychology a confirmation bias does anyone know confirmation bias it's my favorite so so basically you're more likely to remember a thing if confirms what you already thought was going on right happens all over the place and it's it's one there are many sort of little psychological things that are reasons why we in this sort of community do our best when we sort of go and poke at the data and as neutral away as possible and so so can we do that with this question and I think we can and it can maybe it can look like this so first we have to find a pile of comments right we'll get back to where we can do that later and then somehow we need to measure the style in a quantitative way and also measure the sentiment in some way that gives us some numbers and then I don't know we're going to do something else and then and then somehow will profit and that'll be great so so what do I mean so so we need to do statistical dependence and to do statistical dependence we are simply going to work out if that first side that measured style has anything whatsoever to do with that second side the sentiment and then I don't know what profit means I guess it means I'll get more twitter followers out of this that's what this all is circular right so we talked about social media than we do better on social media that we talk about social media let me do it it's all it's great okay so let's find a pile of comments so the pile of comments is from reddit so I don't know if anyone's familiar last year there was a really good community driven effort to package up everything that's ever been uttered on reddit wrap it in some jason and seed out some torrents of it it's a really great resource for doing lots of the kind of work we're going to talk about for the next half hour or so but there's a lot of the tremendous amount of other possibilities with this data set I highly encourage you to go pick it up have a play with it it's also a really good way to sort of flex your muscles with highly paralyzed work because as I'll get to in a minute you can't simply run this volume of data on a laptop so if you really want to get through all of the data you need to use some more sort of big data approaches so here's here's the magnet link I'm sharing I'm sharing a some torrents but they're legal so that that data said in fact is 1.7 trillion reddit comments uh that's so I think it was last up a couple of months ago though there is a sort of idea that will steadily grow with monthly updates so you do have to be careful if you sort of use it to be clear about which sort of time version you use if you want to be comparing data results with other people also handling there's a separate torrent that has the exact same formatted data but strictly for the month of January from last year and just that month it's 59 million reddit comments this is what we're going to be sort of working with in this because most of these methods I've optimized for sort of clarity and understanding and exploration which means that they aren't the fastest things in the world and so getting through 1.7 trillion comments can take weeks so rather than doing that we're just going to use the sort of subset and actually what I've done is sampled about 1% of that data so we can just sort of burn through things and do some visualization and whatever and it doesn't take too long and in particular I'm running that through this filter here's the first bit Python so so we're using tokenized from the natural language toolkit which is not its last appearance in this slide deck and this is a simple but not necessarily fast way to break up words or excuse me break up large chunks of text into nominally words but like sort of small pieces of meaning and then we can count them and what we're doing is skipping all the comments that are sufficiently short that we don't trust the results from either the style and grammar or from the sentiment so anything that's less than five words we're assuming is not enough to have reliable analysis on both sides were throwing lab and so that gets us to 319,000 comments from january of last year which is quite a bit smaller than our original set but is still a very large amount of data and will get us some nice p values at the end to sort of tease out okay so how do we do this step 2 measuring style and sentiment so the first thing is grammar which is in the title so you might think that's what we're going to do so let's let's look at grammar so if you think grammar and computers and computer science probably well maybe you think context-free grammar and actually the roux how many of you have ever done any sort of context-free grammar parsing the sort of thing linguistic analysis anybody hands okay so a few okay so the idea with a context-free grammar and is that you can take a series of texts and you can analyze it without having to have a complete mapping of the words and the relationship of the word based on their type you can also use context-free grammars to do analysis of programming languages not just sort of spoken ones but so it's great for analyzing ambiguous situations and complexity and things like this and there's a really good example that I have shamelessly stolen from the natural language toolkit a book on these topics here's one morning i shot an elephant my pajamas oh we got in my pajamas I don't know okay so that's that's this text I'm going to thoroughly ruin this joke I apologize for any marks fans so Oh one morning i shot an elephant in my pajamas how he got in my pajamas I don't know so the critical thing that's fun here is this part right here and and the joke if it isn't clear is that it's not exactly obvious who is in pajamas right so what we can do is analyze this using a context-free grammar if we don't care about the sort of infinite possibilities of English but just this particular bounds we can use it with quite a simple grammar so we've defined the grammar this way so to quickly run through things s is a sentence that's our sort of largest nominal unit of information and sentences every sentence in this grammar has a noun phrase and a verb phrase that's NP and then VP we can then define what each of these things sort of mean recursively so as we do sort of down here and I apologize for not spending too much time on this but we're going to have to sort of push through for time but so we have these various things where you have one unit of information that represents a particular grammar rule and those rules build on top of each other and then at the bottom we define the different types so a determinate isn't an or my a noun is elephant or pajamas and I misspelled pajamas and slide sorry and such so we can then use that and we can build a parse of our confused sentence again this is not too hard with a little bit of NLT k magic and this is this is a simple parser parsing context-free grammars i should say is sort of a topic deserving its own series of talks there are many ways to do it they have different pros and cons this particular message chart parser is useful because when it's done you get these which are charts of the relationships based on the grammar rules but I I don't I mean maybe if there's some people who sort of grew up on list they're good at these levels of parenthesis I sort of think this is easier to get through if we sort of draw a tree and this is something that if you start looking at grammars and grammar analysis you'll see quite a lot of this sort of thing so the basic idea here is that we have two different ways of understanding this sentence one of which understands that the elephant is in fact in the pajamas and the other one that the man who is doing the shooting is in the pajamas and specifically in fact let's see so the one over on the left from your view is doing both shot an elephant and in my pajamas are referring to the eye so here Groucho Marx is wearing pajamas and shooting an elephant on the other hand on the right we have that I that's Groucho Marx has shot an elephant the elephant is wearing pajamas and and the ambiguity of course is where the joke comes from and again i apologize for thoroughly ruining Groucho Marx jokes so that's all from this grammar but that grammar obviously that's not all the rules of English right and I hope that everyone can sort of see that in the room English in fact has many many many more complicated rules and relationships between nouns and verbs and adjectives and other ways to form sentences and such and in fact if we extend that forever and always and thousands of times can at least in theory define English except that really it's it's a static simplification even at best right because language is full of nuance you saw it in the example there were two different ways to tree that out and also language is always sort of moving so if we define a contacted grammar and then we sort of take a breath that it probably won't fully describe the ways that people speak to each other and write English sentences so there's this idea of a pcfg which is a probabilistic context-free grammar and that helps a little bit because then we can sort of say all right so in this particular case we have a 30-percent likelihood that this is the correct rule and a 70-percent like that's the correct rule this sort of thing so you can assign effectively weights to ambiguities and that's better but but the the thing as this this lovely quote sort of summarizes is that there's not really a holistic agreement on what is grammatically correct right and as as a sort of American who is now living in the UK I see this on a regular basis there's also a no real notion of a sort of uniformity across english as a second language speakers so you have you have a sort of core idea of what well spoken english looks like but there are so many edges and so much fuzz that actually may be thinking about grammar is not quite what i mean maybe we want something a little different maybe we want to sort of test for style so so the nice thing about style is we can arbitrarily say these are the things that we like these are the things that are concise that are easy to read that are clear that our professional whatever so so does anyone familiar with linters does anyone link their code pile in pylons the common 14 for this language okay so so pilant doesn't say whether or not your code is correct it doesn't say we're not your code will run it says whether your code conforms to the norms of our come you which handily and python are quite well defined in public but you see this for all sorts of programming languages and indeed there are projects to do this with English prose lint is a a reasonably young project that I am not directly affiliated with this is not an advertisement that I was paid for but it's a lovely tool uh and and in fact you can you can sort of test it although not here all at once or the Wi-Fi may come down but this is a live nice little piece of JavaScript so you can just type in a box and it hovers over the rules when you violate things whatever it's quite lovely the core of it is this a piece of sort of a config file excuse me that defines the rules that we care about so so Rosalind has a specific definition of various stylistic considerations from all sorts canonical places lots of writers various style guidelines from places like the AP and other canonical sources for things and then you can go through and you can say which things you care about so for instance you have consistency spacing and consistency spelling so consistency spacing does not mine if you use one space after punctuation or two spaces after punctuation which is a great way to start a fight on a english-language comment board but they want to make sure that whichever one you use you use the same throughout your language right similarly it doesn't take a position on whether or not British English or American English is the correct way to spell things only that you're consistently applying them throughout so if you use a you when you spell color that's fine as long as you do the same the next time so this is quite useful and so we can look at for instance a definition here so this is this is looking at checking for repeated exclamations and all its really doing is running a reg ex to see if you've used lots and lots of exclamation part points over and over again instead of just one or occasionally two and then it prints out an informative message associated with the rule which here is a stop yelling keep your excavation points under can troll okay so we can run this parcel inter and we can do it over our entire data set of comments and if we do that we get some results everyone likes results so uh so this is this is a count of a lit of issues brought up by the linter / comment across our whole data set and then the second row is the same counts but normalized against the length of the comment expressed in tokens so it's the count of tokens as the divisor the idea here is that it's reasonable to assume that for sort of the same quality comment you would expect a longer comment to have more issues in absolute terms so one thing to note actually is that surprisingly in reddit you don't have a lot of stylistic trouble so so this distribution is actually quite flat for most of its time and will cycle back to this a little later as well when we start comparing it to sentiment but you know the mean is not very high with a reasonably type standard deviation so it takes all the way to the 75th percentile for raw counts to get 21 / comment now I mean the comments are not massively long we're talking about you know a dozen tokens maybe 15 this kind of thing so when you when you divide them out that does bear out but you still of course aren't affecting the distribution wait fine okay so let's let's talk about centamin alisis so how many of you have done any sentiment alisis previously we have scattering at the same okay so so sentiment house is what we're trying to were trying to go from from a block of text to an understanding of conveyed meaning positive neutral negative is the sort of typical grouping and in fact the grouping that will be using here there's also another idea of sort of intensity right so so do you are you just are you gently expressing your point or are you really trying to beat at home and so there are not there are many approaches for doing centamin alisis many of them are sort of considered as supervised learning classification problems where you take some features from your writing and then you try and stick it into a class of positive or a class of negative or class of neutral some types of grady ations between them these sorts of things the problem with most of these things is that they tend to fall over with shorter texts and even with our sort of threshold infiltration we don't have the luxury of working with paragraphs and paragraphs of text per item and this is thankfully been quite a common case for this kind of work for a while and some very nice people have sorted out a pretty good system for dealing with short especially kind of social media methodologies called vader which stands for the valence aware dictionary of sentiment reasoning i should say so a bunch of the slides have URLs down here and i'll put all of them up so you can see all of my sources for this this is a very readable paper in this case from the designers of this system but Vader is entirely rule-based which is useful because it's reasonably quick and you don't have to train a set it does have sort of built-in biases and assumptions but those biases and assumptions are roughly the ones we want because we're dealing with reddit comments so Vader was designed and tested around Twitter but it's reasonable to assume and this is the assumption will make that the behavior on Twitter is roughly analogous to the behavior that we see on reddit even if reddit doesn't quite have the strict character limits that you see on Twitter so to give you just an example of what Vader looks like and does this is from the their paper so we have some sort of tests and explanations of what's going on so here we see a eight different conditions that are broken apart so what it's doing is it's looking and breaking up the text based on things that it sees going on so you have a basic piece of text that sort of as flat says yay another good phone interview and then there's the addition of punctuation an exclamation point in the second p and then there's a punctuation and also a degree modifier so the that third one uses the word extremely and that use extremely sort of flags it up as pushing the intensity and etc all the way down getting more y le by way of caps lock perhaps expressing themselves via more exclamation points and and you know they go through in the paper and there's there's some very good research that sort of backs this particular set of rules to show that it lines up with a human intuition about the meaning of the sentences and yeah and it's also open source and their data set is open source it's quite easy to use it's been integrated into the natural language toolkit and a few other pieces so it's pretty easy to just sort of run through and so indeed we can run it on our data set and when we do so we get we get these fine results I don't know is that exactly from the back can you guys see my tables of numbers yeah yeah okay great so um the the basic point here actually is that things are not as negative as we might have feared and actually it's quite neutral so and this sort of bears out if you just sort of pick a random reddit comment out and have a read most of them are just sort of flat most of people are having sensible and reasonable conversations so that's nice that makes you feel better about humanity although it doesn't if you wonder about that hypothesis we started out with but we'll get to that in a second um it's also just quickly worth pointing out that these are all normalized so that they vary from 0 to 1 so you have three independent data fields positivity negativity and neutrality and then the intensity of each of those three ideas right which is a bit different than how some of the data analysis can get expressed in other kinds of sentiment analysis approaches okay so third item statistical dependence so statistical dependence in the most sort of generic way you can think about it is the idea that you can define some I set Y as a function of some set X and in the sort of simplest case that's one series of numbers defining another series of numbers through a function and in a more complex case you can have multivariate statistical dependence where you have many many data sets that are all going to inform the larger one as it has anyone anyone does statistical depends in the room anything anything no okay a little bit regression anyone anyone or practitioner of regression okay so regression is a particular way that you can look at statistical dependence it's used quite a bit in our field I think I think it's a bit in fact of a crutch but we'll do it anyway so so this is this is just a sort of rigorous definition here of least squares which is the sort of easiest way to do regression and so if we if we run some linear regression on our variables we make this lovely little chart some of you may recommend or may recognize it from Seaborn so what we have is that's the normed Lintz on the top row then positivity neutrality and negativity and then the same over so you know what I think really matters most is that top row so we're what we're comparing their is that is the normalized links to each of the three different variances and you know at least to my eye and we'll we'll go through this with a little bit more math in a second but at least of my eye there's not actually a lot of correspondence there right like they're basically just blobs of dots which isn't what you expect if there was a clean dependence there you would expect a nice line and dots that here pretty close to that line um and yeah so so let's let's poke at this a little more so let's let's do some stricter correlation testing just to sort of confirm what our eyes are telling us so we can run Pierce pearson's is a nice way of testing for linear correlations it doesn't of course test for more exciting kinds of correlation if there was a very exotic polynomial that led to correlations this would miss it but that data doesn't really imply that so if we run a Pearson's test on both the Lintz just a negativity we get some very very low values now when you do correlation testing a value of zero means that the data is unrelated that there is no function aside from an exact enumeration of data that will define from one set to the other and when you get to one it means that there's a nice positive linear relationship a positive sloping line and when you get a correlation of negative one it means is a nice negative relationship that is exact and complete and then our p-value so so what this tells us is basically that there's no there there that our correlation values are effectively zero for both the lint and normalized winter and our p-values are very very small which means these are actually pretty reliable results so it turns out the answer is no to sort of go back to our question and I mean that's maybe not that exciting but it's actually it's it's a strong know so we have a lot of data and it seems to reinforce that idea that there isn't really much of a correlation there so that's something isn't it so what are our conclusions so the first thing and this to me actually was the most surprising when I started sort of burying into this data is that it seems like people on reddit are actually pretty reasonable at least at large so I I do sort of wonder actually going through this as someone who is an occasional redditor as opposed to someone that sits on lots and lots of communities all the time if there's this distortion effect of of what sort of bubbles out to the broader populace from reddit that sort of informs our ideas of what goes on across all of reddit because actually most of the conversation is neutral in sentiment and reasonably well constructed according to our style analysis and and beyond that when it's not reasonable and not reasonable here I mean sort of negative and sentiment high intensity negative it doesn't appear that broadly speaking it's stylist to be any worse now of course there are fine examples that we can pick out we can go through and we can find the egregious ones but but looking writ large there's not much of a dependency between the sort of sentiment and the stylistic problems at least as we define our stylistic problems in our lender and and really I I'm not that excited about generalizing this so so a sort of task left to the reader or to me in a while is to take these processes that we've done and set them up to paralyze a bit better run them through some big data tools Hadoop and such and run it on the whole data set I and you know then we can actually sort of make some statements about the whole of reddit although my suspicions given how strong these results are is that it will generalize but we don't know that until we try and I think there I'm going to leave it so I will happily take some questions and thanks very much for your attention would you think that results were different if you will break the data by the say category of the topic so people talking about I T or just some random news um yeah maybe so I think um having now sort of done global there's obviously so so all of the comments the the distributed data set has subreddit attachment the the defined communities on reddit for each comment so it's it would be quite easy to label the data out with subreddits um you lose obviously the sort of total comment counts right so so I think for for most reddit communities the participation is quite low so read it being a place on the internet it follows a pretty strong power law distribution so you have a few subreddits that have seventy eighty percent of the activity across the whole site which are things like askreddit and the the meta community policy subreddit and the IMA and beyond those you get a very steep drop-off of activity so I think it's it's worth exploring my suspicion is that there will be some variance so there might be some communities where you would see actual correlations between stylistic use and positive or negative sentiment but um I mean what this result tells us is that in most communities you wouldn't expect that so almost certainly there while there might be exceptions at a community level in general especially with very large communities you expect them to correspond to this because this is a pretty strong result at least at the sort of general level yeah um hi so one thing I notice about reddit is that it tends to have certain mimetic behavior and sometimes people construct people write a comment in the exact same way for like hundreds of comments which look exactly the same because they have because there's a mean yep so do you think this could you your results just taken from reddit when you extrapolate it to the search to the entire internet it might actually turn out different so couldn't skew them yes do I think it's skewed this sample probably not and the reason why is because those sorts of behaviors at least in that particular example and there might be other behaviors that clump a little differently but they tend to be very time-based so reddit reddit maybe not as much of a sort of moving target through time as the the sort of particular nether regions of 4chan but it does tend to fluctuate quite quickly on these sorts of behaviors and so the sample I took was spread across January so it was a random random pull of those comments and my suspicion is that that's probably enough to blur those sorts of effects and also just generally a size right so so because as I mentioned a second ago the actually surprisingly large portion of reddit is from the sort of general comment communities things like askreddit where you actually don't see as many of those kinds of behaviors right I mean there are so one of the other very large communities is the sort of thing that read it's famous for this this thing called advice animals and and certainly in in that subreddit which is just full of basically pictures of animals and their construction into image memes you do see things like that but some of them are structural right so they're not necessarily parroting each other word for word they're just using the same sort of structural form in an almost way that you could almost think of as a kind of poetics right where you have sort of top of the mean bottle the meme kind of construction so yeah hi how would you detect sarcastic comments I mean is it possible to detect sarcasm in sentimental analysis which is that is the piece when you give a questionnaire people you know tend to come up with some sarcasm how it is pop back um so first of all it's certainly the case that sarcasm is is hard but Vader does a reasonably good job of it within the sort of societal norms of Twitter so there are rules with invaders specification and you can you can read them all in fact so which is one of the things that's nice about Vader is it's very easy to see transparently what they're using to make determinations and so they have I think I think Vader is something on the order of eight or nine thousand specific rules and a good chunk of them are for understanding sort of trickier situations and sentiment Alice's they aren't going to work if what you're analyzing is a three paragraph customer review right they might they might work on any individual sentence taken isolation from those things but because it's working from this assumption that you are restricted to 140 characters so you've got maybe one maybe two sort of short coherent sentences and that's what it's analyzing so your sarcasm can't be a whole sentence that makes in retrospect the last three sentences sarcastic right which is something that you can see in larger pros so it's not gonna help you there but Vader's approach does work quite well in understanding tone through sort of social media indicators it uses a whole bunch of emoji vocabulary to understand these things stuff like Lowell and these kinds of things and yeah so with the sort of caveat that I don't know how long it's going to be correct right language language moves and so Vader's only been around for about two years so it's not totally clear that it will be sort of updated as usage changes but I think it's quite a good approach for short form text and you can definitely modify it so that it's more helpful for if you have a couple of paragraphs things like this which might be a good approach if you have a lot of sarcasm you're dealing with because classification can be hard in that situation yeah I was thinking about the throughout the oldest short Commons um don't but that was thinking maybe angry people write shorter comments like stuff like yeah this sucks and yep a long long yeah no it's totally valid the problem with that is the stylistic rules that we're using don't make any sense for a comment that's a single emoji or or someone bashing their keyboard so you know what would happen there is some of what may be going on actually here if you want to sort of think about places where this data might be missing the story is that because what we're doing is looking four counts that the linter flags up against sentiment well if the lint or doesn't flag any style things up then it's great text and that great text might be you are the worst human ever that makes it six right so so and you know that I mean that's a sentence that's fine and if we you know if we if we throw 800,000 ! to the indolent or we'll get mad but if we don't then it's still expressing some negative sentiment but it's doing it in a way that the lender is not going to flag up so in an effort to deal with some of these issues I throughout the shorter comments because i think the the linting approach is not going to do a fair assessment for very short stuff 1 think about the data he has a I don't read the whole credit I really just part of it and it's not a random sample I read the top rated articles and there perhaps at top-rated comments so perhaps if you sample following that kind of rules perhaps wicked we would see a different reddit so do you think that also excused your data or yeah I mean it's so so sampling the way I did absolutely like ask you so what I'm doing is it would be but but its cues in the way that I wanted because what I was trying to do was to see if you examine all of the comments what would you expect right and and so a random sample is going to be the best way to do that it's not good and actually the results if we find the results surprising then perhaps it's because they're different than the part of reddit we commonly see and the part of reddit we commonly see is the stuff that is notionally popular that bubbles up through reddits scoring system so one of the ways that you could do this is a sort of second piece of analysis is to basically run everything I did but to focus on highly scored items perhaps the sort of top ten from each subreddit or the most highly scored across the whole community this sort of thing because that would look at this sort of idea right and again the scores and actually the score separated into upvotes and down votes are one of the pieces of metadata that accompanies the comments in this data so if if one was to grab all of this data you could easily sort of look just at the highly scored items or sort of top and percentile stuff based on voting this kind of thing and you can also look and see specifically at threads because one of the things that I'm not looking at here is sort of argumentative behavior right so i think when when we think about sort of trolling comments on social media it's frequently not just one person yelling into the wind but but it's sort of dispute right maybe it's not a dispute that has any substance but people are sort of yelling back and forth and and there there are ways that you could capture that as well you have parent ids in in each of these comments so you can easily build the comment tree and then you could look and only for instance you could only examine the comments that were involved in threads that were I don't know three or five long or involved multiple participants or things like this which would get you into discussions that have some length right which might also skew the results in a way that's potentially narratively interesting actually I would be very curious if you would do cert similar analysis but more specifically tired maybe political subjects on reddit or any other but i have a question related to father and is it also a political applicable to other line with other languages humble of you would use the dutch language and just look at yeah yeah so um certainly the general idea of Vader is absolutely language in specific but because it's rule-based one of the disadvantages of rule based languages is that if a language has different norms and grammatical features and send structures then you need some different rules and that's especially true in the language so you have to make sure that the rules make sense for whichever language community you're dealing with some of them certainly generalize one of the nice things about emojis is that they are in many ways their own language but you do have to sort of make sure that they're used the same amongst different speakers there are some efforts to translate the rules of Vader to other languages I don't off the top of my head know what the Dutch support is like but because it's also the rules for Vader are on github so it is fully open source it's under I think an MIT license and so certainly things like translation of the rule set so that they're coherent to other languages and doing sort of testing another languages is something that I have been led to believe the contributors love so if there isn't Dutch there's room for Dutch and you know i think in principle there's no reason why that approach doesn't work in other languages okay let's thank Ben again you