If you’re interested in machine learning and Ruby, you’ll probably like this talk I gave at Pivotal Labs in San Francisco:
Overall, it was a great experience. The Pivotal crew was awesome. They had great food. And I got to fly Virgin America for the first time.
Here’s the transcript:
Starr Horne: My name is Starr Horne. I am with Honey Badger. You may or may not know us. We monitor Ruby apps in production and let you know when bad things happen, downtime, exceptions we just added performance monitoring stuff and launched that. We have this problem that I imagine a lot of you guys have as well. We have a ton of data. We have all this data. We have data about our own processes, we have data that we collect on behalf of our customers, and for the most part, what we do with this data isn't really that exciting, right? We just grab it for errors. We maybe count things, and if they go over a certain threshold, we email somebody. I started thinking, "Wouldn't it be cool if we could do some awesome stuff with this data? Let's do some more interesting stuff." I started doing a little research. I came across this company. It's actually located in California. They suck in all of this garbage information from the web, all these LOLcat blogs and Justin Bieber fan sites. They turn it into a product that can answer any question that I have. How does it do that? That's like magic to me. How does it do that? How does Pandora know that because I like Otis Redding, I'm probably going to like James Brown? Actually, right before getting up here, I got a call from my credit card company, saying somebody had bought some stuff at Walmart in Georgia and would I like my credit card cancelled. I was like, "Yes, please." How do they know that? I suspect the answer is math. This is a problem for me because I'm not a mathematician. I'm just a workaday programmer. I consider myself closer to being an electrician than a mathematician. A lot of Rubyists fall into this category. These subjects don't really get a lot of play in Ruby-land, these subjects of data mining, machine learning, all this stuff, but they're important. In case you guys have just been working and haven't stuck your heads up in the past five years, this stuff has pretty much taken over the world. I want to be on the side of the conquerors, not the side of the conquered. We're going to do something a little bit different. I've only got 30 minutes. Most talks that I've been to, the speaker selects a topic. Then they try and go as deep into it as possible in the time that they have. I'm not going to do that. We're not going to go deep here. We're going to go shallow. We're going to cover a lot of machine learning topics. I'm not going to go into detail in any of them, which may seem not very useful, but personally, I find this stuff useful. If I know that for a certain type of problem, there exists a certain type of solution and I know the name of that solution, then I can go look it up. If I don't know that that solution exists, then I'm just screwed. I recently had this happen to me. I was working on some internal metrics. When you have a software-as-a-service app, you need to know certain key metrics, like "Am I making money? Am I having more customers convert than I'm having cancelled"? You think, "That's all fine. That's easy to pull out of the database." You do that. You plot it on a graph. You get a really, really crappy graph. In case you're wondering, blue is money. I don't even really need to know that much information. I don't even need to do that much analysis here. I just want to know "Is my money going up, or is my money going down"? and I can't even do it from the raw data. Given my relatively unsophisticated mathematical background, I'm thinking, "This is easy. I'll just do a rolling average." You do a rolling average. That graph also sucks. In the first graph, we had too much data. We had too much detail. In the second graph, we don't have enough detail. There are actual inflection points and trends that are getting hidden by the averaging. Maybe if I change my parameters here and I tweak things, maybe I could make this moving average thing work, but I really don't have time for that. I don't want to do that for every single line in my graph. I'm just too busy for that. I was talking with Thomas Fuchs. He suggested that I check out this thing called "LOWESS." LOWESS is a graph-fitting algorithm. It's different from the graph-fitting exercises you may have done in stats class where you have to look at the data and say, "This data looks like a line, so I'm going to go and do a least squares regression on this." It's going to be horribly painful. With LOWESS, you just give it the data and it plots a graph on it. It fits a graph on it, which is amazing. I applied this to my data. We get some useful lines. You can see inflection point, like when I went and deployed an update to our billing system that made it easier for people to sign up, we saw an inflection point. This was amazing, the algorithm's cool and all that, but the thing I thought was amazing was that I didn't have to do any work on this. I had to do 20 minutes of work to make this happen. The thing that I needed was this password, LOWESS. I had to type this password in to Google, and then suddenly I was able to solve this problem. That peaked my curiosity. I started to wonder, "What else are they hiding from us"? It turns out they're hiding a lot. There are lots of pretty well-known solutions to pretty well-defined data mining and computer learning problems that are complex mathematically, but are pretty simple to implement once you know about them and once you download the right gems. We're going to go through some of these. The first of these is one of my favorites, because I can understand it and it's simple. This is houses on a graph. I don't know if you've ever used Zillow, but this is a screenshot of Zillow. I didn't put the prices up there, because this is Seattle and I didn't want to make you guys depressed. See how cheap things were. I had to make that joke, because this is one of five cities where I can make that joke and it will be true. If you've ever used Zillow, one thing that stands out to me and annoys the hell out of me is that when you zoom out, the land gets smaller, but the house icons stay the same size. Those houses are two miles wide now. Maybe you could scale down the houses and make them really small, but that would give you a crappy UI. A better solution is to do what Trulia does, which is to cluster them. What they're doing here is they're using an algorithm called k-means clustering. It's one of the most common clustering algorithms you'll find. You may have heard of it. It's so easy that I can explain it to you right here in about a minute. What you do, in order to do k-means, is you say, "I'm going to have five clusters." You have to determine the number of clusters out front by fiat, then you say, "I'm going to throw these cluster points randomly up on the graph." It doesn't matter where, then I'm going to say, "All you houses, find the cluster point that you're nearest, and you now belong to that cluster." Now, in your program you've got five arrays full of houses. You iterate over those. You average the XY points. Now you've got new cluster points. You've got five new cluster points. You say, "OK houses, forget all that other stuff. Now, using these new points, go find the nearest cluster. You belong to that." You keep doing that until the houses stop jumping around, basically. If any of you guys studied this in school, I'm sure I completely butchered the algorithm, but this is how I see things. That's all well and good for houses on a graph. This is a pretty one-off thing. It's a pretty unique use case, but it turns out you can do this with any type of data you can put on a graph, which is neat. Take images, for example. We're used to thinking of images as these 2D matrices of pixels, but you can also represent an image as a graph of colors in color space. If you run a k-means clustering algorithm against those, you can find, I forget how many colors are actually up here, but, say, 16 colors that you can use to represent all of these hundreds of other colors up there. I don't know if any of you guys spent the '90s playing around with Photoshop filters and making band posters like I did, but you know where this is going. This is going to Posterize. In Posterize, you have a dialog box and you say, "Make this image in to five colors," or whatever. It makes it all psychedelic and awesome. Maybe you're not in to psychedelic stuff, and that's fine. This whole Posterize operation, it's reducing the number of colors, is one step removed from downsampling an image from, say, 24-bit color to 8-bit color. It also works with music. [music] Sorry, I like this. I want to listen to it. What happens is that when you downsample an image like this, when you downsample music like this, you're reducing the file size. You wind up with a Lossy compression. Once you have all that crap together, you have the perfect recipe for audio and video codecs. [music] That makes me want to dance. That's cool. Fun times over. What the hell does this have to do with what we actually do with what we actually do as Web developers? This is all cool and everything, but what's that Einstein? Einstein's saying we should ask what a cluster is. What is a cluster in the abstract? A cluster is a bunch of objects that you put in to a geometric space in such a way that objects that are close together in the space are similar in some way. I showed you houses. That's probably the most obvious solution. I showed you colors, which is a little bit more abstract. You also do this with ideas. You can do it with people and characters. Here we've got the standard character class graph, and we have all the Star Trek characters grouped on it. Up in the left-hand corner you've got Picard who's lawful good. He always follows the rules and he's good, but Kirk, he's a good guy, but he plays by his own set of rules. He's chaotic good. Once you have things on a graph like this you can find relationships that you may not have been able to see before. This is the basis of a lot of machine learning. It's silly, but it is. For example, we might not have noticed that, for example, Kirk probably would enjoy having lunch with Q more than he would enjoy having lunch with Picard. I didn't know that before. You can use this to predict things in the future. Who would I enjoy having lunch with more? I'm up there by Bones. Me and Kirk, that would be an awesome lunch, if you're watching this, William Shatner. [laughs] Borg queen, though? Right out, don't want anything to do with her. [laughs] This is really stupid. Obviously, this is put together in Photoshop. How do you do this with a computer? How do you get from things that a computer can understand, like strings of text, and it doesn't even understand the strings of text, it understands that it's a string of data. How do you get from that to being able to plot meaning on a graph? That's pretty crazy. That's pretty out there. It turns out there's math to do it. I'm about to show you guys the most magical thing that I've ever done on my computer. In the next few examples, we're going to talk about a bookshop. A quaint little bookshop happened to be bought by an eccentric European billionaire, wants to take on Amazon. His first decree is that the people have to have recommendations. They want to be able to recommend similar books to people, so if you get a book about farming you might also like a book about chickens or about sheep. This is a tough nut to crack, because before I started looking in to this my first approach might be to say, "We've got these categories and got this metadata. Let's try and go and do some sort of search inside of..." Einstein's saying we should use Latent Semantic Indexing. He knows a lot. We're going to do that, similar items with Latent Semantic Indexing and primary component analysis. This is what I meant about the magic words. It turns out with these techniques you can take a list of books. These are nerdy books, of course on integral equations, attractors for semigroups and evolution equations. I copied these out of a scholarly article, so that's why. I was too lazy to make my own table. I'm walking you through a Latent Semantic Indexing process. You take this list of books, then you index it by words by book, which books contain which words. It's a matrix of ones and zeroes. It should be obvious how that works. I'm going to do some complex math that is simple code. It's about five lines of code. You have your data. You take your data. You run it through some linear algebra stuff. You wind up with two, wait, that's four, two sets of XY coordinates. One of these sets of XY coordinates relates to the words, and one of them relates to the books. If you plot these on a graph you get that. I'm going to give you a second to look at this and let it sink in, because when I read this I was so excited. I went around for an hour. I was giddy. I called my wife. She's not even a computer person. I was calling her, and being like, "Oh my God"! What we have here is we have books, and we have the words that describe those books clustered according to which ones are more similar. We're able to solve the problem that the eccentric billionaire posed for us. We're able to say, "If the user likes B8, they're probably going to like B13, and B4, and B15, because those are similar." Not only that, you can solve a lot of problems that dude didn't even ask for. You can use this for search. You can take a search query, map this into this graph, and if it winds up near a certain cluster of books, then there's a good chance that's what the person is actually searching for. It's all fuzzy, but some people speculate Google does a lot of stuff like this. It's how they're able to figure out that the Amazon River is different from [amazon.com](http://amazon.com/), because you get these clusters of meaning, which is amazing. We're moving on from that. We're done. We implemented that. I told you we were going fast. This guy, he's impressed, client's impressed. He says, "I'm impressed," but now they've got this other problem. They implemented this comment system for books, this review system for books. The system has one huge problem, in that people are actually leaving bad reviews for books. When people leave bad reviews for books, fewer people buy those books which seem like a problem for a "businessey" guy. The guy wants us to be able to automatically classify good reviews and bad reviews so they can hide the bad reviews down at the bottom. Bear with me, because I know that no online retailer would ever tamper with online reviews. This is just an example. This is an example of sentiment analysis. We're trying to figure out is the review good, is it bad, is the person who wrote it happy, were they sad? Bayesian classifiers is a good approach for this type of problem. You guys probably all know about Bayesian classifiers. They're used a lot in classifying spam. We love them as developers, because they're fast, they're really easy to implement, and to understand, and they're useful for a wide variety of problems. Even though they're built on shaky fundamentals, for some reason they work. It's this weird magic truth. The first thing you do, if you're going to tackle this problem, is you hire some people on Mechanical Turk, or whatever, to go through and rate each review as good or bad. This one's a bad one. This one's probably a good one, and then you train your classifier. When you do that it's very simple. You're making a bunch of counts. You guys probably already know all this stuff. I'm going to go through it anyway. For each word, you make a list of counts of the number of good reviews it was in, the number of negative reviews it was in. When a new review comes in, all you have to do is split it up in to words, figure out how positive and negative each word is, and then do some math to glom those all together in to two probabilities, the probability that it's a positive review and the probability that it's a negative review. I should point out that these probabilities. First of all, they don't add up to 100. Second of all, they're complete crap. Essentially you can't trust a, and I use Bayesian classifier to give you real probabilities. The reason is because of that naive part. We made several assumptions. Actually, you make an assumption, to make the math go faster, that all of these words are independent, that the probability of one word being in a negative review doesn't affect the probability of other words being in a negative review. Your probabilities are wrong, but more often than not you can make this comparison return the correct result. It's cool. This guy, he's feeling his oats. He's getting to be really obnoxious. They have found that people get really in to these book series, so if you email somebody right when the new Game of Thrones book comes out, they're much more likely to buy that Game of Thrones book from you, but they don't want to email every single person on their mailing list, because some people don't like Game of Thrones. This is my favorite slide out of this whole presentation. I was so happy about this. [laughs] This is a good example of something that is a good use case for a, it's called a decision tree. Decision trees are really neat for problems like market analysis, because essentially you take a list of attributes about, say, a person, so you have age, education, income, marital status, and then the last one is the one that we're trying to predict, whether or not they like Game of Thrones. You run it through a magical decision tree maker, and you get a graph. You get an actual graph that you can use to figure this stuff out yourself. You're not 100 percent dependent on the computer, which is neat, because if you're doing a marketing analysis you can give this to the marketing guy, which is cool. They might be interested in the fact that for some reason everybody who's over 55 years old really likes Game of Thrones. I don't know why that's the case, but I would spam the hell out of them with all the sci-fi fantasy stuff. This may seem trivial to you, you go from a table to a graph, but it's not. There's some interesting math going on in the background, because in order to make this graph make sense, in order to be able to predict the outcome using the fewest number of steps, you have to do some analysis on the table and figure out which columns provide the most predictive information about the outcome. To do that, you use these formulas that were developed for signal analysis in the '50s, and it's really glasses and pocket protector stuff. You deliver this. The marketing guys are thrilled. They're ecstatic, because they can actually understand something you gave them for once. Now the dude's off his rocker. He's wanting you to build him an OCR, optical character recognition. You have to do it from scratch because the dude had a bad experience with Java one time and all OCR packages are written in Java. I don't know what that pancakes thing is about. I stayed up too late last night making these slides. Oh, great. Now even Einstein's not helping us. Beware the curse. The curse is a curse of dimensionality, which is probably my favorite mathematical term ever, because it's so dramatic, even though it describes something extraordinarily simple and mundane. The curse of dimensionality basically means that if you have a dataset, and that dataset is along one dimension, say, a line, as you increase the number of dimensions in that dataset, the data gets farther and farther apart. The points get farther and farther apart. It's easy to think about if you think of two points on a line. This is one dimension. I've just got width. I'm going to add height. For any value of height, except for zero, the distance between the two points gets bigger. If you wanted to add a third dimension, it's sort of the same thing. The reason that Einstein is being really creepy about this is because as you add dimensions, as this data gets more and more spread apart, a lot of machine learning models tend to get less effective. This is my really long and awkward segue in to text recognition with support vector machines. Support vector machines are one type of classifier. They're good with highly dimensional data, with matrices with a lot of dimensions on them. Here's what a support vector machine looks like. I'm going to give you the really hand-wavy explanation for why they foil the curse of dimensionality, and that is because they're not actually looking for clumps of data. They're actually trying to find lines or planes that give you the greatest separation between data, so you're actually looking for a space in between things. In two-dimensional spaces this looks like this. In three-dimensional space it gets crazy. One other interesting thing about support vector machines is that they can actually handle nonlinear boundaries. In a lot of these machine learning techniques, these classifiers, there's a line in the sand, and if you're on this side then you're classified as one thing, if you're on this side you're classified as another thing. Support vector machines, through this clever mathematical trick, are able to draw really wavy lines and planes. I don't know what a plane is called when you make it 500 dimension. The reason for all this dimension stuff, I'm not trying to sound smart, because I usually fail at that, the reason for all this dimension stuff is that when you deal with subjects such as OCR, optical character recognition, handwriting recognition, you're dealing with images. Here I've got some bitmaps. These are 32 by 32 bitmaps. That's what we tend to think of these as, but when you go to do an analysis on these, the way that you do this analysis is you turn this in to a 1,024-dimension matrix. Once you do that, you can do all this crazy linear algebra stuff on it. You can run all these classifiers on it, and it works. You can make handwriting analysis using these support vector machines. I didn't write this program, but I found a screenshot of it online. I consider that proof. [laughs] To summarize, we have smoothing, clustering, houses, images, audio downsampling, compression, market segmentation modeling, spam detection, personal recommendation, similar items, full-text search, and handwriting recognition. I'm sorry I couldn't get more in. I can't talk that fast. If you would like to learn more about this, there are two really good video courses online, Stats 202, it's a Google lecture series, and there's also a Stanford Machine Learning course. It's about 20 hours of graduate level machine learning stuff that is amazing and mind-blowing to watch. I need to credit Iliar Gregoric's blog, because he's got a lot of simple machine learning examples. I stole, all my code examples from his blog. I need to credit him. If you want to learn more about what I'm up to, I blog at [honeybadger.io/blog](http://honeybadger.io/blog), I'm on the tweeters at @starhorn. That's it.