Posted by Danny TarlowThis is a guest post by Dr. Scott Turner, who won the Machine March Madness prediction contest last year, and who was the co-winner of the Sweet 16 contest from two years ago. If you like this post, check out his great blog all about algorithmic prediction of NCAA basketball: http://netprophetblog.blogspot.com/.
Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the Aerospace Corporation, where he advises the nation's space programs on software and systems engineering issues.
Danny & Lee asked me to contribute a guest post as part of the Machine Madness contest. I started writing a posting about using RapidMiner as part of a prediction workflow, but unfortunately I became overwhelmed with other tasks and wasn't able to finish it. I had given up on finishing it when I realized that anyone entering the Machine Madness contest at this late date might well appreciate a tool that could make creating the routine parts of building a predictive model very fast. So I quickly finished it up and hope it will prove helpful to someone. Readers who are expert data miners won't find much here, but I hope that it might be useful to the interested amateur who knows more about basketball (football, baseball, etc.) than about statistics and data mining and wants to put in a quick entry.
I will assume that you have some program or method for generating the statistics or ratings you want to use to predict games and that you've saved those results as an Excel file. (These might just be season averages of the statistics Danny & Lee are providing.) As a tool RapidMiner is not well-suited for this part of the problem; it's strengths are in pulling the predictive value out of those statistics rather than generating them. (Or perhaps I should say that it's not well-suited as I understand it. I wouldn't be surprised to learn that it has useful features in this area that I don't know about.) The Excel file should have one line for each game, with columns for the team names, statistics, ratings, and scores.
The next step is to download and install RapidMiner. You can do that here. The "community edition" of RapidMiner is completely free. (I like free.) There's a user forum here where questions usually get a fairly quick response.
Once you've installed, start up RapidMiner. You'll see this:
RapidMiner has three default perspectives: Design, Results, and Welcome. It starts up in Welcome. Switch to Design by clicking on the icon that looks like a pencil writing in a notebook, from the View menu, or by hitting F8. The Design view looks like this:
The blank central area is the canvas where you'll graphically build your RapidMiner process. The left-side has a menu of Operators as well as Repositories (where processes are stored). The right-side has details about the current operator (Just a blank "Process" in this case because we haven't added anything yet.)
To start, let's read in our Excel file of game data. In the list of Operators on the left-side of the RapidMiner window, you'll see a folder labeled "Import". Clicking on that reveals sub-folders labeled "Data," "Models", and so on. Click on the Data folder and you'll see a list of operators. "Read Excel" should be near the top. Click and drag that operator onto the blank area in the middle of the screen and release. You'll see this:
To fix this, look to the right-side. You'll see that is now showing the details for the highlighted "Read Excel" operator. Just below there you'll see a button for an "Import Configuration Wizard" and then some input boxes for the various parameters for this operator, including the "excel file" parameter being complained about. There's also a description/help box for the operator below the parameters section.
Use the "Import Configuration Wizard" to find your Excel file and prepare it to be read in. The wizard does some basic data checking, so you may discover a problem in your file at this point. Here's what the final step of the wizard looks like for my sample data:
There are 8 columns to my data: name, score, TrueSkill mean, and home winning percentage. (The TrueSkill mean is a rating system. You can read more about it here.) These will be the inputs to my prediction model.
To run a process in RapidMiner, you click the right-facing blue triangle button near the top of the window. Right now our process isn't very interesting -- it just reads in the Excel file and sends it to the Results -- but let's run it and see what happens. You may be asked to save your model and whether you want to switch to the Results view. For both questions you can save a default answer, which is handy. When you switch to the Results view you'll see something like this:
The data you read in creates an "Example Set" and this window is showing you the Meta Data View for the data set. In my case, the data set has 3699 examples (games), and for each attribute in the examples, the window shows the Role, Name, Type, Statistics, Range and Missings. There's some interesting stuff here -- for example, home teams scored between 28 and 124 points in this season. A home team scored only 28 points?! That's pretty intriguing.
Let's follow up. Click on the "Data View" checkbutton and then on the Hscore column to look at the actual data sorted by home team's score:
Apparently that 28 point performance was put in by SMU against UAB. That had to be fun to watch! You can do some interesting data analysis with the Plot View and Advanced Chart options here, but let's continue on with building a process.
Switch back to the Design view and let's work on conditioning the data. In many cases, there are problems in the input data -- such as missing values -- that will corrupt your prediction models. RapidMiner provides a number of operators for fixing these sorts of problems. Let's work on fixing missing values. In the Design View on the Operators tab on the right part of the screen you'll see a search box. This is handy for finding operators by name. Type "missing" into the Search box and you should see this:
Click on the "Missing Values"operator, drag it onto the canvas in the middle of the screen and drop it. You'll now have this:
You'll see that RapidMiner is complaining of an error in our process: we don't have an input to the Replace Missing Values operator. We want to connect the output of our Excel file to the input of this operator. To do this, we left click on the output of the Read Excel operator, and drag the resulting orange line to the input of the Replace Missing Values operator and release. This causes a pop-up box asking if we really want to disconnect the current output connection or not. Allow RapidMiner to disconnect the port and you should have this:
And that's all you need do: Add operators and hook them together into a process. By default, the Replace Missing Values operators replaces all missing values with the average value for that attribute. That's fine for now, so we'll leave it as is.
One very important step we need to take is to create a "label". The label is the attribute that we're trying to predict. In our case, we'll be trying to predict the winner of the game: "Home" or "Away". We don't actually have that in our input data, so we'll need to create a new attribute and set it to be our label.
To do this, find the "Generate Attributes" operator and the "Set Role" operator and modify your process to look like this:
And that's it for creating the new attribute. Now we need to set the Role of this attribute to "label" so that our models will know what we're trying to predict. To do this, click on the Set Role operator and in the right-side pane, select our new attribute from the drop-down box next to Name, and "label" from the drop-down box next to "target role":
Switch back to the Design View, and find the "Apply Model" and the "Performance (Classification)" and add them to your process after the k-NN operator like so:
Run this, and you should get a Results View that looks something like this:
This illustrates the basics of using RapidMiner for prediction. RapidMiner has a wealth of features and options, and there are many improvements you can make to the simple process flow I've illustrated above. But hopefully this has given you enough guidance to get started, and good luck!