Sunday, March 11, 2012

Using RapidMiner to Predict March Madness

Posted by Danny Tarlow
This is a guest post by Dr. Scott Turner, who won the Machine March Madness prediction contest last year, and who was the co-winner of the Sweet 16 contest from two years ago. If you like this post, check out his great blog all about algorithmic prediction of NCAA basketball:

Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the Aerospace Corporation, where he advises the nation's space programs on software and systems engineering issues.

Danny & Lee asked me to contribute a guest post as part of the Machine Madness contest. I started writing a posting about using RapidMiner as part of a prediction workflow, but unfortunately I became overwhelmed with other tasks and wasn't able to finish it.  I had given up on finishing it when I realized that anyone entering the Machine Madness contest at this late date might well appreciate a tool that could make creating the routine parts of building a predictive model very fast.  So I quickly finished it up and hope it will prove helpful to someone.  Readers who are expert data miners won't find much here, but I hope that it might be useful to the interested amateur who knows more about basketball (football, baseball, etc.) than about statistics and data mining and wants to put in a quick entry.

I will assume that you have some program or method for generating the statistics or ratings you want to use to predict games and that you've saved those results as an Excel file.  (These might just be season averages of the statistics Danny & Lee are providing.)  As a tool RapidMiner is not well-suited for this part of the problem; it's strengths are in pulling the predictive value out of those statistics rather than generating them.  (Or perhaps I should say that it's not well-suited as I understand it.  I wouldn't be surprised to learn that it has useful features in this area that I don't know about.)  The Excel file should have one line for each game, with columns for the team names, statistics, ratings, and scores.

The next step is to download and install RapidMiner.  You can do that here.  The "community edition" of RapidMiner is completely free.  (I like free.)  There's a user forum here where questions usually get a fairly quick response.

Once you've installed, start up RapidMiner.  You'll see this: 

RapidMiner has three default perspectives: Design, Results, and Welcome.  It starts up in Welcome.  Switch to Design by clicking on the icon that looks like a pencil writing in a notebook, from the View menu, or by hitting F8.  The Design view looks like this:

The blank central area is the canvas where you'll graphically build your RapidMiner process.  The left-side has a menu of Operators as well as Repositories (where processes are stored).  The right-side has details about the current operator (Just a blank "Process" in this case because we haven't added anything yet.)

To start, let's read in our Excel file of game data.  In the list of Operators on the left-side of the RapidMiner window, you'll see a folder labeled "Import".  Clicking on that reveals sub-folders labeled "Data," "Models", and so on.  Click on the Data folder and you'll see a list of operators.  "Read Excel" should be near the top.  Click and drag that operator onto the blank area in the middle of the screen and release.  You'll see this:

There are a couple of things to note.  First, RapidMiner has automatically drawn a connection from the output of this process (the little semi-circle node on the right of the box) to the right edge of the workspace.  Anything going out to that edge will show up in the Results view when the process is executed.  Second, the message window at the bottom of the workspace shows an error.  It is complaining "The mandatory parameter "excel file" is undefined."

To fix this, look to the right-side.  You'll see that is now showing the details for the highlighted "Read Excel" operator.  Just below there you'll see a button for an "Import Configuration Wizard" and then some input boxes for the various parameters for this operator, including the "excel file" parameter being complained about.  There's also a description/help box for the operator below the parameters section.

Use the "Import Configuration Wizard" to find your Excel file and prepare it to be read in.  The wizard does some basic data checking, so you may discover a problem in your file at this point.  Here's what the final step of the wizard looks like for my sample data:

There are 8 columns to my data:  name, score, TrueSkill mean, and home winning percentage.  (The TrueSkill mean is a rating system.  You can read more about it here.)  These will be the inputs to my prediction model.

To run a process in RapidMiner, you click the right-facing blue triangle button near the top of the window.  Right now our process isn't very interesting -- it just reads in the Excel file and sends it to the Results -- but let's run it and see what happens.  You may be asked to save your model and whether you want to switch to the Results view.  For both questions you can save a default answer, which is handy.  When you switch to the Results view you'll see something like this:

The data you read in creates an "Example Set" and this window is showing you the Meta Data View for the data set.  In my case, the data set has 3699 examples (games), and for each attribute in the examples, the window shows the Role, Name, Type, Statistics, Range and Missings.  There's some interesting stuff here -- for example, home teams scored between 28 and 124 points in this season.  A home team scored only 28 points?!  That's pretty intriguing.

Let's follow up.  Click on the "Data View" checkbutton and then on the Hscore column to look at the actual data sorted by home team's score:

Apparently that 28 point performance was put in by SMU against UAB.  That had to be fun to watch! You can do some interesting data analysis with the Plot View and Advanced Chart options here, but let's continue on with building a process.

Switch back to the Design view  and let's work on conditioning the data.  In many cases, there are problems in the input data -- such as missing values -- that will corrupt your prediction models.  RapidMiner provides a number of operators for fixing these sorts of problems.  Let's work on fixing missing values.  In the Design View on the Operators tab on the right part of the screen you'll see a search box.  This is handy for finding operators by name.  Type "missing" into the Search box and you should see this:

Click on the "Missing Values"operator, drag it onto the canvas in the middle of the screen and drop it.  You'll now have this:

You'll see that RapidMiner is complaining of an error in our process: we don't have an input to the Replace Missing Values operator.  We want to connect the output of our Excel file to the input of this operator.  To do this, we left click on the output of the Read Excel operator, and drag the resulting orange line to the input of the Replace Missing Values operator and release.  This causes a pop-up box asking if we really want to disconnect the current output connection or not.  Allow RapidMiner to disconnect the port and you should have this:

And that's all you need do:  Add operators and hook them together into a process.  By default, the Replace Missing Values operators replaces all missing values with the average value for that attribute.  That's fine for now, so we'll leave it as is.

One very important step we need to take is to create a "label".  The label is the attribute that we're trying to predict.  In our case, we'll be trying to predict the winner of the game: "Home" or "Away".  We don't actually have that in our input data, so we'll need to create a new attribute and set it to be our label.

To do this, find the "Generate Attributes" operator and the "Set Role" operator and modify your process to look like this:

Now click on the "Generate Attributes" operator.  On the right you'll see a button labeled "function descriptions" and "Edit List(0)".  Click on this to bring up a view that will let us define a new attribute in our data set.

This is fairly simple to use.  We type in a name for our new attribute in the left-hand column and then an expression for calculating it in the right hand column.  We can use any existing attribute in our expression, and if you click on the calculator icon, it will bring up a tool to help create expressions.  In our case, we want to create a new attribute called "winner" that has the value "Home" if the home team scored more than the Away team, and "Away" otherwise.  The expression to do this is 'if(Hscore>Ascore,"Home","Away")':

And that's it for creating the new attribute.  Now we need to set the Role of this attribute to "label" so that our models will know what we're trying to predict.  To do this, click on the Set Role operator and in the right-side pane, select our new attribute from the drop-down box next to Name, and "label" from the drop-down box next to "target role":
We're almost ready to start modeling, but let's check to make sure we've added the "winner" attribute correctly.  Hit the run button to run the process and let's look at the output in the Results view:
At the top of the results (colored light yellow because of its role as "label") we see the new attribute "winner".  In this data set, the Home team won almost twice as often as the Away team.  If you click on the Data View button, you can check a few games to make sure the calculation is correct:
Looks good, so let's go back to Design View and train a model.  Switch back to the Design View and find the k-NN model, drag it into the process and connect it up to look like this:
  Along the right-side you can see the parameters for the k-NN operator.  Change "k" to 3.  We're almost ready to create a model, but we need to add one last step.  Right now the input data to our model includes the scores of both teams.  It isn't very hard to predict who will win the game if we know who scored the most points :-) so we'll need to remove that information from our examples.  To do this, we need an operator called "Select Attributes".  Drop this into our process between "Set Role" and "k-NN".
Highlight the new operator, and on the right-side, set the "attribute filter type" to subset and then click on "Select Attributes".  That will bring up this dialog:
Now we simply select attributes we want to include from the left side and use the green arrow to move them to the right side.  We want to leave out the Hscore, Ascore and Date attributes.
Save this and we're now ready to run the process to create a model.   Hit the Run button and you should see results that look like this:
Great, we created a model!  But how good is it?  We don't have any idea.  To figure that out, we need to apply the model and then measure its performance.  Let's do that.

Switch back to the Design View, and find the "Apply Model" and the "Performance (Classification)" and add them to your process after the k-NN operator like so:
Note that the model output of the K-NN operator goes into the model input for the Apply Model operator, and the example set output goes into the unlabeled input.  The labeled output of Apply Model goes into the labeled input of the Performance operator, and the performance output of that operator goes out the right-hand side of our process.

Run this, and you should get a Results View that looks something like this:
Wow, 83% accuracy predicting the winner of the game -- pretty good!  Good enough to win the Machine Madness contest?  Who can say? :-)

This illustrates the basics of using RapidMiner for prediction.  RapidMiner has a wealth of features and options, and there are many improvements you can make to the simple process flow I've illustrated above.  But hopefully this has given you enough guidance to get started, and good luck!


wintron said...

How do I apply my model to the bracket data. I think I must have missed a step

Ami said...

Do you have link to mentioned excel file?

Kerri said...

I cannot thank you enough for this tutorial, I have been pulling my hair all week and this has just sorted my problem!!!

Anonymous said...

You also want to do an X-Validation. Since this model is overfitted. It may not predict well for data it has never seen. X-Validation will keep a portion of the data out of the model and test the model against the held out data and make improvements over and over again.