Thursday, November 1, 2012

Neural network successes

Posted by Danny Tarlow
The learning group at University of Toronto has had some great recent successes, both of which were powered by neural networks. Many of you have probably seen the ImageNet results by Alex Krizhevsky and collaborators (team SuperVision): More recently, George Dahl, a favored guest poster on this blog, led a team that won the Merck Molecular Activity Challenge over at Kaggle: It's impressive stuff all around. Great job, guys.

Friday, April 6, 2012

Final 2012 Full-Bracket Results

Posted by Lee

Hopefully everyone had a chance to watch the exciting game between Kentucky and Kansas this past Monday. This post only covers the results of the full tournament bracket and not the second chance Sweet Sixteen bracket.

Here are the full standings, including ESPN analysts (E) and my own picks.

TheMatrixFactorizer127
Jay Bilas (E)126
Lee's picks124
The Pain Machine122
Baseline120
Danny's Dangerous Picks117
By The Numbers104
Dick Vitale (E)102
Obama102
Predict the Madness99
Ryan Boesch98
TheSentinel86
AJsMadness73
machine_learning_first_try45

Great contest this year and congratulations to this year's winner, TheMatrixFactorizer! It not only won the full-bracket contest, it also squeezed past ESPN analyst Jay Bilas by a point. Once again, machines triumph over humans in our contest. I, for one, welcome our new March Madness predicting robot overlords.

Wednesday, March 21, 2012

Round 2 Update + Upset Analysis

Posted by Danny Tarlow
Here's another great guest post from Scott Turner, our #1 Machine March Madness guest poster. Great analysis -- thanks Scott! If you want more where this came from, check out his blog.

On my blog here I took a closer look at how the Pain Machine predicts upsets in the tournament and how effective it was this year.  I thought it might be interesting to look at how the top competitors in the Machine Madness contest predicted upsets.  I put together the following table with the competitors across the top and an X in every cell where they predicted an upset.  Boxes are green for correct predictions and red for incorrect predictions.  The final row(s) in the table shows the scores & possible scores for each competitors.

Game Pain Machine Predict the Madness Sentinel Danny's Conservative Picks AJ's Madness Matrix Factorizer
Texas over Cincy X X X X X
Texas over FSU X X
WVU over Gonzaga X X X
Purdue over St. Mary's X X X X X
NC State over SDSU X
South Florida over Temple X X
New Mexico over Louisville X X
Virginia over Florida X
Colorado State over Murray State X
Vandy over Wisconsin X
Wichita State over Indiana X
Murray State over Marquette X X
Upset Prediction Rate 43% 25% 33% 0% 25% 29%
Current Score 42 43 42 41 41 39
Possible Points 166 155 166 161 137 163


(I'm not counting #9 over #8 as an upset. That's why Danny has only 41 points; he predicted a #9 over #8 upsets that did not happen.)

So what do you think?

One thing that jumps out immediately is that the competitors predicted many more upsets this year than in past years.  Historically we'd expect around 7-8 upsets in the first two rounds.  Last year the average number of upsets was about 2 (discounting the Pain Machine and LMRC).  The Pain Machine is forced to predict this many, but this year the Matrix Factorizer also predicts 7, and Predict the Madness and AJ's Madness predict 4.  From what I can glean from the model descriptions, none of these models (other than the Pain Machine) force a certain level of upsets. 

Monte's model ("Predict the Madness") seems to use only statistical inputs, and not any strength measures, or strength of competition measures.  This sort of model will value statistics over strength of schedule, and so you might see it making upset picks that would not agree with the team strengths (as proxied by seeds).

The Sentinel uses a Monte Carlo type method to predict games, so rather than always produce the most likely result, it only most likely to produce the most likely result.  (If that makes sense :-)  The model can be tweaked by choosing how long to run the Monte Carlo simulation.  With a setting of 50 it seems to produce about half the expected number of upsets.

Danny's Dangerous Picks are anything but; it is by far the most conservative of the competitors.  The pick of Murray State over Marquette suggests that Danny's asymmetric loss function component might have led to his model undervaluing strength of schedule.

AJ's Madness model seems to employ a number of hand-tuned weights for different components of the prediction formula.  That may account for the prediction upsets, including the somewhat surprising CSU over Murray State prediction.

The Matrix Factorizer has two features that might lead to a high upset rate.  First, there's an asymmetric reward for getting a correct pick, which might skew towards upsets.  Secondly, Jasper optimized his model parameters based upon the results of previous tournaments, so that presumably built in a bias towards making some upset picks.

What's interesting about the actual upsets?

First, Texas over Cincy and Purdue over St. Mary's were consensus picks (excepting Danny's Conservative Picks).   This suggests that these teams really were mis-seeded.  Purdue vs. St. Mary's is the classic trap seeding problem for humans -- St. Mary's has a much better record, but faced much weaker competition.  Texas came very close to beating Cincinnati -- they shot 16% in the first half and still tied the game up late -- which would have made the predictors 2-0 on consensus picks.

Second, the predictors agreed on few of the other picks.  Three predictors liked WVU over Gonzaga, and the Pain Machine and the Matrix Factorizer agreed on two other games.  Murray State over Marquette is an interesting pick -- another classic trap pick for a predictor that undervalues strength of schedule -- and both Danny's predictor and the Matrix Factorizer "fell" for this pick.

So how did the predictors do?

The Pain Machine was by far the best, getting 43% of its upset predictions correct.  Sentinel was next at 33%.  Perhaps not coincidentally, these two predictors have the most possible points remaining.

In terms of scoring, the Baseline is ahead of all the predictors, so none came out ahead (so far) due to their predictions.  The PM and Sentinel do have a slight edge in possible points remaining over the Baseline.

So who will win?

The contest winner will probably come down to predicting the final game correctly.  There's a more interesting spread of champion predictions than I expected -- particularly given the statistical dominance of Kentucky. 

If Kentucky wins, the likely winner will be the Baseline or Danny.  If Kansas wins, the Pain Machine will likely win unless Wisconsin makes it to the Final Four, in which case AJ should win.  If Michigan State wins, then the Sentinel will likely win.  And finally, if Ohio State wins, then Predict the Madness should win.

Monday, March 19, 2012

Second Chance Competition Announcement

Posted by Danny Tarlow
For all of you who didn't get your algorithms finished in time, and for all of the original competitors who'd like a fresh start, we're pleased to announce this year's "second chance" Sweet 16 contest.

This one will be run a little bit differently. For machines, the rules are all still the same. The difference is that there will now be a pool of human competitors in the mix -- Facebook friends and fans of our sponsor, a knee doctor who likes robots.

The prize pool for the second chance tournament will be $50 and $25 gift certificates for first and second place, respectively, and they will go to the top two entrants, whether they be human or computer.

If you want to participate as a human, you need to add Doctor Tarlow on Facebook and look for his announcement there. For those who wish to enter an algorithm, here are the instructions: That's it! Good luck to all the algorithmic competitors out there. I hope we can pull out a victory over those pesky humans.

"Predict the Madness" by Monte McNair

Posted by Danny Tarlow
This is a guest post by Monte McNair, the man behind team "Predict the Madness," which is the leader of the machine competitors after the second round.

Developing a system to fill out the best NCAA Tournament bracket is composed of two parts: matchup prediction and bracket optimization.

MATCHUP PREDICTION
The first thing to do is come up with a method to predict the likelihood of one team beating another. Since we only care about advancement, I want a system that produces a perentage as opposed to a point spread or something else. Therefore, I use a logistic regression with the outcome of games being the dependent variable. For the variables, I use the location of the game, metrics for the team's offense and defense, and metrics of the team's opponents' averages for both offense and defense. The NCAA Tournament is played at all neutral sites, but since I'm training on all games, I want to know how important playing at home is so that I can strip this out for neutral site games. The reason to use components of a team's offense and defense as opposed to simply points is that the different components that contribute to points have varying levels of reliability. As KenPom figured out this year, for example, defensive 3P% is extremely unreliable. My model takes this into account and weights it less than it would be if we used its influence on points against. By breaking it down, we let the model determine which factors are most reliable in predicting future performance.

The main thing we care about is that the model does a good job of predicing future games. Instead of waiting for future games, however, we can just use out of sample games. I took about 1/3 of our games and made them training games and left the other 2/3 as testing games. One thing I did that may be different from most is that I used all of a team's games for the season except for the game in question to create their profile. For example, say North Carolina played Duke on January 7th in one of my training games. For North Carolina's profile, I used stats from all of their games before AND after January 7th. I'm not sure what other systems do but I think they might use all games (without excluding the game in question) or perhaps just games PRIOR to the game in question. In any case, after training the model, I can test it against the out of sample games I set aside for testing. I divided up all the test games into 100 buckets ordered by their predicted win percentage and compared it to the actual win percentage in those games. As we can see, the buckets are closely aligned meaning the predictions are fairly accurate.



BRACKET OPTIMIZATION
The next thing to do is to take our matchup predictions and maximize our expected points based on the scoring system we are presented with. While this is most beneficial when scoring systems provide bonuses for picking upsets or some other unique scoring, it can still be helpful in basic scoring systems and is better than simply advancing winners round by round.

As an example, take Louisville and New Mexico, the 4 and 5 seeds in the West region. My model predicts New Mexico as the favorite in a game against Louisville, projected to win 51.2% of the time. Both are favored in their 1st round matchups as well, so if we were to simply advance them both, we'd then choose New Mexico to advance over Louisville in the 2nd round. However, New Mexico has a tougher 1st round opponent in Long Beach State than Louisville does against Davidson. In the table below, we see that New Mexico wins just 65% against LBSU while Louisville wins 75% of the time against Davidson. This is enough to make it more likely that Louisville advances to the Sweet 16 than New Mexico, despite UNM being the better team.

1st 2nd
New Mexico 64.9% 37.2%
Louisville 75.3% 40.7%

New Mexico over Louisville: 51.2%

In a basic scoring system, this rarely comes into play and when it does, it provides little benefit. But it still is best to be accurate if you can.

Saturday, March 17, 2012

Machine March Madness: Round 1 Update

Posted by Danny Tarlow
As usual, the first round was full of upsets, with two of the #2 ranked teams falling. None of our competitors predicted either of those upsets, but they are still putting on a respectable performance. Here are details of each competitor's entry, along with the current performance.

The favorites at this point look like "The Matrix Factorizer" and "The Pain Machine". Both did quite well in the first round, and both have 7/8 elite eight teams still surviving, along with all 4/4 final four teams still alive.

The Matrix Factorizer

Jasper

I modified Danny's starter code in two ways: First, I added an asymmetric component to the loss function, so the model is rewarded for getting the prediction correct even if the absolute predicted scores are wrong. Second, I changed the regularization so that latent vectors are penalized for deviating from the global average over latent vectors, rather than being penalized for being far from 0. This can be interpreted as imposing a basic hierarchical prior.

I then ran a search over model parameters (e.g., latent dimension, regularization strength, parameter that trades off the two parts of the loss function) to find the setting that did best on number of correct predictions made in the past 5 years's tournaments.

24 of 33 Correct, 25 Pts, 171 Pts Possible

The Pain Machine

Scott Turner

Methodology: Linear regression on a number of statistics, including strength ratings to predict MOV (Margin of Victory). Some modifications for tournament use, particularly to force a likely number of upsets.

23 of 33 Correct, 24 Pts, 170 Pts Possible

TheSentinel

Chuck Dickens

Methodology: Using Ken Pomeroy's Pythag formula to rate teams, then calculated the actual game probabilities with the log5 formula. Used a random number generator to determine outcome of games. This provided some randomness which created a few interesting upsets. Simulate the tournament 50 times and record each team's probability to reach subsequent rounds. Step through each round of the bracket choosing winners based on the team that had a higher probability to win that round.

I found that running the simulation 50 times gave me the most variability in the final four, running the simulation more than 100 times gave me a bracket that had almost no upsets and most all of the higher seeded teams progressed through the tournament.

23 of 33 Correct, 24 Pts, 172 Pts Possible
Baseline

Always pick the higher seed.

23 of 33 Correct, 24 Pts, 168 Pts Possible
Ryan's Picks

Ryan

For each season (e.g. 2006-2007) I have enumerated the teams and compiled the scores of the games into a matrix S. For example, if team 1 beat team 2 with a score of 82-72 then S12=82 and S21=72. Ideally, each team would play every other team at least once, but this is obviously not the case so the matrix S is sparse. Using the method proposed by George Dahl, I define vectors o and d which correspond to each teams offensive and defensive ability. The approximation to the matrix S is then just the outer product od' (for example (od')_12=o1d2=S12est). This is a simple rank one approximation for the matrix. If each team played each other at least once then the matrix S would be dense and the vectors o and d could be found by finding the SVD of S (see http://www.stanford.edu/~boyd/ee263/notes/low_rank_approx.pdf). Because this is not the case, we instead define a matrix P that represents which teams played that season. For example, P12=P21=1 if teams 1 and 2 played a game. Now the problem stated by George can be expressed compactedly as, "minimize ||P.*(o*d')-S||_F". Here, '.*' represents the Hadamard product and ||.||_F is the Frobenius norm. In this from, it is easy to see that, for constant vector o and variable vector d, this is a convex problem. Also, for constant vector d and variable vector o this is a convex problem. Therefore, by solving a series of convex problems, alternating the vector variable between o and d, the problem converges rapidly in about 5 to 10 steps (see "Nonnegative Matrix Factorizations" code here http://cvxr.com/cvx/examples/).

See this post for more details.

23 of 33 Correct, 24 Pts


Danny's Dangerous Picks

I started with the basic matrix factorization approach from my starter code, then I added small neural networks that applied a transformation to the base latent vectors based on whether the team was playing at home, away, or in the tournament. These transformation vectors were learned based on season and tournament performance of teams from other years. I split the data into 5 cross-validation sets, and looked for hyperparameter settings that did best on tournament prediction in past years. Like Jon, I also added an asymmetric component to the loss function.

Interestingly (disappointingly), after finding the setting of parameters that did best on past data, my method made some pretty conservative predictions for this year, predicting only 3 upsets.

22 of 33 Correct, 23 Pts, 165 Pts Possible
Predict the Madness

Monte McNair

Methodology: To determine the probability of any matchup (Team 1 beating Team 2), I use a logistic regression using statistics for offense/defense of team and team's opponents plus location, dependent variable is outcome of the game. To select bracket, I use a program to calculate the best possible bracket by maximizing number of points based on scoring system, this correctly accounts for situations where simply advancing favored teams round by round would fail.

22 of 33 Correct, 23 Pts, 157 Pts Possible


AJ's Madness

AJ Diliberto

The methodology is that I selected various stats and gave weight to those that I feel are important, such as points for and against, offensive rebounds, and turnover margin. I also factored in whether they were from one of the big conferences, the level of experience and success the coach has had, and then overlaid the formula with a strength of schedule formula that would reduce certain teams scores based on how good or bad the competition was that they played to get those stats.

22 of 33 Correct, 23 Pts, 139 Pts Possible

Machine Learning First Try

Joe Gilbert

My methodology is as follows:
1. Develop a matrix that contains only 2011 scores (done using your data)
2. Develop a matrix that contains all of your teams and generate columns for averages over all players in 2011: minutes played, FT attempted/made, 3P attempted/made (done), rebounds, turnovers, fouls (again using your data)
3. Use machine learning, specifically a traditional Forest algorithm to predict each team's score for each game based on the 2011 data only
4. Select the winner for each round and repeat step 3 for the next round to determine the next winners
Currently, the algorithm predicted the first round modeling each team's score as an "Away" team since they are all technically on the road. I think I may change it so that the scores are based on a mean value of the model for an Away team and Home team because currently it is predicting LIU Brooklyn over MSU in the 1st round...if it comes true then so be it.

20 of 33 Correct, 21 Pts, 91 Pts Possible
By The Numbers

Tim Jacobs

Methodology:
I took the data so generously provided, trained a couple of neural networks on the past performance, then used average away performance for each team to predict performance in the tourney. The networks are training as I type.

17 of 33 Correct, 18 Pts, 166 Pts Possible

Wednesday, March 14, 2012

Data Usage Clarification

Posted by Lee

I just realized that the data rules and usage discussion happened on the Google Group and not everyone may have read it. Similarly, a clarification on hand-tweaking.

Basically, no human judgment data should enter your model except for your decisions on how to build the model and hyper-parameters for that model. Also, if you do use data that we did not provide, please let us know and please make it available to all the other competitors so that they might have the opportunity to use it as well.

Tuesday, March 13, 2012

Fast Company Article

Posted by Danny Tarlow
David Holmes over at Fast Company wrote a nice article on about our Machine March Madness contest:
http://www.fastcompany.com/1824382/march-madness-ncaa-tournament-predictions-algorithms

Thanks David!

To everybody else: I hope you're hard at work on your algorithm.

Prizes and deadline reminder

Posted by Danny Tarlow
Now is the time to make a final push for getting your Machine March Madness algorithms tuned and running smoothly. Remember, submissions are due before tip-off of the first game on Thursday, but you probably want to get them in a little early, just to be safe.

I'm also pleased to announce the prizes: for the main competition, the winning algorithm's owner will get a $50 Amazon or Apple gift certificate, while second place will get a $25 one.

Also, for the "second chance" Sweet 16 contest, we will be hosting a humans versus computers contest, with our field of computers competing against Facebook friends and fans of our sponsor, a knee doctor who is into robotic-assisted surgery. The prize pool for the second chance tournament will also be $50/$25 gift certificates, but the prizes could go either to a human or computer.

If you want to participate as a human, you need to add Doctor Tarlow on Facebook, but if you're reading this blog, hopefully you'll enter an algorithm and participate on our team instead.

The human team has chosen the name, "Dr. T's Robot Powers". We'll need to come up with something better for our computer team. Ideas are welcome in the comments.

Monday, March 12, 2012

How to pick upsets?

Posted by Danny Tarlow
Scott Turner writes...
Doing well in a tournament picking contest probably comes down to picking the right upsets. Anyone can pick the higher seeds to win.

Define an upset as a lower seed beating a higher seed, and ignore upsets where there's only 1 step differential (i.e., a #9 beating a #8). If my math from last year is correct, the upset rate in the tournament is around 22%. Half those upsets happen in the first round, about 7.

Some recent thoughts about upsets:

http://harvardsportsanalysis.wordpress.com/2012/03/12/predicting-ncaa-tournament-upsets-the-importance-of-turnovers-and-rebounding/
http://courtsideanalyst.wordpress.com/2012/03/12/two-potential-ncaa-upset-picks-with-supporting-math/
http://www.teamrankings.com/blog/ncaa-basketball/why-you-should-ignore-the-seeds-when-filling-out-your-2012-ncaa-brackets

I leave it to Danny / Lee to turn this into a blog posting :-)
My response...

From a machine learning perspective, I think Scott raises an interesting issue here. Let me rephrase the problem a little more abstractly, to more clearly get at the crux of the issue. Suppose that some oracle were to come down and tell us that exactly 15 of the games in this year's March Madness tournament will be upsets. How should this affect our prediction strategy?

There are probably two natural answers:
  • Don't change anything. I have my prediction for each game, and I think it's going to lead to the most number of correct predictions.
  • Make my base predictions, but go back and find the games that I'm most uncertain about, and flip predictions until I am predicting exactly 15 upsets.
Actually, these both are reasonable strategies, but they say something different about the objective function that we are optimizing with our picks. If the goal is to just get as many games right as possible, and we believe our model captures all of the information we have about the outcome of the games (and we believe the game outcomes are statistically independent), then the first strategy will still maximize the expected number of games that we will get correct. However, by making this choice, assuming our model isn't predicting 15 upsets already, then we've eliminated ourselves from contention for the $5 million prize that Yahoo offers to anybody who picks the perfect bracket.

So if the goal is to win the $5 million prize and you believe the oracle, then the right strategy is to pick the 15 upsets that the model thinks are most likely.

However, while both of these strategies make some sense, they both seem too extreme. Perhaps the more natural objective should be to ensure that we win this year's Machine March Madness prediction contest. If that's our goal, what's the best strategy? What if we had the predictions from all of the competitors for past years, and I told you that this year's field was going to be drawn from a similar set of competitors?

See Scott's picks for most likely upsets over at his blog.

Sunday, March 11, 2012

2012 Contest Registration

Posted by Lee

In order to facilitate in the contest, we will be using Yahoo! again for you to enter your bracket entries. Please do the following to register your team and participate in the contest:

  1. Send an e-mail to "leezen+marchmadness" at gmail to provide your: team name, team member names, and a brief description of your methodology.
  2. Enter your picks in the Yahoo! tournament group with the entry name being your team name.
  3. Watch the tournament with your friends and have fun!

Data for 2012

Posted by Lee

Selection Sunday! What a day! First we have a great post by Scott Turner on using RapidMiner. Then, the Selection Committee has set the seeding. Now, it's YOUR turn to predict who will win the 2012 NCAA Tournament.

There are two files you can download:

The includes everything from the beginning of the 2006 season up to and including the March 11, 2012 games. Please let us know if you find any issues with the data. One known issue is that some scores in the first file do not match the scores if you were to add up all the player scores from the player-level data. This is due to the fact that data we crawled is occasionally inconsistent in this regard and might be off by a few points.

The data format is as before for both files, except that the aggregate game data is now tab-separated. Please see aggregate game data schema and player-level data schema for details. Good luck!

Using RapidMiner to Predict March Madness

Posted by Danny Tarlow
This is a guest post by Dr. Scott Turner, who won the Machine March Madness prediction contest last year, and who was the co-winner of the Sweet 16 contest from two years ago. If you like this post, check out his great blog all about algorithmic prediction of NCAA basketball: http://netprophetblog.blogspot.com/.

Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the Aerospace Corporation, where he advises the nation's space programs on software and systems engineering issues.


Danny & Lee asked me to contribute a guest post as part of the Machine Madness contest. I started writing a posting about using RapidMiner as part of a prediction workflow, but unfortunately I became overwhelmed with other tasks and wasn't able to finish it.  I had given up on finishing it when I realized that anyone entering the Machine Madness contest at this late date might well appreciate a tool that could make creating the routine parts of building a predictive model very fast.  So I quickly finished it up and hope it will prove helpful to someone.  Readers who are expert data miners won't find much here, but I hope that it might be useful to the interested amateur who knows more about basketball (football, baseball, etc.) than about statistics and data mining and wants to put in a quick entry.

I will assume that you have some program or method for generating the statistics or ratings you want to use to predict games and that you've saved those results as an Excel file.  (These might just be season averages of the statistics Danny & Lee are providing.)  As a tool RapidMiner is not well-suited for this part of the problem; it's strengths are in pulling the predictive value out of those statistics rather than generating them.  (Or perhaps I should say that it's not well-suited as I understand it.  I wouldn't be surprised to learn that it has useful features in this area that I don't know about.)  The Excel file should have one line for each game, with columns for the team names, statistics, ratings, and scores.

The next step is to download and install RapidMiner.  You can do that here.  The "community edition" of RapidMiner is completely free.  (I like free.)  There's a user forum here where questions usually get a fairly quick response.

Once you've installed, start up RapidMiner.  You'll see this: 


RapidMiner has three default perspectives: Design, Results, and Welcome.  It starts up in Welcome.  Switch to Design by clicking on the icon that looks like a pencil writing in a notebook, from the View menu, or by hitting F8.  The Design view looks like this:



The blank central area is the canvas where you'll graphically build your RapidMiner process.  The left-side has a menu of Operators as well as Repositories (where processes are stored).  The right-side has details about the current operator (Just a blank "Process" in this case because we haven't added anything yet.)

To start, let's read in our Excel file of game data.  In the list of Operators on the left-side of the RapidMiner window, you'll see a folder labeled "Import".  Clicking on that reveals sub-folders labeled "Data," "Models", and so on.  Click on the Data folder and you'll see a list of operators.  "Read Excel" should be near the top.  Click and drag that operator onto the blank area in the middle of the screen and release.  You'll see this:

There are a couple of things to note.  First, RapidMiner has automatically drawn a connection from the output of this process (the little semi-circle node on the right of the box) to the right edge of the workspace.  Anything going out to that edge will show up in the Results view when the process is executed.  Second, the message window at the bottom of the workspace shows an error.  It is complaining "The mandatory parameter "excel file" is undefined."

To fix this, look to the right-side.  You'll see that is now showing the details for the highlighted "Read Excel" operator.  Just below there you'll see a button for an "Import Configuration Wizard" and then some input boxes for the various parameters for this operator, including the "excel file" parameter being complained about.  There's also a description/help box for the operator below the parameters section.

Use the "Import Configuration Wizard" to find your Excel file and prepare it to be read in.  The wizard does some basic data checking, so you may discover a problem in your file at this point.  Here's what the final step of the wizard looks like for my sample data:



There are 8 columns to my data:  name, score, TrueSkill mean, and home winning percentage.  (The TrueSkill mean is a rating system.  You can read more about it here.)  These will be the inputs to my prediction model.

To run a process in RapidMiner, you click the right-facing blue triangle button near the top of the window.  Right now our process isn't very interesting -- it just reads in the Excel file and sends it to the Results -- but let's run it and see what happens.  You may be asked to save your model and whether you want to switch to the Results view.  For both questions you can save a default answer, which is handy.  When you switch to the Results view you'll see something like this:



The data you read in creates an "Example Set" and this window is showing you the Meta Data View for the data set.  In my case, the data set has 3699 examples (games), and for each attribute in the examples, the window shows the Role, Name, Type, Statistics, Range and Missings.  There's some interesting stuff here -- for example, home teams scored between 28 and 124 points in this season.  A home team scored only 28 points?!  That's pretty intriguing.

Let's follow up.  Click on the "Data View" checkbutton and then on the Hscore column to look at the actual data sorted by home team's score:



Apparently that 28 point performance was put in by SMU against UAB.  That had to be fun to watch! You can do some interesting data analysis with the Plot View and Advanced Chart options here, but let's continue on with building a process.

Switch back to the Design view  and let's work on conditioning the data.  In many cases, there are problems in the input data -- such as missing values -- that will corrupt your prediction models.  RapidMiner provides a number of operators for fixing these sorts of problems.  Let's work on fixing missing values.  In the Design View on the Operators tab on the right part of the screen you'll see a search box.  This is handy for finding operators by name.  Type "missing" into the Search box and you should see this:



Click on the "Missing Values"operator, drag it onto the canvas in the middle of the screen and drop it.  You'll now have this:



You'll see that RapidMiner is complaining of an error in our process: we don't have an input to the Replace Missing Values operator.  We want to connect the output of our Excel file to the input of this operator.  To do this, we left click on the output of the Read Excel operator, and drag the resulting orange line to the input of the Replace Missing Values operator and release.  This causes a pop-up box asking if we really want to disconnect the current output connection or not.  Allow RapidMiner to disconnect the port and you should have this:



And that's all you need do:  Add operators and hook them together into a process.  By default, the Replace Missing Values operators replaces all missing values with the average value for that attribute.  That's fine for now, so we'll leave it as is.

One very important step we need to take is to create a "label".  The label is the attribute that we're trying to predict.  In our case, we'll be trying to predict the winner of the game: "Home" or "Away".  We don't actually have that in our input data, so we'll need to create a new attribute and set it to be our label.

To do this, find the "Generate Attributes" operator and the "Set Role" operator and modify your process to look like this:

Now click on the "Generate Attributes" operator.  On the right you'll see a button labeled "function descriptions" and "Edit List(0)".  Click on this to bring up a view that will let us define a new attribute in our data set.

This is fairly simple to use.  We type in a name for our new attribute in the left-hand column and then an expression for calculating it in the right hand column.  We can use any existing attribute in our expression, and if you click on the calculator icon, it will bring up a tool to help create expressions.  In our case, we want to create a new attribute called "winner" that has the value "Home" if the home team scored more than the Away team, and "Away" otherwise.  The expression to do this is 'if(Hscore>Ascore,"Home","Away")':



And that's it for creating the new attribute.  Now we need to set the Role of this attribute to "label" so that our models will know what we're trying to predict.  To do this, click on the Set Role operator and in the right-side pane, select our new attribute from the drop-down box next to Name, and "label" from the drop-down box next to "target role":
We're almost ready to start modeling, but let's check to make sure we've added the "winner" attribute correctly.  Hit the run button to run the process and let's look at the output in the Results view:
At the top of the results (colored light yellow because of its role as "label") we see the new attribute "winner".  In this data set, the Home team won almost twice as often as the Away team.  If you click on the Data View button, you can check a few games to make sure the calculation is correct:
Looks good, so let's go back to Design View and train a model.  Switch back to the Design View and find the k-NN model, drag it into the process and connect it up to look like this:
  Along the right-side you can see the parameters for the k-NN operator.  Change "k" to 3.  We're almost ready to create a model, but we need to add one last step.  Right now the input data to our model includes the scores of both teams.  It isn't very hard to predict who will win the game if we know who scored the most points :-) so we'll need to remove that information from our examples.  To do this, we need an operator called "Select Attributes".  Drop this into our process between "Set Role" and "k-NN".
Highlight the new operator, and on the right-side, set the "attribute filter type" to subset and then click on "Select Attributes".  That will bring up this dialog:
Now we simply select attributes we want to include from the left side and use the green arrow to move them to the right side.  We want to leave out the Hscore, Ascore and Date attributes.
Save this and we're now ready to run the process to create a model.   Hit the Run button and you should see results that look like this:
Great, we created a model!  But how good is it?  We don't have any idea.  To figure that out, we need to apply the model and then measure its performance.  Let's do that.

Switch back to the Design View, and find the "Apply Model" and the "Performance (Classification)" and add them to your process after the k-NN operator like so:
Note that the model output of the K-NN operator goes into the model input for the Apply Model operator, and the example set output goes into the unlabeled input.  The labeled output of Apply Model goes into the labeled input of the Performance operator, and the performance output of that operator goes out the right-hand side of our process.

Run this, and you should get a Results View that looks something like this:
Wow, 83% accuracy predicting the winner of the game -- pretty good!  Good enough to win the Machine Madness contest?  Who can say? :-)

This illustrates the basics of using RapidMiner for prediction.  RapidMiner has a wealth of features and options, and there are many improvements you can make to the simple process flow I've illustrated above.  But hopefully this has given you enough guidance to get started, and good luck!

Tuesday, February 28, 2012

Preliminary Aggregate Data

Posted by Lee

For those of you who want to play with just aggregate game result data, I've posted an updated version that you can play with. The format is the same as described in a previous post: date, home team, away team, home score, away score, and whether or not the home team won.

This data covers the 2006 season through 2/26/2012 and, as with the player-level data, will be updated on Selection Sunday to reflect the most up to date information.

Monday, February 27, 2012

Preliminary 2011 Season Data

Posted by Lee

In addition to data from the 2006-2010 seasons shared publicly via Google Docs

We've published some preliminary data for the 2011 season. This uses the same format as past seasons' data and spans the beginning of the 2011 season through 2/26.

After Selection Sunday (March 11th), we will publish an updated set of data for the 2011 season. Please let us know if you find any problems with the preliminary data.

Machine March Madness 2012: Starter Code

Posted by Danny Tarlow
I've started a github repository for the 2012 March Madness competition, to which I've committed some python code that I worked on over the weekend:
https://github.com/dtarlow/Machine-March-Madness

Here, you can find code that parses data from previous seasons, constructs the past brackets, and learns a few different models based on past data. More details are in the README.

I will post in more detail about the models once I get them working a bit better, but I encourage you to take a look at the high level structure in learn_synthetic.py and model.py.

I've brainstormed a bunch of TODOs at the bottom of the README, so if you'd like to jump in and work on some of those, please do. Or feel free to go off in your own direction.

For detailed discussions of the code, questions, or bug reports/fixes, head on over to the official Google group.

Saturday, February 25, 2012

Google group for March Madness competition...

Posted by Danny Tarlow
... here.

We'll use the Google group for discussion of issues related to rules, but other posts are fair game: maybe you're looking for somebody to team up with, or maybe you want to brainstorm modeling ideas, etc.

Thursday, February 23, 2012

Machine March Madness 2012

Posted by Danny Tarlow
Every year, the NCAA College Basketball seasons ends with a tournament of 64 teams. Humans around the US (but also elsewhere in the world) fill in brackets with predictions of the outcome, enter pools, and wait excitedly for the results.

College basketball is a streaky and fairly high variance game, so there are many chances for an underdog to make a run deep into the tournament. We see this often -- for example, last year's tournament featured a final four made up of 3, 4, 8, and 11 seeds -- leading to the colloquial tournament name, "March Madness".

So without further ado, it is my pleasure to announce that this year, this blog, in conjunction with commissioner Lee, will host another "Machine March Madness" contest. The big idea is simple: using data from this season and from past seasons (which we will provide -- e.g., past data here: full and simple), build a computer system that fills out a bracket, then pit yourself against the field of silicon competition. You can see posts from last season's tournament here, and some press coverage here.

We'll get more details coming soon, including details about prizes. For now, you can do a few things.
  1. Download the past data (full and simple), and start thinking about how you'd model the tournament. To get some starter ideas, I recommend this timeless post by George Dahl.
  2. Let us know in the comments if there is any other data that you would like to use. The rule we have is that all systems must be built using the same data, but we're open to suggestions about what this data is.
  3. Get started!


Update: Here's a question about additional data to use, posted on Quora.