Sunday, July 4, 2010

Choosing a First Machine Learning Project: Start by Reading or by Doing?

Posted by Danny Tarlow
Sarath writes about doing a project during his final year of university related to machine learning:
I am writing this email to ask for some advice. well the thing is i haven't decided on my project yet, as i decided it will be better if i took some time to just strengthen my fundamentals and may be work on something small. well i came across this great blog called measuring measures where they had put up a reading list for machine learning and it was may i say a bit overwhelming.

My present goal is doing a graduate course in some good university with some good machine learning research and one of the reason i wanted to do a great project as i have heard that would be a great way to getting into a good university.

So my question is should my first priority be getting a really good and deep understanding of the subject or should i be more concerned with doing some good project with respect to admissions?
There are others who are likely more qualified than I am to answer this one, but here are my two cents:

That post certainly has things that would be nice to learn, but you don't need to know all of that in order to be a successful researcher. Depending on what area you go into, you might need different subsets of those references, or you might need something different all together. (For example, a reference I go back to time and time again is Schrijver's Combinatorial Optimization, but it's not on that list).

I think you should pick a project in an area that you find interesting, then just dive in. At first, I'd be less concerned with doing something new. First, focus on understanding a couple different existing approaches to the specific problem you've chosen, and pick up the necessary background as you go by trying to implement the algorithms and replicate published results, following references when you get confused, looking up terms, etc. Perhaps most importantly, work on your research skills. Important things:
  • Clearly write up exactly what you are doing and why you are doing it. Keep it as short as possible while still having all the important information.
  • Set up a framework so you are organized when running experiments
  • Even if the results are not state of the art or terribly surprising, keep track of all the outputs of all your different executions with different data sets as inputs, different parameter settings, etc.
  • Visualize everything interesting about the data you are using, the execution of your algorithms, and your results. Look for patterns, and try to understand why you are getting the results that you are.
All the while, be on the lookout for specific cases where an algorithm doesn't work very well, assumptions that seem strange, or connections between the approach you're working on to other algorithms or problems that you've run across before. Any of these can be the seed of a good research project.

In my estimation, I'd think graduate schools would be more impressed by a relevant, carefully done project, even if it's not terribly novel, than they would be with you saying on your application that you have read a lot of books.

If you're looking for project ideas, check out recent projects that have been done by students of Andrew Ng's machine learning course at Stanford:

Perhaps some readers who have experience on graduate committees can correct or add to anything that I said that was wrong or incomplete.


. said...

Thanks a lot for the advice.

Could you comment on the how we can "Visualize everything". How do you for example, visualize the execution of your algorithms, and your results?

Phoenix said...

Its usually good to learn some, implement it, check your implementation by matching with standard results. It is necessary to understand the implementation and the theory of why it works and then you can improve this small piece of work to many dimensions and to a grand scale.

Danny Tarlow said...

Regarding visualizations: any algorithm will have some internal state, and often it's interesting to look at how it progresses through time. When you're coding up an algorithm, don't just record the final output; record these intermediate states as well.

Once you have intermediate and final outputs stored in a nice way, write simple scripts to pull out different dimensions of the data, and make lots of plots (e.g., how does validation error change with respect to parameter setting X? how does time to convergence change with respect to data set? etc). I particularly like Python and matplotlib for these purposes, but there are plenty of other plotting frameworks that would work equally well.

Anthony said...

Another option is to compete in a machine learning competition. This way you get to work on a problem that others are working on as well. You can learn by reading and posting questions in the competition forum. Sometimes, partipicants even post code in the forum. E.g. there is a quickstart package for this competition

Rydium Gc said...

Thanks for the article, very interesting.

Algorithms are the way to proceed, check and test.

Jaya Kumaran said...

Thanks for the article.


Jaya Kumaran said...

Thanks for the article.