Wednesday, February 3, 2010

Hot Button Issues Part 2: ClimateGate

Derek, a friend of mine from high school and a guy that I have a lot of respect for, and I have been having a conversation in the comments of another post, and the topic has turned to "ClimateGate". It's interesting, because we have quite different viewpoints (I, for example, am pretty uninformed but will usually take the side of scientists when all else is equal), but the bigger issue is really how we as non climate experts can make sense out of so many conflicting stories, where there is bias and politically-charged agendas looming at every turn.

There are a lot of issues at play here (which I find very interesting), and I don't think I can properly address them all today (or maybe ever). This is good fodder for several full posts, though, I think. It also ties in with the post I wrote a while back about scientific controversies and hot-button issues.

On to climategate: Derek points me to conservapedia as a more reliable source than Wikipedia:
The reason I pointed you to Conservapedia is because they do allow primary sources and original work to be included in their articles.
This is surprising to me, but I'm happy going with it--the more primary sources the better.

The first thing to note is that there is a lot going on here. The first subsection is about Data Manipulation, so it seems reasonable to start reading there. The main issue seems to be part of the source code that is shown in the cited article titled, The Proof Behind the CRU ClimateGate Debacle. After some Googling, the code directory that this was taken looks to be from here:
http://www.di2.nu/foia/osborn-tree6/

Specifically, a comment in the file briffa_sep98_d.pro says:
;
; Apply a VERY ARTIFICAL correction for decline!!
;
This seems to be one of a set of files amongst 4 labeled "briffa_sep98_a.pro" through "briffa_sep98_d.pro". The other files, (a-c), don't seem to have this comment.

One thing I can say from personal experience is that it's not uncommon to make up data at some point in the research process. There are plenty of reasons why it is actually good practice, because it lets you verify that your code is working as expected--for example, if you make up some crazy or random data and you're getting good results, you should really start to question your methods--you're doing something wrong. A good illustration of the case is the one where they "discovered" (haha) that dead salmon can perceive human emotions:
http://www.wired.com/wiredscience/2009/09/fmrisalmon/

Anyhow, I'm not sure what the point of applying this artificial adjustment would be in the climate case (I won't pretend to begin to understand how all that code fits together), but then again, if you were trying to hide something devious, it also doesn't seem like a good idea to make it stand out with a big three-line comment with caps to emphasize how artificial it is.

So in and of itself, one comment about artificial changes to an array in a huge directory of code doesn't seem like that big of a deal to me. This is a change in the final plotting of the results (not something buried deep in a model), so the important questions are which graphs this code produced, where they were published, and what claims they were used to support. If this code could be shown to have produced a figure in a published paper or influential presentation (and it was not explained as being artificial), it would be a very big deal in my eyes.

I haven't read the other sections or main issues, so I won't comment on them now.

At a broader level, I absolutely agree with the criticisms of the scientists for failing to release data and code. Especially for these controversial issues, I think it's important to let anybody who wants to run your code and reproduce every figure and table in your results (I try to do this on my blog, but I admit I could do better in my research. It's something I am working on, but it does take work). Not releasing code and data doesn't mean their conclusions are wrong, but I don't think they're upholding the spirit of science.

However, if somebody finds an error in the code (which is 100% plausible) and wants to dispute the results, I do think peer review is the proper venue--not blogs or popular media. You can't expect a scientist to defend him or herself from every blog post or news article out there. It would be a full time job and an extremely frustrating battle, which I wrote more about here. Most scientific journals that I know of will publish notes that point out errors in papers that they've published (see e.g., the discussion here). If you find an error, send it to the editorial board to verify, then they will verify it and ask the scientist to respond. If you come up with a better way of doing things, write a paper and publish it.

Now, you can further question the foundations of peer review or the bias of a scientific group, but that is a much bigger topic that will have to wait for another day.

Finally, this quote did resonate with me:
Climate researchers know their prescriptions don't carry the certainty laymen assume from that which is labeled "science," yet most shy from a straightforward account of this uncertainty.

"Methods certainly need to be continually refined and improved. I doubt that anyone in the paleoclimate community would disagree with that," says Rob Wilson of the University of St. Andrews's School of Geography and Geosciences. "However, can the nuances of methodological developments be communicated to the laymen—and would they want to know?"
Wilson goes on to say that he doesn't think people would want to know. I disagree, but I also don't know how to communicate the nuances effectively. Much of science takes tens of years for very smart people to really learn, and the conclusions are often of the form, "we think this, but we're not totally sure". To add to that, often times scientists are not the best communicators in the world. It takes a rare and special person to figure out how to distill these complex ideas, nuances, and uncertainties into explanations that people can understand. I think it's absolutely something that scientists should continually be thinking about, and I do think scientists should be open to audit by the public, so long as that doesn't require them to spend all their time responding to unfounded criticisms.

Tuesday, January 19, 2010

Machine Learning in One Sentence

We've all had this experience: you're out at some party or social event, doing your best to blend in and have a good time. Inevitably, though, somebody wants to talk to you. One thing leads to another, and it happens: they ask, "so... what do you do?"

My default answer is
  • "I'm a Ph.D. student in Computer Science." (optional addition: "I do something called 'machine learning' ")
It's not the most exciting answer in the world, but it's surprisingly effective--from here, it's fairly easy to read whether the person opposite me just perked up at the idea of finding a like-soul or whether their eyes just glazed over as they stumble around looking for the nearest exit. But seriously, it gives somebody the opportunity to press further, but usually it provides a nice opening to change the subject tactfully--it's an improvement over the more extreme tactics that may leave somebody thinking I have some deep dark secret or that I'm a criminal, etc.

This leads me to wonder, though. What, then, is the most exciting answer in the world for a PhD student in machine learning to give? A few rules:
  1. One sentence limit.
  2. It has to be more-or-less true.
  3. (this is the hard one) It needs to be a conversation starter rather than a conversation ender. So while "you don't want to know" satisfies (2), it is disallowed by (3).
I can think of a few not-terribly-creative possibilities:
  1. "I teach computers how to think."
  2. (serious face, matter-of-fact tone) "I develop algorithms for MAP inference in graphical models, typically looking at classes of energy functions where standard techniques like quadratic pseudo boolean optimization or max-product belief propagation are inefficient or don't work well." (Update: this one is meant to be a joke, for the record)
  3. "I pretend my computer is a baby, and I try to teach it about the world."
  4. "I design parts of robot brains."
Help me out here. Do you have better ideas? No need to be a PhD to contribute. As a reward, I promise to take the best suggestion, try it out "in the wild", and report back.

Sunday, January 17, 2010

Latex on Blogger

I am playing around with the (very simple) instructions provided here for getting LaTeX ($\LaTeX$) to work on Blogger: http://watchmath.com/vlog/?p=1244

This is a test:
\[ E(\mathbf{x}) = \sum_{i \in \mathcal{V}} \theta_i(x_i) + \sum_{ij \in \mathcal{E}} \theta_{ij}(x_i, x_j) \]

Update: seems to work properly on the post-only page, but not on the blog home page. More to come...

Update 2: I think I was wrong. It looks like there's a bit of randomness to when it works and when it doesn't. I was fooled into thinking it was due to which page I was on. Also, for further reading, Terence Tao has some good discussion--including some of the shortcomings--regarding displaying math on the web.

Wednesday, January 13, 2010

Sam

I'm in shock reading the sad and confusing news about former Toronto (and more recently, NYU) professor, Sam Roweis.

The Toronto Machine Learning group has been missing Sam's presence at our meetings and around the lab for a while now; he was on sabbatical at Google and more recently took a professorship at NYU. I have many memories from my first couple years in Toronto, though. You could always count on Sam to spark an interesting conversation after a seminar or tea talk by asking the penetrating question that cut straight to the heart of the issue. I learned a lot about how to critically think about new ideas by watching Sam in action.

As a teacher, Sam was phenomenal. He had a unique ability to present complex material in a well-motivated, clear way. My favorite part of his lectures were his "street fighting tips" where he'd teach us the tips and tricks that it takes to translate the "book form" of an idea into a program or algorithm that actually works. This ability to connect the elegant theoretical understanding to the practical understanding is something I will always admire.

Finally, Sam was just a really good guy. At one of the earlier conferences that I went to, I knew very few people. Sam recognized this and made the effort on several occasions to come talk to me, explaining parts of talks that I didn't understand, giving his thoughts on some poster, and even going so far as to find other professors to introduce me to. There were so many people there more deserving of his attention, but you'd never have known it from my perspective.

The world is unquestionably worse off without such a brilliant researcher, teacher, and all-round great guy. I think I speak for an enormous number of people when I say we'll miss him greatly.

Update: Posts by people who knew him better than I did: Fernando Pereira, John Langford, Maneesh Sahani, Jennifer Linden, and many others.

Tuesday, January 12, 2010

Google's China policy

I know I'm just repeating the news that is now several hours old, but I find this one very interesting: http://googleblog.blogspot.com/2010/01/new-approach-to-china.html

Sunday, December 27, 2009

Scientific Controversies and Hot-button Issues

I haven't done much work over the holiday, but I've had a chance to do quite a bit of reading. The book next to my bed is Shrijver's three volume series on Combinatorial Optimization: Polyhedra and Efficiency (Algorithms and Combinatorics), which lives up to the hype.

On the (arguably) lighter side, I've discovered the not-so-secret underworld of what I'll call "math 2.0" websites and blogs. Maybe I'm just out of the loop, but from my perspective, it seems that math and theoretical computer science have a more active internet community than the more applied practitioners like us in machine learning, applied statistics, and algorithms (there are of course many notable exceptions).

[Aside] I'd love it if machine learning people started using Math Overflow (MO). It seems like a nice way to get a bit of crossover between the fields, which in many cases aren't as different as they appear on the surface. Here are some example posts that have a fair number of upvotes (i.e., they are good questions according to the MO community) and that I find interesting as a machine learning researcher: There are plenty more examples in topics like combinatorics, statistics, probability, graph theory, and optimization. [/Aside]

Anyhow, what I really wanted to write about comes from reading the blogs. The observation is how difficult it is for computer scientists and mathematicians to get their points across to the general public. Two examples particularly illustrate my point: In both cases, the overarching story is one of a mathematician against the press. As I understand it, in the first case, it's an information theorist arguing against an intelligent design argument based on information; in the second, it's a computational complexity theorist trying to dispel some of the possibly exaggerated claims made about a company's supposed quantum computer. The comment threads under both posts are quite interesting (and sometimes sadly comical) reads.

It's not just the stereotypical story of a mathematician being unable to communicate with normal people. Instead, the common theme in both cases is that the mathematician has spent a lot of time carefully putting down their position in a public and/or peer-reviewed form. The problem is that their opinion--though they claim it is still valid--is several years old. In the interim, their adversaries have come up with more to say and claim to have refuted the criticisms in more recent citations. The mathematicians disagree, saying that their old arguments still hold, citing the original papers, saying they can't be bothered to repeat their arguments fresh every time a new, unrelated argument that doesn't address their original concern gets made.

Now this all may sound reasonably well and good if we were in an academic setting: get a few well-respected members of the field and ask their opinion. In academia, we have two advantages, though:
  • We are used to trusting the opinions of our peers expressed via blind peer review schemes, and
  • It is difficult enough to write a reasonably good conference paper that we aren't completely (some may disagree) overwhelmed with crackpot ideas to evaluate.
Unfortunately, in blogs and the popular press, neither holds. Everybody assumes everybody else has a hidden agenda, and there is absolutely no way to get a qualified person to review every possibly crazy idea out there, much less fight a prolonged battle over it. As we often see in politics, the side with the most money, cleverest spin, and loudest voice tends to be heard clearest.

So then the question becomes where to go from here. The mathematician making the original argument can't be expected to spend time fighting every little battle, but very few people have the time, inclination, and ability to credibly pick up the battle in their stead (in the two example cases, the issue is over different definitions of information, and subtleties about how "quantum" a quantum computer is. See here and comments 23, 24, and 26 here for examples). Even in ridiculous cases where an argument really has no merit, there is enough jargon and a long enough comment thread to make a casual observer think that the issue is "complicated," and therefore either side could be right with equal plausibility.

Yet, for lack of a better idea, we as academics by and large still use the same strategy for making an argument: write the paper, then move on. If we need to make a point, refer the reader to the paper. By putting the ideas in the record and possibly presenting the ideas to our peers, we've done our part.

I don't think this is good enough, but there's not an easy answer. Going back to the politics setting, there are difficult questions about when to put up a fight and when not to even dignify some crazy assertion with a response. Minor wording issues can be blown out of proportion. It is often considered harmful in a political debate to give a long answer. And yet to fight these battles on the public stage, we as academics are woefully inept, which is no surprise, since public relations is a tricky game and most of us have zero training.

So here's one idea: academic conference and journal bodies are quite good at deciding whether an idea deserves their stamp of approval. Peer review isn't perfect, but it's pretty good. Unfortunately, once a paper is accepted or rejected, the responsibility of the reviewing body ends. They will passively make the material accessible (either for free or behind a pay wall) and not give it further thought.

I'd like to see these bodies take a more active position in the public eye. The content published in the conference or journal becomes the agenda of the reviewing body. If somebody puts out an idea in the public eye that contradicts something published in the proceedings, the review body's public relations (PR) wing decides how to address it. In some cases, it may be proper to ignore it. In others, a real, prolonged fight may be needed.

I don't know all the exact details of how it would work, but there are several benefits to having a conference-specific PR body fight the battle:
  • It's harder to undermine the motivations of an entire organization than an individual scientist. It's also harder for one scientist to co-opt the opinion of the full organization.
  • By having the same organization's name come up repeatedly, it will begin to build a reputation for the body in the public eye.
  • Professional public relations people would be in the loop to help scientists make their point effectively.
There are other tangential benefits, increasing the exposure of the field to the public and making more clear the practical implications of the research being done.

The downside is obviously that it would cost money and require additional organization to maintain a permanent PR body for each major conference. I can't help but think that a concerted PR effort would be a good investment for many of these hard-to-understand fields, though.

Edit: I didn't find a place to put it, but this story came up in one of the comment threads.

Edit 2: Another related example. This time an academic against possibly faulty sex ratio statistics (how often parents have boys vs girls) that got picked up by the popular press. The academic was then "refuted" by Wikipedia.

Thursday, November 26, 2009

Data analysis in a presidential campaign

Here's a good video by Dan Siroker about how the Obama campaign used data analysis and web analytics in the 2008 campaign. It might be a stretch to say that data won the campaign, but it sure didn't hurt to be doing lots of interesting stuff with it.
http://websiteoptimizer.blogspot.com/2009/11/new-video-how-we-used-data-to-win.html

He does a good job driving home how valuable it is to always be defining and measuring relevant performance indicators and to do it automatically and quantitatively. It's easy to say you're going to measure and keep on top of the indicators, but it's much harder to pull it off in a clean, focused way, especially as you get more and more data from more and more sources. The campaign did a great job of it, though, and in my (not unique) opinon, this gave a huge leg up. His points also apply to his startup, CarrotSticks.com (which you should go try!).

My favorite part of the video? 43:30, because I think that almost counts as a "shout out" (I spent a little time working with the analytics team and fall somewhere between "friends from college" and "PhDs in computer science that happened to walk through the door").