Posted by Danny TarlowI like the term "evidence-based medicine." Hopefully we'll be hearing more about it in the future: http://www.nytimes.com/2008/12/27/business/27record.html
Saturday, December 27, 2008
Friday, December 26, 2008
Posted by Danny TarlowI've been playing around with sympy, a symbolic math library for Python: http://docs.sympy.org/index.html A lot of work I do involves writing down hairy likelihood functions, taking the partial derivatives, and solving for some update that increases the likelihood. Up until now, I've always done the calculus and algebra by hand. I'm wondering now why I put myself through the tedium. I was never one to use a TI-89 back in high school calculus, but at least then I could justify it to myself by saying I was learning more. But now, sympy seems like it has all the answers. It even has a print feature that will export directly to latex. Is there any downside?
Thursday, December 18, 2008
Posted by Danny TarlowFun little piece on two schools of interpreting probabilities: http://www.johndcook.com/blog/2008/02/26/what-a-probability-means/
Posted by Danny TarlowFrom David Paterson via CNN:
For example, a study by Harvard researchers found that each additional 12-ounce soft drink consumed per day increases the risk of a child becoming obese by 60 percent. For adults, the association is similar.I think this is really cool -- I'm all for using taxes to combat externalities, and obesity seems like a major one. However, my first thought (without knowing anything about the actual study) is that I'm not convinced about the causal role of soft drinks causing obesity. It seems to me that there are an awful lot of unhealthy lifestyles that include drinking more soft drinks. The unhealthy lifestyle can jointly cause higher rates of obesity and more consumption of soft drinks. I haven't been able to find a full list of the new proposed taxes. Is the soft drink tax just a representative example of a larger obesity-fighting program?
Monday, December 15, 2008
Posted by Danny TarlowFirst, let's assume that success in some endeavor is distributed as the product of talent in several relevant characteristics (e.g., musical performance success can be modeled as the product of talent in musical thinking, physical coordination, and dedication to practice). Then, I want to ask how we can explain different tails of distributions that could correspond to success in high-powered jobs. We know that the top level success variable will not be normally distributed, even if the underlying traits are. How would Larry Summers's arguments change by assuming a multiplicative model instead of the simple Gaussian model that his calculations are based on?
Friday, December 12, 2008
Wednesday, December 10, 2008
Saturday, December 6, 2008
Posted by Danny TarlowI've been working a lot with Python and MySQL lately, and I've been very happy with them as the basis for doing research and other data analysis. In particular, MySQLdb, matplotlib, networkx, and CVXOPT have all helped to make the experience very pleasant. So I'm more or less ready to abandon Matlab as my primary research work environment. One of the things I'll miss about Matlab is the
savecommand, which can store the entire environment to a file. It probably encourages some bad habits, but it's also quite convenient and has served me well in the past. To do things right, though, I want to store everything in my database, including data sets, algorithm parameter settings, internal behavior of the algorithm, and results. To do this, I need a schema. I might change my mind on this later, but I think I want to do something general rather than come up with a bunch of project-dependent schemas. It will likely be a bit more work, but hopefully it will also lead to me writing more general code on top of it, which may make future projects go quicker. Here's an early draft of a MySQL script to create the tables I'm thinking of. I'll edit this and add more as I make progress.
CREATE TABLE project ( project_id INT(11) NOT NULL AUTO_INCREMENT, project_title VARCHAR(128), time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(project_id) ); CREATE TABLE data_set ( data_set_id INT(11) NOT NULL AUTO_INCREMENT, project_id INT(11), data_source VARCHAR(64), # Could be "synthetic", "netflix", etc. time_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (data_set_id) ); CREATE TABLE data_instance ( variable_instance_id INT(11) NOT NULL AUTO_INCREMENT, data_set_id INT(11), variable_name VARCHAR(32), variable_real_value DOUBLE, # I'm not sure how to deal with different variable types PRIMARY KEY (data_instance_id) ); CREATE TABLE execution ( execution_id INT(11) NOT NULL AUTO_INCREMENT, data_set_id INT(11), algorithm_id INT(11), code_version INT(11), PRIMARY KEY (execution_id) ); CREATE TABLE iteration ( iteration_id INT(11) NOT NULL AUTO_INCREMENT, execution_id INT(11), step_number INT(11), is_final_iteration INT(1), time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP, time_finished TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (iteration_id) ); CREATE TABLE parameter_value ( parameter_value_id INT(11) NOT NULL AUTO_INCREMENT, iteration_id INT(11), parameter_name VARCHAR(32), parameter_value DOUBLE, PRIMARY KEY(parameter_value_id) );
Friday, December 5, 2008
Posted by Danny TarlowI like Gelman's philosophy better: http://www.stat.columbia.edu/~cook/movabletype/archives/2008/12/greg-mankiws-wo.html
Thursday, December 4, 2008
Posted by Danny TarlowIt seems like the low-hanging fruit with visualizing data is to have built-in ways of viewing things in time and space. If you know something has a location and a time associated with it, you can make some really cool visualizations. I like this one, for example: http://www.spatialkey.com/
Tuesday, December 2, 2008
Posted by Danny TarlowIt's always fun to pull together several different posts. Here are the ones I want to tie together. Can you figure out how? http://randomcrunching.blogspot.com/2008/11/cocktail-party-conversation-starter.html http://randomcrunching.blogspot.com/2008/11/multiplying-for-success.html http://randomcrunching.blogspot.com/2008/11/summers-time.html Here's a start: And the simple python script that generated the figure so you can play along at home (but you need numpy and pylab installed):
from numpy import * from pylab import * a = 50 + 10 * randn(10000,1) b = 50 + 20 * randn(10000,1) c = 50 + 30 * randn(10000,1) e = a + b + c subplot(311) hist(e, 500) xlim(mean(e) - 5*std(e), mean(e) + 5*std(e)) axis('off') title('Additive Model') d = a * b * c subplot(313) hist(d, 500) xlim(mean(d) - 5*std(d), mean(d) + 5*std(d)) title('Multiplicative Model') axis('off') show()
Sunday, November 30, 2008
Posted by Danny TarlowI don't want to sound too pedantic here, so I came up with a witty title. Please don't randomly bring this up in any sort of normal conversation (ahah, get it?). The (normalized) product of two Gaussian distributions is itself a Gaussian distribution. If you're not afraid of a little algebra, you can prove it yourself by writing down the expression for the probability density function of a Gaussian random variable twice, do the multiplication, combine the exponent terms, rearrange terms, then complete the square (yes, I know that's too fast if you don't know what I'm talking about). If you take the (normalized) sum of two Gaussian distributions, you get a mixture of distributions that can have two modes, so it's certainly not Gaussian. Now here's the tricky part. If you have two Gaussian random variables
A ~ N(mu_A, sigma_A^2) B ~ N(mu_B, sigma_B^2)then you define random variable C to take on the value of the sum A + B, then C will be distributed according to a Gaussian distribution:
C ~ N(mu_A + mu_B, sigma_A^2 + sigma_B^2)If instead you define random variable D to take on the value of the product A * B, then D will not be distributed normally. As an example, if A = B and mu_A = mu_B = 0 and sigma_A = sigma_B = 1, then D is distributed according to a chi-square distribution with 1 degree of freedom. The "trick" (if you want to call it that) comes from the loose wording people use when they say things like "the product of two Gaussians." In the first case, you are actually multiplying probability distributions. In the second case, you are multiplying the values of draws from probability distributions -- it's kind of subtle. Unfortunately, both interpretations are reasonable and used in practice. The first one comes up most for me, because if you have two independent beliefs about the value of a variable, then the right thing to do to combine the evidence is to multiply the distributions. The second comes up in places like multiplicative models.
Friday, November 28, 2008
Posted by Danny TarlowThis sounds like a really interesting data set. It shows how the social (Facebook) connections between a class of students at Harvard evolved over a four-year period. http://cyber.law.harvard.edu/node/4682 I'll add checking it out to the queue with the 17 other mini projects on my todo list.
Wednesday, November 26, 2008
Posted by Danny TarlowI'll add this to my to-read list, but some subtle aspects of the abstract wording bother me. For example, play-calling is often done by the head coach or offensive coordinator. The quarterback usually only has the option to make small changes to a given play (e.g. choose whether to run it right or left), or the ability to call an audible. I guess I need to read it to fully understand the play-calling scenario they're addressing. S. D. Patek and D. P. Bertsekas,"Play Selection in American Football: a Case Study in Neuro-Dynamic Programming", Chapter 7 in Advances in Computational and Stochastic Optimization, Logic Programming, and Heuristic Search: Interfaces in Computer Science and Operations Research, David L. Woodruff, editor. Kluwer Academic Publishers, Boston, 1997.
Abstract: We present a computational case study of neuro-dynamic program- ming, a recent class of reinforcement learning methods. We cast the problem of play selection in American football as a stochastic shortest path Markov Decision Problem (MDP). In particular, we consider the problem faced by a quarterback in attempting to maximize the net score of an offensive drive. The resulting optimization problem serves as a medium-scale testbed for numerical algorithms based on policy iteration. The algorithms we consider evolve as a sequence of approximate policy eval- uations and policy updates. An (exact) evaluation amounts to the computation of the reward-to-go function associated with the policy in question. Approxi- mations of reward-to-go are obtained either as the solution or as a step toward the solution of a training problem involving simulated state/reward data pairs. Within this methodological framework there is a great deal of flexibility. In specifying a particular algorithm, one must select a parametric form for esti- mating the reward-to-go function as well as a training algorithm for tuning the approximation. One example we consider, among many others, is the use of a multilayer perceptron (i.e. neural network) which is trained by backpropaga- tion. The objective of this paper is to illustrate the application of neuro-dynamic programming methods in solving a well-defined optimization problem. We will contrast and compare various algorithms mainly in terms of performance, al- though we will also consider complexity of implementation. Because our version of football leads to a medium-scale Markov decision problem, it is possible to compute the optimal solution numerically, providing a yardstick for meaningful comparison of the approximate methods.
Wednesday, November 19, 2008
Posted by Danny TarlowYikes! http://smart-machines.blogspot.com/2008/11/jaguar-supercomputer-for-scientific.html And for those of us who dream of having supercomputers in our home office: http://www.nvidia.com/object/io_1227008280995.html
Posted by Danny TarlowFrom the article:
Now how does this elucidate the elusive X-Factor? My esteemed colleague Dean Keith Simonton  offers a nuanced genetic model of talent that I think is relevant. Simonton has argued that additive models of talent are too simplistic (see last post for an additive model of music talent). It's too simple to say that practice + music ability + high IQ equals musical ability. No, Simonton says that talent, especially in complex domains, is better represented by a multidimensional and multiplicative model.http://blogs.psychologytoday.com/blog/beautiful-minds/200806/the-nature-genius-i-the-genetics-the-x-factor When you hear the term multiplicative model, you should think of AND, and when you hear the term additive model, you should think OR. Essentially, ability in any given task is better modeled by saying that you have to be strong in every relevant trait than by saying that you can make up for a lack of strength in one trait with more strength in another. It is very hard to make up for a lack of musical ability with a high IQ and lots of practice (if your goal is overall musical achievement), for example.
Tuesday, November 18, 2008
Posted by Danny TarlowI was looking at Intrade's market on potential Secretary of State nominees this morning, and I felt that Hillary's odds were somewhat overstated at $84 for a $100 payoff if she wins (see here and here). Across the board, I felt that was contributing to a low estimate for Bill Richardson ($9 for a $100 contract), so I was thinking about spending $9 to put my money where my mouth is (if Richardson were chosen, I would get a payoff of $100 for that bet). EDIT: I'm glad I didn't make that bet. My credit card company doesn't let me make payments to Intrade, so I gave up shortly after, but I did notice that the spread between Bid and Ask prices were quite large in some of these low volume markets. I did a bit of Googling, and it led me to some tangentially related, interesting articles about Intrade: http://www.overcomingbias.com/2008/07/intrades-condit.html http://www.bayesianinvestor.com/amm/
Posted by Danny TarlowThe excerpt from Super Crunchers in this article is an interesting look into the deeper details of the comments Larry Summers made about the differences between men and women in science and mathematics: http://freakonomics.blogs.nytimes.com/2008/11/18/larry-summers-for-treasury-secretary/ I think it's relevant to keep in mind that Summers made this claim at a conference on "Diversifying the Science & Engineering Workforce." You can also see the caveats he lays out by looking at the full text of his speech:
I asked Richard, when he invited me to come here and speak, whether he wanted an institutional talk about Harvard's policies toward diversity or whether he wanted some questions asked and some attempts at provocation, because I was willing to do the second and didn't feel like doing the first. And so we have agreed that I am speaking unofficially and not using this as an occasion to lay out the many things we're doing at Harvard to promote the crucial objective of diversity. There are many aspects of the problems you're discussing and it seems to me they're all very important from a national point of view. I'm going to confine myself to addressing one portion of the problem, or of the challenge we're discussing, which is the issue of women's representation in tenured positions in science and engineering at top universities and research institutions, not because that's necessarily the most important problem or the most interesting problem, but because it's the only one of these problems that I've made an effort to think in a very serious way about. The other prefatory comment that I would make is that I am going to, until most of the way through, attempt to adopt an entirely positive, rather than normative approach, and just try to think about and offer some hypotheses as to why we observe what we observe without seeing this through the kind of judgmental tendency that inevitably is connected with all our common goals of equality. It is after all not the case that the role of women in science is the only example of a group that is significantly underrepresented in an important activity and whose underrepresentation contributes to a shortage of role models for others who are considering being in that group. To take a set of diverse examples, the data will, I am confident, reveal that Catholics are substantially underrepresented in investment banking, which is an enormously high-paying profession in our society; that white men are very substantially underrepresented in the National Basketball Association; and that Jews are very substantially underrepresented in farming and in agriculture. These are all phenomena in which one observes underrepresentation, and I think it's important to try to think systematically and clinically about the reasons for underrepresentation.There is a pretty overwhelming consensus that the arguments he goes on to make are flawed in "twenty different ways," but I think the rough idea of using an order statistics type approach is interesting -- rather than explaining differences we see in ultra-competitive positions as evidence of different means, we can equally explain them as evidence of different standard deviations. Now, we can debate whether nurture or nature can better explain different standard deviations in characteristics that lead one to high-powered science and engineering jobs, and Summers goes on to present some arguments that it may not be all nature, which is probably the source of most of his troubles with the media. Regardless, Summers's speech is interesting and intellectually provocative. One of the things that I admire about Barack Obama is that he generally speaks in an intelligent, more nuanced manner than most politicians I've seen. I felt that same sort of appreciation reading Summers's speech. Now, I won't go so far as to argue whether this speech should have gotten Summers dismissed as Harvard's president, but I agree that the reason why his dismissal might be justified is because of the drastic oversimplification of his arguments that are presented to the broader public. At least in my reading, I saw a person looking at a complex piece of data and trying to come up with some plausible hypotheses that explain it. Nowhere did I see any evidence of latent sexist beliefs held by Summers. I think Summers is an excellent fit for an Obama presidency that is serious about taking an open-minded, pragmatic approach to tackling the problems our country is facing without worrying about the media response and political implication of every decision.
Monday, November 17, 2008
Posted by Danny TarlowHere's an example of a data-driven approach to decision-making in football strategy. http://www.sciencedaily.com/videos/2006/1101-football_frenzy_picking_the_perfect_play.htm This is from 2006. I wonder why we haven't heard more about it. The article gives an attempt at an explanation:
The NFL hasn't embraced the technology just yet. The league is known for its conservative decisions and its trust in the highly paid coaches -- and not necessarily computers. Zeus's makers say the program could also be tailored to college football.I don't completely buy it. If it works well and can help a coach, there should be enough incentive to win games to break the inertia. Somehow this product isn't quite the right way to do it. Or maybe the developers need to talk to the Florida coach from my other football statistics post. More at NY Times: http://fifthdown.blogs.nytimes.com/2008/11/05/coaching-flaw-the-computer-sees-it/
Posted by Danny TarlowNeat video about how the US is using unmanned aircrafts to collect data from over enemy territory. http://www.good.is/?p=13262 Side note: The website should have more descriptive URLs. Something like http://www.good.is/?video=unmanned-aircraft-in-the-us-military should drive more organic search engine traffic.
Sunday, November 16, 2008
Posted by Danny Tarlow
I'll be the first to admit that I don't understand why people make the decisions that they do about money, but I'm not sure I'd ever turn to fMRI to help. These guys are, though. I like the high level notion of mixing psychology and neuroscience with economics: http://www.newyorker.com/archive/2006/09/18/060918fa_fact
Saturday, November 15, 2008
Posted by Danny TarlowThis just seems like the right way to run a football team. Has anybody worked on fancy ways of using statistics for choosing football strategies? I'd be interested in hearing about it. http://www.orlandosentinel.com/sports/orl-fbcuf0508nov05,0,7267488.story
Posted by Danny Tarlowhttp://www.npr.org/templates/story/story.php?storyId=97027872 I'm glad this isn't how ties are resolved in the case of a presidential election. Can you imagine John McCain and Barack Obama standing in the middle of a football stadium, waiting for the result of a coin toss to decide who becomes the next president?
Wednesday, September 24, 2008
Posted by Danny TarlowApparently nobody told John McCain that he won't be able to make it to the debate on Friday. Edit: This was taken from his website the night that he announced that he would not be attending the debates due to the economic crisis.