In God We Trust. The Rest Bring Data!
Sherlock Holmes (in the short story “The Adventures of the Copper Beeches” by Sir Arthur Conan Doyle) summarizes nicely what I want to say about the importance of data: “Data! Data! Data! I can't make bricks without clay!”1
I am sure anyone with a serious interest in the area of data-driven predictive modeling has heard that a strong positive correlation between two variables does not imply a causal relationship. Very often, this is as used as an argument against statistical modeling. It is a widely held opinion that domain expertise is much more important than what the data shows. This argument certainly has some merit. Just because there is relationship between two quantities, we cannot conclude that one is the cause of the other.
Here is an example that I have occasionally used to prove the validity of this statement: If we look at the correlation between the monetary damage caused by fires in a certain city and the number of fire engines sent to the scene of the fire, there is likely to be a significant correlation. In other words, the sites with higher monetary damage would probably have had a higher number of fire engines working to contain the fire as well. A naïve modeler could build a model to predict the monetary damage caused by a fire (the dependent variable) that uses the number of fire engines as an independent variable. Based on this, are we right to conclude that next time there is a fire, we can reduce the monetary ...