Some people see multiple regression as a food processor- just keep throwing the variables in the model and it will kick out an answer. Some variables you can ‘throw in there’ and get an answer. Other variables not so much and I will tell you why. Some prognostic factors are meaningless as stand-alone numbers. Let’s step back from regression for a minute and I will give you an analogy. If I were to tell you that a variable is categorical , the average of all the values will not mean anything to you. That is why factors like gender are never averaged , rather you look at the proportion in each group. In the same spirit, frequencies are meaningless when looking at continuous variables. If you have 5000 numbers, you won’t summarize the data by each observation- you want the average. Now every now and then there is a variable “impostor”. Some variables pretend to look like a regular variables but in reality it’s not. Zip code is a fine example. It looks continuous on the surface. Zip code may tempt you to get the average of it or put it in a regression. Don’t do it. Zip code values are meaningless without linking them to a location. And in some instances, the zip code area may be re-assigned in reality. I strongly suggest you use some form of cluster analysis. Or map it to general locations. I am just saying…..
-If you enjoyed this blog more to follow-Amy