Monday, December 12, 2011

Categorical Variables for Regression Analysis?

Doing a data analysis for a particular tennis player. Have a mix of independent variables (1st serve %, break points, length of match, opponent atp rank, etc), including a categorical (court surface: eg clay, grass, hardcourt, carpet). Since its not like a binary variable, how do it convert this data for analysis? My best guess is do multiple binary data:





Is Clay?


Is Hardcourt?


Is Grass?





and the null to all 3 would denote for carpet.





As far as dependent variable, should i use Win/Lose. I'm not sure since its only a binary variable. I was thinking maybe using % of total points won which is a scalable indicator how thorough the win was, but not in all cases... Thanks guys!|||You are correct in your guess regarding how to handle the court surface variable: turn it into three binary variables with the fourth surface as the default.





If you want to use the percent of total points won as your dependent variable that's fine, but it seems as though what you're really interested in is whether or not the player will win a match -- in other words, a binary dependent variable. The standard techniques for estimating models like this are logit and probit regression, which are available in most statistical packages. They estimate a nonlinear function for the probability that a case will fall into one category as opposed to another. In the case of logit, it's the natural log of the odds of being in the group coded 1 as opposed to the group coded 0.

No comments:

Post a Comment