In addition to a graphic examination of the data, you can also statistically examine the data's normality. Specifically, statistical programs such as SPSS will calculate the skewness and kurtosis for each variable; an extreme value for either one would tell you that the data are not normally distributed. If any variable is not normally distributed, then you will probably want to transform it which will be discussed in a later section.
Checking for outliers will also help with the normality problem. Linearity Regression analysis also has an assumption of linearity.
Linearity means that there is a straight line relationship between the IVs and the DV. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV. Any nonlinear relationship between the IV and DV is ignored. You can test for linearity between an IV and the DV by looking at a bivariate scatterplot i. If the two variables are linearly related, the scatterplot will be oval.
Looking at the above bivariate scatterplot, you can see that friends is linearly related to happiness. Specifically, the more friends you have, the greater your level of happiness.
However, you could also imagine that there could be a curvilinear relationship between friends and happiness, such that happiness increases with the number of friends to a point. Beyond that point, however, happiness declines with a larger number of friends.
This is demonstrated by the graph below: You can also test for linearity by using the residual plots described previously. This is because if the IVs and DV are linearly related, then the relationship between the residuals and the predicted DV scores will be linear.
Nonlinearity is demonstrated when most of the residuals are above the zero line on the plot at some predicted values, and below the zero line at other predicted values. In other words, the overall shape of the plot will be curved, instead of rectangular. The following is a residuals plot produced when happiness was predicted from number of friends and age.
As you can see, the data are not linear: The following is an example of a residuals plot, again predicting happiness from friends and age. But, in this case, the data are linear: If your data are not linear, then you can usually make it linear by transforming IVs or the DV so that there is a linear relationship between them.
Sometimes transforming one variable won't work; the IV and DV are just not linearly related. If there is a curvilinear relationship between the DV and IV, you might want to dichotomize the IV because a dichotomous variable can only have a linear relationship with another variable if it has any relationship at all. Alternatively, if there is a curvilinear relationship between the IV and the DV, then you might need to include the square of the IV in the regression this is also known as a quadratic regression.
The failure of linearity in regression will not invalidate your analysis so much as weaken it; the linear regression coefficient cannot fully capture the extent of a curvilinear relationship. If there is both a curvilinear and a linear relationship between the IV and DV, then the regression will at least capture the linear relationship. Homoscedasticity The assumption of homoscedasticity is that the residuals are approximately equal for all predicted DV scores.
Another way of thinking of this is that the variability in scores for your IVs is the same at all values of the DV. You can check homoscedasticity by looking at the same residuals plot talked about in the linearity and normality sections.
Data are homoscedastic if the residuals plot is the same width for all values of the predicted DV. Heteroscedasticity is usually shown by a cluster of points that is wider as the values for the predicted DV get larger. Alternatively, you can check for homoscedasticity by looking at a scatterplot between each IV and the DV.
As with the residuals plot, you want the cluster of points to be approximately the same width all over. The following residuals plot shows data that are fairly homoscedastic.
In fact, this residuals plot shows data that meet the assumptions of homoscedasticity, linearity, and normality because the residual plot is rectangular, with a concentration of points along the center : Heteroscedasiticy may occur when some variables are skewed and others are not.
Thus, checking that your data are normally distributed should cut down on the problem of heteroscedasticity. Like the assumption of linearity, violation of the assumption of homoscedasticity does not invalidate your regression so much as weaken it. Multicollinearity and Singularity Multicollinearity is a condition in which the IVs are very highly correlated.
Multicollinearity and singularity can be caused by high bivariate correlations usually of. High bivariate correlations are easy to spot by simply running correlations among your IVs. If you do have high bivariate correlations, your problem is easily solved by deleting one of the two variables, but you should check your programming first, often this is a mistake when you created the variables. It's harder to spot high multivariate correlations.
Tolerance, a related concept, is calculated by 1-SMC. Tolerance is the proportion of a variable's variance that is not accounted for by the other IVs in the equation. You don't need to worry too much about tolerance in that most programs will not allow a variable to enter the regression model if tolerance is too low. Statistically, you do not want singularity or multicollinearity because calculation of the regression coefficients is done through matrix inversion.
Consequently, if singularity exists, the inversion is impossible, and if multicollinearity exists the inversion is unstable. Logically, you don't want multicollinearity or singularity because if they exist, then your IVs are redundant with one another.
In such a case, one IV doesn't add any predictive value over another IV, but you do lose a degree of freedom. In general, you probably wouldn't want to include two IVs that correlate with one another at. Transformations As mentioned in the section above, when one or more variables are not normally distributed, you might want to transform them.
You could also use transformations to correct for heteroscedasiticy, nonlinearity, and outliers. Some people do not like to do transformations because it becomes harder to interpret the analysis. Thus, if your variables are measured in "meaningful" units, such as days, you might not want to use transformations. If, however, your data are just arbitrary values on a scale, then transformations don't really make it more difficult to interpret the results.
Since the goal of transformations is to normalize your data, you want to re- check for normality after you have performed your transformations. Deciding which transformation is best is often an exercise in trial-and-error where you use several transformations and see which one has the best results.
The specific transformation used depends on the extent of the deviation from normality. If the distribution differs moderately from normality, a square root transformation is often the best. A log transformation is usually best if the data are more substantially non-normal. An inverse transformation should be tried for severely non-normal data. If nothing can be done to "normalize" the variable, then you might want to dichotomize the variable as was explained in the linearity section.
Direction of the deviation is also important. Bias in odds ratios by logistic regression modelling and sample size. Article Google Scholar. Altman DG, Royston P.
What do we mean by validating a prognostic model? Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol. PLoS Med. Ann Int Med. How to develop a more accurate risk prediction model when there are few events. Google Scholar. A simulation study of the number of events per variable in logistic regression analysis.
Relaxing the rule of ten events per variable in logistic and cox regression. Am J Epidemiol. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. Jacknife bias reduction for polychotomous logistic regression. Albert A, Anderson J. On the existence of maximum likelihood estimates in logistic regression models. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data.
Logistic regression modeling and the number of events per variable: selection bias dominates. Firth D. Bias reduction of maximum likelihood estimates. The design of simulation studies in medical statistics. R Development Core Team. Vienna: R Foundation for Statistical Computing; Lesaffre E, Albert A.
Partial Separation in Logistic Discrimination. King G, Zeng L. Logistic Regression in Rare Events Data. Pol Anal. Cordeiro G, McCullagh P. Bias correction in generalized linear models. Confidence intervals for multinomial logistic regression in sparse data.
A modified score function estimator for multinomial logistic regression in small samples. Comput Stat Data Anal. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Simplifying a prognostic model: a simulation study based on clinical data.
Download references. We gratefully acknowledge financial contribution from the Netherlands Organisation for Scientific Research project MvS wrote the initial version of the paper, performed statistical programming for the simulations and conducted analyses. All authors commented on drafts of the article and approved the manuscript. Maarten van Smeden, Joris A. Moons, Marinus J.
You can also search for this author in PubMed Google Scholar. Correspondence to Maarten van Smeden. Reprints and Permissions. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. Download citation. Received : 31 May Accepted : 17 November Published : 24 November Anyone you share the following link with will be able to read this content:. Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content. Search all BMC articles Search. Download PDF. Moons 1 , Gary S. Collins 2 , Douglas G. Altman 2 , Marinus J. Abstract Background Ten events per variable EPV is a widely advocated minimal criterion for sample size considerations in logistic regression analysis.
Methods The current study uses Monte Carlo simulations to evaluate small sample bias, coverage of confidence intervals and mean square error of logit coefficients. Conclusions The current evidence supporting EPV rules for binary logistic regression is weak. Accuracy of logit coefficients in small samples In a typical binary logistic regression analysis, the strength of associations between covariates and outcome are quantified by the logit coefficients, which are estimated by maximum likelihood.
Separation Another source of difficulty occurs when a single covariate or a linear combination of multiple covariates perfectly separates all events from all non-events [ 16 , 17 ]. Full size image. Methods General For each simulated data set, N covariate vectors X 1 ,…, X P were drawn from either an independent multivariate normal distribution in part I and part II or an independent Bernoulli distribution in part II. Simulation procedures In total, different simulation scenarios were examined.
Part I: Accuracy of logit coefficients A series of scenarios were set-up to identify the factors that are driving the accuracy of the logit coefficient. Table 1 Design factorial simulation studies Ia to Id Full size table. Table 2 Results simulation studies Ia to Id Full size table. Table 3 Results simulation study IIa, maximum likelihood logistic regression only Full size table.
Table 4 Results simulation study IIb Full size table. Discussion This paper offers explanations for the large differences between minimal EPV recommendations from previous simulation studies [ 12 — 14 ].
Drivers of the accuracy of logit coefficients Our results show that logit coefficients are typically overoptimistic estimates of the true associations when estimated by maximum likelihood in small to moderated-sized data sets. The impact of separated data sets on simulation results The traditional maximum likelihood logistic regression analysis of a dataset in which the included covariates perfectly separate the binary outcome variable cannot be trusted.
Reasons for differences between EPV simulation studies We identified two major reasons for the differences in results and recommendations between the preceding simulation studies [ 12 — 14 ]. Appendix To detect separation in a data sets it is sufficient to monitor the maximum likelihood standard errors of parameters during the estimation process [ 23 ]. Abbreviations EPV: Events per variable.
Acknowledgements Not applicable. Funding We gratefully acknowledge financial contribution from the Netherlands Organisation for Scientific Research project Availability of data and materials Not applicable.
Competing interests The authors declare that they have no competing interests. Consent to publish Not applicable. Ethics approval and consent to participate Not applicable. Altman Authors Maarten van Smeden View author publications. View author publications. Active 11 months ago.
Viewed k times. Am I right? If not please let me know how to decide the number of independent variables? Improve this question. Braj-Stat Braj-Stat 2 2 gold badges 7 7 silver badges 6 6 bronze badges. What I mean is: if I have subjects of which 10 are cases the 1 's and 90 non-cases the 0 's , then the rule says "include only 1 predictor". But what if I model the 0 's instead of the 1 's and then I take the reciprocal of the estimated odds ratios? Would I be allowed to include 9 predictors?
That makes no sense to me. Out of respondents there are 73 cases the 1's and rest 0's. Could you throw some light on my question. Do you have any references? Show 5 more comments. Active Oldest Votes. Improve this answer. Community Bot 1. I think what you are looking for is something along the lines of this nice answer by conjugate prior: does-an-unbalanced-sample-matter-when-doing-logistic-regression. I believe that logistic regression will continue to benefit from new data, even if that data is of the same case despite diminishing returns.
That's actually something that has bothered me about machine learning techniques like random forests - that they can get worse by adding more relevant training data.
Perhaps there's a point at which logistic regression would break down due to numerical considerations if the imbalance became too severe. Would be interested in learning more about this. Would it be suggested to have 10 observations per level? Show 2 more comments. Frank Harrell Frank Harrell Add a comment.
There are at least two different kinds of instability: The model parameters vary a lot with only slight changes in the training data. Featured on Meta. Now live: A fully responsive profile. Version labels for answers. Linked 3.
0コメント