Uploaded data structure:
Full model:
Regression coefficients and 95% confidence intervals:
First x-value (zero) = coefficient value from the full model.
Other cases were sequentially removed from data and the model were refitted without them.
This observations should be highly influential:
Note: Check the description of potentially influential observations on 'Description' tab.
Data table of all excluded observations:
Cases (observations) are in the same order as on the exclusion plot (in the decreasing order of influence).
Cases ID = row number from the provided data.
Typical Workflow
1. Upload your data: Tab-delimited text file with header (*.txt) or Excel file with one sheet (*.xls, *.xlsx).
2. Specify the model you would like to fit (see Formula interface for hints).
3. Check other settings (GLM family, maximum number of the most influential observations to exclude, criterion to estimate influence).
4. Press 'Build model' button - your model will be fitted to the full data set. You may check the results on the 'Full model results' tab.
5. Go to the 'Influential effect' tab. This step could be computer-intensive (depending on the data size), so it may freeze for a while.
6. On the sidebar choose which coefficients you would like to plot.
7. To rebuild the plot for the other coefficient press 'Build model' again.
That's it.
Formula interface
Here are several examples you may try (paste formula without quotes):
'y ~ A + B' - Additive model with 2 predictors (no interaction).
'y ~ log(A) + sqrt(B)' - The same but predictors are transformed.
'y ~ A*B' - Model with main effects and interaction = 'y ~ A + B + A:B'.
'y ~ A*B*C' - The same: main effects and interactions = 'y ~ A+B+C+A:B+A:C+B:C+A:B:C'.
'y ~(A+B+C)^2' - A, B, and C crossed to level 2: 'y ~ A+B+C+A:B+A:C+B:C'.
'y ~ A*B*C-A:B:C' - same as above: main effects plus 2-way interactions.
Models without intercept (e.g. 'y ~ A + B - 1') does not work for the moment - it'll be fixed soon.
Please use your 'real' varaiable names insted of 'y' or 'A' & 'B' from this examples.
With binomial data the response (y) can be either a vector or a matrix with two columns.
If the response is a vector it can be numeric with 0 for failure and 1 for success, or a factor with the first level representing 'failure' and all others representing 'success'.
Alternatively, the response can be a matrix where the first column is the number of 'successes' and the second column is the number of 'failures'. In this case one should specify model like ' cbind(success, failure) ~ A + B '. Where 'success' & 'failure' are column names in the data.
Zero-inflated and Hurdle models: If a formula of type 'y ~ x1 + x2' is supplied, it not only describes the count regression relationship of y and x1 & x2 but also implies that the same set of regressors is used for the zero component (zero-inflation or hurdle). This is could be made more explicit by equivalently writing the formula as 'y ~ x1 + x2 | x1 + x2'. Of course, a different set of regressors could be specified for the zero component, e.g., 'y ~ x1 + x2 | z1 + z2 + z3', giving the count data model y ~ x1 + x2 conditional on (|) the zero-inflation or hurdle model y ~ z1 + z2 + z3.
Error distribution family - Model link function used
Gaussian - Identity
Inverse Gaussian - 1/mu^2
Gamma - log
Binomial - logit
Poisson - log
Negative Binomial - log
Quasibinomial - logit
Quasipoisson - log
Zero-Inflated Poisson - log; binomial model with logit link for zero-inflation model
Zero-Inflated Negative Binomial - log; binomial model with logit link for zero-inflation model
Hurdle - log; distribution for the zero hurdle model - binomial model with logit link
Potentially influential observations
Influential cases are defined as:
- any of its absolute dfbetas values are larger than 1, or
- its absolute dffits value is larger than 3*sqrt(k/(n-k)), or
- abs(1 - covratio) is larger than 3*k/(n-k), or
- its Cook's distance is larger than the 50% percentile of an F-distributio with k and n-k degrees of freedom, or
- its hatvalue is larger than 3*k/n.
Notes
For GLMs (other than the Gaussian family with identity link) regression diagnostic measures are based on one-step approximations which may be inadequate if a case has high influence.
Vuong Non-Nested Hypothesis Test: null hypothesis that the models are indistinguishable. A large, positive test statistic provides evidence of the superiority of model 1 over model 2, while a large, negative test statistic is evidence of the superiority of model 2 over model 1.
Future plans
Add export of influece measures.
Add batch comparison of likelihood from different models.
Fix handling of the models without intercept.
Add back-transformation of regression coefficients to the original scale (now they're on the scale of the link function).
Add opportunity to choose other link functions (for example complementary log-log link function in binomial glm for presence-absence data).
Add link-function selection based on lowest residual deviance (-2*logLikelihood).
Add some stopping rules in exclusion algorithm (e.g. based on Mallow's Cp or something).
Add computation of the Wald tests using sandwich standard errors for ordinary Poisson model (because Wald test results might be too optimistic due to a misspecification of the likelihoodin the case of over-dispersion.
Computational details
Used R-packages: MASS, pscl, car, XLConnect.
Author - Vladimir Mikryukov (vmikryukov at gmail.com), 04.07.2014
Which model to choose?
Continuous distributions
Standard normal or linear regression - Gaussian
Positive only continuous - Gamma or Inverse Gaussian
Count data
Equidispersed count - Poisson
Overdispersed counts - Quasipoisson
Overdispersed proportions - Quasibinomial
Binary (1/0) response - Logistic
Binomial distribution with m=1 - Binary-Bernoulli
Proportional (y/m, where y = number of 1's) - Binomial
Poisson Regression is used to model count variables. The variance in the Poisson model is identical to the mean.
Negative binomial regression is for modeling count variables, usually for over-dispersed count outcome variables. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. For negative binomial regression the variance is assumed to be a quadratic function.
For a quasi-poisson regression the variance is assumed to be a linear function of the mean.
Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to exist in the data, 'true zeros' and 'excess zeros'. Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros.
Zero-inflated Poisson model has two parts, a poisson count model and the logit model for predicting excess zeros.
Zero-inflated Negative Binomial Regression does better with over dispersed data, i.e. variance much larger than the mean.
Zero-truncated models used to model count data for which the value zero cannot occur. This class of models will surpass Ordinary Poisson & Negative Binomial regression which will predict zero counts even though there are no zero values.
Zero-truncated Poisson regression - Useful if you have no overdispersion
Zero-truncated negative binomial regression - When overdispersion exists.
Hurdle (zero-augmented or zero-altered) models: In addition to over-dispersion, many empirical count data sets exhibit more zero observations than would be allowed for by the Poisson model. One model class capable of capturing both properties is the hurdle model. They are two-component models: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed.