<h4>GLM (Generalized Linear Models)</h4>

Uploaded data structure:

Full model:

Regression coefficients and 95% confidence intervals:

First x-value (zero) = coefficient value from the full model.

Other cases were sequentially removed from data and the model were refitted without them.

This observations should be highly influential:

Note: Check the description of potentially influential observations on 'Description' tab.

Data table of all excluded observations:

Cases (observations) are in the same order as on the exclusion plot (in the decreasing order of influence).

Cases ID = row number from the provided data.

Typical Workflow

1. Upload your data: Tab-delimited text file with header (*.txt) or Excel file with one sheet (*.xls, *.xlsx).

2. Specify the model you would like to fit (see Formula interface for hints).

3. Check other settings (GLM family, maximum number of the most influential observations to exclude, criterion to estimate influence).

4. Press 'Build model' button - your model will be fitted to the full data set. You may check the results on the 'Full model results' tab.

5. Go to the 'Influential effect' tab. This step could be computer-intensive (depending on the data size), so it may freeze for a while.

6. On the sidebar choose which coefficients you would like to plot.

7. To rebuild the plot for the other coefficient press 'Build model' again.

That's it.

Formula interface

Here are several examples you may try (paste formula without quotes):

'y ~ A + B' - Additive model with 2 predictors (no interaction).

'y ~ log(A) + sqrt(B)' - The same but predictors are transformed.

'y ~ A*B' - Model with main effects and interaction = 'y ~ A + B + A:B'.

'y ~ A*B*C' - The same: main effects and interactions = 'y ~ A+B+C+A:B+A:C+B:C+A:B:C'.

'y ~(A+B+C)^2' - A, B, and C crossed to level 2: 'y ~ A+B+C+A:B+A:C+B:C'.

'y ~ A*B*C-A:B:C' - same as above: main effects plus 2-way interactions.

Models without intercept (e.g. 'y ~ A + B - 1') does not work for the moment - it'll be fixed soon.

Please use your 'real' varaiable names insted of 'y' or 'A' & 'B' from this examples.

With binomial data the response (y) can be either a vector or a matrix with two columns.

If the response is a vector it can be numeric with 0 for failure and 1 for success, or a factor with the first level representing 'failure' and all others representing 'success'.

Alternatively, the response can be a matrix where the first column is the number of 'successes' and the second column is the number of 'failures'. In this case one should specify model like ' cbind(success, failure) ~ A + B '. Where 'success' & 'failure' are column names in the data.

Zero-inflated and Hurdle models: If a formula of type 'y ~ x1 + x2' is supplied, it not only describes the count regression relationship of y and x1 & x2 but also implies that the same set of regressors is used for the zero component (zero-inflation or hurdle). This is could be made more explicit by equivalently writing the formula as 'y ~ x1 + x2 | x1 + x2'. Of course, a different set of regressors could be specified for the zero component, e.g., 'y ~ x1 + x2 | z1 + z2 + z3', giving the count data model y ~ x1 + x2 conditional on (|) the zero-inflation or hurdle model y ~ z1 + z2 + z3.

Error distribution family - Model link function used

Gaussian - Identity

Inverse Gaussian - 1/mu^2

Gamma - log

Binomial - logit

Poisson - log

Negative Binomial - log

Quasibinomial - logit

Quasipoisson - log

Zero-Inflated Poisson - log; binomial model with logit link for zero-inflation model

Zero-Inflated Negative Binomial - log; binomial model with logit link for zero-inflation model

Hurdle - log; distribution for the zero hurdle model - binomial model with logit link

Potentially influential observations

Influential cases are defined as:

- any of its absolute dfbetas values are larger than 1, or

- its absolute dffits value is larger than 3*sqrt(k/(n-k)), or

- abs(1 - covratio) is larger than 3*k/(n-k), or

- its Cook's distance is larger than the 50% percentile of an F-distributio with k and n-k degrees of freedom, or

- its hatvalue is larger than 3*k/n.

Notes

For GLMs (other than the Gaussian family with identity link) regression diagnostic measures are based on one-step approximations which may be inadequate if a case has high influence.

Vuong Non-Nested Hypothesis Test: null hypothesis that the models are indistinguishable. A large, positive test statistic provides evidence of the superiority of model 1 over model 2, while a large, negative test statistic is evidence of the superiority of model 2 over model 1.

Future plans

Add export of influece measures.

Add batch comparison of likelihood from different models.

Fix handling of the models without intercept.

Add back-transformation of regression coefficients to the original scale (now they're on the scale of the link function).

Add opportunity to choose other link functions (for example complementary log-log link function in binomial glm for presence-absence data).

Add link-function selection based on lowest residual deviance (-2*logLikelihood).

Add some stopping rules in exclusion algorithm (e.g. based on Mallow's Cp or something).

Add computation of the Wald tests using sandwich standard errors for ordinary Poisson model (because Wald test results might be too optimistic due to a misspecification of the likelihoodin the case of over-dispersion.

Computational details

Used R-packages: MASS, pscl, car, XLConnect.

Author - Vladimir Mikryukov (vmikryukov at gmail.com), 04.07.2014

Which model to choose?

Continuous distributions

Standard normal or linear regression - Gaussian

Positive only continuous - Gamma or Inverse Gaussian

Count data

Equidispersed count - Poisson

Overdispersed counts - Quasipoisson

Overdispersed proportions - Quasibinomial

Binary (1/0) response - Logistic

Binomial distribution with m=1 - Binary-Bernoulli

Proportional (y/m, where y = number of 1's) - Binomial

Poisson Regression is used to model count variables. The variance in the Poisson model is identical to the mean.

Negative binomial regression is for modeling count variables, usually for over-dispersed count outcome variables. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion. For negative binomial regression the variance is assumed to be a quadratic function.

For a quasi-poisson regression the variance is assumed to be a linear function of the mean.

Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to exist in the data, 'true zeros' and 'excess zeros'. Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros.

Zero-inflated Poisson model has two parts, a poisson count model and the logit model for predicting excess zeros.

Zero-inflated Negative Binomial Regression does better with over dispersed data, i.e. variance much larger than the mean.

Zero-truncated models used to model count data for which the value zero cannot occur. This class of models will surpass Ordinary Poisson & Negative Binomial regression which will predict zero counts even though there are no zero values.

Zero-truncated Poisson regression - Useful if you have no overdispersion

Zero-truncated negative binomial regression - When overdispersion exists.

Hurdle (zero-augmented or zero-altered) models: In addition to over-dispersion, many empirical count data sets exhibit more zero observations than would be allowed for by the Poisson model. One model class capable of capturing both properties is the hurdle model. They are two-component models: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed.