NTCC REPORT ON LOGISTIC REGRESSION SUBMITTED BY CHANDNI MISHRA TOWARDS PARTIAL COMPLETION OF M

NTCC REPORT
ON
LOGISTIC REGRESSION
SUBMITTED BY
CHANDNI MISHRA
TOWARDS PARTIAL COMPLETION OF M.SC STATISTICS

UNDER SUPERVISION OF
DR.C.M.PANDEY
PROFESSOR AND HEAD
DEPARTMENT OF BIOSTATISTICS AND HEALTH INFORMATICS
SANJAY GANDHI POSTGRADUATE INSTITUTE OF MEDICAL SCIENCES, LUCKNOW
Acknowledgment
I WOULD LIKE TO EXPRESS MY SINCERE GRATITUDE TO MY NTCC GUIDE DR. NEERAJ SINGH AND TRAINING GUIDE, DR. C.M. PANDEY FOR PROVIDING THEIR INVALUABLE GUIDANCE TOWARDS COMPLETION OF THIS REPORT.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

ABSTRACT
The purpose of this report is to provide the
researchers, and the readers with a basic concept of LOGISTIC REGRESSION, with the help of some examples and how to calculate and interpret the result using software like SPSS.

This report includes tables, graphs, and calculations with the help of software like SPSS, and interpretation.

BRIEF INTRODUCTION TO SPSS
SPSS (Statistical Package for the Social Sciences) is a versatile and responsive program designed to undertake a range of statistical procedures.

When SPSS, Inc., an IBM Company, was conceived in 1968, it stood for Statistical Package for Social Sciences. Since the company’s purchase by IBM in 2009, IBM has decided to simply use the name SPSS to describe its core product of predictive analytics. IBM describes predictive analytics as tools that help connect data to effective action by drawing reliable conclusions about current conditions and future events.

SPSS is an integrated system of computer programs designed for the analysis of social sciences data. It is one of the most popular of the many statistical packages currently available for statistical analysis. Its popularity stems from the fact that the program:
Allows for a great deal of flexibility in the format of data.

Provides the user with a comprehensive set of procedures for data transformation and file manipulation.

Offers the researcher a large number of statistical analyses commonly used in social sciences.

CONTENT
1.Introduction to the Logistic Regression Model
Introduction
Fitting the Logistic Regression Model
Testing for the Significance of the Coefficients
Confidence Interval Estimation
2.The Multiple Logistic Regression Model
Introduction
The Multiple Logistic Regression Model
Fitting the Multiple Logistic Regression Model
Testing for the Significance of the Model
Confidence Interval Estimation
3. Interpretation of the Fitted Logistic Regression Model
Introduction
Dichotomous Independent Variable
Polychotomous Independent Variable
Continuous Independent Variable
Multivariable Models
Presentation and Interpretation of the Fitted Values
4.Model-Building Strategies and Methods for Logistic Regression
Introduction
Purposeful Selection of Covariates
CASE STUDY.

APPENDIX

REVIEW –
Types of Data ; Measurement Scales: Nominal, Ordinal, interval, and ratio.

These are simply ways to categorize different types of variables.  
Nominal- Nominal scales are used for labeling variables, without any quantitative value. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. Examples of nominal variables include region, zip code, or gender of individual or religious affiliation. The nominal scale can also be coded by the researcher in order to ease out the analysis process, for example; M=Female, F= Female, etc.

Ordinal -This level of measurement involves ordering or ranking the variable to be measured, it is the order of the values is what’s important and significant, but the differences between each one are not really known. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We cant say. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.

Interval- The interval level of measurement not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval.  For example, on a standardized intelligence measure, a 10-point difference in IQ scores has the same meaning anywhere along the scale. Thus, the difference in IQ test scores between 80 and 90 is the same as the difference between 110 and 120. However, it would not be correct to say that a person with an IQ score of 100 is twice as intelligent as a person with a score of 50. The reason for this is because intelligence test scales (and other similar interval scales) do not have a true zero that represents a complete absence of intelligence.

Ratio -In this level of measurement, the observations, in addition to having equal intervals, can have a value of zero as well.  The zero in the scale makes this type of measurement unlike the other types of measurement, although the properties are similar to that of the interval level of measurement.  In the ratio level of measurement, the divisions between the points on the scale have an equivalent distance between them.

The four data types
Attribute Nominal Ordinal Interval Ratio
Name2 Categorical Sequence Equal Interval Ratio
Name3 Set Fully ordered, rank ordered Unit size fixed Zero or ref.pt fixed
Statistics Count, Mode, chi-squared + median, rank order correlation + ANOVA, mean, SD + Logs??
Example1 Set of participants makes of car order of finishing a race centigrade scale Degrees Kelvin or absolute
Types of relativity A?B A>B |(A-B)|  >  |(C-D)| ?
Types of absolute The identity of individual entities order, sequence intervals, differences ratios, proportions
Note- odds can have a large magnitude even if the underlying probabilities are low.

Probability
P=outcomes of interest all possible outcomes Odds=p(occurring )p(not occurring)=p(1-p)=The odds of an event are the number of events / the number of non-events.

Odds ratio- odds ratio is a ratio of two odds.

Odds ratio = odds1odds0Odds ratio = (p11-p1)(p01-p0)

Introduction to the Logistic
Regression Model
INTRODUCTION
Logistic regression is the appropriate regression analysis to conduct when the dependent variable (y)is dichotomous (binary) such as “yes” or “no”, “1” or “2”, “A” or “B” or “c”. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent variable is dichotomous, such as male/female, smoker/non-smoker or success/failure like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. The logistic regression model is the most frequently used regression model for the analysis of these data. The independent variables are often called covariates.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is categorical. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logistic regression.

There are three primary uses of logistic regression:
Prediction of group membership and outcome.

The goal is to correctly predict the category of the outcome of individual cases. Thus, the research question asked is whether an outcome can be predicted from a selected set of independent variables. For instance, in epidemiological studies, can the development of lung cancer be predicted from the incidence and duration of smoking as well as from demographic variables such as gender, age, and social and economic status (SES)?
2. Logistic regression provides knowledge of the relationships and strengths among the variables.

The goal is to identify which independent variables predict the outcome, that is, increase or decrease the probability of the outcome or have no effect. For example, does inclusion of information about the incidence and duration of smoking improve prediction of lung cancer, and is a particular variable associated with an increase or decrease in the probability that a case has lung cancer? These parameter estimates (the coefficients of the predictors included in a model) can also be used to calculate and interpret the odds ratio. For instance, what are the odds that a person has lung cancer at age 65, given that he has smoked 10 packs a day for the past 30 years?
3.Classification of cases.

The goal is to understand how reliable the logistic regression model is in classifying cases for whom the effect is known. For instance, how many people with or without lung cancer are diagnosed correctly? The researcher establishes a cut point of say .5, and then asks, for instance: How many people with lung cancer are correctly classified if everyone with a predicted probability more is diagnosed as having lung cancer?
Why will other regression procedure not work?
Simple linear regression is one quantitative variable predicting another.

Multiple regression is a simple linear regression with more independent variables.

Nonlinear regression is still two quantitative variables, but the data is curvilinear.

Running a typical linear regression in some way has major problems since binary data does not have a normal distribution which is a condition needed for most other types of regression.

Example1: Table 1.1 lists the age in years (AGE), and presence or absence of
Evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical
Study of risk factors for heart disease. The table also contains an identifier
Variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD,
Which is coded with a value of “0” to indicate that CHD is absent, or “1” to indicate
That it is present in the individual. In general, any two values could be used, but
We have found it most convenient to use zero and one. We refer to this dataset as the CHDAGE data.
CHD

AGE(Years)
FIG 1.1
A scatterplot of the data in Table 1.1 is given in Figure 1.1.

In this scatterplot, all points fall on one of two parallel lines representing the
absence of CHD (y = 0) or the presence of CHD (y = 1). There is some tendency
for the individuals with no evidence of CHD to be younger than those with evidence
of CHD. While this plot does depict the dichotomous nature of the outcome variable
quite clearly, it does not provide a clear picture of the nature of the relationship
between CHD and AGE.

The main problem with Figure 1.1 is that the variability in CHD at all ages is
large. This makes it difficult to see any functional relationship between AGE and
CHD. One common method of removing some variation, while still maintaining
the structure of the relationship between the outcome and the independent variable,
is to create intervals for the independent variable and compute the mean of the
outcome variable within each group. We use this strategy by grouping age into the
categories (AGEGRP) defined in Table 1.1. Table 1.2 contains, for each age group,
the frequency of occurrence of each outcome, as well as the percent with CHD present.

Table SEQ Table * ARABIC 1.2
Age group n Absent Present Mean
20–29 10 9 1 0.1
30–34 15 13 2 0.133
35–39 12 9 3 0.25
40–44 15 10 5 0.333
45–49 13 7 6 0.462
50–54 8 3 5 0.625
55–59 17 4 13 0.765
60–69 10 2 8 0.8
Total 100 57 43 0.43

FIG 1.2Age(years)
By examining Table 1.2, a clearer picture of the relationship begins to emerge. It
shows that as age increases, the proportion (mean) of individuals with evidence of
CHD increases. Figure 1.2 presents a plot of the percent of individuals with CHD
versus the midpoint of each age interval. This plot provides considerable insight
into the relationship between CHD and AGE in this study, but the functional form
for this relationship needs to be described. The plot in this figure is similar to what
one might obtain if this same process of grouping and averaging were performed
in a linear regression. We note two important differences.

Some important facts:-
The dependent variable in logistic regression follows the Bernoulli distribution having an unknown probability, p.

Bernoulli distribution is just a special case of the Binomial distribution where n=1 (just one trial)
Success is “1” and failure is “0”.

The probability of success is “p” and failure is “q=1-p”.

In logistic regression, we are estimating an unknown p, for any given linear combination of the independent variables.

Therefore, we need to link together our independent variable to essentially the Bernoulli distribution, that link is called the logit.

The first difference concerns the nature of the relationship between the outcome
and independent variables.

In any regression problem, the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as “E(Y|x)” where Y
Denotes the outcome variable and x denotes a specific value of the independent
Variable. The quantity E (Y|x) is read “the expected value of Y, given the value x”.

In linear regression, we assume that this mean may be expressed as an equation. This expression implies that it is possible for E (Y|x) to take on any value as x
Ranges between ??and +?.The column labeled “Mean” in Table 1.2 provides an estimate of E (Y|x). We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2are close enough to the true values of E (Y|x) to provide a reasonable assessment of the functional relationship between CHD and AGE. With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., 0 ?E (Y|x) ?1). This can be seen in Figure 1.2. In addition,
The plot shows that this mean approaches zero and one “gradually”. The change in
The E (Y|x) per unit change in x becomes progressively smaller as the conditional
Mean gets closer to zero or one. The curve is said to be S-shaped and resembles a
The plot of the cumulative distribution of a continuous random variable. Thus, it should
Not seem surprising that some well-known cumulative distributions have been used
To provide a model for E (Y|x) in the case when Y is dichotomous. The model we use is based on the logistic distribution.

In order to simplify notation, we use the quantity ?(x) = E (Y|x) to represent
The conditional mean of Y given x when the logistic distribution is used. The
The specific form of the logistic regression model we use is:
?(x) = e?0+?1×1+e?0+?1x(1.1)
A transformation of ?(x) that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of ?(x), as:
g(x) = ln ?(x)1-?(x)
g(x) = ?0+ ?1x. (1.1*)
The importance of this transformation is that g(x) has many of the desirable properties
Of a linear regression model. The logit, g(x), is linear in its parameters, May
Be continuous, and may range from ??to +?, depending on the range of x.

The second important difference between the linear and logistic regression
Models concern the conditional distribution of the outcome variable. In the linear
Regression model we assume that an observation of the outcome variable may be
Expressed as y = E (Y|x) + ?. The quantity ? is called the error and expresses an
Observation’s deviation from the conditional mean. The most common assumption
Is that ? follows a normal distribution with mean zero and some variance that is
Constant across levels of the independent variable. It follows that the conditional
Distribution of the outcome variable given x is normal with mean E (Y|x), and a
The variance that is constant. This is not the case with a dichotomous outcome variable.

In this situation, we may express the value of the outcome variable given x
As y = ?(x) + ?. Here the quantity ? may assume one of two possible values. If
y = 1 then ? = 1 ??(x) with probability ?(x), and if y = 0 then ? = ??(x) with
Probability 1 ??(x). Thus, ? has a distribution with mean zero and variance equal
To ?(x) 1 ??(x). That is, the conditional distribution of the outcome variable
Follows a binomial distribution with probability given by the conditional mean,
?(x).

In summary, we have shown that in a regression analysis when the outcome
Variable is dichotomous:
1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, ?(x), given
In equation (1.1), satisfies this constraint.

2. The binomial, not the normal, distribution describes the distribution of the
Errors and is the statistical distribution on which the analysis is based.

3. The principles that guide an analysis using linear regression also guide us in
Logistic regression.

.
FITTING THE LOGISTIC REGRESSION MODEL
In the dichotomous outcome variable given x as
y = ?x+e, e may assume one of two possible values.

When, y= 1
e = 1-?(x). With probability ?(x). e~N (0, ?(x) (1-?(x)))
When, y=0
e = -?(x) with probability 1-?(x)
Fitting the logistic regression model in equation (1.1) to a set of data requires that we estimate the values of ?0 and ?1, the unknown parameters. The general method of estimation that leads to the least squares function under the linear regression model (when the error terms are normally distributed) is called maximum likelihood. This method provides the foundation for our approach to estimation with the logistic regression model.

f (xi, yi) = ?(xi)yi(1-?(xi))(1-yi) (1.2)
The likelihood function is given by:-
L (xi, yi) = i=1n?(xi)yi(1-?(xi))1-yi (1.3)
After taking log both sides we get,
Ln (L (xi, yi)) = i=1n{yi ln?(xi ) + (1 -yi ) ln1 -?(xi )}. (1.4)
To find the value of ? that maximizes L (?) we differentiate L (?) with respect to
?0 and ?1 and set the resulting expressions equal to zero. These equations, known
As the likelihood equations are
i=1nyi-?xi=0 (1.5)
i=1nxiyi-?(xi) = 0. (1.6)
The value of ? given by the solution to equations (1.5) and (1.6) is called
The maximum likelihood estimate and is denoted as ?.In general, the use of the
The symbol ” denotes the maximum likelihood estimate of the respective quantity.

For example, ?(xi)is the maximum likelihood estimate of ? (xi). This quantity
Provides an estimate of the conditional probability that Y is equal to 1, given that
X is equal to xi. As such, it represents the fitted or predicted value for the logistic
Regression model. An interesting consequence of equation (1.5) is that
i=1nyi=i=1n?(xi)As an example, consider the data given in Table 1.1. Use of a logistic regression
Software package SPSS, with continuous variable AGE as the independent variable,
Produces the output in Table 1.3
Table 1.3 Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .111 .024 21.254 1 .000 1.117 1.066 1.171
Constant -5.309 1.134 21.935 1 .000 .005 Log-likelihood = ?53.676546.

The maximum likelihood estimates of ?0 and ?1 are ?0= ?5.309 and ?1=0.111. The fitted values are given by the equation-
?= e-5.309+0.111*AGE1+e-5.309+.111*AGE (1.7)
and the estimated logit , g(x)= -5.309+.111*AGE (1.8)
The log-likelihood given in Table 1.3 is the value of equation (1.4) computed using?0 and ?1.

Testing For The Significance Of The Coefficient
In logistic regression, comparison of observed to predicted values is based on the log-likelihood function defined in equation (1.4).

The comparison of observed to predicted values using the likelihood function is
based on the following expression:
D = ?2 ln(likelihood of the fitted model)(likelihood of the saturated model) (1.9)
The quantity inside the large brackets in the expression above is called the likelihood
Ratio Using minus twice its log is necessary to obtain a quantity whose distribution
is known and can, therefore, be used for hypothesis testing purposes. Such a test is
called the likelihood ratio test. Using equation (1.4), equation (1.9) becomes
D = -2i=1nyiln?iyi+ (1-yi)ln1-?i1-yi (1.10)
The statistic, D, in equation (1.10) is called the deviance, and for logistic
regression, it plays the same role that the residual sum-of-squares plays in linear
regression. In fact, the deviance as shown in equation (1.10), when computed
for linear regression, is identically equal to the SSE.

In particular, to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the independent variable in the model is:
G = D(model without the variable) ? D(model with the variable).

This statistic, G, plays the same role in logistic regression that the numerator of the partial F-test does in linear regression. Because the likelihood of the saturated model is always common to both values of D being differenced, G can be expressed as
G = ?2 ln((likelihood without the variable))(likelihood with the variable)For the specific case of a single independent variable, it is easy to show that when the variable is not in the model, the maximum likelihood estimate of ?0 is
ln(n1/n0) where n1= yi and n0=(1-yi) and the predicted probability for
all subjects is constant and equal to n1/n. In this setting, the value of G is:
G = ?2 ln(n1n)n1(n0n)n0i=1n?iyi(1-?i)(1-yi) (1.13)
Confidence Interval Estimation.

The Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter, ?1, to an estimate of its standard error. Under the null hypothesis
and the sample size assumptions, this ratio follows a standard normal distribution.

W = ?1SE(?1)The basis for construction of the interval estimators is the same statistical theory we used to formulate the tests for the significance of the model. In particular, the confidence
interval estimators for the slope and intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a 100(1 ??)% confidence interval for the slope coefficient are-
?1±z1-?2SE(?1) (1.9 ) ?0±z1-?2SE(?0) (1.10)
where z1??/2 is the upper 100(1 ??/2)% point from the standard normal distribution
and SE(.)denotes a model-based estimator of the standard error of the
respective parameter estimator. Since we are using software like SPSS, we do not require to calculate it manually, as we can see in Table 1.3 we are given the confidence interval for exp(?), if we take antilog for both e?1 we can obtain a confidence interval for ?1.

Chapter 2.

The Multiple Logistic Regression Model
Introduction
In Chapter 1 we introduced the logistic regression model in the context of a model
containing a single variable. In this chapter, we generalize the model to one
with more than one independent variable (i.e., the multivariable or multiple logistic
regression model). Central to the consideration of the multiple logistic models
is estimating the coefficients and testing for their significance.

The logit of the multiple logistic regression model is given by generalizing equation (1.1) and (1.1*) we get:-
g(x)=ln?(x)1-?(x)= ?0+?1×1+?2×2+……………+?pxp. (2.1)
where, for the multiple logistic regression model,?x=eg(x)1+eg(x)(2.2)
FITTING THE MULTIPLE LOGISTIC REGRESSION MODEL
The method of estimation used in the multivariable case is the same as in the univariable situation – maximum likelihood. The likelihood function is nearly identical to that given in equation. The likelihood function is nearly identical to that given in equation(1.3) with the only change being that ?(x) is now defined as in equation (2.1). There will be p + 1 likelihood equations that are obtained by differentiating the log-likelihood function with respect to the p + 1 coefficients. The likelihood equations that result may be expressed as follows:
i=1nyi-?(xi)= 0
And
i1nxijyi-?(xi)=0 for j = 1, 2, . . . , p
As in the univariable model, the solution of the likelihood equations requires
software that is available in virtually every statistical software package. Let ?denote the solution to these equations. Thus, the fitted values for the multiple
logistic regression model are ?ˆ(xi ), the value of the expression in equation (2.2)
computed using ?and xi.

The Global Longitudinal Study of Osteoporosis in Women (GLOW) dataset, as an example, we consider five variables thought to be of importance that is age at enrollment (AGE), weight at enrollment (WEIGHT), history of a previous fracture (PRIORFRAC), whether or not the women experienced menopause before or after age 45 (PREMENO) and self-reported risk of fracture relative to women of the same age (RATE RISK) coded at three levels: less, same, or more risk.
TABLE 2.2 Fitted Multiple Logistic Regression Model of Fracture in the First Year of Follow Up (FRACTURE) on Age, Weight, Prior Fracture (PRIORFRAC), Early Menopause (PREMENO), and Self-Reported Risk of Fracture (RATE RISK) from the GLOW Study, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .050 .013 13.966 1 .000 1.051 1.024 1.079
WEIGHT .004 .007 .347 1 .556 1.004 .991 1.018
PRIORFRAC .679 .242 7.858 1 .005 1.973 1.227 3.173
PREMENO .187 .277 .456 1 .499 1.206 .701 2.074
RATE RISK 9.181 2 .010 RATERISK(1) .534 .276 3.754 1 .053 1.707 .994 2.930
RATERISK(2) .874 .289 9.139 1 .003 2.397 1.360 4.224
Constant -5.606 1.221 21.090 1 .000 .004 Variable(s) entered on step 1: AGE, WEIGHT, PRIORFRAC, PREMENO, RATE RISK.

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 518.075a .085 .125
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.

Log-Likelihood = -259.03768
In the example given above, the variable RATE RISK is modeled using the two design variables coded at three level. If we are using software like SPSS we can go to categorical to code the variable.

In Table 2.2 the estimated coefficient for the two design variables for RATERISK are indicated by RATERISK1 and RATERISK2 . The estimated logit is given in the following equation :
g(x) = -5.606 + .050 * AGE + .004 * WEIGHT + .679 * PRIORFRAC + .187 * PREMENO + .534 * RATERISK1 + .874 * RATERISK2
And the associated estimated logistic probabilities are found by using equation (2.2)
Testing For The Significance of The Model
once we have a particular multiple (multivariable) logistic regression model, we begin the process of model assessment. The likelihood ratio test for overall significance of the p coefficients for the independent variables in the model is performed in exactly the same manner as in the univariable case. The test is based on the statistic G given in equation (1.12). Consider the fitted model whose estimated coefficients are given in Table 2.2. For that model, the value of the log-likelihood, shown at the bottom of the table, is L = ?259.0377. The log-likelihood for the constant only model may be obtained by evaluating the numerator of equation (1.13) or by fitting the constant only model.

Either method yields the log-likelihood L = ?281.1676. Thus the value of the likelihood ratio test is, from equation (1.12), G = ?2?281.1676 ? (?259.0377) = 44.2598 and the p-value for the test is P?2(6) > 44.2598 ? 0.0001, which is significant at well beyond the ? = 0.05 level. We reject the null hypothesis in this case and conclude that at least one or more of the p coefficients are different from zero, an interpretation analogous to the F-test used in multiple linear regression.

The p-values computed under this hypothesis are shown in the fifth column of Table 2.2. If we use a level of significance of 0.05, then we would conclude that the variables AGE, history of prior fracture (PRIORFRAC) and self-reported rate of risk (RATE RISK) are statistically significant, while WEIGHT and early menopause (PREMENO) are not significant.

As our goal is to obtain the best fitting model while minimizing the number of parameters, the next logical step is to fit a reduced model containing only those variables thought to be significant and compare that reduced model to the full
model containing all of the variables. The results of fitting the reduced model are
given in Table 2.3
The difference between the two models is the exclusion of the variables
WEIGHT and early menopause (PREMENO) from the full model. The likelihood ratio test comparing these two models is obtained using the definition of G
given in equation (1.12). It has a distribution that is chi-square with 2 degrees of
freedom under the hypothesis that the coefficients for both excluded variables are
equal to zero. The value of the test statistic comparing the model in Table 2.3 to
the one in Table 2.2 is
G = ?2?259.4494 ? (?259.0377) = 0.8324
Table 2.3 Fitted Multiple Logistic Regression Model of Fracture in the First Year of
Follow Up (FRACTURE) on AGE, Prior Fracture (PRIORFRAC), and Self-Reported
Risk of Fracture (RATE RISK) from the GLOW Study, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .046 .012 13.618 1 .000 1.047 1.022 1.073
PRIORFRAC .700 .241 8.431 1 .004 2.014 1.256 3.231
RATE RISK 9.223 2 .010 RATERISK(1) .549 .275 3.979 1 .046 1.731 1.010 2.967
RATERISK(2) .866 .286 9.150 1 .002 2.377 1.356 4.165
Constant -4.991 .903 30.565 1 .000 .007 a. Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 518.899a .083 .123
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.

Log-Likelihood = -259.4494
which, with 2 degrees of freedom, has a p-value of P?2(2) > 0.8324 = 0.663.

As the p-value is large, exceeding 0.05, we conclude that the full model is no better
than the reduced model. That is, there is little statistical justification for including
WEIGHT and PREMENO in the model. However, we must not base our models
entirely on tests of statistical significance.

CONFIDENCE INTERVAL ESTIMATION
The methods used for confidence interval estimators for a multivariable model are
essentially the same. Table 2.3, the 95 percent confidence interval for the exponential of the coefficient of variables are given we have to take antilog to obtain CI of coefficient variables.

Table 2.4 we are taking antilog of exp(?)
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for B
Lower Upper
Step 1a AGE .046 .012 13.618 1 .000 1.047 0.0218 0.0705
PRIORFRAC .700 .241 8.431 1 .004 2.014 .2279 1.1728
RATE RISK 9.223 2 .010 RATERISK(1) .549 .275 3.979 1 .046 0.5487 .0100 1.0876
RATERISK(2) .866 .286 9.150 1 .002 2.377 0.3045 1.4267
Constant -4.991 .903 30.565 1 .000 .007 a. Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.

CHAPTER 3.

Interpretation of the Fitted Logistic Regression Model
INTRODUCTION
We begin this chapter assuming that a logistic regression model has been fit, that the variables in the model are significant in either a clinical or statistical sense, and that the model fits according to some statistical measure of fit.

The interpretation of any fitted model requires that we be able to draw practical inferences from the estimated coefficients in the model it involves two issues: determining the functional relationship between the dependent variable and the independent variable, and appropriately defining the unit of change for the independent variable.

When the independent variable is binary or dichotomous:-
This case provides the conceptual foundation for all the
other situations.

We assume that the independent variable, x, is coded as either 0 or 1. The
difference in the logit for a subject with x = 1 and x = 0 is
g(1) ? g(0) = (?0+ ?1× 1) ? (?0+ ?1× 0) = (?0+ ?1) ? (?0) = ?1.

The practical problem is that change on the scale of the log-odds is hard to explain and it may not be especially meaningful to a subject-matter audience. In order to provide a more meaningful interpretation, we need to introduce the odds ratio as a measure of association.
The odds of the outcome being present among individuals with x = 1 is ?(1)/1 ? ?(1). Similarly, the odds of the outcome being present among individuals with x = 0 is
?(0)/1? ?(0). The odds ratio denoted OR, is the ratio of the odds for x = 1 to the odds for x = 0, and is given by the equation:-
OR = ?(1)1-?(1)?(0)1-?(0) (3.1)
Substituting the expressions for the logistic regression model probabilities in
Table 3.1 into equation (3.1) we obtain
OR =e?0+?11+e?0+?111+e?0+?1e?01+e?011+e?0= e?0+?1e?0= e?0+?1-?0= e?1Hence, for a logistic regression model with a dichotomous independent variable
coded 0 and 1, the relationship between the odds ratio and the regression coefficient is -OR= e?1 (3.2)
Table 3.1 Values of the Logistic Regression Model when the Independent Variable Is
Dichotomous
Independent Variable(x) Outcome Variable(y) X=1 X=0
Y=1
?1=e?0+?11+e?0+?1?0=e?01+e?0Y=0
1-?1=11+e?0+?11-?0=11+e?0Total 1.0 1.0
The odds ratio is widely used as a measure of association as it approximates how much more likely or unlikely (in terms of odds) it is for the outcome to be present among those subjects with x = 1 as compared to those subjects with x = 0.

To review, the outcome variable is having a fracture (FRACTURE) in the first year of follow-up.

Here we use has had a fracture between the age of 45 and enrollment in the study (PRIORFRAC) as the dichotomous independent variable. The result of cross-classifying fracture during follow-up by prior fracture is presented in Table 3.2.

Table 3.2 Cross-Classification of Prior Fracture and Fracture During Follow-Up in the GLOW Study, n = 500
FRACTURE * PRIORFRAC Crosstabulation
Count
PRIORFRAC Total
0 1 FRACTURE 0 301 74 375
1 73 52 125
Total 374 126 500
The frequencies in Table 3.2 tell us that there were 52 subjects with values
(x = 1, y = 1), 73 with (x = 0, y = 1), 74 with (x = 1, y = 0), and 301 with (x = 0, y = 0). The results of fitting a logistic regression model containing the dichotomous covariate PRIORFRAC are shown in Table 3.3.

Table 3.3 Results of Fitting the Logistic Regression Model of Fracture
(FRACTURE) on Prior Fracture (PRIORFRAC) Using the Data in Table 3.2
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486
Constant -1.417 .130 117.908 1 .000 .243 Variable(s) entered on step 1: PRIORFRAC.

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 540.068a .044 .065
Log-Likelihood= -270.03397
The estimate of the odds ratio using equation (3.2) and the estimated coefficient for PRIORFRAC in Table 3.3 is OR= e1.064 = 2.9. Readers who have had some previous experience with the odds ratio undoubtedly wonder why we used a logistic regression package to estimate the odds ratio when we easily could have
computed it directly as the cross-product ratio from the frequencies in Table 3.2, namely,
OR= 52*30174*73 = 2.897.

Thus, we see that the slope coefficient from the fitted logistic regression model is
?1= ln(52 × 301)/(74 × 73) = 1.0638
We obtain a 100 × (1 ? ?)% confidence interval estimator for the odds ratio by first calculating the endpoints of a confidence interval estimator for the log-odds ratio (i.e., ?1) and then exponentiating the endpoints of this interval. In general, the endpoints are given by the expression
exp?1±z1-?2*SE(?1).

As an example, consider the estimation of the odds ratio for the dichotomized variable PRIORFRAC. Using the results in Table 3.3 the point estimate is OR=2.9and the 95% con?dence interval is exp(1.064±1.96×0.2231) = (1.87,4.49).

This interval is typical of many con?dence intervals for odds ratios when the point estimate exceeds 1, in that it is skewed to the right from the point estimate. This con?dence interval suggests that the odds of a fracture during follow-up among women with a prior fracture could be as little as 1.9 times or much as 4.5 times the odds for women without a prior fracture, at the 95% level of con?dence.

POLYCHOTOMOUS INDEPENDENT VARIABLE
Suppose that instead of two categories the independent variable has k > 2 distinct values. In the GLOW study, the covariate self-reported risk is coded at three levels (less, same, and more). The cross-tabulation of it with fracture during follow-up (FRACTURE) is shown in Table 3.5. In addition, we show the estimated odds ratio, its 95% confidence interval and log-odds ratio for the same and more versus less risk.

The extension to a situation where the variable has more than three levels is not conceptually different so all the examples in this section use k = 3. Using Spss we obtain table 3.5 and 3.7.

Table 3.5 Cross-Classification of Fracture During Follow-Up (FRACTURE) by Self-Reported Rate of Risk (RATE RISK) from the GLOW Study, n = 500
FRACTURE * RATE RISK Crosstabulation
Count
RATE RISK Total
1 2 3 FRACTURE 0 139 138 98 375
1 28 48 49 125
Total 167 186 147 500
Odds Ratio 1 28×13848×139 =1.73 49×13928×98=2.482195% CI (1.02, 2.91) (1.46, 4.22) ln(OR) 0.0 0.55 0.91 Table 3.6 Specification of the Design Variables for RATE RISK Using Reference Cell Coding with Less as the Reference Group
RATE RISK(Code) RATERISK1 RATERISK2
Less(1) 0 0
Same(2) 1 0
More(3) 0 1
Table 3.7 Results of Fitting the Logistic Regression Model to the Data in Table 3.5 Using the Design Variables in Table 3.6
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911
RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223
Constant -1.602 .207 59.831 1 .000 .201 Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 550.578a .023 .034
Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

LogLikelihood = -275.28917
Table 3.7 gives us confidence interval for odd ratio , coefficient, SE, p value.

CONTINUOUS INDEPENDENT VARIABLE
When a logistic regression model contains a continuous independent variable, interpretation
of the estimated coefficient depends on how it is entered into the model and the particular units of the variable. For purposes of developing the method to interpret the coefficient for a continuous variable, we assume that the logit is
linear in the variable.

As an example, we show the results in Table 1.3 of a logistic regression of AGE on CHD status using the data in Table 1.1. The estimated logit is g(AGE) = ?5.310 + 0.111 × AGE. The estimated odds ratio for an increase of 10 years in
age is _OR(10) = exp(10 × 0.111) = 3.03. Thus, for every increase of 10 years in age, the odds of CHD being present is estimated to increase by 3.03 times. The validity of this statement is questionable, because the increase in the odds of CHD
for a 40-year-old compared to a 30-year-old may be quite different from the odds for a 60-year-old compared to a 50-year-old. This is the unavoidable dilemma when a continuous covariate is modeled linearly in the logit and motivates the
importance of examining the linearity assumption for continuous covariates. The endpoints of a 95% confidence interval for this odds ratio are exp(10 × 0.111 ± 1.96 × 10 × 0.024) = (1.90, 4.86).

The interpretation of the estimated odds ratio for a continuous variable is similar to that of nominal scale variables. The main difference is that a meaningful change must be defined for the continuous variable.

MULTIVARIABLE MODELS
Fitting a series of univariable models, although useful for a preliminary analysis, rarely provides an adequate or complete analysis of the data in a study because the independent variables are usually associated with one another and may have different distributions within levels of the outcome variable. Thus, one generally uses a multivariable analysis for a more comprehensive modeling of the data. One goal of such an analysis is to statistically adjust the estimated effect
of each variable in the model for differences in the distributions of and associations among the other independent variables in the model. Applying this concept to a multivariable logistic regression model, we may surmise that each estimated coefficient provides an estimate of the log-odds adjusting for the other variables in the model. Another important aspect of multivariable modeling is to assess to what extent, if at all, the estimate of the log-odds of one independent variable changes, depending on the value of another independent variable. When the odds ratio for one variable is not constant over the levels of another variable, the two variables are said to have a statistical interaction. In some applied disciplines statistical interaction is referred to as effect modification. This terminology is used to describe the fact that the log odds of one variable are modified or changed by values of the other variable. We begin with an example where there is neither statistical adjustment nor statistical interaction. The data we use come from the GLOW study described in Table Dataset 2. The outcome variable is having a fracture during the first year of follow up (FRACTURE). For the dichotomous variable, we use variable history of prior fracture (PRIORFRAC) and for the continuous covariate, we use height in centimeters (HEIGHT). The results from the three fitted models are presented in Table 3.10. In discussing the results from the examples we use significance levels from the Wald statistics. In all cases, the same conclusions would be reached had we used likelihood ratio tests.

Table 3.10 Estimated Logistic Regression Coefficients, Standard Errors, Wald Statistics, p-Values and 95% CIs from Three Models Showing No Statistical Adjustment and No Statistical Interaction from the GLOW Study, n = 500
Variables in the Equation
Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
1. PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486
Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.

Variables in the Equation
Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
2. PRIORFRAC 1.013 .225 20.199 1 .000 2.754 1.770 4.284
HEIGHT -.045 .017 6.811 1 .009 .956 .924 .989
Constant 5.895 2.796 4.445 1 .035 363.095 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT.

Variables in the Equation
Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
3. PRIORFRAC -3.055 5.790 .278 1 .598 .047 .000 3999.295
HEIGHT -.054 .022 6.216 1 .013 .947 .907 .988
HEIGHT by PRIORFRAC .025 .036 .494 1 .482 1.026 .956 1.101
Constant 7.361 3.510 4.398 1 .036 1573.846 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT, HEIGHT * PRIORFRAC.

The Wald Statistic for the coefficient of PRIORFRAC in Model 1 is significant with p < 0.001. When we add HEIGHT to the model the Wald statistics are significant at the 1% level for both covariates. Note that there is little change in the estimate of the coefficient for PRIORFRAC as
??% = 100×(1.064-1.012)1.012 = 5.1
indicating that the inclusion of HEIGHT does not statistically adjust the coefficient of PRIORFRAC. Thus we conclude that, in these data, the height it is not a confounder of prior fracture. The statistical interaction of prior fracture (PRIORFRAC) and height (HEIGHT) is added to Model 2 to obtain Model 3. The Wald statistic for the added product term has p = 0.492 and thus is not significant. In these data height is not an effect modifier of prior fracture. Hence, the choice is between Model 1 and Model 2. Even though the estimate of the effect of prior fracture is basically the same for the two models, we would choose Model 2 as height (HEIGHT) is not only statistically significant in Model 2, but is an important clinical covariate as well.

CH A P T E R 4-
Model-Building Strategies and
Methods for Logistic Regression
INTRODUCTION
The goal of any method is to select those variables that result in a “best” model within the scientific context of the problem. In order to achieve this goal, we must have: (i) a basic plan for selecting the variables for the model and (ii) a set of methods for assessing the adequacy of the model both in terms of its individual variables and its overall performance. In this chapter, we discuss methods that address both of these areas.

PURPOSEFUL SELECTION OF COVARIATES
CASE STUDY 1 (The GLOW Study)
STATEMENT
For Purposeful selection, we use the GLOW500 data.

OBJECTIVE
Provides a good example of an analysis designed to identify risk factors for a specified binary outcome.

METHOD TOOL
We are using software tool SPSS
Steps
.

Step 1: The first step in purposeful selection is to fit a univariable logistic regression model for each covariate. The results of this analysis are shown in Table 4.7. Note that in this table, each row presents the results for the estimated regression coefficient(s) from a model containing only that covariate.

Table 4.7 Results of Fitting Univariable Logistic Regression Models in the GLOW
Data, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .053 .012 20.684 1 .000 1.054 1.031 1.079
Constant -4.779 .827 33.374 1 .000 .008 a. Variable(s) entered on step 1: AGE.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a WEIGHT -.005 .006 .656 1 .418 .995 .982 1.007
Constant -.727 .468 2.417 1 .120 .483 a. Variable(s) entered on step 1: WEIGHT.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a HEIGHT -.052 .017 9.134 1 .003 .950 .918 .982
Constant 7.212 2.744 6.910 1 .009 1356.000 a. Variable(s) entered on step 1: HEIGHT.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a BMI .006 .017 .112 1 .738 1.006 .972 1.040
Constant -1.258 .486 6.686 1 .010 .284 a. Variable(s) entered on step 1: BMI.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486
Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a PREMENO .051 .259 .038 1 .845 1.052 .633 1.749
Constant -1.109 .115 92.397 1 .000 .330 a. Variable(s) entered on step 1: PREMENO.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a MOMFRAC .661 .281 5.526 1 .019 1.936 1.116 3.358
Constant -1.196 .114 110.932 1 .000 .302 Variable(s) entered on step 1: MOMFRAC.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a ARMASSIST .709 .210 11.429 1 .001 2.032 1.347 3.066
Constant -1.394 .142 96.584 1 .000 .248 a. Variable(s) entered on step 1: ARMASSIST.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a SMOKE -.308 .436 .498 1 .480 .735 .313 1.727
Constant -1.079 .107 102.450 1 .000 .340 a. Variable(s) entered on step 1: SMOKE.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911
RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223
Constant -1.602 .207 59.831 1 .000 .201 a. Variable(s) entered on step 1: RATE RISK.

Step 2: We now fit our first multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. The results of this fit are shown in Table 4.8. Once this model is fit we examine each covariate to ascertain its continued significance, at traditional levels, in the model. We see that the covariate with the largest p-value that is greater than 0.05
is for RATERISK2, the design/dummy variable that compares women with RATERISK = 2 to women with RATERISK = 1. The likelihood ratio test for the exclusion of self-reported risk of fracture (i.e., deleting RATERISK_2 and RATERISK_3 from the model) is G = 5.96, which with two degrees of freedom, yields p = 0.051, nearly significant at the 0.05 level.

Table 4.8 Results of Fitting the Multivariable Model with All Covariates Significant
at the 0.25 Level in the Univariable Analysis in the GLOW Data, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .034 .013 6.930 1 .008 1.035 1.009 1.062
HEIGHT -.044 .018 5.759 1 .016 .957 .923 .992
PRIORFRAC .645 .246 6.877 1 .009 1.906 1.177 3.088
MOMFRAC .621 .307 4.095 1 .043 1.861 1.020 3.397
ARMASSIST .446 .233 3.667 1 .056 1.562 .990 2.465
RATERISK 5.820 2 .054 RATERISK(1) .422 .279 2.284 1 .131 1.525 .882 2.636
RATERISK(2) .707 .293 5.804 1 .016 2.028 1.141 3.604
Constant 2.709 3.230 .704 1 .402 15.019 Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK.

Step 3: Next we check to see if covariate(s) removed from the model in Step 2 confound or are needed to adjust the effects of covariates remaining in the model. In results not shown, we find that the largest percent change is 17% for the coefficient of ARMASSIST. This does not exceed our criterion of 20%. Thus, we see that while the self-reported rate of risk is not a confounder it is an important covariate. No other covariates are candidates for exclusion and thus, we continue using the model in Table 4.8.

Step 4: On univariable analysis, the covariates for weight (WEIGHT), body mass index (BMI), early menopause (PREMENO) and smoking (SMOKE) were not significant. When each of these covariates is added, one at a time, to the model in Table 4.8 its coefficient did not become significant. The only change
of note is that the significance of BMI changed from 0.752 to 0.334. Thus the next step is to check the assumption of linearity in the logit of continuous covariates age and height.

Before moving to step 5 we consider another possible model. Since the coefficient for RATERISK_2 is not significant, one possibility is to combine levels 1 and 2, self-reported risk less than or the same as other women, into a new reference category, it was thought that combining these two categories is reasonable.

Hence we fit this model and its results are shown in Table 4.9. In this model, the coefficient for the covariate RATERISK_3 now provides the estimate of the log of the odds ratio comparing the odds of fracture for individuals in level 3 to that of the combined group consisting of levels 1 and 2.

Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .033 .013 6.567 1 .010 1.034 1.008 1.060
HEIGHT -.046 .018 6.526 1 .011 .955 .921 .989
PRIORFRAC .664 .245 7.336 1 .007 1.943 1.201 3.142
MOMFRAC .664 .306 4.722 1 .030 1.943 1.067 3.536
ARMASSIST .473 .231 4.176 1 .041 1.604 1.020 2.525
RATERISK .458 .238 3.700 1 .054 1.581 .991 2.521
Constant 2.491 3.237 .592 1 .442 12.070 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, and RATERISK.

Step 5: At this point, we have our preliminary main effects model and must now check for the scale of the logit for continuous covariates age and height.

Step 6: The next step in the purposeful selection procedure is to explore possible interactions between the main effects. The subject matter investigators felt that each pair of main effects represents a plausible interaction. Hence, we fit models that individually added each of the 15 possible interactions to the main effects model. The results are summarized in Table 4.14. Three interactions are significant at the 10 percent level: Age by prior fracture (PRIORFRAC), prior fracture by mother had a fracture (MOMFRAC) and mother had a fracture by arms needed to rise from a chair (ARMASSIST). We note that prior fracture and mother having had a fracture are involved in two of the three significant interactions.

Table 4.14 Log-Likelihood, Likelihood Ratio Test (G, df = 1), and p-Value for the Addition of the Interactions to the Main Effects Model
Interaction Log-Likelihood G /Wald p
Main effects model -254.9090 Age*Height -254.8420 0.13 0.716
Age*Priorfrac -252.3920 5.701 0.025
Age*Momfrac -254.8395 0.140 0.708
Age*Armassist -254.8360 0.146 0.702
Age*Raterisk -254.3855 1.50 0.305
Height*Priorfrac -254.8025 0.213 0.644
Height*Momfrac -253.7045 2.438 0.118
Height*Armassist -254.1115 1.588 0.208
Height*Raterisk -254.4220 0.990 0.320
Priorfrac*Momfrac -253.5095 2.793 0.095
Priorfrac*Armassist -254.7960 0.224 0.636
Priorfrac*Raterisk -254.8475 0.122 0.726
Momfrac*Armassist -252.5180 4.699 0.030
Momfrac*Raterisk -254.6425 0.533 0.465
Armassist*Raterisk -254.7925 2.230 0.135
The next step is to fit a model containing the main effects and the three significant interactions. The results of this fit are shown in Table 4.15 The three degree of freedom likelihood ratio test of the interactions model in Table 4.15 versus the main effects model in Table 4.9 is G = 11.03 with p =0.012. Thus, in aggregate, the interactions contribute to the model. However, one interaction, prior fracture by mother’s fracture, is not significant with a Wald statistic p = 0.191. Next, we fit the model excluding this interaction and the results are shown in Table 4.16
Table 4.15 Results of Fitting the Multivariable Model with the Addition of Three
Interactions, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .058 .017 12.172 1 .000 1.060 1.026 1.095
HEIGHT -.049 .018 7.038 1 .008 .952 .919 .987
PRIORFRAC 4.598 1.878 5.993 1 .014 99.240 2.501 3937.490
MOMFRAC 1.472 .423 12.124 1 .000 4.360 1.903 9.986
ARMASSIST .626 .254 6.075 1 .014 1.869 1.137 3.074
RATERISK .474 .241 3.869 1 .049 1.607 1.002 2.577
AGE by PRIORFRAC -.053 .026 4.223 1 .040 .948 .901 .998
MOMFRAC by PRIORFRAC -.847 .648 1.711 1 .191 .429 .121 1.525
ARMASSIST by MOMFRAC -1.167 .617 3.580 1 .058 .311 .093 1.043
Constant 1.011 3.385 .089 1 .765 2.749 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, MOMFRAC * PRIORFRAC, ARMASSIST * MOMFRAC.

Table 4.16 Results of Fitting the Multivariable Model with the Significant Interactions, n = 500
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a AGE .057 .017 12.060 1 .001 1.059 1.025 1.094
HEIGHT -.047 .018 6.501 1 .011 .954 .921 .989
PRIORFRAC 4.612 1.880 6.018 1 .014 100.715 2.527 4013.438
MOMFRAC 1.247 .393 10.064 1 .002 3.479 1.610 7.514
ARMASSIST .644 .252 6.538 1 .011 1.904 1.162 3.120
RATERISK .469 .241 3.794 1 .051 1.598 .997 2.562
AGE by PRIORFRAC -.055 .026 4.543 1 .033 .946 .899 .996
ARMASSIST by MOMFRAC -1.281 .623 4.225 1 .040 .278 .082 .942
Constant .779 3.381 .053 1 .818 2.180 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, ARMASSIST * MOMFRAC.

Interpretation-
The estimated coefficients in the interactions model in Table 4.16 are, with one exception, significant at the five percent level. The exception is the estimated coefficient for the dichotomized self-reported risk of fracture, RATERISK3 (1 = more, 0 = same or less) with p = 0.051. We elect to retain this in the model since the covariate is clinically important and its significance is nearly five percent. Hence the model in Table 4.16 is our preliminary final model.

APPENDIX
DATASET 1
TABLE 1.1
Age, Age Group, and Coronary Heart Disease
(CHD) Status of 100 Subjects
ID AGE AGEGRP CHD
1201 0
2231 0
3241 0
4251 0
5251 1
6261 0
7261 0
8281 0
9281 0
10291 0
11302 0
12302 0
13302 0
14302 0
15302 0
16302 1
17322 0
18322 0
19332 0
20332 0
21342 0
22342 0
23342 1
24342 0
25342 0
26353 0
27353 0
28363 0
29363 1
30363 0
31373 0
32373 1
33373 0
34383 0
35383 0
36393 0
37393 1
38404 0
39404 1
40414 0
41414 0
42424 0
43424 0
44424 0
45424 1
46434 0
47434 0
48434 1
49444 0
50444 0
51444 1
52444 1
53455 0
54455 1
55465 0
56465 1
57475 0
58475 0
59475 1
60485 0
61485 1
62485 1
63495 0
64495 0
65495 1
66506 0
67506 1
68516 0
69526 0
70526 1
71536 1
72536 1
73546 1
74557 0
75557 1
76557 1
77567 1
78567 1
79567 1
80577 0
81577 0
82577 1
83577 1
84577 1
85577 1
86587 0
87587 1
88587 1
89597 1
90597 1
91608 0
92608 1
93618 1
94628 1
95628 1
96638 1
97648 0
98648 1
99658 1
100698 1
DATASET 2-
The Global Longitudinal Study of Osteoporosis in Women
The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international
Study of osteoporosis in women over 55 years of age being coordinated at the
Code Sheet for Variables in the GLOW Study
Variable Description Codes/Values Name
1 Identification code 1–n SUB_ID
2 Study site 1–6 SITE_ID
3 Physician ID code 128 unique codes PHY_ID
4 History of prior fracture 1=yes PRIORFRAC
0=No 5 Age at enrolment Years AGE
6 Weight at enrolment Kilograms WEIGHT
7 Height at enrolment Centimeters HEIGHT
8 Body mass index kg/m^2 BMI
9 Menopause before age 45 1 = yes PREMENO
0 = no 10 Mother had the hip fracture 1= yes MOMFRAC
0= no 11 Arms are needed to stand from a chair 1= yes ARMASSIST
0= no 12 Former or current smoker 1= yes SMOKE
0= no 13 Self-reported risk of fracture 1 = Less than others of the RATERISK
same age 2 = Same as others of the Same age. 3 = Greater than others of the
Same age. 14 Fracture risk score Composite risk score FRACSCORE
15 Any fracture in the first year 1 = yes FRACTURE
0 = no
IBM SPSS Statistics 20 Command Syntax Reference
TABLE 1.3
LOGISTIC REGRESSION VARIABLES CHD
/METHOD=ENTER AGE
/CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).

TABLE 2.2
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE WEIGHT PRIORFRAC PREMENO RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).

TABLE 2.3
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE PRIORFRAC RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.2
CROSSTABS
/TABLES=FRACTURE BY PRIORFRAC
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.

TABLE 3.3
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PRIORFRAC
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.7
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.10
MODEL 1:-
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PRIORFRAC
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

MODEL 2:-
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PRIORFRAC HEIGHT
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

MODEL 3:-
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PRIORFRAC HEIGHT HEIGHT*PRIORFRAC
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.7
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER WEIGHT
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER HEIGHT
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER BMI
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PRIORFRAC
/PRINT=CI (95)
/CR LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER PREMENO
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).ITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER MOMFRAC
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER ARMASSIST
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER SMOKE
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.8
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.9
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.15
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC MOMFRAC*PRIORFRAC
ARMASSIST*MOMFRAC
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.16
LOGISTIC REGRESSION VARIABLES FRACTURE
/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC ARMASSIST*MOMFRAC
/CONTRAST (RATERISK) =Indicator (1)
/PRINT=CI (95)
/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

x

Hi!
I'm Alfred!

Would you like to get a custom essay? How about receiving a customized one?

Check it out