NTCC REPORT

ON

LOGISTIC REGRESSION

SUBMITTED BY

CHANDNI MISHRA

TOWARDS PARTIAL COMPLETION OF M.SC STATISTICS

UNDER SUPERVISION OF

DR.C.M.PANDEY

PROFESSOR AND HEAD

DEPARTMENT OF BIOSTATISTICS AND HEALTH INFORMATICS

SANJAY GANDHI POSTGRADUATE INSTITUTE OF MEDICAL SCIENCES, LUCKNOW

Acknowledgment

I WOULD LIKE TO EXPRESS MY SINCERE GRATITUDE TO MY NTCC GUIDE DR. NEERAJ SINGH AND TRAINING GUIDE, DR. C.M. PANDEY FOR PROVIDING THEIR INVALUABLE GUIDANCE TOWARDS COMPLETION OF THIS REPORT.

ABSTRACT

The purpose of this report is to provide the

researchers, and the readers with a basic concept of LOGISTIC REGRESSION, with the help of some examples and how to calculate and interpret the result using software like SPSS.

This report includes tables, graphs, and calculations with the help of software like SPSS, and interpretation.

BRIEF INTRODUCTION TO SPSS

SPSS (Statistical Package for the Social Sciences) is a versatile and responsive program designed to undertake a range of statistical procedures.

When SPSS, Inc., an IBM Company, was conceived in 1968, it stood for Statistical Package for Social Sciences. Since the company’s purchase by IBM in 2009, IBM has decided to simply use the name SPSS to describe its core product of predictive analytics. IBM describes predictive analytics as tools that help connect data to effective action by drawing reliable conclusions about current conditions and future events.

SPSS is an integrated system of computer programs designed for the analysis of social sciences data. It is one of the most popular of the many statistical packages currently available for statistical analysis. Its popularity stems from the fact that the program:

Allows for a great deal of flexibility in the format of data.

Provides the user with a comprehensive set of procedures for data transformation and file manipulation.

Offers the researcher a large number of statistical analyses commonly used in social sciences.

CONTENT

1.Introduction to the Logistic Regression Model

Introduction

Fitting the Logistic Regression Model

Testing for the Significance of the Coefficients

Confidence Interval Estimation

2.The Multiple Logistic Regression Model

Introduction

The Multiple Logistic Regression Model

Fitting the Multiple Logistic Regression Model

Testing for the Significance of the Model

Confidence Interval Estimation

3. Interpretation of the Fitted Logistic Regression Model

Introduction

Dichotomous Independent Variable

Polychotomous Independent Variable

Continuous Independent Variable

Multivariable Models

Presentation and Interpretation of the Fitted Values

4.Model-Building Strategies and Methods for Logistic Regression

Introduction

Purposeful Selection of Covariates

CASE STUDY.

APPENDIX

REVIEW –

Types of Data ; Measurement Scales: Nominal, Ordinal, interval, and ratio.

These are simply ways to categorize different types of variables.

Nominal- Nominal scales are used for labeling variables, without any quantitative value. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. Examples of nominal variables include region, zip code, or gender of individual or religious affiliation. The nominal scale can also be coded by the researcher in order to ease out the analysis process, for example; M=Female, F= Female, etc.

Ordinal -This level of measurement involves ordering or ranking the variable to be measured, it is the order of the values is what’s important and significant, but the differences between each one are not really known. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We cant say. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.

Interval- The interval level of measurement not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval. For example, on a standardized intelligence measure, a 10-point difference in IQ scores has the same meaning anywhere along the scale. Thus, the difference in IQ test scores between 80 and 90 is the same as the difference between 110 and 120. However, it would not be correct to say that a person with an IQ score of 100 is twice as intelligent as a person with a score of 50. The reason for this is because intelligence test scales (and other similar interval scales) do not have a true zero that represents a complete absence of intelligence.

Ratio -In this level of measurement, the observations, in addition to having equal intervals, can have a value of zero as well. The zero in the scale makes this type of measurement unlike the other types of measurement, although the properties are similar to that of the interval level of measurement. In the ratio level of measurement, the divisions between the points on the scale have an equivalent distance between them.

The four data types

Attribute Nominal Ordinal Interval Ratio

Name2 Categorical Sequence Equal Interval Ratio

Name3 Set Fully ordered, rank ordered Unit size fixed Zero or ref.pt fixed

Statistics Count, Mode, chi-squared + median, rank order correlation + ANOVA, mean, SD + Logs??

Example1 Set of participants makes of car order of finishing a race centigrade scale Degrees Kelvin or absolute

Types of relativity A?B A>B |(A-B)| > |(C-D)| ?

Types of absolute The identity of individual entities order, sequence intervals, differences ratios, proportions

Note- odds can have a large magnitude even if the underlying probabilities are low.

Probability

P=outcomes of interest all possible outcomes Odds=p(occurring )p(not occurring)=p(1-p)=The odds of an event are the number of events / the number of non-events.

Odds ratio- odds ratio is a ratio of two odds.

Odds ratio = odds1odds0Odds ratio = (p11-p1)(p01-p0)

Introduction to the Logistic

Regression Model

INTRODUCTION

Logistic regression is the appropriate regression analysis to conduct when the dependent variable (y)is dichotomous (binary) such as “yes” or “no”, “1” or “2”, “A” or “B” or “c”. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent variable is dichotomous, such as male/female, smoker/non-smoker or success/failure like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. The logistic regression model is the most frequently used regression model for the analysis of these data. The independent variables are often called covariates.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is categorical. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logistic regression.

There are three primary uses of logistic regression:

Prediction of group membership and outcome.

The goal is to correctly predict the category of the outcome of individual cases. Thus, the research question asked is whether an outcome can be predicted from a selected set of independent variables. For instance, in epidemiological studies, can the development of lung cancer be predicted from the incidence and duration of smoking as well as from demographic variables such as gender, age, and social and economic status (SES)?

2. Logistic regression provides knowledge of the relationships and strengths among the variables.

The goal is to identify which independent variables predict the outcome, that is, increase or decrease the probability of the outcome or have no effect. For example, does inclusion of information about the incidence and duration of smoking improve prediction of lung cancer, and is a particular variable associated with an increase or decrease in the probability that a case has lung cancer? These parameter estimates (the coefficients of the predictors included in a model) can also be used to calculate and interpret the odds ratio. For instance, what are the odds that a person has lung cancer at age 65, given that he has smoked 10 packs a day for the past 30 years?

3.Classification of cases.

The goal is to understand how reliable the logistic regression model is in classifying cases for whom the effect is known. For instance, how many people with or without lung cancer are diagnosed correctly? The researcher establishes a cut point of say .5, and then asks, for instance: How many people with lung cancer are correctly classified if everyone with a predicted probability more is diagnosed as having lung cancer?

Why will other regression procedure not work?

Simple linear regression is one quantitative variable predicting another.

Multiple regression is a simple linear regression with more independent variables.

Nonlinear regression is still two quantitative variables, but the data is curvilinear.

Running a typical linear regression in some way has major problems since binary data does not have a normal distribution which is a condition needed for most other types of regression.

Example1: Table 1.1 lists the age in years (AGE), and presence or absence of

Evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical

Study of risk factors for heart disease. The table also contains an identifier

Variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD,

Which is coded with a value of “0” to indicate that CHD is absent, or “1” to indicate

That it is present in the individual. In general, any two values could be used, but

We have found it most convenient to use zero and one. We refer to this dataset as the CHDAGE data.

CHD

AGE(Years)

FIG 1.1

A scatterplot of the data in Table 1.1 is given in Figure 1.1.

In this scatterplot, all points fall on one of two parallel lines representing the

absence of CHD (y = 0) or the presence of CHD (y = 1). There is some tendency

for the individuals with no evidence of CHD to be younger than those with evidence

of CHD. While this plot does depict the dichotomous nature of the outcome variable

quite clearly, it does not provide a clear picture of the nature of the relationship

between CHD and AGE.

The main problem with Figure 1.1 is that the variability in CHD at all ages is

large. This makes it difficult to see any functional relationship between AGE and

CHD. One common method of removing some variation, while still maintaining

the structure of the relationship between the outcome and the independent variable,

is to create intervals for the independent variable and compute the mean of the

outcome variable within each group. We use this strategy by grouping age into the

categories (AGEGRP) defined in Table 1.1. Table 1.2 contains, for each age group,

the frequency of occurrence of each outcome, as well as the percent with CHD present.

Table SEQ Table * ARABIC 1.2

Age group n Absent Present Mean

20–29 10 9 1 0.1

30–34 15 13 2 0.133

35–39 12 9 3 0.25

40–44 15 10 5 0.333

45–49 13 7 6 0.462

50–54 8 3 5 0.625

55–59 17 4 13 0.765

60–69 10 2 8 0.8

Total 100 57 43 0.43

FIG 1.2Age(years)

By examining Table 1.2, a clearer picture of the relationship begins to emerge. It

shows that as age increases, the proportion (mean) of individuals with evidence of

CHD increases. Figure 1.2 presents a plot of the percent of individuals with CHD

versus the midpoint of each age interval. This plot provides considerable insight

into the relationship between CHD and AGE in this study, but the functional form

for this relationship needs to be described. The plot in this figure is similar to what

one might obtain if this same process of grouping and averaging were performed

in a linear regression. We note two important differences.

Some important facts:-

The dependent variable in logistic regression follows the Bernoulli distribution having an unknown probability, p.

Bernoulli distribution is just a special case of the Binomial distribution where n=1 (just one trial)

Success is “1” and failure is “0”.

The probability of success is “p” and failure is “q=1-p”.

In logistic regression, we are estimating an unknown p, for any given linear combination of the independent variables.

Therefore, we need to link together our independent variable to essentially the Bernoulli distribution, that link is called the logit.

The first difference concerns the nature of the relationship between the outcome

and independent variables.

In any regression problem, the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as “E(Y|x)” where Y

Denotes the outcome variable and x denotes a specific value of the independent

Variable. The quantity E (Y|x) is read “the expected value of Y, given the value x”.

In linear regression, we assume that this mean may be expressed as an equation. This expression implies that it is possible for E (Y|x) to take on any value as x

Ranges between ??and +?.The column labeled “Mean” in Table 1.2 provides an estimate of E (Y|x). We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2are close enough to the true values of E (Y|x) to provide a reasonable assessment of the functional relationship between CHD and AGE. With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., 0 ?E (Y|x) ?1). This can be seen in Figure 1.2. In addition,

The plot shows that this mean approaches zero and one “gradually”. The change in

The E (Y|x) per unit change in x becomes progressively smaller as the conditional

Mean gets closer to zero or one. The curve is said to be S-shaped and resembles a

The plot of the cumulative distribution of a continuous random variable. Thus, it should

Not seem surprising that some well-known cumulative distributions have been used

To provide a model for E (Y|x) in the case when Y is dichotomous. The model we use is based on the logistic distribution.

In order to simplify notation, we use the quantity ?(x) = E (Y|x) to represent

The conditional mean of Y given x when the logistic distribution is used. The

The specific form of the logistic regression model we use is:

?(x) = e?0+?1×1+e?0+?1x(1.1)

A transformation of ?(x) that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of ?(x), as:

g(x) = ln ?(x)1-?(x)

g(x) = ?0+ ?1x. (1.1*)

The importance of this transformation is that g(x) has many of the desirable properties

Of a linear regression model. The logit, g(x), is linear in its parameters, May

Be continuous, and may range from ??to +?, depending on the range of x.

The second important difference between the linear and logistic regression

Models concern the conditional distribution of the outcome variable. In the linear

Regression model we assume that an observation of the outcome variable may be

Expressed as y = E (Y|x) + ?. The quantity ? is called the error and expresses an

Observation’s deviation from the conditional mean. The most common assumption

Is that ? follows a normal distribution with mean zero and some variance that is

Constant across levels of the independent variable. It follows that the conditional

Distribution of the outcome variable given x is normal with mean E (Y|x), and a

The variance that is constant. This is not the case with a dichotomous outcome variable.

In this situation, we may express the value of the outcome variable given x

As y = ?(x) + ?. Here the quantity ? may assume one of two possible values. If

y = 1 then ? = 1 ??(x) with probability ?(x), and if y = 0 then ? = ??(x) with

Probability 1 ??(x). Thus, ? has a distribution with mean zero and variance equal

To ?(x) 1 ??(x). That is, the conditional distribution of the outcome variable

Follows a binomial distribution with probability given by the conditional mean,

?(x).

In summary, we have shown that in a regression analysis when the outcome

Variable is dichotomous:

1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, ?(x), given

In equation (1.1), satisfies this constraint.

2. The binomial, not the normal, distribution describes the distribution of the

Errors and is the statistical distribution on which the analysis is based.

3. The principles that guide an analysis using linear regression also guide us in

Logistic regression.

.

FITTING THE LOGISTIC REGRESSION MODEL

In the dichotomous outcome variable given x as

y = ?x+e, e may assume one of two possible values.

When, y= 1

e = 1-?(x). With probability ?(x). e~N (0, ?(x) (1-?(x)))

When, y=0

e = -?(x) with probability 1-?(x)

Fitting the logistic regression model in equation (1.1) to a set of data requires that we estimate the values of ?0 and ?1, the unknown parameters. The general method of estimation that leads to the least squares function under the linear regression model (when the error terms are normally distributed) is called maximum likelihood. This method provides the foundation for our approach to estimation with the logistic regression model.

f (xi, yi) = ?(xi)yi(1-?(xi))(1-yi) (1.2)

The likelihood function is given by:-

L (xi, yi) = i=1n?(xi)yi(1-?(xi))1-yi (1.3)

After taking log both sides we get,

Ln (L (xi, yi)) = i=1n{yi ln?(xi ) + (1 -yi ) ln1 -?(xi )}. (1.4)

To find the value of ? that maximizes L (?) we differentiate L (?) with respect to

?0 and ?1 and set the resulting expressions equal to zero. These equations, known

As the likelihood equations are

i=1nyi-?xi=0 (1.5)

i=1nxiyi-?(xi) = 0. (1.6)

The value of ? given by the solution to equations (1.5) and (1.6) is called

The maximum likelihood estimate and is denoted as ?.In general, the use of the

The symbol ” denotes the maximum likelihood estimate of the respective quantity.

For example, ?(xi)is the maximum likelihood estimate of ? (xi). This quantity

Provides an estimate of the conditional probability that Y is equal to 1, given that

X is equal to xi. As such, it represents the fitted or predicted value for the logistic

Regression model. An interesting consequence of equation (1.5) is that

i=1nyi=i=1n?(xi)As an example, consider the data given in Table 1.1. Use of a logistic regression

Software package SPSS, with continuous variable AGE as the independent variable,

Produces the output in Table 1.3

Table 1.3 Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .111 .024 21.254 1 .000 1.117 1.066 1.171

Constant -5.309 1.134 21.935 1 .000 .005 Log-likelihood = ?53.676546.

The maximum likelihood estimates of ?0 and ?1 are ?0= ?5.309 and ?1=0.111. The fitted values are given by the equation-

?= e-5.309+0.111*AGE1+e-5.309+.111*AGE (1.7)

and the estimated logit , g(x)= -5.309+.111*AGE (1.8)

The log-likelihood given in Table 1.3 is the value of equation (1.4) computed using?0 and ?1.

Testing For The Significance Of The Coefficient

In logistic regression, comparison of observed to predicted values is based on the log-likelihood function defined in equation (1.4).

The comparison of observed to predicted values using the likelihood function is

based on the following expression:

D = ?2 ln(likelihood of the fitted model)(likelihood of the saturated model) (1.9)

The quantity inside the large brackets in the expression above is called the likelihood

Ratio Using minus twice its log is necessary to obtain a quantity whose distribution

is known and can, therefore, be used for hypothesis testing purposes. Such a test is

called the likelihood ratio test. Using equation (1.4), equation (1.9) becomes

D = -2i=1nyiln?iyi+ (1-yi)ln1-?i1-yi (1.10)

The statistic, D, in equation (1.10) is called the deviance, and for logistic

regression, it plays the same role that the residual sum-of-squares plays in linear

regression. In fact, the deviance as shown in equation (1.10), when computed

for linear regression, is identically equal to the SSE.

In particular, to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the independent variable in the model is:

G = D(model without the variable) ? D(model with the variable).

This statistic, G, plays the same role in logistic regression that the numerator of the partial F-test does in linear regression. Because the likelihood of the saturated model is always common to both values of D being differenced, G can be expressed as

G = ?2 ln((likelihood without the variable))(likelihood with the variable)For the specific case of a single independent variable, it is easy to show that when the variable is not in the model, the maximum likelihood estimate of ?0 is

ln(n1/n0) where n1= yi and n0=(1-yi) and the predicted probability for

all subjects is constant and equal to n1/n. In this setting, the value of G is:

G = ?2 ln(n1n)n1(n0n)n0i=1n?iyi(1-?i)(1-yi) (1.13)

Confidence Interval Estimation.

The Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter, ?1, to an estimate of its standard error. Under the null hypothesis

and the sample size assumptions, this ratio follows a standard normal distribution.

W = ?1SE(?1)The basis for construction of the interval estimators is the same statistical theory we used to formulate the tests for the significance of the model. In particular, the confidence

interval estimators for the slope and intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a 100(1 ??)% confidence interval for the slope coefficient are-

?1±z1-?2SE(?1) (1.9 ) ?0±z1-?2SE(?0) (1.10)

where z1??/2 is the upper 100(1 ??/2)% point from the standard normal distribution

and SE(.)denotes a model-based estimator of the standard error of the

respective parameter estimator. Since we are using software like SPSS, we do not require to calculate it manually, as we can see in Table 1.3 we are given the confidence interval for exp(?), if we take antilog for both e?1 we can obtain a confidence interval for ?1.

Chapter 2.

The Multiple Logistic Regression Model

Introduction

In Chapter 1 we introduced the logistic regression model in the context of a model

containing a single variable. In this chapter, we generalize the model to one

with more than one independent variable (i.e., the multivariable or multiple logistic

regression model). Central to the consideration of the multiple logistic models

is estimating the coefficients and testing for their significance.

The logit of the multiple logistic regression model is given by generalizing equation (1.1) and (1.1*) we get:-

g(x)=ln?(x)1-?(x)= ?0+?1×1+?2×2+……………+?pxp. (2.1)

where, for the multiple logistic regression model,?x=eg(x)1+eg(x)(2.2)

FITTING THE MULTIPLE LOGISTIC REGRESSION MODEL

The method of estimation used in the multivariable case is the same as in the univariable situation – maximum likelihood. The likelihood function is nearly identical to that given in equation. The likelihood function is nearly identical to that given in equation(1.3) with the only change being that ?(x) is now defined as in equation (2.1). There will be p + 1 likelihood equations that are obtained by differentiating the log-likelihood function with respect to the p + 1 coefficients. The likelihood equations that result may be expressed as follows:

i=1nyi-?(xi)= 0

And

i1nxijyi-?(xi)=0 for j = 1, 2, . . . , p

As in the univariable model, the solution of the likelihood equations requires

software that is available in virtually every statistical software package. Let ?denote the solution to these equations. Thus, the fitted values for the multiple

logistic regression model are ?ˆ(xi ), the value of the expression in equation (2.2)

computed using ?and xi.

The Global Longitudinal Study of Osteoporosis in Women (GLOW) dataset, as an example, we consider five variables thought to be of importance that is age at enrollment (AGE), weight at enrollment (WEIGHT), history of a previous fracture (PRIORFRAC), whether or not the women experienced menopause before or after age 45 (PREMENO) and self-reported risk of fracture relative to women of the same age (RATE RISK) coded at three levels: less, same, or more risk.

TABLE 2.2 Fitted Multiple Logistic Regression Model of Fracture in the First Year of Follow Up (FRACTURE) on Age, Weight, Prior Fracture (PRIORFRAC), Early Menopause (PREMENO), and Self-Reported Risk of Fracture (RATE RISK) from the GLOW Study, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .050 .013 13.966 1 .000 1.051 1.024 1.079

WEIGHT .004 .007 .347 1 .556 1.004 .991 1.018

PRIORFRAC .679 .242 7.858 1 .005 1.973 1.227 3.173

PREMENO .187 .277 .456 1 .499 1.206 .701 2.074

RATE RISK 9.181 2 .010 RATERISK(1) .534 .276 3.754 1 .053 1.707 .994 2.930

RATERISK(2) .874 .289 9.139 1 .003 2.397 1.360 4.224

Constant -5.606 1.221 21.090 1 .000 .004 Variable(s) entered on step 1: AGE, WEIGHT, PRIORFRAC, PREMENO, RATE RISK.

Model Summary

Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square

1 518.075a .085 .125

a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.

Log-Likelihood = -259.03768

In the example given above, the variable RATE RISK is modeled using the two design variables coded at three level. If we are using software like SPSS we can go to categorical to code the variable.

In Table 2.2 the estimated coefficient for the two design variables for RATERISK are indicated by RATERISK1 and RATERISK2 . The estimated logit is given in the following equation :

g(x) = -5.606 + .050 * AGE + .004 * WEIGHT + .679 * PRIORFRAC + .187 * PREMENO + .534 * RATERISK1 + .874 * RATERISK2

And the associated estimated logistic probabilities are found by using equation (2.2)

Testing For The Significance of The Model

once we have a particular multiple (multivariable) logistic regression model, we begin the process of model assessment. The likelihood ratio test for overall significance of the p coefficients for the independent variables in the model is performed in exactly the same manner as in the univariable case. The test is based on the statistic G given in equation (1.12). Consider the fitted model whose estimated coefficients are given in Table 2.2. For that model, the value of the log-likelihood, shown at the bottom of the table, is L = ?259.0377. The log-likelihood for the constant only model may be obtained by evaluating the numerator of equation (1.13) or by fitting the constant only model.

Either method yields the log-likelihood L = ?281.1676. Thus the value of the likelihood ratio test is, from equation (1.12), G = ?2?281.1676 ? (?259.0377) = 44.2598 and the p-value for the test is P?2(6) > 44.2598 ? 0.0001, which is significant at well beyond the ? = 0.05 level. We reject the null hypothesis in this case and conclude that at least one or more of the p coefficients are different from zero, an interpretation analogous to the F-test used in multiple linear regression.

The p-values computed under this hypothesis are shown in the fifth column of Table 2.2. If we use a level of significance of 0.05, then we would conclude that the variables AGE, history of prior fracture (PRIORFRAC) and self-reported rate of risk (RATE RISK) are statistically significant, while WEIGHT and early menopause (PREMENO) are not significant.

As our goal is to obtain the best fitting model while minimizing the number of parameters, the next logical step is to fit a reduced model containing only those variables thought to be significant and compare that reduced model to the full

model containing all of the variables. The results of fitting the reduced model are

given in Table 2.3

The difference between the two models is the exclusion of the variables

WEIGHT and early menopause (PREMENO) from the full model. The likelihood ratio test comparing these two models is obtained using the definition of G

given in equation (1.12). It has a distribution that is chi-square with 2 degrees of

freedom under the hypothesis that the coefficients for both excluded variables are

equal to zero. The value of the test statistic comparing the model in Table 2.3 to

the one in Table 2.2 is

G = ?2?259.4494 ? (?259.0377) = 0.8324

Table 2.3 Fitted Multiple Logistic Regression Model of Fracture in the First Year of

Follow Up (FRACTURE) on AGE, Prior Fracture (PRIORFRAC), and Self-Reported

Risk of Fracture (RATE RISK) from the GLOW Study, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .046 .012 13.618 1 .000 1.047 1.022 1.073

PRIORFRAC .700 .241 8.431 1 .004 2.014 1.256 3.231

RATE RISK 9.223 2 .010 RATERISK(1) .549 .275 3.979 1 .046 1.731 1.010 2.967

RATERISK(2) .866 .286 9.150 1 .002 2.377 1.356 4.165

Constant -4.991 .903 30.565 1 .000 .007 a. Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.

Model Summary

Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square

1 518.899a .083 .123

a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.

Log-Likelihood = -259.4494

which, with 2 degrees of freedom, has a p-value of P?2(2) > 0.8324 = 0.663.

As the p-value is large, exceeding 0.05, we conclude that the full model is no better

than the reduced model. That is, there is little statistical justification for including

WEIGHT and PREMENO in the model. However, we must not base our models

entirely on tests of statistical significance.

CONFIDENCE INTERVAL ESTIMATION

The methods used for confidence interval estimators for a multivariable model are

essentially the same. Table 2.3, the 95 percent confidence interval for the exponential of the coefficient of variables are given we have to take antilog to obtain CI of coefficient variables.

Table 2.4 we are taking antilog of exp(?)

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for B

Lower Upper

Step 1a AGE .046 .012 13.618 1 .000 1.047 0.0218 0.0705

PRIORFRAC .700 .241 8.431 1 .004 2.014 .2279 1.1728

RATE RISK 9.223 2 .010 RATERISK(1) .549 .275 3.979 1 .046 0.5487 .0100 1.0876

RATERISK(2) .866 .286 9.150 1 .002 2.377 0.3045 1.4267

Constant -4.991 .903 30.565 1 .000 .007 a. Variable(s) entered on step 1: AGE, PRIORFRAC, RATE RISK.

CHAPTER 3.

Interpretation of the Fitted Logistic Regression Model

INTRODUCTION

We begin this chapter assuming that a logistic regression model has been fit, that the variables in the model are significant in either a clinical or statistical sense, and that the model fits according to some statistical measure of fit.

The interpretation of any fitted model requires that we be able to draw practical inferences from the estimated coefficients in the model it involves two issues: determining the functional relationship between the dependent variable and the independent variable, and appropriately defining the unit of change for the independent variable.

When the independent variable is binary or dichotomous:-

This case provides the conceptual foundation for all the

other situations.

We assume that the independent variable, x, is coded as either 0 or 1. The

difference in the logit for a subject with x = 1 and x = 0 is

g(1) ? g(0) = (?0+ ?1× 1) ? (?0+ ?1× 0) = (?0+ ?1) ? (?0) = ?1.

The practical problem is that change on the scale of the log-odds is hard to explain and it may not be especially meaningful to a subject-matter audience. In order to provide a more meaningful interpretation, we need to introduce the odds ratio as a measure of association.

The odds of the outcome being present among individuals with x = 1 is ?(1)/1 ? ?(1). Similarly, the odds of the outcome being present among individuals with x = 0 is

?(0)/1? ?(0). The odds ratio denoted OR, is the ratio of the odds for x = 1 to the odds for x = 0, and is given by the equation:-

OR = ?(1)1-?(1)?(0)1-?(0) (3.1)

Substituting the expressions for the logistic regression model probabilities in

Table 3.1 into equation (3.1) we obtain

OR =e?0+?11+e?0+?111+e?0+?1e?01+e?011+e?0= e?0+?1e?0= e?0+?1-?0= e?1Hence, for a logistic regression model with a dichotomous independent variable

coded 0 and 1, the relationship between the odds ratio and the regression coefficient is -OR= e?1 (3.2)

Table 3.1 Values of the Logistic Regression Model when the Independent Variable Is

Dichotomous

Independent Variable(x) Outcome Variable(y) X=1 X=0

Y=1

?1=e?0+?11+e?0+?1?0=e?01+e?0Y=0

1-?1=11+e?0+?11-?0=11+e?0Total 1.0 1.0

The odds ratio is widely used as a measure of association as it approximates how much more likely or unlikely (in terms of odds) it is for the outcome to be present among those subjects with x = 1 as compared to those subjects with x = 0.

To review, the outcome variable is having a fracture (FRACTURE) in the first year of follow-up.

Here we use has had a fracture between the age of 45 and enrollment in the study (PRIORFRAC) as the dichotomous independent variable. The result of cross-classifying fracture during follow-up by prior fracture is presented in Table 3.2.

Table 3.2 Cross-Classification of Prior Fracture and Fracture During Follow-Up in the GLOW Study, n = 500

FRACTURE * PRIORFRAC Crosstabulation

Count

PRIORFRAC Total

0 1 FRACTURE 0 301 74 375

1 73 52 125

Total 374 126 500

The frequencies in Table 3.2 tell us that there were 52 subjects with values

(x = 1, y = 1), 73 with (x = 0, y = 1), 74 with (x = 1, y = 0), and 301 with (x = 0, y = 0). The results of fitting a logistic regression model containing the dichotomous covariate PRIORFRAC are shown in Table 3.3.

Table 3.3 Results of Fitting the Logistic Regression Model of Fracture

(FRACTURE) on Prior Fracture (PRIORFRAC) Using the Data in Table 3.2

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486

Constant -1.417 .130 117.908 1 .000 .243 Variable(s) entered on step 1: PRIORFRAC.

Model Summary

Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square

1 540.068a .044 .065

Log-Likelihood= -270.03397

The estimate of the odds ratio using equation (3.2) and the estimated coefficient for PRIORFRAC in Table 3.3 is OR= e1.064 = 2.9. Readers who have had some previous experience with the odds ratio undoubtedly wonder why we used a logistic regression package to estimate the odds ratio when we easily could have

computed it directly as the cross-product ratio from the frequencies in Table 3.2, namely,

OR= 52*30174*73 = 2.897.

Thus, we see that the slope coefficient from the fitted logistic regression model is

?1= ln(52 × 301)/(74 × 73) = 1.0638

We obtain a 100 × (1 ? ?)% confidence interval estimator for the odds ratio by first calculating the endpoints of a confidence interval estimator for the log-odds ratio (i.e., ?1) and then exponentiating the endpoints of this interval. In general, the endpoints are given by the expression

exp?1±z1-?2*SE(?1).

As an example, consider the estimation of the odds ratio for the dichotomized variable PRIORFRAC. Using the results in Table 3.3 the point estimate is OR=2.9and the 95% con?dence interval is exp(1.064±1.96×0.2231) = (1.87,4.49).

This interval is typical of many con?dence intervals for odds ratios when the point estimate exceeds 1, in that it is skewed to the right from the point estimate. This con?dence interval suggests that the odds of a fracture during follow-up among women with a prior fracture could be as little as 1.9 times or much as 4.5 times the odds for women without a prior fracture, at the 95% level of con?dence.

POLYCHOTOMOUS INDEPENDENT VARIABLE

Suppose that instead of two categories the independent variable has k > 2 distinct values. In the GLOW study, the covariate self-reported risk is coded at three levels (less, same, and more). The cross-tabulation of it with fracture during follow-up (FRACTURE) is shown in Table 3.5. In addition, we show the estimated odds ratio, its 95% confidence interval and log-odds ratio for the same and more versus less risk.

The extension to a situation where the variable has more than three levels is not conceptually different so all the examples in this section use k = 3. Using Spss we obtain table 3.5 and 3.7.

Table 3.5 Cross-Classification of Fracture During Follow-Up (FRACTURE) by Self-Reported Rate of Risk (RATE RISK) from the GLOW Study, n = 500

FRACTURE * RATE RISK Crosstabulation

Count

RATE RISK Total

1 2 3 FRACTURE 0 139 138 98 375

1 28 48 49 125

Total 167 186 147 500

Odds Ratio 1 28×13848×139 =1.73 49×13928×98=2.482195% CI (1.02, 2.91) (1.46, 4.22) ln(OR) 0.0 0.55 0.91 Table 3.6 Specification of the Design Variables for RATE RISK Using Reference Cell Coding with Less as the Reference Group

RATE RISK(Code) RATERISK1 RATERISK2

Less(1) 0 0

Same(2) 1 0

More(3) 0 1

Table 3.7 Results of Fitting the Logistic Regression Model to the Data in Table 3.5 Using the Design Variables in Table 3.6

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911

RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223

Constant -1.602 .207 59.831 1 .000 .201 Model Summary

Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square

1 550.578a .023 .034

Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

LogLikelihood = -275.28917

Table 3.7 gives us confidence interval for odd ratio , coefficient, SE, p value.

CONTINUOUS INDEPENDENT VARIABLE

When a logistic regression model contains a continuous independent variable, interpretation

of the estimated coefficient depends on how it is entered into the model and the particular units of the variable. For purposes of developing the method to interpret the coefficient for a continuous variable, we assume that the logit is

linear in the variable.

As an example, we show the results in Table 1.3 of a logistic regression of AGE on CHD status using the data in Table 1.1. The estimated logit is g(AGE) = ?5.310 + 0.111 × AGE. The estimated odds ratio for an increase of 10 years in

age is _OR(10) = exp(10 × 0.111) = 3.03. Thus, for every increase of 10 years in age, the odds of CHD being present is estimated to increase by 3.03 times. The validity of this statement is questionable, because the increase in the odds of CHD

for a 40-year-old compared to a 30-year-old may be quite different from the odds for a 60-year-old compared to a 50-year-old. This is the unavoidable dilemma when a continuous covariate is modeled linearly in the logit and motivates the

importance of examining the linearity assumption for continuous covariates. The endpoints of a 95% confidence interval for this odds ratio are exp(10 × 0.111 ± 1.96 × 10 × 0.024) = (1.90, 4.86).

The interpretation of the estimated odds ratio for a continuous variable is similar to that of nominal scale variables. The main difference is that a meaningful change must be defined for the continuous variable.

MULTIVARIABLE MODELS

Fitting a series of univariable models, although useful for a preliminary analysis, rarely provides an adequate or complete analysis of the data in a study because the independent variables are usually associated with one another and may have different distributions within levels of the outcome variable. Thus, one generally uses a multivariable analysis for a more comprehensive modeling of the data. One goal of such an analysis is to statistically adjust the estimated effect

of each variable in the model for differences in the distributions of and associations among the other independent variables in the model. Applying this concept to a multivariable logistic regression model, we may surmise that each estimated coefficient provides an estimate of the log-odds adjusting for the other variables in the model. Another important aspect of multivariable modeling is to assess to what extent, if at all, the estimate of the log-odds of one independent variable changes, depending on the value of another independent variable. When the odds ratio for one variable is not constant over the levels of another variable, the two variables are said to have a statistical interaction. In some applied disciplines statistical interaction is referred to as effect modification. This terminology is used to describe the fact that the log odds of one variable are modified or changed by values of the other variable. We begin with an example where there is neither statistical adjustment nor statistical interaction. The data we use come from the GLOW study described in Table Dataset 2. The outcome variable is having a fracture during the first year of follow up (FRACTURE). For the dichotomous variable, we use variable history of prior fracture (PRIORFRAC) and for the continuous covariate, we use height in centimeters (HEIGHT). The results from the three fitted models are presented in Table 3.10. In discussing the results from the examples we use significance levels from the Wald statistics. In all cases, the same conclusions would be reached had we used likelihood ratio tests.

Table 3.10 Estimated Logistic Regression Coefficients, Standard Errors, Wald Statistics, p-Values and 95% CIs from Three Models Showing No Statistical Adjustment and No Statistical Interaction from the GLOW Study, n = 500

Variables in the Equation

Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

1. PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486

Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.

Variables in the Equation

Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

2. PRIORFRAC 1.013 .225 20.199 1 .000 2.754 1.770 4.284

HEIGHT -.045 .017 6.811 1 .009 .956 .924 .989

Constant 5.895 2.796 4.445 1 .035 363.095 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT.

Variables in the Equation

Model Variable B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

3. PRIORFRAC -3.055 5.790 .278 1 .598 .047 .000 3999.295

HEIGHT -.054 .022 6.216 1 .013 .947 .907 .988

HEIGHT by PRIORFRAC .025 .036 .494 1 .482 1.026 .956 1.101

Constant 7.361 3.510 4.398 1 .036 1573.846 a. Variable(s) entered on step 1: PRIORFRAC, HEIGHT, HEIGHT * PRIORFRAC.

The Wald Statistic for the coefficient of PRIORFRAC in Model 1 is significant with p < 0.001. When we add HEIGHT to the model the Wald statistics are significant at the 1% level for both covariates. Note that there is little change in the estimate of the coefficient for PRIORFRAC as

??% = 100×(1.064-1.012)1.012 = 5.1

indicating that the inclusion of HEIGHT does not statistically adjust the coefficient of PRIORFRAC. Thus we conclude that, in these data, the height it is not a confounder of prior fracture. The statistical interaction of prior fracture (PRIORFRAC) and height (HEIGHT) is added to Model 2 to obtain Model 3. The Wald statistic for the added product term has p = 0.492 and thus is not significant. In these data height is not an effect modifier of prior fracture. Hence, the choice is between Model 1 and Model 2. Even though the estimate of the effect of prior fracture is basically the same for the two models, we would choose Model 2 as height (HEIGHT) is not only statistically significant in Model 2, but is an important clinical covariate as well.

CH A P T E R 4-

Model-Building Strategies and

Methods for Logistic Regression

INTRODUCTION

The goal of any method is to select those variables that result in a “best” model within the scientific context of the problem. In order to achieve this goal, we must have: (i) a basic plan for selecting the variables for the model and (ii) a set of methods for assessing the adequacy of the model both in terms of its individual variables and its overall performance. In this chapter, we discuss methods that address both of these areas.

PURPOSEFUL SELECTION OF COVARIATES

CASE STUDY 1 (The GLOW Study)

STATEMENT

For Purposeful selection, we use the GLOW500 data.

OBJECTIVE

Provides a good example of an analysis designed to identify risk factors for a specified binary outcome.

METHOD TOOL

We are using software tool SPSS

Steps

.

Step 1: The first step in purposeful selection is to fit a univariable logistic regression model for each covariate. The results of this analysis are shown in Table 4.7. Note that in this table, each row presents the results for the estimated regression coefficient(s) from a model containing only that covariate.

Table 4.7 Results of Fitting Univariable Logistic Regression Models in the GLOW

Data, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .053 .012 20.684 1 .000 1.054 1.031 1.079

Constant -4.779 .827 33.374 1 .000 .008 a. Variable(s) entered on step 1: AGE.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a WEIGHT -.005 .006 .656 1 .418 .995 .982 1.007

Constant -.727 .468 2.417 1 .120 .483 a. Variable(s) entered on step 1: WEIGHT.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a HEIGHT -.052 .017 9.134 1 .003 .950 .918 .982

Constant 7.212 2.744 6.910 1 .009 1356.000 a. Variable(s) entered on step 1: HEIGHT.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a BMI .006 .017 .112 1 .738 1.006 .972 1.040

Constant -1.258 .486 6.686 1 .010 .284 a. Variable(s) entered on step 1: BMI.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a PRIORFRAC 1.064 .223 22.741 1 .000 2.897 1.871 4.486

Constant -1.417 .130 117.908 1 .000 .243 a. Variable(s) entered on step 1: PRIORFRAC.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a PREMENO .051 .259 .038 1 .845 1.052 .633 1.749

Constant -1.109 .115 92.397 1 .000 .330 a. Variable(s) entered on step 1: PREMENO.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a MOMFRAC .661 .281 5.526 1 .019 1.936 1.116 3.358

Constant -1.196 .114 110.932 1 .000 .302 Variable(s) entered on step 1: MOMFRAC.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a ARMASSIST .709 .210 11.429 1 .001 2.032 1.347 3.066

Constant -1.394 .142 96.584 1 .000 .248 a. Variable(s) entered on step 1: ARMASSIST.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a SMOKE -.308 .436 .498 1 .480 .735 .313 1.727

Constant -1.079 .107 102.450 1 .000 .340 a. Variable(s) entered on step 1: SMOKE.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a RATE RISK 11.247 2 .004 RATERISK(1) .546 .266 4.203 1 .040 1.727 1.024 2.911

RATERISK(2) .909 .271 11.242 1 .001 2.482 1.459 4.223

Constant -1.602 .207 59.831 1 .000 .201 a. Variable(s) entered on step 1: RATE RISK.

Step 2: We now fit our first multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. The results of this fit are shown in Table 4.8. Once this model is fit we examine each covariate to ascertain its continued significance, at traditional levels, in the model. We see that the covariate with the largest p-value that is greater than 0.05

is for RATERISK2, the design/dummy variable that compares women with RATERISK = 2 to women with RATERISK = 1. The likelihood ratio test for the exclusion of self-reported risk of fracture (i.e., deleting RATERISK_2 and RATERISK_3 from the model) is G = 5.96, which with two degrees of freedom, yields p = 0.051, nearly significant at the 0.05 level.

Table 4.8 Results of Fitting the Multivariable Model with All Covariates Significant

at the 0.25 Level in the Univariable Analysis in the GLOW Data, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .034 .013 6.930 1 .008 1.035 1.009 1.062

HEIGHT -.044 .018 5.759 1 .016 .957 .923 .992

PRIORFRAC .645 .246 6.877 1 .009 1.906 1.177 3.088

MOMFRAC .621 .307 4.095 1 .043 1.861 1.020 3.397

ARMASSIST .446 .233 3.667 1 .056 1.562 .990 2.465

RATERISK 5.820 2 .054 RATERISK(1) .422 .279 2.284 1 .131 1.525 .882 2.636

RATERISK(2) .707 .293 5.804 1 .016 2.028 1.141 3.604

Constant 2.709 3.230 .704 1 .402 15.019 Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK.

Step 3: Next we check to see if covariate(s) removed from the model in Step 2 confound or are needed to adjust the effects of covariates remaining in the model. In results not shown, we find that the largest percent change is 17% for the coefficient of ARMASSIST. This does not exceed our criterion of 20%. Thus, we see that while the self-reported rate of risk is not a confounder it is an important covariate. No other covariates are candidates for exclusion and thus, we continue using the model in Table 4.8.

Step 4: On univariable analysis, the covariates for weight (WEIGHT), body mass index (BMI), early menopause (PREMENO) and smoking (SMOKE) were not significant. When each of these covariates is added, one at a time, to the model in Table 4.8 its coefficient did not become significant. The only change

of note is that the significance of BMI changed from 0.752 to 0.334. Thus the next step is to check the assumption of linearity in the logit of continuous covariates age and height.

Before moving to step 5 we consider another possible model. Since the coefficient for RATERISK_2 is not significant, one possibility is to combine levels 1 and 2, self-reported risk less than or the same as other women, into a new reference category, it was thought that combining these two categories is reasonable.

Hence we fit this model and its results are shown in Table 4.9. In this model, the coefficient for the covariate RATERISK_3 now provides the estimate of the log of the odds ratio comparing the odds of fracture for individuals in level 3 to that of the combined group consisting of levels 1 and 2.

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .033 .013 6.567 1 .010 1.034 1.008 1.060

HEIGHT -.046 .018 6.526 1 .011 .955 .921 .989

PRIORFRAC .664 .245 7.336 1 .007 1.943 1.201 3.142

MOMFRAC .664 .306 4.722 1 .030 1.943 1.067 3.536

ARMASSIST .473 .231 4.176 1 .041 1.604 1.020 2.525

RATERISK .458 .238 3.700 1 .054 1.581 .991 2.521

Constant 2.491 3.237 .592 1 .442 12.070 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, and RATERISK.

Step 5: At this point, we have our preliminary main effects model and must now check for the scale of the logit for continuous covariates age and height.

Step 6: The next step in the purposeful selection procedure is to explore possible interactions between the main effects. The subject matter investigators felt that each pair of main effects represents a plausible interaction. Hence, we fit models that individually added each of the 15 possible interactions to the main effects model. The results are summarized in Table 4.14. Three interactions are significant at the 10 percent level: Age by prior fracture (PRIORFRAC), prior fracture by mother had a fracture (MOMFRAC) and mother had a fracture by arms needed to rise from a chair (ARMASSIST). We note that prior fracture and mother having had a fracture are involved in two of the three significant interactions.

Table 4.14 Log-Likelihood, Likelihood Ratio Test (G, df = 1), and p-Value for the Addition of the Interactions to the Main Effects Model

Interaction Log-Likelihood G /Wald p

Main effects model -254.9090 Age*Height -254.8420 0.13 0.716

Age*Priorfrac -252.3920 5.701 0.025

Age*Momfrac -254.8395 0.140 0.708

Age*Armassist -254.8360 0.146 0.702

Age*Raterisk -254.3855 1.50 0.305

Height*Priorfrac -254.8025 0.213 0.644

Height*Momfrac -253.7045 2.438 0.118

Height*Armassist -254.1115 1.588 0.208

Height*Raterisk -254.4220 0.990 0.320

Priorfrac*Momfrac -253.5095 2.793 0.095

Priorfrac*Armassist -254.7960 0.224 0.636

Priorfrac*Raterisk -254.8475 0.122 0.726

Momfrac*Armassist -252.5180 4.699 0.030

Momfrac*Raterisk -254.6425 0.533 0.465

Armassist*Raterisk -254.7925 2.230 0.135

The next step is to fit a model containing the main effects and the three significant interactions. The results of this fit are shown in Table 4.15 The three degree of freedom likelihood ratio test of the interactions model in Table 4.15 versus the main effects model in Table 4.9 is G = 11.03 with p =0.012. Thus, in aggregate, the interactions contribute to the model. However, one interaction, prior fracture by mother’s fracture, is not significant with a Wald statistic p = 0.191. Next, we fit the model excluding this interaction and the results are shown in Table 4.16

Table 4.15 Results of Fitting the Multivariable Model with the Addition of Three

Interactions, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .058 .017 12.172 1 .000 1.060 1.026 1.095

HEIGHT -.049 .018 7.038 1 .008 .952 .919 .987

PRIORFRAC 4.598 1.878 5.993 1 .014 99.240 2.501 3937.490

MOMFRAC 1.472 .423 12.124 1 .000 4.360 1.903 9.986

ARMASSIST .626 .254 6.075 1 .014 1.869 1.137 3.074

RATERISK .474 .241 3.869 1 .049 1.607 1.002 2.577

AGE by PRIORFRAC -.053 .026 4.223 1 .040 .948 .901 .998

MOMFRAC by PRIORFRAC -.847 .648 1.711 1 .191 .429 .121 1.525

ARMASSIST by MOMFRAC -1.167 .617 3.580 1 .058 .311 .093 1.043

Constant 1.011 3.385 .089 1 .765 2.749 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, MOMFRAC * PRIORFRAC, ARMASSIST * MOMFRAC.

Table 4.16 Results of Fitting the Multivariable Model with the Significant Interactions, n = 500

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a AGE .057 .017 12.060 1 .001 1.059 1.025 1.094

HEIGHT -.047 .018 6.501 1 .011 .954 .921 .989

PRIORFRAC 4.612 1.880 6.018 1 .014 100.715 2.527 4013.438

MOMFRAC 1.247 .393 10.064 1 .002 3.479 1.610 7.514

ARMASSIST .644 .252 6.538 1 .011 1.904 1.162 3.120

RATERISK .469 .241 3.794 1 .051 1.598 .997 2.562

AGE by PRIORFRAC -.055 .026 4.543 1 .033 .946 .899 .996

ARMASSIST by MOMFRAC -1.281 .623 4.225 1 .040 .278 .082 .942

Constant .779 3.381 .053 1 .818 2.180 a. Variable(s) entered on step 1: AGE, HEIGHT, PRIORFRAC, MOMFRAC, ARMASSIST, RATERISK, AGE * PRIORFRAC, ARMASSIST * MOMFRAC.

Interpretation-

The estimated coefficients in the interactions model in Table 4.16 are, with one exception, significant at the five percent level. The exception is the estimated coefficient for the dichotomized self-reported risk of fracture, RATERISK3 (1 = more, 0 = same or less) with p = 0.051. We elect to retain this in the model since the covariate is clinically important and its significance is nearly five percent. Hence the model in Table 4.16 is our preliminary final model.

APPENDIX

DATASET 1

TABLE 1.1

Age, Age Group, and Coronary Heart Disease

(CHD) Status of 100 Subjects

ID AGE AGEGRP CHD

1201 0

2231 0

3241 0

4251 0

5251 1

6261 0

7261 0

8281 0

9281 0

10291 0

11302 0

12302 0

13302 0

14302 0

15302 0

16302 1

17322 0

18322 0

19332 0

20332 0

21342 0

22342 0

23342 1

24342 0

25342 0

26353 0

27353 0

28363 0

29363 1

30363 0

31373 0

32373 1

33373 0

34383 0

35383 0

36393 0

37393 1

38404 0

39404 1

40414 0

41414 0

42424 0

43424 0

44424 0

45424 1

46434 0

47434 0

48434 1

49444 0

50444 0

51444 1

52444 1

53455 0

54455 1

55465 0

56465 1

57475 0

58475 0

59475 1

60485 0

61485 1

62485 1

63495 0

64495 0

65495 1

66506 0

67506 1

68516 0

69526 0

70526 1

71536 1

72536 1

73546 1

74557 0

75557 1

76557 1

77567 1

78567 1

79567 1

80577 0

81577 0

82577 1

83577 1

84577 1

85577 1

86587 0

87587 1

88587 1

89597 1

90597 1

91608 0

92608 1

93618 1

94628 1

95628 1

96638 1

97648 0

98648 1

99658 1

100698 1

DATASET 2-

The Global Longitudinal Study of Osteoporosis in Women

The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international

Study of osteoporosis in women over 55 years of age being coordinated at the

Code Sheet for Variables in the GLOW Study

Variable Description Codes/Values Name

1 Identification code 1–n SUB_ID

2 Study site 1–6 SITE_ID

3 Physician ID code 128 unique codes PHY_ID

4 History of prior fracture 1=yes PRIORFRAC

0=No 5 Age at enrolment Years AGE

6 Weight at enrolment Kilograms WEIGHT

7 Height at enrolment Centimeters HEIGHT

8 Body mass index kg/m^2 BMI

9 Menopause before age 45 1 = yes PREMENO

0 = no 10 Mother had the hip fracture 1= yes MOMFRAC

0= no 11 Arms are needed to stand from a chair 1= yes ARMASSIST

0= no 12 Former or current smoker 1= yes SMOKE

0= no 13 Self-reported risk of fracture 1 = Less than others of the RATERISK

same age 2 = Same as others of the Same age. 3 = Greater than others of the

Same age. 14 Fracture risk score Composite risk score FRACSCORE

15 Any fracture in the first year 1 = yes FRACTURE

0 = no

IBM SPSS Statistics 20 Command Syntax Reference

TABLE 1.3

LOGISTIC REGRESSION VARIABLES CHD

/METHOD=ENTER AGE

/CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).

TABLE 2.2

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE WEIGHT PRIORFRAC PREMENO RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/CRITERIA=PIN (.05) POUT (.10) ITERATE (20) CUT (.5).

TABLE 2.3

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE PRIORFRAC RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.2

CROSSTABS

/TABLES=FRACTURE BY PRIORFRAC

/FORMAT=AVALUE TABLES

/CELLS=COUNT

/COUNT ROUND CELL.

TABLE 3.3

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PRIORFRAC

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.7

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 3.10

MODEL 1:-

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PRIORFRAC

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

MODEL 2:-

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PRIORFRAC HEIGHT

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

MODEL 3:-

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PRIORFRAC HEIGHT HEIGHT*PRIORFRAC

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.7

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER WEIGHT

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER HEIGHT

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER BMI

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PRIORFRAC

/PRINT=CI (95)

/CR LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER PREMENO

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).ITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER MOMFRAC

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER ARMASSIST

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER SMOKE

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.8

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.9

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.15

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC MOMFRAC*PRIORFRAC

ARMASSIST*MOMFRAC

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).

TABLE 4.16

LOGISTIC REGRESSION VARIABLES FRACTURE

/METHOD=ENTER AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST RATERISK AGE*PRIORFRAC ARMASSIST*MOMFRAC

/CONTRAST (RATERISK) =Indicator (1)

/PRINT=CI (95)

/CRITERIA=PIN (0.05) POUT (0.10) ITERATE (20) CUT (0.5).