Epidemiology and Population Health Summer Institute at Columbia University

Analysis of Complex Survey Data

CLASS SESSIONS

Monday, June 13, 2011 – Friday, June 17, 2011 1 :00PM to 5 :00PM

INSTRUCTOR Katherine M. Keyes, PhD MPH Columbia University Epidemiology Merit Fellow (212) 543-5002 Kmk2104@columbia.edu 722 West 168th Street, 2nd Floor, Suite 229C

COURSE DESCRIPTION This course will provide participants with practical skills to analyze data arising from complex epidemiologic sampling designs. Complex survey data violate typical assumptions about simple random samples of independent observations, thus requiring specialized statistical techniques. National Health and Nutrition Examination Survey (NHANES) data will be used for applied demonstrations, illustrating concepts applicable to all data sets arising from complex survey designs. We will discuss the theory behind complex sampling strategies and the necessity of applying appropriate statistical techniques to analyze these data and make valid inferences. We will demonstrate the appropriate use of sampling weights in the NHANES data and how the appropriate weight is specific to the research question being asked. We will demonstrate how to obtain basic descriptive statistics, appropriate variance estimates, regression parameters, and survival analysis output in SAS and SAS-callable SUDAAN software. COURSE LEARNING OBJECTIVES 1. Understand the theoretical justifications for complex survey designs, and why special statistical procedures are needed. 2. Learn how to find publicly available data online, download and organize datasets, and manipulate data for analysis 3. Use SUDAAN software to analyze complex survey data, and be able to do basic frequency and crosstab analysis, linear and logistic regression, and Cox proportional hazards models.

RECOMMENDED COURSE READING LIST

The recommended textbook is: Research Triangle Intitute (2008). SUDAAN Language Manual, Release 10.0. Research Triangle Park, NC: Research Triangle Institute.

COURSE STRUCTURE Class time is 20 hours total. The structure of each class will be: 1:00-2:00 Lecture 2:00-3:30 Guided exercise 3:30-3:45 Break 3:45-5:00 Independent research project

Each class is divided into three parts:

Lecture.Dr. Keyes will give an overview of the topic of the day, review basic statistical theory and practice, and provide an overview into the use of SUDAAN software for the topics of the day.

Guided exercise.Dr. Keyes will provide example code for the topics of the day and an in-class assignment sheet. Students will run the code provided an answer the in-class assignment questions. We will discuss the answers as a group.

Independent research project. On the first day of class, students will identify a research question of interest in the NHANES data. By the end of the week, students will have completed an analysis of this research question on their own using the tools discussed in lecture.

COURSE SCHEDULE

Session 1 – Introduction to complex survey data

6-13-11

Lecture: Discuss the structure of common complex sampling designs Discuss finding publicly available data, downloading and organizing data Merging data into SAS, manage and manipulate study variables

In-Class Assignment: Locate variables, download data, append and merge data, identify, recode, and evaluate missing data in SAS

Independent practice: Select a research question in NHANES that will guide your independent practice throughout the week. Locate variables, download data, append and merge data, identify, recode, and evaluate missing data in SAS.

Lecture: Introduction to SUDAAN software basic language Create macros in SAS to output large quantities of univariate statistics Review application of basic statistical tests (what tests, when, why, assumptions)

In-Class Assignment: Practice the CROSSTAB, RATIO, and DESCRIPT procedures in SUDAAN to do univariate analysis and bivariate statistics including chi-square and t-test

Independent practice: Students will create a table of univariate and bivariate statistics associated with their research question on interest, categorize variables appropriately, use chi-square and t-tests. Also, extract and code potential confounders such as sex, race, age, and income.

Session 2 Files

Here are the power point slides for today's class:

Lecture: Introduction to linear, logistic, Poisson, and polytomous regression procedures in SUDAAN Review application of regression procedures (what procedure, when, why, assumptions) Discuss variable categorization (when are categories necessary, how should you categorize your variables)

In-Class Assignment: REGRESS, RLOGIST, MULTILOG, and LOGLINK procedures in SUDAAN

Independent practice: Students will conduct and interpret regression procedures related to their research question of interest.

Session 3 Files

Here are the power point slides for today's class:

Lecture: Introduction to survival analysis and Cox proportional hazards models in SUDAAN Discuss differences between Kaplan-Meier and life table Discuss assumptions of Cox proportional hazards models

In-Class Assignment: KAPMEIER, SURVIVAL procedures in SUDAAN

Independent practice: Students will conduct and interpret survival analyses related to their research question of interest.

Session 4 Files

Here are the power point slides for today's class:

Epidemiology and Population Health Summer Institute at Columbia UniversityAnalysis of Complex Survey DataCLASS SESSIONSMonday, June 13, 2011 – Friday, June 17, 2011

1 :00PM to 5 :00PM

INSTRUCTORKatherine M. Keyes, PhD MPH

Columbia University Epidemiology Merit Fellow

(212) 543-5002

Kmk2104@columbia.edu

722 West 168th Street, 2nd Floor, Suite 229C

COURSE DESCRIPTIONThis course will provide participants with practical skills to analyze data arising from complex epidemiologic sampling designs. Complex survey data violate typical assumptions about simple random samples of independent observations, thus requiring specialized statistical techniques. National Health and Nutrition Examination Survey (NHANES) data will be used for applied demonstrations, illustrating concepts applicable to all data sets arising from complex survey designs. We will discuss the theory behind complex sampling strategies and the necessity of applying appropriate statistical techniques to analyze these data and make valid inferences. We will demonstrate the appropriate use of sampling weights in the NHANES data and how the appropriate weight is specific to the research question being asked. We will demonstrate how to obtain basic descriptive statistics, appropriate variance estimates, regression parameters, and survival analysis output in SAS and SAS-callable SUDAAN software.

COURSE LEARNING OBJECTIVES1. Understand the theoretical justifications for complex survey designs, and why special statistical procedures are needed.

2. Learn how to find publicly available data online, download and organize datasets, and manipulate data for analysis

3. Use SUDAAN software to analyze complex survey data, and be able to do basic frequency and crosstab analysis, linear and logistic regression, and Cox proportional hazards models.

RECOMMENDED COURSE READING LISTThe recommended textbook is:

Research Triangle Intitute (2008). SUDAAN Language Manual, Release 10.0. Research Triangle Park, NC: Research Triangle Institute.

COURSE STRUCTUREClass time is 20 hours total. The structure of each class will be:

1:00-2:00 Lecture

2:00-3:30 Guided exercise

3:30-3:45 Break

3:45-5:00 Independent research project

Each class is divided into three parts:Dr. Keyes will give an overview of the topic of the day, review basic statistical theory and practice, and provide an overview into the use of SUDAAN software for the topics of the day.Lecture.Dr. Keyes will provide example code for the topics of the day and an in-class assignment sheet. Students will run the code provided an answer the in-class assignment questions. We will discuss the answers as a group.Guided exercise.On the first day of class, students will identify a research question of interest in the NHANES data. By the end of the week, students will have completed an analysis of this research question on their own using the tools discussed in lecture.Independent research project.COURSE SCHEDULESession 1 – Introduction to complex survey data6-13-11Lecture:Discuss the structure of common complex sampling designs

Discuss finding publicly available data, downloading and organizing data

Merging data into SAS, manage and manipulate study variables

In-Class Assignment: Locate variables, download data, append and merge data, identify, recode, and evaluate missing data in SASIndependent practice: Select a research question in NHANES that will guide your independent practice throughout the week. Locate variables, download data, append and merge data, identify, recode, and evaluate missing data in SAS.Required reading:

NHANES general data release document: http://www.cdc.gov/nchs/data/nhanes/nhanes_05_06/general_data_release_doc_05_06.pdf

Session 1 FilesHere is the course syllabus:

Here are the power point slides for today's class:

Here is the word document that guides you through extracting files from the NHANES website:

Here is an excel spreadsheet to use as the basis of your data dictionary:

Here is the SAS program file that we will use for the lab today:

Session 2 – Univariate statistics6-14-11Lecture:Introduction to SUDAAN software basic language

Create macros in SAS to output large quantities of univariate statistics

Review application of basic statistical tests (what tests, when, why, assumptions)

In-Class Assignment: Practice the CROSSTAB, RATIO, and DESCRIPT procedures in SUDAAN to do univariate analysis and bivariate statistics including chi-square and t-testIndependent practice: Students will create a table of univariate and bivariate statistics associated with their research question on interest, categorize variables appropriately, use chi-square and t-tests. Also, extract and code potential confounders such as sex, race, age, and income.Session 2 FilesHere are the power point slides for today's class:

Here is the SAS program file for lab:

Here is the dataset that we created yesterday:

Session 3 – Regression6-15-11Lecture:Introduction to linear, logistic, Poisson, and polytomous regression procedures in SUDAAN

Review application of regression procedures (what procedure, when, why, assumptions)

Discuss variable categorization (when are categories necessary, how should you categorize your variables)

In-Class Assignment: REGRESS, RLOGIST, MULTILOG, and LOGLINK procedures in SUDAANIndependent practice: Students will conduct and interpret regression procedures related to their research question of interest.Session 3 FilesHere are the power point slides for today's class:

Here is the SAS program file for today's lab

:

Session 4 – Survival analysis6-16-11Lecture:Introduction to survival analysis and Cox proportional hazards models in SUDAAN

Discuss differences between Kaplan-Meier and life table

Discuss assumptions of Cox proportional hazards models

In-Class Assignment: KAPMEIER, SURVIVAL procedures in SUDAANIndependent practice: Students will conduct and interpret survival analyses related to their research question of interest.Session 4 FilesHere are the power point slides for today's class:

Here is the SAS program file for today's lab:

Here is an excel file with a formatted kaplan meier curve

Here is the lab from yesterday's class with code for predicted and conditional marginals added

Here is the SAS PROC SURVEY code

Session 5 – Special procedures and options6-17-11Lecture:Discuss data imputation: theories and assumptions

Discuss weight development

In-Class Assignment: HOTDECK for imputation and WTADJUST for creating weightsIndependent practice: Students will finish their independent research project, practice imputing data for at least one of their variables of interestSession 5 FilesSlides

Code for imputation lab

Code for weighting lab