# Nutrition survey analysis- Associations

This question was posted the Assessment forum area and has 13 replies. You can also reply via email – be sure to leave the subject unchanged.

### Dr. Farah Ibrahim

Normal user

7 Apr 2011, 07:37

I want to analyze associations between wasting and (morbidity, sex and age groups.... etc) in a nutrition survey - 30X24 two stage cluster sampling. I have tried different programs; Epi info only gives odds ratio but no Chi 2 & no p-values (when complex sample design is used), SPSS can not analyse complex sample designs (only simple and systematic RS), ENA SMART statistical calculator also assumes SRS.

Is there any recommendations about this type of analysis? Inferential statistical tests to use ? and guidelines (if any) to perform it using STATA.

### Kevin Sullivan

Professor

Normal user

12 Apr 2011, 15:10

Here is some info:

Epi Info, using the "Complex Sample ..." commands, can account for survey stratification, clusters, and sample weights. For a 2x2 table, it can provide the prevalence odds ratio, prevalence ratio, and prevalence difference with confidence limits and DEFF. It does not provide a p-value. I do have a spreadsheet that calculates the wald statistics that can be used to derive a p-value - send me an e-mail and I will send it to you.

SAS can do these analysis with it's survey PROCS - PROC SURVEYFREQ, PROC SURVEYLOGISTIC.

SPSS can do analyze survey data with the optional Complex Samples Module.

Stata can handle complex survey data as can R.

### Anonymous 81

Public Health Nutritionist

Normal user

12 Apr 2011, 15:36

Dear Kevin Sullivan,

I was wondering if you share me the spreadsheet.

### Kevin Sullivan

Professor

Normal user

12 Apr 2011, 20:09

Hi Kiross - I just e-mailed it to you. Kevin

### Mark Myatt

Consultant Epideomiologist

Frequent user

13 Apr 2011, 12:14

Be aware that this sort of analysis is likely to be very low powered compared to (e.g.) a case-control study because (1) you often end up with small numbers in table cells because the outcome is relatively rare and the exposure may also be rare, and (2) because the sample design reduces the effective sample size. This analysis will be limited to finding the largest effects.

This is a common problem with analysing data from cross-sectional surveys as retrospective cohorts.

### Mark Myatt

Consultant Epideomiologist

Frequent user

13 Apr 2011, 13:43

The cell counts >= 5 rule is a technical issue regarding assumptions behind using ch-square tests. With cell counts < 5 you can use something like the Fisher-Irwin test which is based on exact hypergeometric probability rather than normal approximations to the binomial.

It is more a sample size and power issue. Low frequencies of outcomes and exposures means that you may will have varying and often low statistical power. You can do the analysis but it may not find weak or moderate strength associations.

### Ranjith

Normal user

16 Aug 2012, 19:46

Is it correct to say that it is OK to use the 'Table' command under 'Statistics' in EPI INFO to run chi squared test and get results for a cluster survey as the survey design does not affect this test results here?

### Bradley A. Woodruff

Self-employed

Technical expert

16 Aug 2012, 20:51

No, it is not correct to use the chi square from the normal "Tables" command in EpiInfo to judge the statistical significance of a difference in some outcome between two or more subgroups of the survey sample if the sampling included cluster sampling. Chi square is meant to tell you the likelihood that a difference between subgroups in the survey sample has occurred solely as a result of sampling error; that is, there is no real difference in the population surveyed. Any measure of sampling error or any measure involving sampling error, such as confidence intervals or p values, will be affected by the increase in variance induced by cluster sampling compared to simple random sampling. The "Tables" command assumes simple random sampling. As a result, the chi square from the "Tables" command will underestimate the variance and p value, and therefore overestimate your confidence that there is a real difference in the population, not just a difference due to sampling error. This could lead to incorrect conclusions. In EpiInfo, you must use the command "Complex Sample Tables" and specify the variable containing the codes for PSU and, if applicable, the variables containing the statistical weights and codes identifying the strata.

### Mark Myatt

Consultant Epideomiologist

Frequent user

16 Aug 2012, 21:40

You can use chi-square tests with cluster sampled data but these must be corrected for the sample design. An uncorrected test would, with design effects above one, be likely to make a type I error (i.e. incorrectly rejecting the null hypothesis) more frequently than expected.

Common software (e.g. EpiInfo, STATA) provide methods to correct chi-square tests. You must specify complex sampling procedures ... if you do not do this then the software reports the standard (uncorrected) test. Be sure to specify the sample design correctly. This will, for a SMART type survey, be simply a matter of telling the software the variable that identifies the cluster. It can be quite complicated to specify the design for more complex designs.

I hope this is of some use.

### Ranjith

Normal user

16 Aug 2012, 22:20

Thank you very much for the clarifications.

The problem is that Complex Sample Tables in EPI INFO does not provide chi-square test results (only provide risk ratio, odds ratio, and risk difference). All other common software are quite expensive!

Thank you again.

### Bradley A. Woodruff

Self-employed

Technical expert

16 Aug 2012, 22:42

Exactly. This is why I've stopped using EpiInfo for analysis of larger survey datasets. Although it is somewhat cumbersome, you can use the CSample module in the old DOS EpiInfo v. 6.04d which can be downloaded from http://wwwn.cdc.gov/epiinfo/html/ei6_downloads.htm. If you are using Windows 7, you will need to use the virtual Windows XP environment, as EpiInfo for DOS won't run in Windows 7, at least the 64-bit version. And you will also have to export the data from the Microsoft Access format used by EpiInfo for Windows to a .REC file readable in DOS EpiInfo.

I thought they would add adjusted chi squares to the statistics output for Complex Sample Tables with the release of EpiInfo 7, but, alas, they did not. I suggest contacting EpiInfo support and complaining about this problem; it seriously impedes the utility of the program. I have heard that another program called "R" can handle complex sampling, and it is free. I have never used it, but from what I hear, it is not very user friendly. It can be downloaded from: http://www.r-project.org/. But given the cost of SPSS, STATA, SUDAAN, etc., it may be worth some pain to learn how to use it.

Good luck.

### Ranjith

Normal user

16 Aug 2012, 23:38

Thank you.

I didn't know that you could chi square test in EPI 6. I will give it a try while also send an email to EPINFO helpdesk.(if chi square test is available in EPI 6, why was it taken out in Windows version? I wonder...)

Perhaps other users, especially those who are using EPI-ENA to analyse non-anthropometric data could also send similar requests to urge CDC to add this feature?

### Mark Myatt

Consultant Epideomiologist

Frequent user

17 Aug 2012, 10:38

A few things ...

There is very little difference between using the CI on the risk ratio or the risk difference as an hypotheses test and using a chi-square test with a fixed p-value for rejection of the null (i.e. p < 0.05). The test is that the CI on the risk ratio does not include one or that the CI on the risk difference does not include zero.

I am not a Windows user but I do sometimes run EpiInfo v6.04d on Mac OS-X and BSD UNIX using a utility called DOSBOX. You can get DOSBOX for Windows for free from this site.

R provides complex survey sample analysis through the "survey" package. This works very like the "svy" commands in STATA. The package has been tested against SAS SUDAAN, STATA, SPSS, &c. and gives the same results with benchmark datasets. R is an extremely powerful programming language (based on S and S-Plus) with a rather steep learning curve. I have a (slightly dated) introduction here. This tutorial does not cover the use of the "survey" package but the first 20 or so pages should be enough to get you working with R.

These software all take a model-based approach to the problem. It is possible to use resampling techniques (e.g. the bootstrap) to address this problem. With a PPS cluster sample you would have to use a "blocking" approach to creating replicates with the blocks being individual clusters sampled with replacement. The "p-value" would, for a positive association, be something like:

p = (number of replicates with RR <= 1) / (number of replicates + 1)

R can also be used to implement these methods. We use a block bootstrap in the PSM survey method (it gives the same results as CSAMPLE in EpiInfo v6.04d). We use a weighted block bootstrap in the S3M survey method and in the RAM method. The advantage of the bootstrap is that it can be used for test statistics, such as a CI on the difference between two medians, that cannot be used with classical approaches.