# Confidence Intervals for Complex Sampling

This question was posted the Assessment and Surveillance forum area and has 8 replies.

### Ranjith

Normal user

17 May 2012, 13:29

### Kevin Sullivan

Normal user

17 May 2012, 16:08

### Mark Myatt

Consultant Epidemiologist

Frequent user

18 May 2012, 08:38

### Mark Myatt

Consultant Epidemiologist

Frequent user

23 May 2012, 09:45

### Mark Myatt

Consultant Epidemiologist

Frequent user

16 Jul 2012, 10:03

**Cluster / PPS :**Point estimates (odds ratios, risk ratio, means, &c.) derived from PPS cluster samples samples will be the same as calculated as if the data came from a simple random sample. The confidence interval around the estimate will not be the same (it will usually be wider). This is due to loss of sampling variation. It is possible to reduce this loss by careful sample design (i.e. increasing the number of clusters and / or using a within-cluster sampling scheme that helps to maintain sampling variation) although there will be a point at which cost-savings (the main reason for cluster sampling) are lost.

**Stratified sample :**Point estimates will usually be different from those calculated as if the data came from a simple random sample. This is because stratum-specific results must be weighted by some function of stratum population before being combined to form an overall estimate. The confidence intervals around the estimate will not be the same (it will usually be narrower). In the case of hypothesis testing (e.g. Chi-square tests), most testing procedures are not optimal when data are autocorrelated. This is often the case with complex samples. Errors associated with a test may be different from specified (i.e. p < 0.05 may not be p < 0.05). There are special cases such as a chi-square test for twinned observations (e.g. one person has two eyes) - you may be lucky to find a special case that applies. here are a number of approaches to dealing with this problem. The most common is, probably, to ignore it and treat the data as if from a simple random sample. One approach (modelling) uses procedures to correct for correlation. These procedures vary in complexity and between the test being used. Another approach is to use resampling approaches (e.g. the bootstrap). The resampling approach is consistent and simple and works well in most cases. Both modelling and resampling require familiarity to do properly. It is probably easier to recast a hypothesis testing problem as an estimation problem. Most problems are amenable to this apprach. For example, the difference between two proportions (chi-square test commonly used) may be recast as a risk ratio (or odds ratio) with a 95% CI (90%) for a single sided test) that does not include zero. You can do this sort of analysis with (e.g.) CSAMPLE. I hope this is of some use.

### Elijah Odundo

Information Manager- Nutrition Researcher & M&E

Normal user

12 Feb 2021, 11:33

Dear Dr. Mark Myatt and Dr. Kevin Sullivan,

My name is Odundo- a nutrition researcher working in Kenya. I am doing some analysis- Acute malnutrition hotspots analysis, looking at historical data spanning over 10 years.

The data is mainly from SMART surveys and would like to estimate the CIs factoring in the sampling methodology (Cluster sampling). Ideally, I want to do graphs with error bars-the data is stored in MS Excel flatfiles and would like to set up a formular in Excel to give me the CIs.

Any assistance with this?

Using MS Excel I have calculated the CI for proportions but this is obviously narrower compared to CIs obtained from ENA for SMART software. I would like to factor in the margin of error on account of clusters. I have the parameters such as ## of clusters, DEFF etc. I am avoiding using ENA software because that would be too manual and time consuming given the need to granulate/disaggregate the findings..

Thanks in advance!

### Bradley A. Woodruff

Self-employed

Technical expert

13 Feb 2021, 23:22

Dear Odundo:

Since neither Mark nor Kevin have submitted an answer to your question, let me give it a try. The simplest formula to calculate 95% confidence intervals assuming simple random sampling (or its equivalent) is:

*Lower confidence interval= Mean or proportion **- (1.96 x **standard error**)*

*Upper confidence interval= Mean or proportion +** (1.96 x **standard error**)*

However, as you correctly state, the confidence intervals must account for complex sampling. The design effect (DEFF) is the multiplier to determine by how much the sample size must be inflated to maintain the same position if complex sampling will be done rather than simple random sampling. But when calculating measures of precision, such as confidence intervals, we use the square root of DEFF. Therefore, the equation to calculate confidence intervals for complex sampling surveys are:

*Lower confidence interval= Mean or proportion **- (1.96 x **standard error x square root of DEFF**)*

*Upper confidence interval= Mean or proportion +** (1.96 x **standard error x square root of DEFF**)*

So if you have the mean or proportion, the standard error calculated assuming simple random sampling, and the DEFF, you can calculate the appropriate 95% confidence intervals for estimates derived from data obtained by complex sampling.

There are other formulas to calculate confidence intervals which may or may not be more accurate in your situation. My general recommendation would be to use a statistical analysis software package which can account for cluster sampling and automatically calculate the appropriate 95% conference intervals. Such software includes Epi Info, SAS, SPSS, STATA, and R.

I hope this is helpful.

### Elijah Odundo

Information Manager- Nutrition Researcher & M&E

Normal user

7 Mar 2021, 20:57

Many thanks Bradley

This is profoundly helpful!

If you have any problem posting a response, please contact the moderator at post@en-net.org.