# Confidence Intervals for Complex Sampling

This question was posted the Assessment and Surveillance forum area and has 5 replies. You can also reply via email – be sure to leave the subject unchanged.

### Ranjith

Normal user

17 May 2012, 13:29

Can you calculate confidence intervals for cluster sampling in Open Epi?

### Kevin Sullivan

Normal user

17 May 2012, 16:08

OpenEpi (www.OpenEpi.com) works like a calculator in that summary data are entered and some calculations are presented. For calculating confidence limits for a proportion or percentage based on cluster sampling, information is needed for each cluster in order to calculate the design effect (DEFF). OpenEpi does not perform these calculations. However, if you have an estimate of the design effect, there is a link on the OpenEpi menu towards the bottom that can lead to a web-based program that can do this calculation:

OpenEpi Prototypes

which will take you to www.sph.emory.edu/~cdckms

At this website click on:

Confidence intervals for a proportion with DEFF

In this module if you provide the following it will calculate various confidence limts for a proportion:

Numerator:

Denominator:

Number of Clusters:

Design Effect (DEFF):

### Mark Myatt

Consultant Epideomiologist

Frequent user

18 May 2012, 08:38

The term "complex sampling" is very broad covering both stratified and cluster sampling (and stratified cluster sampling). The procedure required to analyse (e.g.) a PPS cluster sample (such as SMART) are quite different from those required to analyses a stratified sample.

There are a number of packages that can handle complex sample survey data including EpiInfo (CSAMPLE), SPSS (Complex Samples module), STATA ("svy" commands), R/S-Plus ("survey" library), SUDAAN, &c. These tend to implement model-based procedures and yield approximate results. An alternative (non-parametric) approach is to use bootstrap / jack-knife estimators.

Open-Epi does not provide procedures for complex sample data. The EpiTable module in the MSDOS version of EpiInfo which is available here does provide some simple facilities for estimating proportions from cluster sampled surveys.

I hope this is of some help.

### Mark Myatt

Consultant Epideomiologist

Frequent user

23 May 2012, 09:45

Forgot to mention that the SMART software (ENA for SMART) also handles two-stage cluster sampled survey data producing CIs with acceptable coverage and efficiency. This software is, however, specifically designed for SMART type surveys and is not a general survey analysis package. If your need is to estimate the prevalence of undernutrition using two stage cluster sampled surveys then ENA is a useful tool. You can find it here.

### Mark Myatt

Consultant Epideomiologist

Frequent user

16 Jul 2012, 10:03

As I wrote before above ... "The term 'complex sampling' is very broad covering both stratified and cluster sampling (and stratified cluster sampling). The procedure required to analyse (e.g.) a PPS cluster sample (such as SMART) are quite different from those required to analyses a stratified sample".

In general terms ... for estimation:

**Cluster / PPS :** Point estimates (odds ratios, risk ratio, means, &c.) derived from PPS cluster samples samples will be the same as calculated as if the data came from a simple random sample. The confidence interval around the estimate will not be the same (it will usually be wider). This is due to loss of sampling variation. It is possible to reduce this loss by careful sample design (i.e. increasing the number of clusters and / or using a within-cluster sampling scheme that helps to maintain sampling variation) although there will be a point at which cost-savings (the main reason for cluster sampling) are lost.

**Stratified sample :** Point estimates will usually be different from those calculated as if the data came from a simple random sample. This is because stratum-specific results must be weighted by some function of stratum population before being combined to form an overall estimate. The confidence intervals around the estimate will not be the same (it will usually be narrower).

In the case of hypothesis testing (e.g. Chi-square tests), most testing procedures are not optimal when data are autocorrelated. This is often the case with complex samples. Errors associated with a test may be different from specified (i.e. p < 0.05 may not be p < 0.05). There are special cases such as a chi-square test for twinned observations (e.g. one person has two eyes) - you may be lucky to find a special case that applies. here are a number of approaches to dealing with this problem. The most common is, probably, to ignore it and treat the data as if from a simple random sample. One approach (modelling) uses procedures to correct for correlation. These procedures vary in complexity and between the test being used. Another approach is to use resampling approaches (e.g. the bootstrap). The resampling approach is consistent and simple and works well in most cases. Both modelling and resampling require familiarity to do properly. It is probably easier to recast a hypothesis testing problem as an estimation problem. Most problems are amenable to this apprach. For example, the difference between two proportions (chi-square test commonly used) may be recast as a risk ratio (or odds ratio) with a 95% CI (90%) for a single sided test) that does not include zero. You can do this sort of analysis with (e.g.) CSAMPLE.

I hope this is of some use.