# Clarification on SMART survey cluster numbering

This question was posted the Assessment and Surveillance forum area and has 4 replies.

### Edwin

Normal user

15 Jun 2019, 12:28

We have conducted a SMART survey and I would like to seek expert opinion on a technical question regarding the analysis for our SMART survey data.

SMART uses PPS to assign groups of households to different locations, so for example you have cluster numbers 8, 9, 10…each representing 10 households from the different locations. A particularly large location may be assigned multiple cluster numbers, and therefore more households (say for example Village Shawar gets clusters 19, 20 and 21 together, and therefore 30 households total).

The question is: Will it affect the analysis (confidence intervals mainly) if we are assigning households different cluster IDs, even if they are technically in the same geographic area/large village? As in the example of Village Shawar, should all households in Village Shawar be assigned cluster ID 19…or they retain the original numbers from sampling (19, 20, 21).

Looking forward to your feedback and clarification on this.

Thanks,

### Bradley A. Woodruff

Self-employed

Technical expert

16 Jun 2019, 23:23

Edwin:

As you know, the design effect is a measure of the effect of cluster sampling on the final precision of any estimates made from a sample selected using cluster sampling. The design effect is affected by 2 factors:

1) The average size of the clusters in the survey sample (M) and

2) The inherent heterogeneity of distribution of the outcome of interest, as measured by the intracluster correlation coefficient (ICC, sometimes called the rho).

The formula showing this relationship is:

Design effect = 1 + [(M - 1) x ICC]

So you can see as either M or ICC increases, the design effect increases, indicating that precision decreases and confidence intervals widen. Not good. So we want the smallest average cluster size to maximize precision. That's why we say, given a total sample size, for example 500 housholds, we want a larger number of smaller clusters. 50 clusters of 10 households each is much better than 10 clusters of 50 households each.

You are interested in the case where more than 1 cluster is selected from the same primary sampling unit (in your example, the village of Shawar). And you want to know whether using a single cluster ID number for all the units selected in the same primary sampling unit (that is, all the households selected in Shawar) is better or worse than keeping these clusters separate by using different cluster ID numbers.

The effect of combining all the clusters into one is clear: this would increase the average cluster size, thus increasing the design effect and decreasing the precision. But there may be other considerations to this question. Regardless, in large primary sampling units, I usually divide the area into segments, then randomly select the number of segments I need depending on the number of clusters in that primary sampling unit, then select the required number of households from each selected segment. That way, the cluster are at least geographically distinct from different areas of Shawar village. This may more accurately model what the computer assumes to be separate and distinct clusters.

Of course, the best way is to be sure that the primary sampling units are small enough so that none gets more than one cluster in the first stage of sampling.

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

17 Jun 2019, 09:19

The issue is, I think, whether there is one big cluster of n = 30 or three separate clusters of n = 10 each. Separate clusters are usually taken by segmenting the community and taking each cluster from separate segments (this works because PPS will select multiple clusters for the larger communities which will likely have several segments / areas).

As a general rule, the analysis of a sample should be informed by the way the sample was taken. If you have one large cluster then you should treat it as the same (i.e. one large) cluster (i.e. give them, the same cluster ID). If you have separately sampled clusters within the same community then you should give them separate IDs.

Point estimates will not change as population weighting is decided beforehand by the PPS sampling procedure. Confidence intervals may change, As a general rule we usually want many small clusters for better precision but the effect will be complicated by within and between cluster variability. The effect on precision will probably not be large if you don't have too many large (combined) clusters.

You could try doing this both ways and see what happens? It is only a matter of changing cluster IDs. Make sure you backup your data first.

You should check my answer in the SMART forums:

https://smartmethodology.org/forum/forum/survey-design-sampling.

These seem moribund with no new posts for a couple of years or more. I think the SMART people monitor this forum.

You may want to contact CDC who had a role in developing the SMART method. I have forwarded your message to them.

### Martin

IMPACT

Normal user

18 Jun 2019, 06:41

Thanks for all answers!

@Bradley:

*"Design effect = 1 + [(M - 1) x ICC] *

*So you can see as either M or ICC increases, the design effect increases, indicating that precision decreases and confidence intervals widen. Not good. So we want the smallest average cluster size to maximize precision."*

Are you sure we can just pick the option that would give the best precision mathematically? Would that give us the actual precision we would see if we would repeat the experiment 10000 times?

@Mark:

*"As a general rule, the analysis of a sample should be informed by the way the sample was taken. If you have one large cluster then you should treat it as the same (i.e. one large) cluster (i.e. give them, the same cluster ID). If you have separately sampled clusters within the same community then you should give them separate IDs*."

This would have been my take as well. As far as I am aware, the issue comes from sampling from clusters with replacement, so some clusters come back multiple times. Which category would you say this falls in?

On a different note - How far can you go in "arbitrarily" separating clusters to increase the calculated precision?

Thanks again, any further advice would be sincerely appreciated!

m

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

23 Jul 2019, 13:59

Sorry I missed this question.

We never really sample with replacement. I have only done this with computer intensive estimators (e.g. the blocked and weighted bootstrap in RAM and S3M surveys) but never as a real-world survey sampling procedures.

WRT separating clusters, I have used this with small sample (RAM) surveys in which we select 16 communities and then take a part of the sample from a number of secondary clusters decided by community layout. I think of this as a within-cluster spatially stratified sample and stick with 16 clusters in analysis. This works quite well and increases precision using implicit stratification. The RAM sample is also spatially stratified so we get a boost from that. We have found that this type of sample with m = 16 and n = 12 (overall n = 192) returns estimates with precision similar to a SMART survey with an overall sample size about three times larger that that.

If you have any problem posting a response, please contact the moderator at post@en-net.org.