# Allocation of sample by proportion

This question was posted the Assessment and Surveillance forum area and has 4 replies. You can also reply via email – be sure to leave the subject unchanged.

### Anonymous 81

Public Health Nutritionist

Normal user

25 Feb 2013, 07:27

we are on process to conduct caregivers satisfaction towards the services through systematic random sampling. The plan is to conduct in five health facilities. we have calculated the sample size (403). The population and caseload in the five facilities is different. Of the total population, the proportion to clinic A, B, C, D and E is 19%, 16%, 17%, 40% and 8% respectively. Given this difference, can we allocate or apportion the total sample according the population size? if this is possible, the sample in each facility will be 78, 65, 70, 160 and 30 respectively.

### Mark Myatt

Frequent user

25 Feb 2013, 11:36

This is one approach. It gives a self-weighting sample over the five facilities.

Another approach would be to use a fixed sample size that would allow an estimate or classification to be made for each facility. For example, m = 5 and n = 96 would give an overall sample size of m * n = 480. This is not much different from your n = 403 but would give estimates with a 95% CI of ± 10% or better from each facility. You would then weight the samples by population size to get an overall average (this is not hard). The latter approach is preferred if you suspect that satisfaction will vary considerably between clinics.

### Anonymous 2443

Normal user

25 Jan 2014, 23:09

Hi Mark, I am wondering how can we calculate the sample size if we do not have confirmed recent population data for an area struck by disaster and many people have moved out.

### Mark Myatt

Frequent user

27 Jan 2014, 16:05

There are a number of issues here. I will try to go through them one-by-one.

If you have reasonably accurate and current population data then you can use the PPS approach (as in SMART) to select communities to include in the sample. As populations may be small you may need consider taking smaller than usual clusters and more of them.

If you do not have the population data required for a PPS sample then you can take a systematic sample of communities. I would use a systematic spatial sampling method such as CSAS or sampling from lists of communities sorted by a spatial factor (e.g. sub-districts, chiefdoms, localities, clinic catchments &c.) to do this. You can then sample in communities as usual but you will also need to collect data that will allow you to estimate the populations of the sampled communities. This need only be something like counting roofs / doorways, estimating the proportion of empty dwellings, and (optional) estimating the mean number living under each roof / behind each doorway. Populations are estimates as either:

w = roofs (or doorways) * proportion empty

or:

w = roofs (or doorways) * proportion empty * mean HH size

These populations (labelled "w" for "weights" above) are then used when analysing the data. The PPS method (in effect) applies population weights before sampling (prior weighting). This method applies these weights after sampling (posterior sampling). Given reasonably accurate population data the two methods will give comparable results. The analysis procedure required is not, I think, available in ENA for SMART but can be done in SUDAAN, M-Plus, SAS, R, S-Plus, STATA, SPSS, in a spreadsheet, or by hand. If you need more information about this then post a follow-up request to this thread.

Access may also be an issue. If you are not pressed for time (e.g. an expected "state of emergency" declaration may shut down access) then you should take "contingency clusters". This means that you select a few more clusters than you need and use these to replace clusters than you cannot get to. Be sure to document what happened so users of the data know what is and isn't represented. If time is an issue then you can use a method such as RAM which uses fewer clusters (e.g. m = 16) than SMART, a more labour-intensive within-cluster sampling method, and a different estimation procedure (PROBIT) for estimating GAM / SAM prevalence. Sample sizes are relatively small (e.g. we have been using n = 192 in Sierra Leone and n = 200 in Sudan). If you need more information about RAM then post a follow-up request to this thread.

If your total population is small (i.e. less than about 5000) then you may be able to use a smaller sample size. The conventional approach is to calculate your sample size as you usually do and then multiply this by a "finite population correction" (FPC). Data analysis can be done with the usuall tools but confidence intervals on estimates are adjusted by a second FPC. If you need more information about this then post a follow-up request to this thread.

You will also need to be careful with some indicators if there has (e.g.) been communal violence (a common reason for displacement) as this can lead to considerable clustering (e.g. of death and destruction) which can result in large survey design effects.

I hope this helps. Post a follow-up message to this thread if you need more information or if I have missed the point.

### Mark Myatt

Frequent user

28 Jan 2014, 16:09

I have received a request for more details regarding the statistical calculations for the methods outlined above.

I will try to give more detail as briefly as I can.

SAMPLE SIZE

A typical sample size calculation for a single proportion in a survey is:

n = DEFF * [(p * (1 - p)) / (precision / 1.96)^2]

where:

DEFF = Expected design effect (usually 2.0 unless we know better) p = Expected proportion (choose 50% unless we know better) precision = Desired width of the confidence interval 1.96 = Constant for a 95% confidence interval

For example, if we want to estimate a proportion of 10% with a 95% CI of +/- 3% with an expected design effect of 1.5 we would need a sample size of:

n = 1.5 * [(0.1 * (1 - 0.1)) / (0.03 / 1.96)^2] = 576

This calculation assumes a large population (e.g. N > c. 5000). If you have a smaller population then you can apply a finite population correction. This is:

new.n = (old.n * population) / (old.n + (population - 1))

where "old.n" is the sample size calculated using the first formula given above.

Continuing the example ... if (e.g.) we were sampling from a population of 1000 we would need a sample size of:

new.n = (576 * 1000) / (576 + (1000 - 1)) = 366

quite a saving!

DATA ANALYSIS WITH A FINITE POPULATION

Most statistical packages and estimation formulae assume a large population and will present confidence intervals that are **not** corrected for the size of the population. The FPC in this case is:

FPC = sqrt((population - n) / (population - 1))

Continuing the example we have an FPC of:

FPC = sqrt((1000 - 366) / (1000 - 1)) = 0.7966

we correct the uncorrected confidence limits by scaling it by this factor. If (e.g.) we had:

point estimate = 10.67% Lower confidence limit = 7.71% Upper confidence limit = 14.21%

we would scale the confidence limits as:

Corrected LCL = 10.67 - (10.67 - 7.71) * 0.7966 = 8.31% Corrected UCL = 10.67 + (14.21 - 10.67) * 0.7966 = 13.49%

POSTERIOR WEIGHTING

The first step is to calculate weights for each sampled community. The simplest approach is:

w = N / sum(N)

where:

N = population (e.g. number of roofs) in a sampled community sum(N) = total population in ALL sampled communities

We can then calculate a point estimate as:

p = sum(w * (c / n))

where:

p = point estimate w = weight (calculated as above) c = number of cases n = sample size in a sampled community

Here are some example coverage data with eight villages sampled (you'd have more):

Village Pop w c n c / n w * c / n ------- --- ------------ ---- ---- ---------- ---------------- 1 115 115/900=0.13 29 38 29/38=0.76 0.13*0.76=0.0988 2 91 91/900=0.10 18 32 18/32=0.56 0.10*0.56=0.0560 3 121 121/900=0.13 36 43 36/43=0.84 0.13*0.84=0.1092 4 114 114/900=0.13 15 35 15/35=0.43 0.13*0.43=0.0559 5 98 98/900=0.11 14 42 14/42=0.33 0.11*0.44=0.0363 6 104 105/900=0.12 10 37 10/37=0.27 0.12*0.27=0.0324 7 132 132/900=0.15 5 39 5/39=0.13 0.15*0.13=0.0195 8 125 125/900=1.14 23 42 23/42=0.55 0.14*0.55=0.0770 ------- --- ------------ ---- --- ----------- ---------------- SUMS 900 1.00 150 308 NA 0.4851

The point estimate is 0.4851 (48.51%).

This sort of calculation can be done in a spreadsheet.

The calculation of the 95% confidence interval is also a little involved:

p +/- 1.96 * sqrt(((w^2 * (c / n) * (1 - c / n)) / n)

Continuing using the data above:

Village w^2 c / n 1-(c/n) ((w^2*(c/n)*(1-c/n))/n) ------- ----- ------ ------- ----------------------- 1 0.0169 0.76 0.24 0.00008112 2 0.0100 0.56 0.24 0.00007700 3 0.0169 0.84 0.16 0.00005282 4 0.0169 0.43 0.57 0.00011835 5 0.0121 0.33 0.67 0.00006370 6 0.0144 0.27 0.73 0.00007671 7 0.0225 0.13 0.87 0.00006525 8 0.0186 0.55 0.45 0.00011550 ------- ----- ------ ------- ----------------------- SUM = 0.00065045 SQRT(SUM) = 0.02550392 ----------------------

The 95% CI is then:

Lower 95% CL = 0.4851 - 1.96 * 0.02550392 = 0.4351 (43.51%)

Upper 95% CL = 0.4851 + 1.96 * 0.02550392 = 0.5351 (53.51%)

This sort of calculation can also be done in a spreadsheet.

I hope this is useful.

Someone should check my work.

Did I miss anything?