# Coverage survey using Blocked Weighted Bootstrap

This question was posted the Assessment and Surveillance forum area and has 14 replies.

### Roman

Normal user

24 Apr 2014, 10:27

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

27 Apr 2014, 10:32

**Blocked :**The block corresponds to the primary sampling unit (PSU = cluster). PSUs are resampled with replacement. Observations within the resampled PSUs are also sampled with replacement.

**Weighted :**RAM and S3M samples do not use population proportional sampling (PPS) to weight the sample prior to data collection (e.g. as is done with SMART surveys). This means that a posterior weighting procedure is required. BBW uses a "roulette wheel" algorithm (see illustration below) to weight (i.e. by population) the selection probability of PSUs in bootstrap replicates. In the case of prior weighting by PPS all clusters are given the same weight. With posterior weighting (as in RAM or S3M) the weight is the population of each PSU. This procedure is very similar to the fitness proportional selection technique used in evolutionary computing. A total of m PSUs are sampled with replacement for each bootstrap replicate (where m is the number of PSUs in the survey sample). The required statistic is applied to each replicate. The reported estimate consists of the 0.025th (95% LCL), 0.5th (point estimate), and 0.975th (95% UCL) quantiles of the distribution of the statistic across all survey replicates. Early versions of the BBW did not resample observations within PSUs following :

```
Cameron AC, Gelbach JB, Miller DL, Bootstrap-based improvements
for inference with clustered errors, Review of Economics and
Statistics, 2008:90;414–427
```

and used a large number (e.g. 3999) survey replicates. Current versions of the BBW resample observations within PSUs and use a smaller number of survey replicates (e.g. n = 400). This is a more computationally efficient approach
The BBW has been implemented in the R language for Data Analysis and Graphics. The curent code (as of 21/03/2014) for the BBW is available here. This code usually forms part of a larger survey analysis workflow.
I am happy to help you get this to work.
BTW : I think an unweighted cluster sample using a spatial sample is the best approach to estimating coverage.
I hope this is of some use.### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

28 Apr 2014, 11:45

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

1 May 2014, 08:16

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

29 May 2014, 08:49

### Ernest Guevarra

Valid International

Frequent user

2 Jun 2014, 07:42

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

2 Jun 2014, 14:49

### Mark Myatt

Epidemiologist at Brixton Health

Frequent user

29 Sep 2015, 08:51

```
Replicate BP EP Difference (BP - EP)
------------ ----- ----- --------------------
1 0.121 0.105 0.016
2 0.133 0.114 0.019
3 0.125 0.129 -0.004
4 0.091 0.113 -0.022
. . . .
. . . .
. . . .
500 0.120 0.112 0.008
------------ ----- ----- --------------------
```

(4) There are two ways to proceed from here. If you are interested in estimating the magnitude of the difference then find the MEDIAN difference (this is the point estimate of the difference) and the 2.5th percentile and the 97.5th percentile (these are the lower and upper 95% confidence limits of the difference). If the confidence interval contains zero then you might conclude that there is no significant difference. If you want a p-value then count the number of differences that are less than or equal to zero and divide that by the number of replicates. If (e.g.) there were 11 differences <= 0 and 500 replicates then p = 11 / 500 = 0.0220.
I present here an example coded in R of a standard bootstrap example related to weight gain in two groups of pigs on different dietary supplements:
```
#
# The weight gains on the two diets
#
diet1 <- c(31, 34, 29, 26, 32, 35, 38, 34, 31, 29, 32, 31)
diet2 <- c(26, 24, 28, 29, 30, 29, 31, 29, 32, 26, 28, 32)
#
# Accumulator for the differences
#
differences <- NULL
#
# Take 500 replicates
#
for(i in 1:500)
{
#
# Replicates are mean weight gains on each diet
#
r1 <- mean(sample(diet1, replace = TRUE))
r2 <- mean(sample(diet2, replace = TRUE))
#
# Differences
#
differences <- c(differences, r1 - r2)
}
#
# Estimates
#
quantile(differences, probs = c(0.5, 0.025, 0.975))
#
# A p-value
#
z <- ifelse(differences <= 0, 1, 0)
sum(z) / 500
```

When I run this I got difference = 3.17 (95% CI = 1.08; 5.58) with p = 0.0020. Similar results can be obtained using a simple t-test.
You may ask "Why use the bootstrap?" ... several answers:
(1) It is very efficient WRT sample size.
(2) It is (in the form given above) non-parametric using empirical rather than theoretical distributions. There are no assumptions of (e.g.) normality to violate.
(3) We can use any statistic we want. There is (e.g.) no classical test for differences in medians. For the bootstrap above we can do this by replacing "mean" with "median". We could easily have looked at total weight gain by replacing "mean" with "sum". Classical test are limited to a few statistics.
(4) What you see above is as complicated as it gets. Classical test can get complicated quite quickly.
Anyway ... I hope this helps.
If you have any problem posting a response, please contact the moderator at post@en-net.org.