# Individual data entry in mortality data set of ENA for HH size > 20

This question was posted the Assessment and Surveillance forum area and has 10 replies. You can also reply via email – be sure to leave the subject unchanged.

### Anonymous 2411

INGO

Normal user

29 Nov 2013, 18:01

### Mark Myatt

Frequent user

2 Dec 2013, 09:12

All this data-entry seems like a lot of fuss since (usually) we only want (and can get given sample size limitations) a simple estimate of the crude mortality rate. In this case you need only enter cluster level summaries. The required calculations can be done in a spreadsheet. The calculations are shown in this Field Exchange article. You may find this spreadhseet useful.

### Victoria Sauveplane

Senior Program Manager, Action Against Hunger CA

Normal user

3 Dec 2013, 14:05

For the latest version of the ENA software (Version November 16th, 2013) or any other questions relating to the SMART methodology, please refer to the SMART website: www.smartmethodology.org

### Tamsin Walters

en-net moderator

Forum moderator

4 Dec 2013, 13:21

*From Juergen Erhardt:*

Dear Mark, thanks for the suggestion with splitting the households. I think this is the easiest solution until we have extended the number of members per household. In ENA there is also a section where the data is entered on the cluster level. It's nearly identical to the Excel spreadsheet from your link but gives also the confidence intervals adjusted for cluster sampling and the design effect which is not possible to calculate in Excel. It's good to know that the simple estimate on the cluster level is often sufficient.

Sometimes I was pushed by professional demographers to remove this section from ENA. They thought in offering this option the better procedure by entering the data on the household level will be less used.

### Mark Myatt

Frequent user

4 Dec 2013, 16:19

where 'P' is the proportion in the sample, 'Pi' is the proportion in each cluster, and 'K' is the number of clusters. This does give "the confidence intervals adjusted for cluster sampling and the design effect". This works because cumulative incidence can be treated as a proportion. The best that can be said about this procedure is that it returns an approximate 95% CI.

I see no advantage in entering household level data. Am I missing something.

### Juergen Erhardt

Normal user

4 Dec 2013, 21:18

I’m not a statistician and therefore I don’t know exactly why a SUDAAN procedure is recommended for the calculation of the confidence interval in cluster sampling. It’s quite complicated and only possible to calculate with a special software. Since we integrated this some years ago into ENA I thought it’s useful to mention it. In the Excel spreadsheet for which you have made a link I also couldn’t find the calculation of the design effect. As far as I know it can’t be done in Excel or only in a simplified form.

For the collection of mortality data Court Robinson (one of the authors in the article which you cited) always told me that the collection of mortality data on the cluster level shouldn’t be done. Therefore we added in ENA the collection of mortality data on the household level. It’s supposed to be more accurate and enables a more detailed analysis. Probably he can give more information on this.

### Kevin Sullivan

Professor

Normal user

4 Dec 2013, 21:35

Like Mark states, mortality data could be summarized and entered at the cluster level and analyzed. You can take into account the cluster design using a spreadsheet - one modification I would add to Mark's formula is the use of the t-statistics with k-1 degrees of freedom rather than 1.96. Mark's spreadsheet uses the variance approach assuming PPS sampling, however many statistical programs would analyze the data as a one-stage cluster survey and therefore the results may differ, usually by a small amount. Also I am not sure about the approach to converting to rates in the spreadsheet - would need to look into this further but it seems approximately correct. I have developed spreadsheets that can perform these types of analyses and allow the user to input the number of clusters since there could be more or fewer than 30 clusters.

The original issue deals with the way the data are entered using ENA - if there are more than 20 household individuals, seems like they could place the additional individuals into a different household - in terms of the analyses this does not seem like it would effect mortality estimates.

### Mark Myatt

Frequent user

5 Dec 2013, 09:52

is implemented in the spreadsheet (that was my intention anyway). The cluster-specific components of the standard error (SE) are in cells E4:E33. These are summed in cell E34. The standard error is calculated in cell H18. The SE is calculated directly rather than by calculating a design effect and using that to correct the SE calculated as for a simple random sample. I suppose you could get at the design effect (which you may want for sample size related calculations) by calculating the SE as for a simple random sample and then dividing this into the SE calculated using the formula given above.

Kevin is right, use of the t-distribution (rather than the standard normal) would improve the coverage of the 95% CI. This could be done by changing these cells:

H19 ... change to ... =H17-T.INV(0.975,H3-1)*H18

H20 ... change to ... =H17-T.INV(0.975,H3-1)*H18

Someone should check this. Perhaps Kevin or Juergen should verify this and review the spreadsheet so we can be sure that I am not proposing the use of a "broken" tool. I will then make fixes as required.

I suppose the issue with cluster-level data is that it is generally a bad idea to investigate a clustered phenomena with a clustered sample and we expect mortality caused by factors such as infectious diseases and communal violence to cluster. I am confused, however, by the distinction made by collecting data at the cluster and then the household level since (in most SMART surveys) you still have a cluster sample. Most estimators will aggregate data at the cluster level anyway. I can't see how entering data at the household level can change the fact that we have a cluster sample. Am I missing something? Perhaps Court can clarify this issue.

### Mark Myatt

Frequent user

5 Dec 2013, 09:57

Should have been:

H19 ... change to ... =H17-T.INV(0.975,H3-1)*H18

H20 ... change to ... =H17+T.INV(0.975,H3-1)*H18

### Mark Myatt

Frequent user

30 Dec 2013, 13:37

**t[**distribution with clusters - 1 degrees of freedom rather than the standard normal distribution to calculate the 95% confidence interval. This modification was suggest by Kevin (see above) and should improve the coverage or the calculated confidence interval.

You can get the new version here.