Tony Crowther: RSMET week 8 Activities

CDs containing the confidential personal details of 25 million child benefit recipients have been lost by HM Revenue & Customs (HMRC) 2007.

To decide the sample size if you wanted to be 99% confident that that sampling error % (confidence interval) was below 1%, you could use the calculator below. I have included the link for you to try it yourself.

http://www.surveysystem.com/sscalc.htm

Taken from the article.

“The data contained on the discs contained the child benefit information of 25 million claims. The data was made up from names, addresses, dates of birth, and National Insurance numbers. It also contained the bank account details of more than 7 million parents, guardians and carers claiming child benefit.”

http://www.silicon.com/special-features/digital-defences/2007/11/20/missing-25-million-child-benefit-records-39169217/

Using the data contained on the discs a number of methods could be used to sample the data for statistical analysis. The Sampling Frame is the contents of the 2 cds. In the first instance I suggest assigning a Number to individual files once the data has been sorted in alphabetical order based on the surname of the claimants, so that each record could be given a unique id.

Sampling the data

To find a Systematic sample of the population 2500000/16630 = 1503 (population/sample size = interval number) this means we could compare the data from every 1503rd record. This spreads the sample over the population of 2.5 million (2500000) and makes it easier to conduct than a random sample.

It would also be possible to stratify the data based on age or number of children per claim/household this would give us unique groups to test and produce statistics which would describe the differences based on the age of the children or numbers of children per family.

Using this method could achieve greater precision, the members within each set would be similar in attributes making it possible to analyse each set individually and then compare the data achieved across the population sample.

It would also be possible to divide the population within the frame into regional groups and draw conclusions based on a regional representation and then compare the findings. This could be a good method for determining regional variations within the population sample.

Using this method on the entire population within the frame and then selecting a random number from each of the regions would also provide some interesting statistical data. For example families in the north of the country tend to be larger and therefore claim more in benefits. (This example is only used to highlight a point and has no basis of fact.)

Quota sampling would be possible but not as effective as stratified sampling as the sample would be non-random so therefore not as representative of the population.

Therefore in this situation I would prefer to use a stratified sample.

Tony Crowther

Monday, 6 December 2010

RSMET week 8 Activities

No comments:

Post a Comment