Reads and does basic cleaning on the Health Survey for England 2009.

read_2009(
  root = c("X:/", "/Volumes/Shared/")[1],
  file =
    "HAR_PR/PR/Consumption_TA/HSE/Health Survey for England (HSE)/HSE 2009/UKDA-6732-tab/tab/hse09ai.tab",
  select_cols = c("tobalc", "all")[1]
)

Arguments

root

Character string - the root directory. This is the section of the file path to where the data is stored that might vary depending on how the network drive is being accessed. The default is "X:/", which corresponds to the University of Sheffield's X drive in the School of Health and Related Research. Within the function, the root is pasted onto the front of the rest of the file path specified in the 'file' argument. Thus, if root = NULL, then the complete file path is given in the 'file' argument.

file

Character string - the file path and the name and extension of the file. The function has been designed and tested to work with tab delimited files '.tab'. Files are read by the function [data.table::fread].

select_cols

Character string - select either: "all" - keep all variables in the survey data; "tobalc" - keep a reduced set of variables associated with tobacco and alcohol consumption and a selected set of survey design and socio-demographic variables that are needed for the functions within the hseclean package to work.

Value

Returns a data table.

Survey details

The HSE 2009 included a general population sample of adults and children, representative of the whole population at both national and regional level, and a boost sample of children aged 2-15. A sub-sample was identified in which the main survey was supplemented with objective measures of physical activity and fitness. For the general population sample, 4,680 addresses were randomly selected in 360 postcode sectors, issued over twelve months from January to December 2009. Where an address was found to have multiple dwelling units, one was selected at random. Where there were multiple households at a dwelling unit, up to three households were included, and if there were more than three, a random selection was made. At each address, all households, and all persons in them, were eligible for inclusion in the survey. Where there were three or more children aged 0-15 in a household, two of the children were selected at random. A nurse visit was arranged for all participants who consented.

In addition to the core general population sample, a boost sample of children aged 2-15 was selected using 12,600 addresses, some in the same postcode sectors as the core sample and some in an additional 180 postcode sectors to supplement the sample obtained in the core sectors. As for the core sample, where there were three or more children in a household, two of the children were selected at random to limit the respondent burden for parents. There was no nurse follow up for this child boost sample.

A total of 4,645 adults and 3,957 children were interviewed, with 1,147 children from the core sample and 2,810 from the boost. A household response rate of 68 the core sample, and 74 3,261 adults and 807 children had a nurse visit.

Weighting

Individual weight

For analyses at the individual level, the weighting variable to use is (wt_int). These weights are generated separately for adults and children:

  • for adults (aged 16 or more), the interview weights are a combination of the householdweight and a component which adjusts the sample to reduce bias from individual non-response within households;

  • for children (aged 0 to 15), the weights are generated from the household weights and the child selection weights – the selection weights correct for only including a maximum of two children in a household. The combined household and child selection weight were adjusted to ensure that the weighted age/sex distribution matched that of all children in co-operating households.

For analysis of children aged 0-15 in both the Core and the Boost sample, taking into account child selection only and not adjusting for non-response, the (wt_child) variable can be used. For analysis of children aged 2-15 in the only Boost sample the (wt_childb) variable can

Missing values

  • -1 Not applicable: Used to signify that a particular variable did not apply to a given respondent usually because of internal routing. For example, men in women only questions.

  • -2 Schedule not applicable: Used mainly for variables on the self-completions when the respondent was not of the given age range, also used for children without legal guardians in the home who could not participate in the nurse schedule.

  • -8 Don't know, Can't say.

  • -9 No answer/ Refused

How the data is read and processed

The data is read by the function [data.table::fread]. The 'root' and 'file' arguments are pasted together to form the file path. The following are converted to NA: c("NA", "", "-1", "-2", "-6", "-7", "-8", "-9", "-90", "-90.0", "-99", "N/A"). All variable names are converted to lower case. The cluster and probabilistic sampling unit have the year appended to them. Some renaming of variables is done for consistency with other years.

Examples


if (FALSE) {

data_2009 <- read_2009("X:/", "ScHARR/PR_Consumption_TA/HSE/HSE 2009/UKDA-6732-tab/tab/hse09ai.tab")

}