Reads and does basic cleaning on the Health Survey for England 2002.

read_2002(
  root = c("X:/", "/Volumes/Shared/"),
  file =
    "HAR_PR/PR/Consumption_TA/HSE/Health Survey for England (HSE)/HSE 2002/UKDA-4912-tab/tab/hse02ai.tab",
  select_cols = c("tobalc", "all")[1]
)

Arguments

root

Character string - the root directory. This is the section of the file path to where the data is stored that might vary depending on how the network drive is being accessed. The default is "X:/", which corresponds to the University of Sheffield's X drive in the School of Health and Related Research. Within the function, the root is pasted onto the front of the rest of the file path specified in the 'file' argument. Thus, if root = NULL, then the complete file path is given in the 'file' argument.

file

Character string - the file path and the name and extension of the file. The function has been designed and tested to work with tab delimited files '.tab'. Files are read by the function [data.table::fread].

select_cols

Character string - select either: "all" - keep all variables in the survey data; "tobalc" - keep a reduced set of variables associated with tobacco and alcohol consumption and a selected set of survey design and socio-demographic variables that are needed for the functions within the hseclean package to work.

Value

Returns a data table. Note that:

  • A single sampling cluster is assigned.

Survey details

As well as providing a sample designed to give a cross-section of the population, HSE 2002 also focussed on the health of a number of specific groups, including: infants and children (aged 0-15), young adults (aged 16-24) and mothers with infants aged under 1. Addresses sampled in each postal sector were systematically allocated to one of two groups: Sample I (29 addresses) or Sample II (9 addresses). Sample I was designed to boost the proportion of children, young people and mothers of infants, and Sample II to provide a sample of the general population. At Sample I addresses all persons aged 0-24 were eligible for inclusion in the survey, as were all mothers of infants aged under 1 (there was no upper age limit for the mothers). At Sample II addresses all persons were eligible for interview. At both Sample I and II addresses, where there were more than two children aged 0-15, two children were selected at random. Information was obtained directly from persons aged 13 and over. Information about children aged under 13 was obtained from a parent, with the child present.

An interview with each eligible person (Stage 1) was followed by a visit by a nurse (Stage 2), who made a number of measurements and in some cases obtained a blood sample and a saliva sample. Both interviewers and nurses used computer-assisted interviewing. Blood and saliva samples were sent to a laboratory for analysis.

Weighting

In HSE 2002, the sample was boosted in order to obtain greater numbers of children, young adults (aged 16-24) and mothers of infants under 1. While children aged 0-15 and young adults aged 16-24 were sampled from all selected addresses, adults aged 25 and over were selected only at Sample II addresses (i.e. they were selected at only 9 out of the 38 addresses included within each postcode sector). Consequently, in HSE 2002, those aged 25 and over were under-represented in the final dataset. Different weights were applied to different age groups as explained below:

  • Children aged 0-15: To compensate for limiting the number of children interviewed in a household to two (the sampling fraction therefore being lower in households containing three or more children) it has become necessary to weight the child sample. This ‘child weight’ is the total number of children aged 0-15 in the household divided by the number of selected children in the household. The weighted sample was then adjusted to ensure that the age/sex distribution matched that of all children in co-operating households.

  • Young adults aged 16-24: As all people in the household in this age range were selected for interview, the sample in this age group have a weight of 1.

  • Adults aged 25 and over: The under-representation of adults aged 25+ in the sample is addressed by weighting the final dataset whereby all adults aged 25 and over are given a weight of 38/9. The exception is natural mothers of children under the age of 1 who were selected at all addresses and hence, were not over represented.

  • The variable child_wt contains the appropriate weights for each of the three age groups described above. These weights were then scaled by a constant factor so that the weighted sample size across the sample as a whole was same as the unweighted sample size. The scaled weight variable is tablewt.

  • The tables in the published volumes of the HSE2002 have been weighted using the child_wt variable. For analysis relating to adults aged 16 and over using the both boost and general population samples, the variable tablewt should be used.

Missing values

  • -1 Not applicable: Used to signify that a particular variable did not apply to a given respondent usually because of internal routing. For example, men in women only questions.

  • -2 Schedule not applicable: Used mainly for variables on the self-completions when the respondent was not of the given age range, also used for children without legal guardians in the home who could not participate in the nurse schedule.

  • -6 Schedule not obtained: Used to signify that a particular variable was not answered because the respondent did not complete or agree to a particular schedule (i.e. nurse schedule or selfcompletions).

  • -7 Refused/ not obtained: Used only for variables on the nurse schedules, this code indicates that a respondent refused a particular measurement or test or the measurement was attempted but not obtained or not attempted.

  • -8 Don't know, Can't say.

  • -9 No answer/ Refused

How the data is read and processed

The data is read by the function [data.table::fread]. The 'root' and 'file' arguments are pasted together to form the file path. The following are converted to NA: c("NA", "", "-1", "-2", "-6", "-7", "-8", "-9", "-90", "-90.0", "-99", "N/A"). All variable names are converted to lower case. The cluster and probabilistic sampling unit have the year appended to them. Some renaming of variables is done for consistency with other years.

Examples


if (FALSE) {

data_2002 <- read_2002("X:/", "ScHARR/PR_Consumption_TA/HSE/HSE 2002/UKDA-4912-tab/tab/hse02ai.tab")

}