english_hse_data_processing_methods.Rmd
Click here to go to the Health Survey for England data page on the STAPM website
This vignette sets out a description of how the data related to tobacco and alcohol consumption in the Health Survey for England are processed for use in the Sheffield Tobacco and Alcohol Policy Modelling (STAPM). The vignette covers: (1) the processing of alcohol consumption data; the processing of tobacco consumption data; the processing of socio-demographic covariates relevant to our tobacco and alcohol modelling; the imputation of missing data. This is a working set of notes to help keep track of the methods used to process the data.
The Sheffield Alcohol Policy Model (SAPM) has been used to examine the effects of pricing policies, advertising restrictions and advice on why and how to reduce drinking (Holmes et al. 2014; Purshouse et al. 2013) (see the range of publications and projects on the Sheffield Alcohol Research Group website). Patterns of alcohol consumption in the SAPM modelling are informed primarily by the Health Survey for England data, with additional information e.g. on the division of alcohol consumption between the on- and off-trade provided by the Living Costs and Food Survey.
The hseclean R package was written to help standardise the processing of the Health Survey for England data to produce inputs for the STAPM simulation modelling. This helps data storage and processing to be consistent across the STAPM projects. The code reads, cleans, filters and combines data from multiple survey years.
Alcohol consumption data in the Health Survey for England (HSE) is recorded in four main forms:
Both adults and children have data on whether they drink alcohol or not, and on the frequency of drinking. The main difference between the recording of data for adults and children is that adults have a lot of data on how much and what they drink, but children only have data on the amount drunk in the last week.
The recording of data varies among years of the HSE. We consider years from 2001 onwards. The main features of these changes in recording are:
Due to the variability in recording, we only consider data on the amount drunk by adults and children from 2011 onwards.
We analyse beverage-specific alcohol consumption in terms of beer (combining normal beer, strong beer), wine (combining wine and sherry), spirits, and alcopops.
Calculated for adults (aged 16 years or older) and children (aged 8
to 15 years) by the function
hseclean::alc_drink_now_allages()
. We combine the
information on drinking frequency from adults and children into a single
variable.
We calculate the variable drinks_now
, which classes
someone as either a drinker or a non-drinker. Adults are classed as
drinkers if they reported drinking at all in the last 12 months, even if
reporting only having 1-2 drinks a year (according to the variable
dnoft
). Note that this definition of a non-drinker can vary
among surveys, e.g. some surveys class only having 1-2 drinks a year as
a non-drinker, and this could lead to variation in estimates of the
number of non-drinkers.
We calculate the variable drink_freq_7d
, which is a
numerical variable that described drinking frequency. Adult drinking
frequency is also inferred from the variable dnoft
: the
function hseclean::alc_drink_freq()
converts the
categorical responses into the expected number of days in a week that
someone drinks.
Missing data on whether or not someone currently drinks
(drinks_now
) is supplemented by responses to if currently
drinks or if always non-drinker (the variables dnnow
,
dnany
and dnevr
).
For children (aged 8-15 years) we infer whether someone drinks or not
(drinks_now
) from the variable adrinkof
.
Someone is a non-drinker if they responded never
to
adrinkof
. The categorical responses are converted into the
expected number of days in a week that someone drinks as follows
Missing data on whether or not a child currently drinks
(drinks_now
) is supplemented by responses to when they last
had an alcoholic drink (adrlast
): if the last drink was
less than six months ago, then we classify them as a drinker; if the
last drink was six months or more ago, then we classify them as a
non-drinker.
Some standard assumptions are made about the volume and alcohol content of the beverages that are reported to be drunk. The values that we use for these assumptions are based on those used by Natcen to create the derived variables for units of alcohol consumed in the HSE. We have made our own adjustments to the values used based on further information from market research data and figures from academic publications.
Alcohol content assumptions are the expected percentages of alcohol that each beverage contains (alcohol by volume, ABV). We use separate values for normal beer (4.4%), strong beer (8.4%), spirits (38%), sherry (17%), wine (12.5%), and alcopops (also known as “ready to drink” or RTD) (4.5%).
Beverage volume assumptions are the expected volumes (ml) of different beverage containers / serving sizes. We use separate values for normal and strong beer (half pint 284ml, small can 330ml, large can 440ml, bottle 330ml), spirits (serving 25ml), sherry (serving 50ml), wine (small glass 125ml, standard glass 175ml, large glass 250ml, bottle 750ml), and alcopops (small can 250ml, small bottle 275ml, large bottle 700ml).
We estimate the average amount drunk in a week
(weekmean
) in terms of UK standard units of alcohol (1 unit
= 10ml or 8g pure ethanol). The average amount drunk is then categorised
as follows:
abstainer
= 0 units/weeklower_risk
drinker = less than 14 units/weekincreasing_risk
drinker = 14 or more units/week but
less than 35 units/week for females or less than 50 units/week for
maleshigher_risk
drinker = 35 or more units/week for females
or 50 or more units/week for malesSeparate variables are produced describing the average weekly units
in four beverage categories: beer_units
(including cider),
wine_units
(including sherry), spirit_units
,
rtd_units
(this is alcopops). Further variables on beverage
preference are produced that:
per_spirit_units
,
perc_wine_units
, perc_beer_units
,
perc_rtd_units
).does_not_drink_spirits
,
drinks_some_spirits
, mostly_drinks_spirits
,
where “mostly drinks” is defined by a single beverage comprising more
that 50% of an individuals average weekly consumption.The processing is done by the function
hseclean::alc_weekmean_adult()
. The calculation has the
following steps:
The function hseclean::alc_sevenday_adult()
processes
the information from the questions on adult (16 or more years old)
drinking in the last seven days:
n_days_drink
.We estimate the number of UK standard units of alcohol drunk on the
heaviest drinking day (peakday
) by using the data on how
many of what size measures of different beverages were drunk, and
combining this with our standard assumptions about beverage volume and
alcohol content. We further estimate their total units drunk of each
beverage type on the heaviest drinking day (d7nbeer_units
,
d7sbeer_units
, d7spirits_units
,
d7sherry_units
, d7wine_units
,
d7pops_units
).
Binge drinking status is then categorised into the variable
binge_cat
, with levels did_not_drink
,
binge
and no_binge
, where a binge day in
defined by males drinking over 8 units and females drinking over 6
units.
Note that in 2007 new questions were added asking which glass size was used when wine was consumed. Therefore the post HSE 2007 unit calculations are not directly comparable to previous years’ data.
Missing data is imputed using the means of people who did drink in the last seven days, stratified by year, sex, IMD quintile and age category (0-1, 2-4, 5-7, 8-10, 11-12, 13-15, 16-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90+).
The function hseclean::alc_sevenday_child()
processes
the information on drinking by children (ages 13-15) in the last seven
days. The data on children’s drinking comes in the form of survey
questions on whether or not they have drunk each beverage type in the
last week, and if so, how much of each was drunk. The main output is the
variable total_units7_ch
- the total units drunk in the
last seven days.
We estimate the number of UK standard units of alcohol drunk in the last 7 days by using the data on how many of what size measures of different beverages were drunk, and combining this with our standard assumptions about beverage volume and alcohol content.
The information from this question is also used to update the
drinks_now
variable to make it a variable that describes
whether or not adults and children drink.
Due to high missingness in this variable, we assume that anyone who has missing data for this variable does not drink. This means that we are likely to under-estimate the number of children who drink.
For the Sheffield Tobacco Policy Model (STPM), we use HSE data from years 2001 to the latest available (although survey weights started to be used from 2003, which is likely to make the data from 2003 slightly more reliable). We use these data to inform the trends in smoking prevalence, the socio-demographic variation in smoking prevalence, and as inputs to a procedure that we use to infer the age-specific probabilities of smoking initiation and quitting (see our smktrans R package). Our upper age limit is 89 years, but otherwise we make use of all ages.
The purpose of this vignette is to explain how we use the HSE data to
inform the patterns of tobacco smoking, and to explain how
hseclean
supports this.
Questions about cigarette smoking have been asked of adults aged 16 and over as part of the HSE series since 1991 - we use data from 2001 to the latest year available. We use data on children (12-15 years) and adults (16+ years). There is often a special section in the annual HSE report devoted to describing trends in cigarette smoking e.g.HSE 2015.
The function hseclean::smk_status()
categorises
cigarette smoking into current, former and never regular cigarette
smokers. If some smokes either regularly or occasionally, then they are
classified as a current regular cigarette smoker. People who used to
smoke regularly or occasionally are classified as former smokers; people
who have only tried a cigarette once or twice are classified as never
smokers. We create a smoking status variable for children aged 8-15
years and adults aged >= 16 years. Ever-smokers are people who are
either current or former smokers.
The function hseclean::smk_quit()
is in development, and
will process the data on the motivation to quit smoking, the reasons for
quitting smoking, and the support used to stop smoking. It currently
produces only one variable - whether someone wants to quit smoking
(y/n).
The function hseclean::smk_former()
cleans the data for
former smokers on the time since quitting and time spent as a regular
smoker. The main issue to overcome is that in the HSE 2015+, time since
quit and time spent as a smoker is provided in categories rather than
single years. We simulate the single years by just picking a value at
random within the time interval, using hseclean::num_sim()
.
We then fill missing data for these variables as follows:
The function hseclean::smk_life_history()
cleans the
data on the ages when smokers started and stopped being regular
cigarette smokers. For each individual smoker, the data recorded in the
HSE implies a single age at which a smoker started to smoke and, if they
stopped, an age at which they did so. This provides a simplified view of
what might be a complicated life history of smoking, e.g. smoking to
different frequencies or levels, or starting and stopping multiple
times.
Both the start age and stop age will have error in them e.g. due to uncertainty in respondent recall, and, for years 2015+, due to the reporting in categories of time intervals rather than single years, which we then impute introducing random error. Start age is likely to be biased towards earlier ages, because for adult smokers and former smokers with missing values we use the age first tried a cigarette, and for children the reported start age does not necessarily mean the start of regular smoking, it is just the age at which they started to smoke.
We also create a variable for the age at which an individual was censored from our data sample - this is their age at the survey + 1 year.
Any missing data is assigned the average start or stop age for each age, sex and IMD quintile.
The function hseclean::smk_amount()
cleans the data that
describe how much, what and to what level of addiction people smoke. The
main variable is the average number of cigarettes smoked per day. For
adults, this is calculated from questions about how many cigarettes are
smoked typically on a weekday vs. a weekend (this is a weighted average
to account for more weekdays in a week than weekends). For children,
this is based on asking how many cigarettes were smoked in the last
week. Missing values are imputed as the average amount smoked for an
age, sex and IMD quintile subgroup.
We categorise cigarette preferences based on the answer to ‘what is the main type of cigarette smoked’. For years 2013 and later of the HSE, questions were added that ask how many handrolled vs. machine rolled cigarettes are smoked on a weekday vs. a weekend.
We also categorise the amount smoked, and use information on the time from waking until smoking the first cigarette (this latter variable has a high level of missingness). Together these two variables allow calculation of the heaviness of smoking index.
Taking the survey design into account is important when estimating
the mean and confidence intervals around summary statistics computed
from the data i.e. it is not possible to accurately estimate sampling
error without accounting for survey design. The survey
R
package (Thomas Lumley 2019) has a collection of
functions that incorporate survey design into the calculation of summary
statistics. The survey
package is used by the function
prop_summary()
in hseclean
to estimate the
uncertainty around proportions calculated from a binary variable -
prop_summary()
was designed to simplify the process of
estimating smoking prevalence from the HSE data, stratified by a
specified set of variables.
The suppliers of the HSE data introduced tighter information governance rules in 2015, which meant that they stopped providing variables that could be used to identify the age in single years of an individual, and also stopped providing information on number of children in the household. These variables can still be obtained, but only after applying for the secure-access version of the data, which we do not do. Therefore, in our processing of the standard-access version of the data, we use imputation methods to overcome the added restrictions.
The first thing to consider is the influence of survey sampling
design, which is variable among years. The variables that describe the
sampling structure are cluster
and PSU
(probabilistic sampling unit).
In most years there are also survey weights, which are calculated after the survey data has been collected, that when applied are supposed to make the survey sample representative of the general population e.g. if a particular subgroup has been under-sampled, then it receives a higher survey weight. As we understand the HSE methods, the survey weights supplied with the data consider only the age and sex distribution of the population, and do not consider the distribution of socio-economic or health characteristics. The definition and structure of the survey weights provided with the data varies between years, and is described in the dataset documentation for each year of data. For example, some key changes
hseclean
contains separate functions for reading the
survey data for each year, e.g. read_2001()
, and a
description of the survey weights has been added to the help files of
those functions. Any processing or combining of survey weights is done
in the functions that read each year of data. The function
clean_surveyweights()
assigns any missing weights the
average weight for each year, and standardises the weights to sum to 1
within each year. The resulting survey weight variable for each year is
wt_int
.
From 2015 onwards, the HSE no longer supplies age in single years (to prevent individual identification). For our modelling, we require age in single years, so we apply a method that randomly assigns an age in single years to individuals for who we only have an age category. The age categories we work with are: 0-1, 2-4, 5-7, 8-10, 11-12, 13-15, 16-17, 18-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90+. These categories are the finest scale version of age that is available for years 2015+. We then select only individuals younger than 90 years for our modelling.
This processing is done by the function clean_age()
that
calls the function num_sim()
to simulate single years of
age. For years 2015+, we also use num_sim()
to convert the
categorical variables for years since quitting smoking and years spent
as a smoker to single years of age.
The function clean_demographic()
creates variables for
ethnicity, sex and quintiles of the Index of Multiple Deprivation
(IMDq).
Previous SAPM modelling has used a simple white/non-white classification. The ONS recommend a harmonised ethnicity measure for use in social surveys (ONS, 2017). The use of ethnicity measures is also discussed in Connelly et al. 2016, who recommend testing the sensitivity of analyses to different specifications. We try to map the HSE categories to the ONS recommended groups for England. However, over the years, the HSE is not clear or consistent in how they have categorised chinese and arab as ‘asian’ or ‘other’. In an attempt to harmonise, we have pooled the asian and other categories.
Following inspection of the data, the white/non-white classification does look appropriate, especially given the likely limited sample sizes - so the 2 level variable has also been created. Previous Sheffield modelling in the Sheffield Alcohol Policy Model has also used the white/non-white classification.
Individuals in the HSE are not assigned a Townsend quintile of deprivation, but for a project that investigated the cost of alcohol to primary care in England, we needed to predict the Townsend quintile of each individual so that we could use it to stratify our summary of alcohol consumption.
The function use_townsend()
adds a Townsend variable to
the data. It produces a version of the Health Survey for England data
that has the Townsend Index in it, based on the probabilistic mapping
between the 2015 English Index of Multiple Deprivation and the Townsend
Index from the 2001 census.
It does so based on a matrix (stored in
hseclean::imdq_to_townsend
) that maps quintiles of the
Index of Multiple Deprivation onto the Townsend Index of Deprviation. To
produce this we used area-level
Office for National Statistics data to estimate the statistical
association between the two metrics of deprivation. We used estimates of
the Townsend Index from 2001 Census data at Ward level, and the Index of
Multiple Deprivation 2015 (IMD 2015) at Lower-layer Super Output Area
(LSOA) level. First, we mapped the 2001
definitions of Wards to the 2001 definitions of LSOAs. Second, we
mapped the 2001
definitions of LSOAs to the 2011 definitions of LSOAs that are used by
the IMD 2015.
The function clean_economic_status()
creates a variety
of variables to classify economic status.
The issues around using occupation-based social classifications for social survey research are discussed by Connelly et al. (2016). They advise using a range of alternative measures, and not creating new measures beyond what is already established.
The classifications considered are:
The main education variable produced by the function
clean_education()
is a four category description of the age
at which someone finished full-time education. The categories are:
If someone was still in full time education at the time of the survey, then if they were younger than 18 years, we assumed they would leave at 16-18, and if they were older than 18 years, we assumed they would leave at 19 years or over.
A further education variable is also produced - which indicates whether an individual reached a degree as their top qualification or not. Here a degree is defined as an “NVQ4/NVQ5/Degree or equiv”.
The function clean_family()
processes the data on the
number of children in the household and the relationship status of each
respondent.
kids
is the number of children aged 0-15 years who live
in the household. If a 3 year old lives in a household with 2 siblings,
aged 6 and 8 years, then we might expect them to be recorded as living
in a household with 3 children under age 15 years. The variable is
created by combining the HSE data on children and infants in the
household. It is categorised into: 0, 1, 2, 3+ children under age 15
years.
The problem with the Health Survey for England is that from 2015 onwards, the number of children in the household is not provided as this information could be identifiable (you can get it if you apply and pay for a secure dataset). Therefore, for years 2015+, the number of children in the household is completely missing and needs to be imputed.
We impute the number of children for years 2015+ automatically in the
function clean_family()
, based on the correlation between
the number of children and a range of demographic and socioeconomic
variables in 2012-2014, the last three years for which data on kids is
available. This imputation is based on the fit of a multinomial model in
package(nnet)
. The model object is saved in the
hseclean
package as the object
hseclean::impute_kids_model
, and is drawn upon by the
clean_family()
function to impute the data as needed. This
imputation won’t work unless the required demographic and socio-economic
variables have already been cleaned prior to running
clean_family()
. There will still be missing values in
kids
if there are missing values in the predictor variables
required by the model. These missing values can be taken care of in a
multiple imputation procedure (see vignette(missing_data)).
The function clean_income()
processes the data on
income.
There are a few different options for classifying income - the need to have a measure that is consistent across years of the Health Survey for England has led us to use equivalised income quintiles only. (Past SAPM modelling has used years of the HSE for which a continous variable for equivalised income was provided - and calculated our own income groups - but in later years, this continuous income variable is not available.)
In the past SAPM modelling, a measure of in “poverty” vs. “not in poverty” has been used, where the poverty threshold is defined as 60% of the median income for any year. For years in which we only have income quintiles available, it is not possible to make an exact calculation of poverty, but being in poverty will coincide approximately with the lowest 2 income quintiles.
It would also be possible from the Health Survey for England to
classify people as being in receipt of benefits or not, but this is not
currently implemented in hseclean
, and would have to have
some thought on how to deal with the changing definitions of benefits
over time.
The function clean_health_and_bio()
cleans data on
presence/absence of certain categories of health condition, and on
height and weight.
There are a set of 15 categories of long-lasting illnesses (occurring
for or expected to last at least 12 months) that are ascertained
consistently across all years of the HSE. These are:
- Cancer
- Endocrine or metabolic condition
- Mental health condition
- Nervous system condition
- Eye condition
- Ear condition
- Heart or circulatory system condition
- Respiratory condition
- Digestive condition
- Genito-urinary condition
- Skin condition
- Musculo-skeletal condition
- Infectious disease
- Blood and related organs condition
- Other complaints
To prepare for the process that imputes missing data, run the full set of functions to read and clean the data. It is important to note that there has already been some filling-in of missing data done by these cleaning functions - using simple rules,
clean_age()
function has randomly
assigned single years of age within each age category.The number of children in the household is missing for years 2015+.
This is imputed in the function clean_family()
based on the
fit of a multinomial model to years 2012-2014 (see
vignette("covariate_data")
).
The function select_data()
has the option to filter the
data to retain only complete cases for certain variables. To prepare the
data, the example code below filters out any incomplete data on key
survey variables (age, sex, year, quarter, psu, cluster, imd_quintile).
It also filters out any incomplete information on the key smoking and
drinking variables, “cig_smoker_status” and “drinks_now”.
The variable with the most missingness in the data is income5cat (19% missing). In this example, the other variables to be imputed are: kids, ethnicity_4cat, eduend4cat, degree, relationship_status, nssec3_lab, activity_lstweek.
To conduct the multiple imputation, we use the R package
mice
(Stef van Buuren and Karin Groothuis-Oudshoorn
2011). The process of running the multiple imputation can
take a long time and consume a lot of RAM. There is a range of
mice
documentation and tutorials online.
In hseclean
, multiple imputation is implemented in a
basic way by the impute_data_mice()
function.
mice
fits a chained series of regression equations that
predict the missing values of variables based on their relationships
with other selected variables in the data. The
impute_data_mice()
function currently only imputes
categorical variables, which could be one of three types: “logreg” -
binary Logistic regression; “polr” - ordered Proportional odds model;
“polyreg” - unordered Polytomous logistic regression.
In running the multiple imputation, the number of iterations of the imputed data is selected (choosing a small number e.g. < 5 helps keep the size of the resulting imputed data manageable), and the variables to either be predicted or to inform the prediction are selected. If a variable is just going to inform the prediction of the other variables but is not going to be predicted itself, then the model type is set to ““, otherwise to one of”logreg”, “polr” or “polyreg”.