Data and R code for plotting data on Disability Adjusted Life Years from World Health Organisation
I was interested in gender differences in disease and mortality in humans, and found an extensive data set from the World Health Organisation (http://www.who.int/). Specifically this is disability-adjusted life-years in the year 2012, for the whole world, separated by age-group and by sex.
The excel file downloaded from http://www.who.int/healthinfo/global_burden_disease/estimates/en/index2.html is fairly complex, and contains features that make it a bit complicated for me to read into R. In excel, I did some reformatting and created a few more variables, prior to import into R.
I'm also interested in trying to get big data into good graphs, - and doing good, reproducible analyses - and using github - and teaching others (especially at Sussex). This is my first experiment with github io pages, a demonstration of dplyr, ggplot, and a few related R packages (see below), and explanation of some of the R code.
R packages used are:
dplyr https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
ggplot2 http://docs.ggplot2.org/current/
tidyr https://cran.r-project.org/web/packages/tidyr/index.html
gridExtra https://cran.r-project.org/web/packages/gridExtra/index.html
cowplot https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html
A routine import of data in text (tab-delimited) format:
morta = read.table("who_mortality_2015.txt", header = T, sep = "\t")
Convert global DALY for each sex into millions and view:
id myclass myid male female age years milmale milfemale
1 Neonatal_sepsis_and_infections B B-Inf 22129032 17494859 0-27 days 0.00 22.129032 17.494859
2 Neonatal_sepsis_and_infections B B-Inf 22129032 17494859 0-27 days 0.07 22.129032 17.494859
3 Neonatal_sepsis_and_infections B B-Inf 3332 3549 1-59 months 0.08 0.003332 0.003549
4 Neonatal_sepsis_and_infections B B-Inf 3332 3549 1-59 months 4.90 0.003332 0.003549
5 Neonatal_sepsis_and_infections B B-Inf 894 12022 5-14 years 5.00 0.000894 0.012022
6 Neonatal_sepsis_and_infections B B-Inf 894 12022 5-14 years 14.00 0.000894 0.012022
Use dplyr's select'n'gather functions to reformat the data as 'long':
b = morta %>%
select (myid, milmale, milfemale, years, age) %>%
gather (key = sex, value = daly, -c(myid, years, age))
> head(b)
myid years age sex daly
1 B-Inf 0.00 0-27 days milmale 22.129032
2 B-Inf 0.07 0-27 days milmale 22.129032
3 B-Inf 0.08 1-59 months milmale 0.003332
4 B-Inf 4.90 1-59 months milmale 0.003332
5 B-Inf 5.00 5-14 years milmale 0.000894
6 B-Inf 14.00 5-14 years milmale 0.000894
The factor level manipulations (not shown here) are not elegant, and could probably be improved somehow.
Make a decent custom plot theme (because native ggplot themes are not great):
theme_f1 <- function() {
theme_bw(base_size = 8) +
theme(
plot.margin = unit(c(.25, .5, .25, .25), "cm"),
strip.background = element_rect(fill = "white", colour="white"),
axis.line.y = element_line(size = .3, colour = "black"),
axis.line.x = element_line(size = .3, colour = "black")
)
}
Use this command to check that the custom theme is complete (Should return 'true'):
attr(theme_f1(), "complete")
Plot all causes of DALY throughout age as area graphs; group (and colour) the areas by sex; achieved generally by ggplot2 facet_wrap. Having a full label on each plot for each cause ('id') isn't really practical. The custom identifiers ('myid') work, but adding the Guide as a title isn't great. Note, the y-axis range differs between plots, and can be changed with scales = "fixed" (facet_wrap).
all.pheno.time =
ggplot (b, aes(years, daly, group = sex, fill = sex)) +
geom_area(alpha = .5, position = "identity") +
facet_wrap("myid", scales="free_y") +
theme_f1() +
labs(title = "Guide: B=Birth, C=Cardio, D=Digestive, G=Genetic, I=Infection, J=inJury, M=Metabolic,
N=Neurological, O=Oncology, P=Psychiatric, S=Sensory, U=Uro-genital, Z=other", size = 6) +
scale_y_continuous(name="Disability-adjusted life-years (millions)", labels = NULL) +
scale_x_continuous(name = "Age group (years)") +
theme(legend.position="none",
panel.grid.major.x = element_line(size = 0),
title = element_text(size = 5))
Plot DALY for selected phenotypes, age x sex x DALY. It is also possible to plot these as smoothed lines but I feel as though plain geom_line is truer to the data.
sel.pheno.plot =
ggplot (c, aes(years, milcounts, group = sex, linetype = sex, colour = sex)) +
geom_line(size = 4, alpha = .3, linetype = 1) +
geom_line(size = .5, alpha = .8, colour = "black") +
facet_wrap("id", scales="free_y", labeller = mylabs) +
scale_colour_discrete(name = "Legend", labels=c("Female", "Male")) +
scale_linetype_discrete(name = "Legend", labels = c("Female", "Male")) +
theme_f2() +
scale_y_continuous(name = "Disability-adjusted life years (millions)") +
scale_x_continuous(name = "Age group (years)") +
guides (colour = guide_legend(title = "Legend")) +
theme(legend.position = "bottom")
Scatter plots of male vs female DALY, grouped by age. Warning, points are paired/duplicated due to age/years.
scat.all.ages =
ggplot(morta, aes(milmale, milfemale)) +
geom_abline(intercept = 0, slope = 1, size=.5, linetype = 2, colour="black") +
geom_point(size=2, shape=21) +
geom_smooth(se=TRUE, size = 0) +
scale_x_continuous(name = "male DALY (x10^6)") +
scale_y_continuous(name = "female DALY (x10^6)") +
facet_wrap("age", nrow = 2) +
theme_f1() +
theme(legend.position="none")
Scatter plot of male vs female DALY for age to 15-29 only, points labelled with type.
scat.mid =
ggplot(filter(morta, years == 29.00), aes(milmale, milfemale, label = id)) +
geom_abline(intercept = 0, slope = 1, size=.5, linetype = 2, colour = "black") +
geom_text(size = 2) +
geom_smooth(se = TRUE, size = 0) +
scale_x_continuous(name = "male DALY (x10^6)") +
scale_y_continuous(name = "female DALY (x10^6)") +
facet_wrap("age", nrow = 2) +
theme_f1() +
theme(legend.position = "none")
Combine four plots into one page. This requires the package 'gridExtra'.
all.plots = plot_grid(nrow = 2,
all.pheno.time, sel.pheno.plot,
scat.all.ages, scat.mid,
labels = "AUTO", label_size = 12, scale = 1)
Save combined plots, using package cowplot.
save_plot("suppfig1_all_sexMortality.pdf", all.plots, base_height = 10)
save_plot("fig1_sexMortality.pdf", sel.pheno.plot, base_height = 5)
Now, how do I display images here ??? Here are the links anyway:
https://github.com/willgilks/WHO_DALY_2012/blob/master/fig1_sexMortality.png https://github.com/willgilks/WHO_DALY_2012/blob/master/suppfig1_all_sexMortality.png
There ends my first github io page.