WHO Disability Adjusted Life Years, by gender and cause, 2012

Data and R code for plotting data on Disability Adjusted Life Years from World Health Organisation

View the Project on GitHub willgilks/WHO_DALY_2012

I was interested in gender differences in disease and mortality in humans, and found an extensive data set from the World Health Organisation (http://www.who.int/). Specifically this is disability-adjusted life-years in the year 2012, for the whole world, separated by age-group and by sex.

The excel file downloaded from http://www.who.int/healthinfo/global_burden_disease/estimates/en/index2.html is fairly complex, and contains features that make it a bit complicated for me to read into R. In excel, I did some reformatting and created a few more variables, prior to import into R.

I'm also interested in trying to get big data into good graphs, - and doing good, reproducible analyses - and using github - and teaching others (especially at Sussex). This is my first experiment with github io pages, a demonstration of dplyr, ggplot, and a few related R packages (see below), and explanation of some of the R code.

R packages used are:

dplyr https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

ggplot2 http://docs.ggplot2.org/current/

tidyr https://cran.r-project.org/web/packages/tidyr/index.html

gridExtra https://cran.r-project.org/web/packages/gridExtra/index.html

cowplot https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html

A routine import of data in text (tab-delimited) format:

morta = read.table("who_mortality_2015.txt", header = T, sep = "\t")

Convert global DALY for each sex into millions and view:

                          id myclass  myid     male   female         age years   milmale milfemale
1 Neonatal_sepsis_and_infections       B B-Inf 22129032 17494859   0-27 days  0.00 22.129032 17.494859
2 Neonatal_sepsis_and_infections       B B-Inf 22129032 17494859   0-27 days  0.07 22.129032 17.494859
3 Neonatal_sepsis_and_infections       B B-Inf     3332     3549 1-59 months  0.08  0.003332  0.003549
4 Neonatal_sepsis_and_infections       B B-Inf     3332     3549 1-59 months  4.90  0.003332  0.003549
5 Neonatal_sepsis_and_infections       B B-Inf      894    12022  5-14 years  5.00  0.000894  0.012022
6 Neonatal_sepsis_and_infections       B B-Inf      894    12022  5-14 years 14.00  0.000894  0.012022

Use dplyr's select'n'gather functions to reformat the data as 'long':

b = morta %>%
    select (myid, milmale, milfemale, years, age) %>%
    gather (key = sex, value = daly, -c(myid, years, age))

> head(b)
  myid years         age     sex      daly
1 B-Inf  0.00   0-27 days milmale 22.129032
2 B-Inf  0.07   0-27 days milmale 22.129032
3 B-Inf  0.08 1-59 months milmale  0.003332
4 B-Inf  4.90 1-59 months milmale  0.003332
5 B-Inf  5.00  5-14 years milmale  0.000894
6 B-Inf 14.00  5-14 years milmale  0.000894

The factor level manipulations (not shown here) are not elegant, and could probably be improved somehow.

Make a decent custom plot theme (because native ggplot themes are not great):

theme_f1 <- function() {
    theme_bw(base_size = 8) +
      theme(
        plot.margin = unit(c(.25, .5, .25, .25), "cm"),
        strip.background = element_rect(fill = "white", colour="white"),
        axis.line.y = element_line(size = .3, colour = "black"),
        axis.line.x = element_line(size = .3, colour = "black")
      ) 
    }

Use this command to check that the custom theme is complete (Should return 'true'):

attr(theme_f1(), "complete")

Plot all causes of DALY throughout age as area graphs; group (and colour) the areas by sex; achieved generally by ggplot2 facet_wrap. Having a full label on each plot for each cause ('id') isn't really practical. The custom identifiers ('myid') work, but adding the Guide as a title isn't great. Note, the y-axis range differs between plots, and can be changed with scales = "fixed" (facet_wrap).

all.pheno.time = 
    ggplot (b, aes(years, daly, group = sex, fill = sex)) +
      geom_area(alpha = .5, position = "identity") +
      facet_wrap("myid", scales="free_y") + 
        theme_f1() +
        labs(title = "Guide: B=Birth, C=Cardio, D=Digestive, G=Genetic, I=Infection, J=inJury, M=Metabolic,
             N=Neurological, O=Oncology, P=Psychiatric, S=Sensory, U=Uro-genital, Z=other", size = 6) +
        scale_y_continuous(name="Disability-adjusted life-years (millions)", labels = NULL) +
        scale_x_continuous(name = "Age group (years)") +
          theme(legend.position="none",
                panel.grid.major.x = element_line(size = 0),
                title = element_text(size = 5))

Plot DALY for selected phenotypes, age x sex x DALY. It is also possible to plot these as smoothed lines but I feel as though plain geom_line is truer to the data.

sel.pheno.plot =  
    ggplot (c, aes(years, milcounts, group = sex, linetype = sex, colour = sex)) +
      geom_line(size = 4, alpha = .3, linetype = 1) + 
      geom_line(size = .5, alpha = .8, colour = "black") + 
        facet_wrap("id", scales="free_y", labeller = mylabs) + 
        scale_colour_discrete(name = "Legend", labels=c("Female", "Male")) +
        scale_linetype_discrete(name = "Legend", labels = c("Female", "Male")) +
        theme_f2() +
        scale_y_continuous(name = "Disability-adjusted life years (millions)") +
        scale_x_continuous(name = "Age group (years)") +
        guides (colour = guide_legend(title = "Legend")) +
          theme(legend.position = "bottom")

Scatter plots of male vs female DALY, grouped by age. Warning, points are paired/duplicated due to age/years.

scat.all.ages = 
    ggplot(morta, aes(milmale, milfemale)) + 
      geom_abline(intercept = 0, slope = 1, size=.5, linetype = 2, colour="black") +
      geom_point(size=2, shape=21) +
      geom_smooth(se=TRUE, size = 0) +
        scale_x_continuous(name = "male DALY (x10^6)") +
        scale_y_continuous(name = "female DALY (x10^6)") +
        facet_wrap("age", nrow = 2) +
        theme_f1() +
          theme(legend.position="none")

Scatter plot of male vs female DALY for age to 15-29 only, points labelled with type.

scat.mid =
    ggplot(filter(morta, years == 29.00), aes(milmale, milfemale, label = id)) + 
    geom_abline(intercept = 0, slope = 1, size=.5, linetype = 2, colour = "black") +
    geom_text(size = 2) +
    geom_smooth(se = TRUE, size = 0) +
      scale_x_continuous(name = "male DALY (x10^6)") +
      scale_y_continuous(name = "female DALY (x10^6)") +
      facet_wrap("age", nrow = 2) +
      theme_f1() +
        theme(legend.position = "none")

Combine four plots into one page. This requires the package 'gridExtra'.

all.plots = plot_grid(nrow = 2, 
                    all.pheno.time, sel.pheno.plot, 
                    scat.all.ages, scat.mid, 
                    labels = "AUTO",  label_size = 12, scale = 1)

Save combined plots, using package cowplot.

save_plot("suppfig1_all_sexMortality.pdf", all.plots, base_height = 10)
save_plot("fig1_sexMortality.pdf", sel.pheno.plot, base_height = 5)

Now, how do I display images here ??? Here are the links anyway:

https://github.com/willgilks/WHO_DALY_2012/blob/master/fig1_sexMortality.png https://github.com/willgilks/WHO_DALY_2012/blob/master/suppfig1_all_sexMortality.png

There ends my first github io page.