R8: Plot your data in R - Episode II

Advanced plots

GGplot

Plot layout

Color palettes

Interactive plots

GGplot addons

Author

Modesto

Published

August 17, 2023

Modified

December 2, 2024

1 GGplotting

As we discussed in Lesson R5, exploratory data visualization is one of the greatest advantage of R. One can quickly go from idea to data to graph with a unique balance of flexibility and ease.

There are many graphing options available in R. You already know that the graphing capabilities that come with a basic installation of R are already quite useful. There are also a number of packages for creating advances graphs like grid, plotly, or lattice. In this course we chose to use ggplot2 because it is widely used and also the basis of many derivative packages for specific advanced plots. The main advantage of ggplot is that it breaks plots into components in a way that allows beginners to create relatively complex and aesthetically pleasing plots using an intuitive and relatively easy-to-remember syntax. Moreover, although we are not working on it, it belongs to the set of libraries tidyverse and it is very well integrated with them.

From: https://ourcodingclub.github.io/tutorials/dataviz-beautification/

I must say that ggplot2 is generally less intuitive for beginners. However, once you get used to it, even if you have to look for help or examples to obtain your plots, you will find that ggplot is amazingly powerful and allows you to create perfect plot, with a high level of detail. This is because it uses a “graphing grammar” (see Wilkinson et al.2000), the gg of ggplot2. This is similar to the way that learning a language grammar can help you construct hundreds of different sentences from a small number of verbs, nouns, and adjectives, rather than memorizing each sentence. Similarly, by learning a small amount of the basic components of ggplot2 and the elements of its grammar, you will be able to create hundreds of different plots to represent the data exactly as the way you think it’s the best way or even better. As with any language, the grammar of graphics can be flexible and we may omit some elements o add more elements of the same type, just like we can add diverse kinds of complements (place, time…) to a sentence.

From a general perspective (see ref. 3), plots are composed of the data, the information you want to visualize, and a mapping, the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components (again from ref. 3):

A layer is a collection of geometric elements and statistical transformations. Geometric elements, geoms for short, represent what you actually see in the plot: points, lines, polygons, maps, etc. Statistical transformations, stats for short, summarize the data: for example, binning and counting observations to create a histogram, or fitting a linear model.
Scales or aesthetic map is required to map values in the data space to values in the space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot (an inverse mapping).
A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
A facet specifies how (usually a factor type vector) to break up and display subsets of data as small multiples. This is also known as conditioning or latticing.
A theme controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.

Let’s see all those element through some examples.

1.1 Example 1: Advanced plotting with ggplot in one line

Let’s start strong. We are going to generate a complex plot in one code line using the EU Covid-19 vaccines table (already used in R6 lesson and publicly available at https://opendata.ecdc.europa.eu/covid19/vaccine_tracker/csv/data.csv).

#load the data
library(data.table)
vaccines<-fread('data/vaccines_EU_22oct2022.csv')
#install & load ggplot2
if(!require(ggplot2)) install.packages("ggplot2", dependencies = TRUE)

Loading required package: ggplot2

library(ggplot2)
 
#step 1: create the plot object
ggplot(data = vaccines)

There are some alternative ways to create the plot than are useful for more complex plots. You can also create an object in the RStudio environment.

p <- ggplot(data = vaccines)
class(p)

[1] "gg"     "ggplot"

Also, you will see on many websites the use of the pipe (|> or %>% with dplyr()) as the first argument. This is very convenient to combine ggplot with other tidyverse packages, although we are not using that syntax in the following examples.

p <- vaccines |> ggplot()

Now, we can add some data to generate a dot plot:

(p <- ggplot(vaccines, aes(x = YearWeekISO, y = log10(FirstDose), col=TargetGroup)) )

What happened? Where are my points? We added the data and it allows to plot the axis, but we also need to add the type of plot, that is, the geometry layer. We usually build the final plot adding (just with +) more layers to the previous plot object.

(p <- p + geom_point())

Finally, we would like to group our data:

ggplot(vaccines, aes(x = YearWeekISO, y = log10(FirstDose), col=TargetGroup), stat=mean) +
    geom_point() + facet_wrap(~ ReportingCountry)

That is, quite a complex plot in just one code line.

2 Diversity of plots with ggplot

2.1 Example 2: From R base plot to personalized ggplot

The use of ggplot is usually associate to great, gorgeous charts and plots. However, once you learn how to use it and how adapt and re-adapt code to your data, you will probably use ggplot for every graph.

In the following code, we are going to use ggplot to solve the exercise 1 from Lesson R5.

#create the same dataset
set.seed(2023) #set a seed to obtain reproducible random data
(GeneA <- rnorm(50))

 [1] -0.08378436 -0.98294375 -1.87506732 -0.18614466 -0.63348570  1.09079746
 [7] -0.91372727  1.00163971 -0.39926660 -0.46812305  0.32696208 -0.41274690
[13]  0.56203647  0.66335826 -0.60289728  0.69837769  0.59584645  0.45209183
[19]  0.89674396  0.57221651 -0.41165301 -0.29432715  1.21857396  0.24411143
[25] -0.44515196 -1.84780364 -0.62882531 -0.86108069  1.51492030  2.73523893
[31] -0.27487706  1.27665407 -0.81098034 -0.04492278 -0.63941238 -0.43835600
[37]  0.50720220  1.12921920  0.97563765 -0.12460379  0.42833658  0.41872636
[43]  0.43546649 -0.20426209  0.30235349  0.69435780  2.37306566  1.07597029
[49]  0.33538478  0.75342823

(GeneB <- c(rep(-1, 30), rep(2, 20)) + rnorm(50))

 [1] -0.33561372 -2.09088266 -1.42277085  0.18340204  0.58482679  1.28085537
 [7] -3.06545705 -1.40811197 -0.69898543 -0.30338692 -0.01181613 -1.28406323
[13] -2.92226747 -2.17371980  0.04428274 -1.13292810 -0.15512296 -1.37159255
[19]  0.04758385 -0.07458432 -2.47860533 -1.92263494 -1.08490244 -1.86651004
[25] -0.32495764 -1.08070491 -1.12559261 -1.38833109 -0.98317261 -2.12630659
[31]  2.21835181  3.74303686  1.88042146  2.48617995  2.14136852  2.30796765
[37]  2.99408134  1.48218823  1.19286361  0.07682104  0.61086149  2.42867002
[43]  1.70621858  4.27999879  0.83548924  3.24086762  0.50491614  1.76836367
[49]  1.17513326  0.67582686

(tumor <- factor(c(rep("Colon", 30), rep("Lung", 20))))

 [1] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
[13] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
[25] Colon Colon Colon Colon Colon Colon Lung  Lung  Lung  Lung  Lung  Lung 
[37] Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung 
[49] Lung  Lung 
Levels: Colon Lung

#for ggplot it is more convenient to work with dataframes
genes <- data.frame(tumor,GeneA,GeneB)

#basic plots

#geneA
ggplot(genes, aes(x=tumor, y=GeneA,color=tumor)) +
  geom_boxplot()

#geneB with custom colors
ggplot(genes, aes(x=tumor, y=GeneB,fill=tumor)) +
  scale_fill_manual(values=c("#999999", "#E69F00")) +
  geom_boxplot()

#together
#we need to adapt dataset with stack()
genes2 <- cbind(stack(genes[,2:3]),tumor)
names(genes2) <- c("Expression","Gene","Tumor")
#plot
p <- ggplot(genes2, aes(x=Tumor, y=Expression,fill=Tumor)) +
  scale_fill_manual(values=c("#999999", "#E69F00")) +
  geom_boxplot()
p #see the plot

Quick exercise

Do you remember how you did the boxplot in two panels with base R plots?

#define the setup
par(mfrow = c(1, 2)) 

#generate the two plots
boxplot(GeneA ~ tumor, col=c("#999999", "#E69F00"),ylim = c(-4, 5))
boxplot(GeneB ~ tumor, col=c("#999999", "#E69F00"),ylim = c(-4, 5))

Now let’s try the same with ggplot:

How would you generate a two-panels boxplot for each gene with ggplot?

ggplot(genes2, aes(x=Tumor, y=Expression,fill=Gene)) + scale_fill_manual(values=c(“#999999”, “#E69F00”)) + geom_boxplot() ggplot(genes2, aes(x=Tumor, y=Expression,fill=Tumor)) + scale_fill_manual(values=c(“#999999”, “#E69F00”)) + geom_boxplot() + facet_wrap(.~ Gene) ggplot(genes2, aes(x=Gene, y=Expression,fill=Tumor)) + scale_fill_manual(values=c(“#999999”, “#E69F00”)) + geom_boxplot() ggplot(genes2, aes(x=Gene, y=Expression,fill=Gene)) + scale_fill_manual(values=c(“#999999”, “#E69F00”)) + geom_boxplot(.~ Tumor)

What do you think?

As we mentioned above, ggplot may seem more complicated but when you compare it’s a flexible and powerful way to obtain cool plots with one single line code.

For simple plots, the degree of difficulty and the time consumption of making them with base R plot functions like boxplot() or stripchart() or with ggplot() is very similar, but this is only the very tip of the ggplot iceberg.

Customization of your plot is very easy thanks to the themes, and other options, as in the examples below. Check for built-in ggplot themes: https://ggplot2.tidyverse.org/reference/ggtheme.html. Also, you can find some packages with custom themes and you can create your own theme (check this article by Emanuela Furfaro).

(q <- p + facet_grid(.~ Gene) + theme_light())

(q2 <- p+facet_grid(.~ Gene) + theme_dark())

(q3 <- p+facet_grid(.~ Gene) + theme_linedraw())

q+ stat_boxplot(geom ='errorbar', width=0.5, alpha=0.5)

q + theme(legend.position="top")

q + theme(legend.position="bottom")

q + theme(legend.position="none")

Now, let’s try some more cool example plots with these same data. Note the effect of the option position in the dotplots and how we can make different plots with the data, like density curves and violin plots.

r <- ggplot(genes2, aes(x=Gene, y=Expression,fill=Tumor)) +
  scale_fill_manual(values=c("#999999", "#E69F00")) +
  geom_boxplot()

r + geom_dotplot(binaxis='y', stackdir='center',
                 position=position_dodge(0.75))

Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

r + geom_point(pch = 21, position = position_jitterdodge())

#more examples
a <- ggplot(genes2, aes(x = Expression))
a + geom_density(aes(fill = Gene),alpha=0.4) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF")) +
  theme_classic() + facet_grid(.~ Tumor)

library(RColorBrewer)
a <- ggplot(genes2, aes(x = Gene, y=Expression, fill=Tumor)) + 
  scale_fill_brewer(palette = "Pastel1") + 
  geom_violin(alpha=0.4, position="dodge")
a

2.2 Example 3: Color palettes and other plot customization

When you have large datasets or several factors in your data, selecting the colors is not trivial. In R, there are defined color combinations or palettes that you can select in your plot. Moreover, there are also several packages that contain custom color palettes suitable for base plots and/or ggplots, like viridis or RColorBrewer. You may also find interesting the package ggsci, which contains palettes with colors used in scientific journals, data visualization libraries, science fiction movies…

To obtain the desired plot, you sometimes need to rotate the axis x labels, which must be done within a theme() argument. You can customize the label rotation, justification, font…

coli_genomes <- read.csv(file = 'data/coli_genomes_renamed.csv', strip.white = TRUE, stringsAsFactors = TRUE)
p <- ggplot(coli_genomes,aes(x=Strain, y=contigs1kb, fill=Source)) + geom_bar(stat="identity", position = "dodge", alpha=0.6) 
p

p2 <- p + scale_fill_brewer(palette = "Dark2") + theme_linedraw() 
p2

p2b <- p + scale_fill_brewer(palette = "Pastel2") + theme_linedraw() 
p2b

p3 <- p2 + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, face = "bold"))
p3

In the examples above, we generated a plot and then make variations with some custom palettes (p2). The palettes are available in ggplot, but in order to explore and edit them, you need to install the package. Also, in the plot p3, we modified the font and orientation of the axis labels using a new theme() layer, that contains the text personalizations.

As mentioned above, one of the most common palette packages is RColorBrewer. Let’s check it out.

if(!require(RColorBrewer)) install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all()

display.brewer.all(colorblindFriendly = TRUE) #only colorblind-friendly!

#display only the a number of colors from specific palette
display.brewer.pal(n = 3, name = 'Dark2')

Saving your plot

Finally, saving plots is also very easy with ggplot2, check the function ggsave().

#save 
ggsave(filename = "plot_p3.svg", plot = p3,width = 10,height = 6)

Of course, you can also do it the same way than with Base plots:

svg("plot_p3b.svg")
print(p3)
dev.off()

quartz_off_screen 
                2

2.3 Example 4: Plotting multiple variables

One of the nice ways to summarize data is plotting more than one variable in the same plot. However, you should consider that different data may need different scale to be render in the same plot. Thus, you should scale up/down the data of one variable to the values of the second variable. Then, for the secondary axis, you just need to apply the same scaling factor in the opposite direction.

In the following example, we plot the number of Contigs and the Assembly length from our coli_genomes.csv.

#barplot
plots2 <- ggplot(data=coli_genomes) + 
  geom_bar(aes(x = Strain, y = Contigs, color = Source, fill = Source), stat="identity") 
#add the points and adjust the scale in the right axis
plots2b <- plots2 +
  geom_point(aes(x = Strain, y = Assembly_length/15000,  fill = Source, alpha=0.8),col="black",shape=21, size=3)
plots2b

plots2c <- plots2b  +   scale_y_continuous(name = "Number of Contigs",limits = c(0, 400), expand=c(0,0), sec.axis = sec_axis(~15000 * ., name  = "Assembly length")) 
plots2c

#add colors and more customization
plots2custom <- plots2b +   
    scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1") +
    theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold")) +
    guides(alpha = "none", color = "none") 
plots2custom

We did the plot in different stages, to check each of the variables and the scale of the secondary axis before apply the customization. As the two variables have different range of value, we need to scale them. In this example, we scale-down the data (/ 15000) and then scale-up the axis to show the real data (* 15000).

As you noticed, axis customization may entail adjust the limits with limits(), the axis expansion below and above those limits with expand(), and other aspects as the axis ticks with breaks(). We also can add/remove a legend with the argument guides().

3 ggplot and beyond

As mentioned above, ggplot is already an standard and the base of many derivative packages. We are going to see a couple of examples, but there are many, including packages for specific tasks, like working with maps or sequences. We will see in the lesson R9 the packages ggmsa() and ggseqlogo().

3.1 Interactive plots

Another interesting application of ggplot is its use for the generation of interactive plots to be published on websites. One that you might find of interest is the package heatmaply, that generates interactive heatmaps. Further, I find awesome the use of the package plotly() for very quick upgrade of any plot as interactive.

See the examples:

#install.packages("ggplotly")
#install.packages("heatmaply")

library(heatmaply)

Loading required package: plotly


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Loading required package: viridis

Loading required package: viridisLite


======================
Welcome to heatmaply version 1.5.0

Type citation('heatmaply') for how to cite the package.
Type ?heatmaply for the main documentation.

The github page is: https://github.com/talgalili/heatmaply/
Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
You may ask questions at stackoverflow, use the r and heatmaply tags: 
     https://stackoverflow.com/questions/tagged/heatmaply
======================

#aggregate the data with xtabs
matrix <- xtabs( ~ coli_genomes[,4]+coli_genomes[,5])
#xtabs objects must be converted into dataframes, but heatmaply requires a matrix...
heatmaply(as.data.frame.matrix(matrix))

library(plotly)
ggplotly(plots2custom)

4 References

R in action. Robert I. Kabacoff. March 2022 ISBN 9781617296055
R Graphics Cookbook: https://r-graphics.org/ (I recommend the Appendix A: Understanding ggplot).
GGplot cheatsheet: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf
ggplot2: elegant graphics for data analysis: https://ggplot2-book.org/index.html
ggplot in “Introducción a la ciencia de datos” (Spanish!): http://rafalab.dfci.harvard.edu/dslibro/ggplot2.html
“Una breve introducción a ggplot” by Prof. José Ramón Berrendero @UAM (Spanish!): http://verso.mat.uam.es/~joser.berrendero/R/introggplot2.html
Color palettes in R: https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
Awesome ggplot2 addons and tricks: https://github.com/erikgahner/awesome-ggplot2

5 Exercises

1. The table coli_curve.csv contains the growth curves of three E. coli strains. Plot the curves as scatterplot and lines containing the 90% confidence interval for the three strains.

Tips.

Check the ggplot geom geom_smooth() for the confidence interval.
Also, it would be better if you color the data by strains, but keep the information of each sample measure using, for instance, the point shape.

2. The table microbe_download.csv contains the data of worldwide human deceases associated with antimicrobial resistance, as indicated in the column Counterfactual (data from https://vizhub.healthdata.org/microbe/). Explore the data and use it to reproduce the following barplots.

Tips.

You will need to check geom_errorbar() for the first plot.
For the second plot, you will need to order the data by the total number of Deaths.
Also, adjust the plot margins in order to show all the x-axis labels.

6 Extra exercises

Solved exercises from NCI-NIH: https://btep.ccr.cancer.gov/docs/data-visualization-with-r/Practice_Exercises/ExercisesLesson2/
Solved exercises from Babraham Institute ggplot course:

7 Session Info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] heatmaply_1.5.0    viridis_0.6.5      viridisLite_0.4.2  plotly_4.10.4     
 [5] RColorBrewer_1.1-3 ggplot2_3.5.1      data.table_1.16.2  webexercises_1.1.0
 [9] formatR_1.14       knitr_1.48        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    tidyr_1.3.1       stringi_1.8.4    
 [5] digest_0.6.37     magrittr_2.0.3    evaluate_1.0.1    grid_4.4.1       
 [9] iterators_1.0.14  fastmap_1.2.0     plyr_1.8.9        foreach_1.5.2    
[13] jsonlite_1.8.9    seriation_1.5.6   gridExtra_2.3     httr_1.4.7       
[17] purrr_1.0.2       fansi_1.0.6       crosstalk_1.2.1   scales_1.3.0     
[21] codetools_0.2-20  lazyeval_0.2.2    textshaping_0.4.0 registry_0.5-1   
[25] cli_3.6.3         rlang_1.1.4       munsell_0.5.1     withr_3.0.2      
[29] yaml_2.3.10       tools_4.4.1       reshape2_1.4.4    dplyr_1.1.4      
[33] colorspace_2.1-1  webshot_0.5.5     assertthat_0.2.1  ca_0.71.1        
[37] TSP_1.2-4         vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[41] stringr_1.5.1     htmlwidgets_1.6.4 ragg_1.3.3        pkgconfig_2.0.3  
[45] dendextend_1.18.1 pillar_1.9.0      gtable_0.3.6      Rcpp_1.0.13      
[49] glue_1.8.0        systemfonts_1.1.0 xfun_0.48         tibble_3.2.1     
[53] tidyselect_1.2.1  rstudioapi_0.17.1 farver_2.1.2      htmltools_0.5.8.1
[57] rmarkdown_2.28    svglite_2.1.3     labeling_0.4.3    compiler_4.4.1