MÉMOVISUELDAGRONOMIE

Yield gap between countries


The illustration we are about to reproduce shows yield differences between countries for wheat and maize. In this data visualization, yields are represented in the form of a waffle chart.

Basically, a waffle chart is a grid made up of different cells, the number of which can vary according to the value you wish to represent (think of a download bar to which you would gradually add squares one by one until you reach 100). The squares can also be coloured according to different categories, to reproduce a square pie chart.

In this tutorial, we’ll see how waffle charts can be adapted to reproduce distribution curves.

1. Load packages and data

In R, waffle charts may be created with the {waffle} package. In addition to this library, we will load several packages that we will need for this tutorial:

library(tidyverse)
library(waffle)   
# For text customization:
library(ggtext)
# The following spatial packages will only be used to assign continents to countries:
library(rnaturalearth) 
library(sf)        

Average national wheat and corn yields for the period 2010 to 2020. have been downloaded from FAOSTAT.

data<-read_csv('https://raw.githubusercontent.com/BjnNowak/book/main/data/wheat_maize_2010_2020.csv')     

2. Compute mean yields for the period

The great advantage of FAOSTAT data tables is that they are always formatted in the same way. So, once you’re used to them, it’s easy to find your way around. What’s more, these tables are in “long” format, making them easy to process with the tidyverse.

About the columns we’ll be using here:

  • the country code is stored in the ‘Area Code (M49)’ column

  • the ‘Item’ column indicates the type of crop (wheat or maize here)

  • the ‘Value’ column gives the value of the variable of interest (here, yield)

The average yield over the period is therefore calculated as follows:

clean<-data%>%
  # Changing column name to avoid space
  dplyr::rename(M49 = 'Area Code (M49)')%>%
  drop_na(Value)%>%
  group_by(M49, Item)%>%
  # Mean yield in tons
  summarize(yield=mean(Value)/10000)%>%
  ungroup()

head(clean)
## # A tibble: 6 × 3
##   M49   Item         yield
##   <chr> <chr>        <dbl>
## 1 004   Maize (corn)  1.91
## 2 004   Wheat         2.00
## 3 008   Maize (corn)  6.78
## 4 008   Wheat         4.04
## 5 012   Maize (corn)  3.44
## 6 012   Wheat         1.62

We’re now going to assign a continent to each country. To do this, we’ll use the world map available in the {rnaturalearth} package, keeping only the attribute table and ignoring the geometry.

The join will be based on the United Nations codes (column ‘M49’ in our data table, column ‘un_a3’ in the {rnaturalearth} table).

wrld <- ne_countries(type = "countries", scale = "small", returnclass = "sf")%>%
  # Drop geometry and keep only attribute table
  st_drop_geometry()

clean<-clean%>%
  # Join by UN code
  left_join(wrld,by=c('M49'='un_a3'))%>%
  select(M49,Item,yield,continent)%>%
  drop_na(continent)

head(clean)
## # A tibble: 6 × 4
##   M49   Item         yield continent
##   <chr> <chr>        <dbl> <chr>    
## 1 004   Maize (corn)  1.91 Asia     
## 2 004   Wheat         2.00 Asia     
## 3 008   Maize (corn)  6.78 Europe   
## 4 008   Wheat         4.04 Europe   
## 5 012   Maize (corn)  3.44 Africa   
## 6 012   Wheat         1.62 Africa

3. Make waffle chart for wheat

To create our waffle charts, we need to (i) create yield classes and (ii) count the number of countries in each class (with a different count for each continent).

This is done as follows:

clean<-clean%>%
  mutate(
    rd=floor(yield),
    ct=1
  )%>%
  group_by(rd, Item, continent)%>%
  summarize(
    n=sum(ct)
  )%>%
  ungroup()

head(clean)
## # A tibble: 6 × 4
##      rd Item         continent         n
##   <dbl> <chr>        <chr>         <dbl>
## 1     0 Maize (corn) Africa           13
## 2     0 Maize (corn) North America     1
## 3     0 Maize (corn) Oceania           1
## 4     0 Wheat        Africa            5
## 5     0 Wheat        North America     1
## 6     1 Maize (corn) Africa           24

We’re now ready to create our first distribution curve plotted as a waffle chart! But to make things easier, we’ll start with wheat only.

With {waffle}, waffle charts are created with geom_waffle().

With this function, there’s one important thing to know: the yield classes here must be added as successive graphs (facets), not simply as classes on the x-axis.

ggplot(
    # Keep only data for wheat
    clean%>%filter(Item=="Wheat"),
    aes(values=n, fill=continent)
  )+
  waffle::geom_waffle(
    n_rows = 4,              # Number of squares in each row
    color = "white",         # Border color
    flip = TRUE, na.rm=TRUE
  )+
  facet_grid(~rd)+
  coord_equal()

Note that the x-axis and y-axis values don’t correspond to anything significant here (we’ll remove them later). Only the facet titles are important (0 for countries with yields from 0 to 1 ton, etc.).

4. Make waffle chart for both crops

We’re now going to present the values for the two crops, wheat and maize, simultaneously but keeping only the yields below 10 tons.

ggplot(
    # Keep only data with yield below 10 t
    clean%>%filter(rd<10),
    aes(values=n, fill=continent)
  )+
  waffle::geom_waffle(
    n_rows = 4,              # Number of squares in each row
    color = "white",         # Border color
    flip = TRUE, na.rm=TRUE
  )+
  # Add Item in facet_grid
  facet_grid(Item~rd)+
  coord_equal()

If you try to remove the filter on the yield, you’ll see that R returns an error message. What’s the problem? Because, unlike corn, wheat doesn’t have a country with a mean yield above 10 tons. The corresponding facets cannot be represented with ggplot().

The workaround? Create an imaginary continent that can be present for all combinations, but which we’ll then make disappear in the color palettes!

This is how we will proceed:

cpl<-clean%>%
  # Removing some countries with unrealistic yield for maize
  filter(rd<14)%>%
  # Complete all combinations for real continents, but with 0 value 
  # (this prevents cases from being created during the 2nd application of the function)
  complete(
    rd,Item,continent, 
    fill=list(n=0)
  )%>%
  # Add imaginary 'z' continent
  add_row(continent='z',Item='Wheat',n=0,rd=1)%>%
  # Complete all combination for the 'z' continente
  complete(
    rd,Item,continent, 
    fill=list(n=1)
  )

We can now plot the data without filter.

ggplot(
    cpl, # Use cpl (complete) tibble 
    aes(values=n, fill=continent)
  )+
  waffle::geom_waffle(
    n_rows = 4,              # Number of squares in each row
    color = "white",         # Border color
    flip = TRUE, na.rm=TRUE
  )+
  # Add Item in facet_grid
  facet_grid(Item~rd)+
  coord_equal()

5. Plot customization

We now have some work to do to customize the graph ! First thing: create two color palette (one for fill, one for borders) to hide the imaginary z continent.

Here’s the palette color I use for each continent in my book. Using the same color for a given continent throughout the book makes it easier to read.

pal_fill <- c(
  "Africa" = "#FFC43D", "Asia" = "#F0426B", "Europe" = "#5A4FCF",
  "South America" = "#06D6A0", "North America" = "#059E75",
  "Oceania" = "#F68EA6",
  # Set alpha to 0 to hide 'z'
  'z'=alpha('white',0)
)

pal_color <- c(
  "Africa" = "white", "Asia" = "white", "Europe" = "white",
  "South America" = "white", "North America" = "white",
  "Oceania" = "white",
  # Set alpha to 0 to hide 'z'
  'z'=alpha('white',0)
)

It also seems more natural to place corn at the bottom (as it has more yield classes). We’ll do this by reordering the factors:

# Order crop names
cpl$Item <- as.factor(cpl$Item)
cpl$Item <- fct_relevel(cpl$Item, "Wheat","Maize (corn)")

Let’s go for a new version of the plot:

p1<-ggplot(
  cpl,
  aes(values=n,fill=continent,color=continent)
)+
  waffle::geom_waffle(
    n_rows = 3, 
    flip = TRUE, na.rm=TRUE
  )+
  facet_grid(
    Item~rd, 
    # Switch both facet position
    switch="both"
  )+
  scale_x_discrete()+
  scale_fill_manual(values=pal_fill)+
  scale_color_manual(values=pal_color)+
  coord_equal()

p1

We are now going to place the legend in the subtitle (with each continent name colored accordingly), using {ggtext} syntax:

title<-"<b>The Yield Gap</b>"
sub<-"Each square is a country, colored according to continent:<br><b><span style='color:#FFC43D;'>Africa</span></b>, <b><span style='color:#F0426B;'>Asia</span></b>, <b><span style='color:#5A4FCF;'>Europe</span></b>, <b><span style='color:#059E75;'>Central and North America</span></b>, <b><span style='color:#06D6A0;'>South America</span></b> or <b><span style='color:#F68EA6;'>Oceania</span>. "

We may now add these labels to our plot:

p2<-p1+
  # Hide legend
  guides(fill='none',color='none')+
  # Add title and subtitle
  labs(title=title,subtitle=sub)+
  theme(
    # Enable markdown for title and subtitle
    plot.title=element_markdown(),
    plot.subtitle=element_markdown(),
    # "Clean" facets 
    panel.background=element_rect(fill="white"),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    strip.background.x = element_rect(fill="white"),
    strip.background.y = element_rect(fill="dimgrey"),
    strip.text.y = element_text(color="white")
  )

p2

Here’s a relatively “clean” version of the chart!

What can we see here?

  • For both crops, the distribution curves are centered on low yields

  • A majority of African countries show low to very low yields

In this way, major improvements in food production could be achieved by improving yields in the least efficient countries (by moving towards a “Gaussian” distribution) rather than by gaining a few percentages of production in already efficient countries. This is all the more true given that the higher the yields, the greater the quantity of inputs needed to further increase productivity (and therefore the lower the efficiency of these inputs).