In this data science tutorial, we will be using an avocado prices dataset to model some of the things that you might want to do with some arbitrary data. This data is important, because the cost of one popular produce item can be indicative of many things such as the cost of living in the area, and the relative health of the region’s residents.

A person might want to analyze this data if they are an avocado fanatic, such as a millennial, and wants to know where they can get them for cheap, or perhaps if you are an avocado supplier and you want to see the regions where your strongest demand lies in order to improve your supply chain model.

In combination with our avocados dataset we will also be analyzing a housing affordability table. We will be joining our two datasets together to analyze which cities are best to buy homes in for avocado lovers. Thus a lot of the data cleaning that we will be performing will be in preparation to join our two datasets together.

Required Tools

For this tutorial you will need to have R 3.5.0 or above. We use RStudio, but you can use any IDE you want.

About Our Datasets:

The avocado prices dataset details the cost of a single avocado from 2015 to present, across different American cities and regions. It also details the total volume of avocados sold, the quantity they are sold in (small bags, large bags, or XL bags), and the type of avocado (3 different PLU varieties as well as organic or conventional). You can find the dataset here

List of Topics

  1. Exploratory Data Analysis (EDA)
  2. Machine Learning
  3. Analysis of Findings


These are the packages that we will be using.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

This is our avocado dataset mentioned earlier.

avocado_file <- "avocado.csv"

avocado <- read_csv(avocado_file) %>%
  readr::type_convert(col_types=cols(Date=col_date(format="%Y-%m"))) %>% # converting date to POSIx

## # A tibble: 6 x 13
##   Date       AveragePrice `Total Volume` `4046` `4225` `4770` `Total Bags`
##   <date>            <dbl>          <dbl>  <dbl>  <dbl>  <dbl>        <dbl>
## 1 2015-12-27         1.33         64237.  1037. 5.45e4   48.2        8697.
## 2 2015-12-20         1.35         54877.   674. 4.46e4   58.3        9506.
## 3 2015-12-13         0.93        118220.   795. 1.09e5  130.         8145.
## 4 2015-12-06         1.08         78992.  1132  7.20e4   72.6        5811.
## 5 2015-11-29         1.28         51040.   941. 4.38e4   75.8        6184.
## 6 2015-11-22         1.26         55980.  1184. 4.81e4   43.6        6684.
## # … with 6 more variables: `Small Bags` <dbl>, `Large Bags` <dbl>, `XLarge
## #   Bags` <dbl>, type <chr>, year <dbl>, region <chr>

Cleaning our avocado dataset

Here I change the column names so that they are more regular throughout our dataset and can be more easily read. Then we check all of our columns for missing values.

We also edit how time is represented in our dataset, segmenting our year into quarters rather than worrying about the exact date. This is useful for two reasons. One it makes sorting by time a bit easier because two entities with close times will be lumped into the same category rather than being disjointed. The other reason that this is useful for us is that our affordability table uses the same method of segmenting their time, so it will help with joining the tables later on.

# Reading in the avocado data

avocado_tab <- avocado %>%
  rename("avg_price" = AveragePrice) %>%
  rename("date"= Date) %>%
  rename("total_volume" = `Total Volume`) %>%
  rename("total_bags" = `Total Bags`) %>%
  rename("small_bags" = `Small Bags`) %>%
  rename("large_bags" = `Large Bags`) %>%
  rename("xl_bags" = `XLarge Bags`) %>%
  mutate(month = month(date)) %>%
  # Creating a quarter factor that is represented by the beginning month of the quarter. 
  # This is to allow us to join on the affordability dataframe more easily
  mutate(quarter = ifelse(month < 3, 12, ifelse(month < 6, 3, ifelse(month < 9, 6, ifelse(month < 12, 9, 12))))) %>% 
  # Creating the same time variable as is in the affordability dataframe using the year 
  # and quarter month to make joining easy
  unite("time", c("year", "quarter"), sep="-", remove = FALSE) %>%
  select(-quarter) %>%
  type_convert(col_types=cols(time=col_date(format="%Y-%m"))) %>%
  # Calculating the mean price of each quarter for each region 
  group_by(time, region) %>%
  mutate(quarter_mean_price = mean(avg_price)) %>%
  ungroup() %>%
  # Standardizing the average price of an avocado by region
  group_by(region) %>%
  mutate(mean_price = mean(quarter_mean_price)) %>%
  mutate(sd_price = sd(avg_price)) %>%
  mutate(norm_price = (quarter_mean_price - mean_price) / sd_price) %>%
  ungroup() %>%
  mutate(capital_count = stringi::stri_count_regex(region, "[[:upper:]]")) # number of capital letters

missing_values <- avocado %>% summarise_all(funs(sum( / n()))
## # A tibble: 1 x 13
##    Date AveragePrice `Total Volume` `4046` `4225` `4770` `Total Bags`
##   <dbl>        <dbl>          <dbl>  <dbl>  <dbl>  <dbl>        <dbl>
## 1     0            0              0      0      0      0            0
## # … with 6 more variables: `Small Bags` <dbl>, `Large Bags` <dbl>, `XLarge
## #   Bags` <dbl>, type <dbl>, year <dbl>, region <dbl>

We have found that our data does not contain any missing values. We are kind of lucky in this regard, as a lot of the time when doing data analysis you will end up using data that is not as pristine as this dataset and might have many missing values. If you do find missing values in your dataset you have a few remedies to make your data tidier.

What to do about Missing Values (if you have them)

Method 1. Leave as missing:. This is easiest method, but not always the right option. You would mostly use this when dealing with missing categorical or text based data.

Method 2. Data Manipulation: Imagine if some of our missing values were in a numerical attribute. We could then replace the missing value with the average of the rest of our data in that specific attribute. This is also called Imputation. It is easy to do, and if you do it well, it shouldn’t lead to much bias in your data.

Method 3. Removal A lot of the time when working with data you will end up with data that is better to remove than to work with. You will see this later with our example when we join our two datasets together and it creates entities that are now obsolete in our joined dataframe.

Part 1. EDA

Lets start off with somethings simple. Suppose we just want to see which cities have the highest and lowest prices for avocados last year.

city_prices <- avocado_tab %>%
  select(region, year, avg_price) %>%
  group_by(region, year) %>%
  summarise(avg_price = mean(avg_price)) %>%

most_expensive <- city_prices %>%
  filter(year == 2018) %>%

least_expensive <- city_prices %>%
  filter(year == 2018) %>%

## # A tibble: 54 x 3
##    region               year avg_price
##    <chr>               <dbl>     <dbl>
##  1 HartfordSpringfield  2018      1.68
##  2 Boston               2018      1.58
##  3 NewYork              2018      1.57
##  4 Chicago              2018      1.56
##  5 SanFrancisco         2018      1.55
##  6 RaleighGreensboro    2018      1.54
##  7 Boise                2018      1.49
##  8 Charlotte            2018      1.48
##  9 SanDiego             2018      1.48
## 10 Northeast            2018      1.47
## # … with 44 more rows
## # A tibble: 54 x 3
##    region            year avg_price
##    <chr>            <dbl>     <dbl>
##  1 Houston           2018      1.04
##  2 DallasFtWorth     2018      1.10
##  3 SouthCentral      2018      1.10
##  4 PhoenixTucson     2018      1.16
##  5 Indianapolis      2018      1.17
##  6 Columbus          2018      1.17
##  7 Detroit           2018      1.18
##  8 NewOrleansMobile  2018      1.20
##  9 Nashville         2018      1.22
## 10 BuffaloRochester  2018      1.23
## # … with 44 more rows

From this we can see that the most expensive region is the Hartford & Springfield areas, and the least expensive region is in Houston.

Average Prices Over Time

Here I am plotting the overall average price of avocados over time. Using this plot we can see how the prices have changed over time.

avocado_tab %>%
  select(time, quarter_mean_price, region) %>%
  group_by(time) %>%
  summarise(avg_quarter_price = mean(quarter_mean_price)) %>%
  ggplot(aes(y = avg_quarter_price, x = time)) +

Analyzing the plot we can see that the price of avocados varies greatly from year to year. There are many reasons that these fluctuations could be occuring, but one could theorize that it could be because of different avocado crop yields for different times of year. We should take note of these fluctuations, because they will be important for when we analyze how avocado prices and affordability are related.

Purchase Volume by Month

We saw in the previous plot that avocado prices are varying over the course of the year, so now another question that we might ask is, what month of the year are the most avocados being purchased and what are the average prices for each month?

avocado_tab %>%
  select(month, total_volume) %>%
  group_by(month) %>%
  summarise(avg_monthly_volume = mean(total_volume)) %>%
  ggplot(aes(y = avg_monthly_volume, x = factor(month))) +
  geom_bar(stat = "identity") + xlab("Month") + ylab("Average Volume")

Looking at the months all together, we can see that while they do not very that much, there are certain months where on average more avocados are purchased. People seem to purchase more avocados in the months of February, May, and June. People are conversly buying less avocados in the later months of September through December. The highest volume of avocados is purchased in February. Since all of our data is collected from the United States, it could be that people are buying a lot of avocados to make guacomole for Super Bowl Sunday.

Average Price by Month

In what months do avocados cost the most? Analyzing this with out previous plot could help us draw some conclusions about why people buy more avocados in one month versus the other. We would expect to see that months with lower average costs, would see higher average sales.

avocado_tab %>%
  select(month, avg_price) %>%
  group_by(month) %>%
  summarise(avg_monthly_price = mean(avg_price)) %>%
  ggplot(aes(y = avg_monthly_price, x = factor(month))) +
  geom_bar(stat = "identity")

We can see that our hypothesis was generally supported as the month with the highest sales (February), also had the lowest average price.

Volume Sold at Different Price Ranges

This plot could be useful to someone who is concerned with the amount of avocados that sell at different price points. Say an avocado supplier wanted to know how much they should charge for their avocados. It would be very useful to the supplier if they were to know at what price points the retailer is able to best sell avocados. The supplier would then be able to take this into account when analyzing their supply chain costs.

avocado_tab %>%
  select(avg_price, total_volume) %>%
  group_by(avg_price) %>%
  summarise(avg_volume = mean(total_volume)) %>%
  ggplot(aes(y = avg_volume, x = cut(avg_price, breaks = 7))) +
  geom_bar(stat = "identity") + xlab("Price ranges") + ylab("Average Volume")