Weather Data Analysis Project: JFK Airport

Author

Ali Sufayran

Published

March 28, 2025

This is the final project for the Data Analysis with R course, part of the IBM Data Analytics with Excel and R Professional Certificate and the IBM Applied Data Science with R Specialization. In this project, I will practice reading data files, preprocessing data, creating models, improving them, and evaluating to select the best model.

Introduction

This project relates to the NOAA Weather Dataset - JFK Airport (New York). The original dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

For this project, you will be using a subset dataset, which contains 5727 rows (about 5% or original rows) and 9 columns. The end goal will be to predict the precipitation using some of the available features. In this project, you will practice reading data files, preprocessing data, creating models, improving models and evaluating them to ultimately choose the best model.

0. Import required modules

Tidymodels is a collection of packages that use tidyverse principles to easily do the entire modeling process from preprocessing initial data, to creating a model, to tunning hyperparameters. The tidymodels packages can be used to produce high quality statistical and machine learning models. Our Jupyter notebook platforms have a built-in Tidyverse, Tidymodels and rlang packages so we do not need to install these packages prior to loading library. However, if you decide to run this lab on your RStudio Desktop locally on your machine, you can remove the commented lines of code to install these packages before loading.

# Library for modeling
library(tidymodels)

# Load tidyverse
library(tidyverse)

Understand the Dataset

The original NOAA JFK dataset contains 114,546 hourly observations of various local climatological variables (including temperature, wind speed, humidity, dew point, and pressure).

In this project you will use a sample dataset, which is around 293 KB. Link to the sample dataset.

The sample contains 5727 rows (about 5% or original rows) and 9 columns, which are:

DATE
HOURLYDewPointTempF
HOURLYRelativeHumidity
HOURLYDRYBULBTEMPF
HOURLYWETBULBTEMPF
HOURLYPrecip
HOURLYWindSpeed
HOURLYSeaLevelPressure
HOURLYStationPressure

The original dataset is much bigger. Feel free to explore the original dataset. Link to the original dataset.

For more information about the dataset, checkout the preview of NOAA Weather - JFK Airport.

1. Download NOAA Weather Dataset

Use the download.file() function to download the sample dataset from the URL below.

URL = ‘https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz’

URL <- "https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz"

# Download the file
download.file(URL, destfile = "noaa-weather-sample-data.tar.gz")

Untar the zipped file.

untar("noaa-weather-sample-data.tar.gz", tar = "internal")

Warning in untar2(tarfile, files, list, exdir, restore_times): using pax
extended headers

2. Extract and Read into Project

We start by reading in the raw dataset. You should specify the file name as “noaa-weather-sample-data/jfk_weather_sample.csv”.

# Read the CSV file into a dataframe
jfk_weather <- read_csv("noaa-weather-sample-data/jfk_weather_sample.csv")

Rows: 5727 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): HOURLYPrecip
dbl  (7): HOURLYDewPointTempF, HOURLYRelativeHumidity, HOURLYDRYBULBTEMPF, H...
dttm (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, display the first few rows of the dataframe.

head(jfk_weather)

# A tibble: 6 × 9
  DATE                HOURLYDewPointTempF HOURLYRelativeHumidity
  <dttm>                            <dbl>                  <dbl>
1 2015-07-25 13:51:00                  60                     46
2 2016-11-18 23:51:00                  34                     48
3 2013-01-06 08:51:00                  33                     89
4 2011-01-27 16:51:00                  18                     48
5 2015-01-03 12:16:00                  27                     61
6 2013-02-15 20:51:00                  35                     79
# ℹ 6 more variables: HOURLYDRYBULBTEMPF <dbl>, HOURLYWETBULBTEMPF <dbl>,
#   HOURLYPrecip <chr>, HOURLYWindSpeed <dbl>, HOURLYSeaLevelPressure <dbl>,
#   HOURLYStationPressure <dbl>

Also, take a glimpse of the dataset to see the different column data types and make sure it is the correct subset dataset with about 5700 rows and 9 columns.

glimpse(jfk_weather)

Rows: 5,727
Columns: 9
$ DATE                   <dttm> 2015-07-25 13:51:00, 2016-11-18 23:51:00, 2013…
$ HOURLYDewPointTempF    <dbl> 60, 34, 33, 18, 27, 35, 4, 14, 51, 71, 76, 19, …
$ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
$ HOURLYDRYBULBTEMPF     <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
$ HOURLYWETBULBTEMPF     <dbl> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
$ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
$ HOURLYWindSpeed        <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
$ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
$ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

3. Select Subset of Columns

The end goal of this project will be to predict HOURLYprecip (precipitation) using a few other variables. Before you can do this, you first need to preprocess the dataset. Section 3 to section 6 focuses on preprocessing.

The first step in preprocessing is to select a subset of data columns and inspect the column types.

The key columns that we will explore in this project are:

HOURLYRelativeHumidity
HOURLYDRYBULBTEMPF
HOURLYPrecip
HOURLYWindSpeed
HOURLYStationPressure

Data Glossary:

‘HOURLYRelativeHumidity’ is the relative humidity given to the nearest whole percentage.
‘HOURLYDRYBULBTEMPF’ is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
‘HOURLYPrecip’ is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
‘HOURLYWindSpeed’ is the speed of the wind at the time of observation given in miles per hour (mph).
‘HOURLYStationPressure’ is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).

Select those five columns and store the modified dataframe as a new variable.

sub_jfk_weather <- jfk_weather %>%
  select(HOURLYRelativeHumidity,
         HOURLYDRYBULBTEMPF,
         HOURLYPrecip,
         HOURLYWindSpeed,
         HOURLYStationPressure)

Show the first 10 rows of this new dataframe.

head(sub_jfk_weather, 10)

# A tibble: 10 × 5
   HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
                    <dbl>              <dbl> <chr>                  <dbl>
 1                     46                 83 0.00                      13
 2                     48                 53 0.00                       6
 3                     89                 36 0.00                      13
 4                     48                 36 0.00                      14
 5                     61                 39 T                         11
 6                     79                 41 0.00                       6
 7                     51                 19 0.00                       0
 8                     65                 24 0.00                      11
 9                     90                 54 0.06                      11
10                     94                 73 <NA>                       5
# ℹ 1 more variable: HOURLYStationPressure <dbl>

4. Clean Up Columns

From the dataframe preview above, we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NA and T values. T specifies trace amounts of precipitation (meaning essentially no precipitation), while NA means not available, and is used to denote missing values. Additionally, some values also have “s” at the end of them, indicating that the precipitation was snow.

Inspect the unique values present in the column HOURLYPrecip (with unique(dataframe$column)) to see these values.

unique(sub_jfk_weather$HOURLYPrecip)

 [1] "0.00"  "T"     "0.06"  NA      "0.03"  "0.02"  "0.08"  "0.01"  "0.07" 
[10] "0.16"  "0.09"  "0.22"  "0.02s" "0.24"  "0.18"  "0.05"  "0.04"  "0.09s"
[19] "0.11"  "0.14"  "0.25"  "0.10"  "0.01s" "0.58"  "0.12"  "0.13"  "0.46" 
[28] "1.07"  "1.19"  "0.34"  "0.20"  "0.36s" "0.42"  "0.17"  "0.27"  "0.35" 
[37] "0.31"  "0.33"  "0.23"  "0.26"  "0.28"  "0.75"  "0.19"  "0.36"  "0.03s"
[46] "0.07s" "0.54"  "0.59"  "0.21"

Having characters in values (like the “T” and “s” that you see in the unique values) will cause problems when you create a model because values for precipitation should be numerical. So you need to fix these values that have characters.

Now, for the column HOURLYPrecip: 1. Replace all the T values with “0.0” and 2. Remove “s” from values like “0.02s”. In R, you can use the method str_remove(column, pattern = "s$") to remove the character “s” from the end of values. The “$” tells R to match to the end of values. The pattern is a regex pattern. Look at here for more information about regex and matching to strings in R.

Remember that you can use tidyverse’s mutate() to update columns.

You can check your work by checking if unique values of HOURLYPrecip still contain any T or s. Store the modified dataframe as a new variable.

jfk_df <- sub_jfk_weather %>%
  mutate(HOURLYPrecip = str_replace(HOURLYPrecip, "T", "0.00")) %>%
  mutate(HOURLYPrecip = str_remove(HOURLYPrecip, pattern = "s$")) 


unique(jfk_df$HOURLYPrecip)

 [1] "0.00" "0.06" NA     "0.03" "0.02" "0.08" "0.01" "0.07" "0.16" "0.09"
[11] "0.22" "0.24" "0.18" "0.05" "0.04" "0.11" "0.14" "0.25" "0.10" "0.58"
[21] "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42" "0.17"
[31] "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19" "0.54"
[41] "0.59" "0.21"

5. Convert Columns to Numerical Types

Now that you have removed the characters in the HOURLYPrecip column, you can safely covert the column to a numeric type.

First, check the types of the columns. You will notice that all are dbl (double or numeric) except for HOURLYPrecip, which is chr (character or string). Use the glimpse function from Tidyverse.

glimpse(jfk_df)

Rows: 5,727
Columns: 5
$ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
$ HOURLYDRYBULBTEMPF     <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
$ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "0.00", "0.00",…
$ HOURLYWindSpeed        <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
$ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

Convert HOURLYPrecip to the numeric type and store the cleaned dataframe as a new variable.

jfk_df$HOURLYPrecip <- as.numeric(jfk_df$HOURLYPrecip)

We can now see that all fields have numerical data type.

glimpse(jfk_df)

Rows: 5,727
Columns: 5
$ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
$ HOURLYDRYBULBTEMPF     <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
$ HOURLYPrecip           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ HOURLYWindSpeed        <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
$ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

6. Rename Columns

Let’s rename the following columns as:

‘HOURLYRelativeHumidity’ to ‘relative_humidity’
‘HOURLYDRYBULBTEMPF’ to ‘dry_bulb_temp_f’
‘HOURLYPrecip’ to ‘precip’
‘HOURLYWindSpeed’ to ‘wind_speed’
‘HOURLYStationPressure’ to ‘station_pressure’

You can use dplyr::rename(). Then, store the final dataframe as a new variable.

jfk_df1 <- jfk_df %>%
    rename(
    relative_humidity = HOURLYRelativeHumidity,
    dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
    precip = HOURLYPrecip,
    wind_speed = HOURLYWindSpeed,
    station_pressure = HOURLYStationPressure) %>% 
    drop_na( # Remove Missing values
        relative_humidity,
        dry_bulb_temp_f,
        precip,
        wind_speed,
        station_pressure
    )

head(jfk_df1)

# A tibble: 6 × 5
  relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
              <dbl>           <dbl>  <dbl>      <dbl>            <dbl>
1                46              83      0         13             30.0
2                48              53      0          6             30.0
3                89              36      0         13             30.1
4                48              36      0         14             29.8
5                61              39      0         11             30.5
6                79              41      0          6             29.9

7. Exploratory Data Analysis

Now that you have finished preprocessing the dataset, you can can start exploring the columns more.

First, split the data into a training and testing set. Splitting a dataset is done randomly, so to have reproducible results set the seed = 1234. Also, use 80% of the data for training.

set.seed(1234)

# Split the data into training (80%) and testing (20%) sets
jfk_df1_split <- initial_split(jfk_df1, prop = 0.8)
train_data <- training(jfk_df1_split)
test_data <- testing(jfk_df1_split)

Next, looking at just the training set, plot histograms or box plots of the variables (relative_humidity, dry_bulb_temp_f, precip, wind_speed, station_pressure) for an intial look of their distributions using tidyverse’s ggplot. Leave the testing set as is because it is good practice to not see the testing set until evaluating the final model.

library(ggplot2)
library(tidyr)
library(dplyr)

# Reshape the data
long_data <- train_data %>%
    pivot_longer(cols = c(relative_humidity, dry_bulb_temp_f, precip, wind_speed, station_pressure),
                 names_to = "variable", values_to = "value")

# Define colors for each variable
color_mapping <- c(
    "relative_humidity" = "steelblue2",
    "dry_bulb_temp_f" = "tomato",
    "precip" = "darkolivegreen2",
    "wind_speed" = "yellow2",
    "station_pressure" = "#d11141"
)

# Create the faceted histogram with different bin widths
ggplot() +
    geom_histogram(data = subset(long_data, variable == "relative_humidity"),
                   aes(x = value, fill = variable), binwidth = 10, color = "black", na.rm = TRUE) +
    geom_histogram(data = subset(long_data, variable == "dry_bulb_temp_f"),
                   aes(x = value, fill = variable), binwidth = 10, color = "black", na.rm = TRUE) +
    geom_histogram(data = subset(long_data, variable == "precip"),
                   aes(x = value, fill = variable), binwidth = 1, color = "black", na.rm = TRUE) +
    geom_histogram(data = subset(long_data, variable == "wind_speed"),
                   aes(x = value, fill = variable), binwidth = 10, color = "black", na.rm = TRUE) +
    geom_histogram(data = subset(long_data, variable == "station_pressure"),
                   aes(x = value, fill = variable), binwidth = 1, color = "black", na.rm = TRUE) +
    scale_fill_manual(values = color_mapping) +
    facet_wrap(~ variable, scales = "free") +
    theme_minimal() +
    labs(title = "Histograms of Weather Variables", x = "Value", y = "Frequency")

8. Linear Regression

After exploring the dataset more, you are now ready to start creating models to predict the precipitation (precip).

Create simple linear regression models where precip is the response variable and each of relative_humidity, dry_bulb_temp_f,wind_speed or station_pressure will be a predictor variable, e.g. precip ~ relative_humidity, precip ~ dry_bulb_temp_f, etc. for a total of four simple models. Additionally, visualize each simple model with a scatter plot.

# Define a linear regression model
lm_spec <- linear_reg() %>% 
  set_engine(engine = "lm")

# Model 1: precip and relative_humidity
lm_pr <- lm_spec %>%
fit(precip ~ relative_humidity, data = train_data)

summary(lm_pr$fit)


Call:
stats::lm(formula = precip ~ relative_humidity, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.02196 -0.01227 -0.00522  0.00183  1.16980 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -2.211e-02  2.710e-03  -8.156 4.88e-16 ***
relative_humidity  4.407e-04  3.922e-05  11.237  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04444 on 3248 degrees of freedom
Multiple R-squared:  0.03742,   Adjusted R-squared:  0.03712 
F-statistic: 126.3 on 1 and 3248 DF,  p-value: < 2.2e-16

ggplot(data = train_data, aes(relative_humidity, precip)) + geom_point(color = "grey20", alpha = 0.6) +
    geom_jitter(alpha = 0.2) +
    geom_smooth(method = "lm", formula = y ~ x, color = "red2")

# Model 2: precip and dry_bulb_temp_f
lm_pd <- lm_spec %>%
fit(precip ~ dry_bulb_temp_f, data = train_data)

summary(lm_pd$fit)


Call:
stats::lm(formula = precip ~ dry_bulb_temp_f, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.00878 -0.00764 -0.00696 -0.00632  1.18241 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)     4.716e-03  2.677e-03   1.761   0.0783 .
dry_bulb_temp_f 4.233e-05  4.614e-05   0.917   0.3590  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04528 on 3248 degrees of freedom
Multiple R-squared:  0.000259,  Adjusted R-squared:  -4.878e-05 
F-statistic: 0.8415 on 1 and 3248 DF,  p-value: 0.359

ggplot(data = train_data, aes(dry_bulb_temp_f, precip)) + geom_point(color = "grey20", alpha = 0.6) +
    geom_jitter(alpha = 0.2) +
    geom_smooth(method = "lm", formula = y ~ x, color = "red2")

# Model 3: precip and wind_speed
lm_pw <- lm_spec %>%
fit(precip ~ wind_speed, data = train_data)

summary(lm_pw$fit)


Call:
stats::lm(formula = precip ~ wind_speed, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.01957 -0.00865 -0.00577 -0.00404  1.17675 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0005920  0.0016474   0.359    0.719    
wind_speed  0.0005752  0.0001284   4.479 7.77e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04515 on 3248 degrees of freedom
Multiple R-squared:  0.006138,  Adjusted R-squared:  0.005832 
F-statistic: 20.06 on 1 and 3248 DF,  p-value: 7.77e-06

ggplot(data = train_data, aes(wind_speed, precip)) + geom_point(color = "grey20", alpha = 0.6) +
    geom_jitter(alpha = 0.2) +
    geom_smooth(method = "lm", formula = y ~ x, color = "red2")

# Model 4: precip and station_pressure
lm_ps <- lm_spec %>%
    fit(precip ~ station_pressure, data = train_data)

summary(lm_ps$fit)


Call:
stats::lm(formula = precip ~ station_pressure, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.02941 -0.01007 -0.00615 -0.00145  1.17496 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.790804   0.098928   7.994 1.80e-15 ***
station_pressure -0.026129   0.003298  -7.923 3.17e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04486 on 3248 degrees of freedom
Multiple R-squared:  0.01896,   Adjusted R-squared:  0.01866 
F-statistic: 62.77 on 1 and 3248 DF,  p-value: 3.168e-15

ggplot(data = train_data, aes(station_pressure, precip)) + geom_point(color = "grey20", alpha = 0.6) +
    geom_jitter(alpha = 0.2) +
    geom_smooth(method = "lm", formula = y ~ x, color = "red2")

9. Improve the Model

Now, try improving the simple models you created in the previous section.

Create at least two more models, each model should use at least one of the different techniques: 1. Add more features/predictors 2. Add regularization (L1, L2 or a mix) 3. Add a polynomial component

Also, for each of the models you create, check the model performance using the training set and a metric like MSE, RMSE, or R-squared.

Consider using tidymodels if you choose to add regularization and tune lambda.

# 1.Displays the correlation matrix
# Calculate the correlation matrix
correlation <- jfk_df1 %>%
select(relative_humidity, dry_bulb_temp_f, precip, wind_speed
, station_pressure) %>%
cor(use = "pairwise.complete.obs", method = "pearson")

# Color palette for correlation plot
col <- colorRampPalette(c("#c32242", "#FFFFFF", "#428ce1"))

# Visualize the correlation matrix
corrplot(correlation, 
         method = "color", 
         col = col(200), 
         type = "upper",
         addCoef.col = "black", 
         tl.col = "black", 
         tl.srt = 35,
         title = "Correlation Matrix", 
         mar = c(0, 0, 3, 0))

library(recipes)  

# Define a multiple linear regression model
mlr_spec <- linear_reg() %>% 
  set_engine(engine = "lm") 

# 2. Add new models

# Fit the multiple linear regression model
train_mlr <- mlr_spec %>%
  fit(precip ~ relative_humidity + wind_speed + dry_bulb_temp_f, 
      data = train_data)

# Predict and evaluate the model
train_mlr_results <- train_mlr %>%
    predict(new_data = train_data) %>%
    mutate(truth = train_data$precip) 

rmse(train_mlr_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0441

rsq(train_mlr_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard      0.0533

# Define a polynomial regression model
poly_spec <- linear_reg() %>% 
  set_engine(engine = "lm") 

# Fit the polynomial regression model
train_poly <- poly_spec %>%
    fit(precip ~ poly(relative_humidity, degree = 2, raw = TRUE), data = train_data)

# Predict and evaluate the model
train_poly_results <- train_poly %>%
    predict(new_data = train_data) %>%
    mutate(truth = train_data$precip)

rmse(train_poly_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0439

rsq(train_poly_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard      0.0586

# Create a recipe
jfk_recipe <- recipe(precip ~., data = train_data)

# Define the ridge regression model
tune_spec <- linear_reg(penalty = tune(), mixture = 0) %>% 
  set_engine("glmnet")

# Set up the work flow
ridge_wf <- workflow() %>%
  add_recipe(jfk_recipe)  

# Define cross-validation to resample the data
jfk_cvfolds <- vfold_cv(train_data)

# Set up the grid
lambda_grid <- grid_regular(levels = 50, 
  penalty(range = c(-3, 0.3)))  # 50 values of lambda are used.

ridge_grid <- tune_grid(
    ridge_wf %>% add_model(tune_spec), 
    resamples = jfk_cvfolds, 
    grid = lambda_grid)

Warning: package 'glmnet' was built under R version 4.3.3

Warning: package 'Matrix' was built under R version 4.3.3

# To view the best results for RMSE
show_best(ridge_grid, metric = "rmse")

# A tibble: 5 × 7
  penalty .metric .estimator   mean     n std_err .config              
    <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
1 0.00877 rmse    standard   0.0409    10 0.00548 Preprocessor1_Model15
2 0.00751 rmse    standard   0.0409    10 0.00548 Preprocessor1_Model14
3 0.00643 rmse    standard   0.0409    10 0.00547 Preprocessor1_Model13
4 0.0102  rmse    standard   0.0409    10 0.00549 Preprocessor1_Model16
5 0.00551 rmse    standard   0.0409    10 0.00546 Preprocessor1_Model12

show_best(ridge_grid, metric = "rsq")

# A tibble: 5 × 7
  penalty .metric .estimator   mean     n std_err .config              
    <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
1 0.001   rsq     standard   0.0728    10 0.00966 Preprocessor1_Model01
2 0.00117 rsq     standard   0.0728    10 0.00966 Preprocessor1_Model02
3 0.00136 rsq     standard   0.0728    10 0.00966 Preprocessor1_Model03
4 0.00159 rsq     standard   0.0728    10 0.00966 Preprocessor1_Model04
5 0.00186 rsq     standard   0.0728    10 0.00966 Preprocessor1_Model05

$\lambda$ = 0.00751 gives the lowerest RMSE of 0.0396 and $\lambda$ = 0.00296 gives the highest R-squared of 0.07757189

# To visualize the RMSE results: 
ridge_grid %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(penalty, mean)) +
  geom_line(size=1, color="red") +
  scale_x_log10() +
  ggtitle("RMSE")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# To visualize the R-squared results: 
ridge_grid %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  ggplot(aes(penalty, mean)) +
  geom_line(size=1, color="red") +
  scale_x_log10() +
  ggtitle("R-Squared")

10. Find Best Model

Compare the regression metrics of each model from section 9 to find the best model overall. To do this,

Evaluate the models on the testing set using at least one metric (like MSE, RMSE or R-squared).
After calculating the metrics on the testing set for each model, print them out in as a table to easily compare. You can use something like:

model_names <- c("model_1", "model_2", "model_3")
train_error <- c("model_1_value", "model_2_value", "model_3_value")
test_error <- c("model_1_value", "model_2_value", "model_3_value")
comparison_df <- data.frame(model_names, train_error, test_error)

Finally, from the comparison table you create, conclude which model performed the best.

test_mlr <- mlr_spec %>%
  fit(precip ~ relative_humidity + wind_speed + dry_bulb_temp_f, 
      data = test_data)

test_mlr_results <- test_mlr %>%
    predict(new_data = test_data) %>%
    mutate(truth = test_data$precip) 

rmse(test_mlr_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0211

rsq(test_mlr_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard      0.0882

test_poly <- poly_spec %>%
    fit(precip ~ poly(relative_humidity, degree = 2, raw = TRUE), data = test_data)

test_poly_results <- test_poly %>%
    predict(new_data = test_data) %>%
    mutate(truth = test_data$precip)

rmse(test_poly_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0211

rsq(test_poly_results, truth = truth,
     estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard      0.0894

# Create a recipe
jfk_recipe <- recipe(precip ~., data = test_data)

tune_spec <- linear_reg(penalty = tune(), mixture = 0) %>% 
  set_engine("glmnet")

ridge_wf <- workflow() %>%
  add_recipe(jfk_recipe)  # Specify the work flow

# define cross validation to resample the data:
jfk_cvfolds <- vfold_cv(test_data)

# Set up the grid
lambda_grid <- grid_regular(levels = 50, 
  penalty(range = c(-3, 0.3)))  # 50 values of lambda are used.

ridge_grid <- tune_grid(
    ridge_wf %>% add_model(tune_spec), 
    resamples = jfk_cvfolds, 
    grid = lambda_grid)

# To view best results for RMSE:
show_best(ridge_grid, metric = "rmse")

# A tibble: 5 × 7
  penalty .metric .estimator   mean     n std_err .config              
    <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
1 0.00186 rmse    standard   0.0194    10 0.00284 Preprocessor1_Model05
2 0.00159 rmse    standard   0.0194    10 0.00284 Preprocessor1_Model04
3 0.00217 rmse    standard   0.0194    10 0.00284 Preprocessor1_Model06
4 0.00136 rmse    standard   0.0194    10 0.00284 Preprocessor1_Model03
5 0.00117 rmse    standard   0.0194    10 0.00284 Preprocessor1_Model02

show_best(ridge_grid, metric = "rsq")

# A tibble: 5 × 7
  penalty .metric .estimator  mean     n std_err .config              
    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1 0.00159 rsq     standard   0.124    10  0.0249 Preprocessor1_Model04
2 0.00136 rsq     standard   0.124    10  0.0249 Preprocessor1_Model03
3 0.00186 rsq     standard   0.124    10  0.0249 Preprocessor1_Model05
4 0.00117 rsq     standard   0.123    10  0.0249 Preprocessor1_Model02
5 0.001   rsq     standard   0.123    10  0.0249 Preprocessor1_Model01

model_names <- c("Multiple Linear Regression Model", 
                 "Quadratic Polynomial Model", 
                 "Regularized Multiple Linear Regression Model")
train_error <- c(0.05326346, 0.05856079, 0.07757189)
test_error <- c(0.08815611, 0.0893642, 0.1226883)
comparison_df <- data.frame(model_names, train_error, test_error)
print(comparison_df)

                                   model_names train_error test_error
1             Multiple Linear Regression Model  0.05326346 0.08815611
2                   Quadratic Polynomial Model  0.05856079 0.08936420
3 Regularized Multiple Linear Regression Model  0.07757189 0.12268830

paste("The best performing model is",comparison_df %>%
    select(model_names) %>%
    filter(test_error == max(test_error)))

[1] "The best performing model is Regularized Multiple Linear Regression Model"

The Best Performing Model is Regularized Multiple Linear Regression Model

Author(s): Yiwen Li Contributions: Tiffany Zhu