Introduction

Today, we will work with daily water temperature and air temperature data observed for 31 rivers in Spain. The goal of this tutorial is to identify the best model for predicting the maximum water temperature given the maximum air temperature. In the preview below, W represents the daily maximum water temperature and A represents the daily maximum air temperature. The data contains almost a full year of data for each of the 31 different rivers.

## # A tibble: 6 x 8
##   JULIAN_DAY  YEAR     L     W     A  TIME MONTH   DAY
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          1  2003   103  14.2  21.2     1     1     1
## 2          2  2003   103  14.4  16.8     2     1     2
## 3          3  2003   103  14.4  15.4     3     1     3
## 4          4  2003   103  10.9  10.8     4     1     4
## 5          5  2003   103  10.8  11.7     5     1     5
## 6          6  2003   103  10.7  12.4     6     1     6

Part 1: Examining the Relationship

Chunk 1: Overall Relationship

Please modify the code chunk to visualize the relationship between A and W with scatterplot (set alpha=0.3) and smooth line.

#

Chunk 2: Location-Specific Relationship

Do You Think This Relationship is the Same for All Locations? Please write a function to visualize the relationship between A and W with scatterplot (set alpha=0.3) and smooth line for a specified Location L.

Test your function to see if it works:

WAPLOT.func(103)
WAPLOT.func(105)
WAPLOT.func(918)

Chunk 3: Split Data into Train and Test Sets

set.seed(216)
TEST.LOCATIONS=sample(x=unique(DATA$L),size=3,replace=F)

TRAIN = anti_join(DATA,tibble(L=TEST.LOCATIONS),by="L")
TEST = semi_join(DATA,tibble(L=TEST.LOCATIONS),by="L")

Chunk 4: Plots of Relationship for Train and Test Data

Please write a function to visualize the relationship between A and W with scatterplot (set alpha=0.3) and smooth line for a specified Dataset.

Test the function:

WAPLOT2.func(TRAIN)
WAPLOT2.func(TEST)

Part 2: Linear Regression Model

Chunk 1: Fitting Linear Model to Train Data

Please complete the code chunk to fit a linear regression model on the training set. (Formula: W~A)

Chunk 2: Getting Predictions from Linear Model

TRAIN2 = TRAIN %>% add_predictions(linmod,var="linpred")
TEST2 = TEST %>% add_predictions(linmod,var="linpred")

Chunk 3: Getting Residuals from Linear Model

TRAIN3 = TRAIN2 %>% add_residuals(linmod,var="linres")
TEST3 = TEST2 %>% add_residuals(linmod,var="linres")

Part 3: Polynomial Regression Model

Chunk 1: Fitting Polynomial Regression Models

poly2mod=lm(W~A+I(A^2),data=TRAIN)
poly3mod=lm(W~A+I(A^2)+I(A^3),data=TRAIN)
poly4mod=lm(W~A+I(A^2)+I(A^3)+I(A^4),data=TRAIN)
anova(linmod,poly2mod,poly3mod,poly4mod,test="Chisq")

Chunk 2: Getting Predictions from Polynomial Models

Please add the predictions of polynomial regression models to the training and testing sets with column names poly2pred, poly3pred and poly4pred.

TRAIN4 =TRAIN3 %>% 
  COMPLETE
  
TEST4 =TEST3 %>% 
  COMPLETE

Chunk 3: Getting Residuals from Polynomial Models

Please add the residuals of polynomial regression models to the training and testing sets with column names poly2res, poly3res and poly4res.

TRAIN5 =TRAIN4 %>% 
  COMPLETE

TEST5 =TEST4 %>% 
  COMPLETE

Part 4: Nonlinear Logistic Model

Chunk 1: Logistic Model Investigation

\[y=d+\frac{a}{1+e^{(b-cx)}}\]

Chunk 2: Essential Functions for Nonlinear Logistic Model

Please write a function to calculate mean squared error for the nonlinear logistic model.

logistic.model=function(COEF,DATA){
  pred=COEF[1]+COEF[2]/(1+exp(COEF[3]-COEF[4]*DATA$A))
}
MSE.logistic=function(COEF,DATA){
  COMPLETE
  return(mse)
}

Chunk 3: Estimate Parameters for Nonlinear Logistic Model

logistic.mod=optim(
  par=c(min(TRAIN$W,na.rm=T),
        max(TRAIN$W,na.rm=T)-min(TRAIN$W,na.rm=T),
        median(TRAIN$A,na.rm=T),
        1),           #Smart Starting Values
  fn=MSE.logistic,    #Function to Minimize
  DATA=TRAIN          #Required Argument
)
print(logistic.mod)

Chunk 4: Obtain Predictions and Residuals

TRAIN6=TRAIN5 %>% mutate(logpred=COMPLETE,
                         logres=COMPLETE)
TEST6=TEST5 %>% mutate(logpred=COMPLETE,
                         logres=COMPLETE)

Intermission:

The function save.image() in R can be used to save all objects in the global environment. This is very helpful when you want work off your results without rerunning all previous R code. The name of the exported information should contain the file extension .Rdata. These files can be extremely large depending how much RAM was utilized in your R session. The function load() can be used to import a previous workspace.

For more information on .Rdata file types, see https://fileinfo.com/extension/rdata for help.

save.image("Tutorial.Rdata")
load("PATH_TO_THE_FILE")

Part 5: Evaluation by Visualization

Chunk 1: Plotting Models

Chunk 2: Predicted Versus Actual Max Water Temperature

Chunk 3: Plotting Residuals Versus Time

Chunk 4: Evaluating The Location-Specific Error

Part 6: Evaluation by Numerical Summary

Chunk 1: Functions Required For Evaluating Prediction

bias.func=function(res){
  bias=mean(res,na.rm=T)
  return(bias)
}

mae.func=function(res){
  mae=mean(abs(res),na.rm=T)
  return(mae)
}

rmse.func=function(res){
  mse=mean(res^2,na.rm=T)
  rmse=sqrt(mse)
  return(rmse)
}

Chunk 2: Checking Functions

Please use apply function to calculate Bias, MAE and rMSE for the models.

ex.res.mat=TEST6 %>% select(linres,poly2res,poly3res,poly4res,logres)
apply(COMPLETE)
apply(COMPLETE)
apply(COMPLETE)

Chunk 3: Get Table Quickly

Chunk 4: HTML Formatted Table