Warning: package 'xgboost' was built under R version 4.4.3
Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':
slice
library(purrr)library(kernlab)
Attaching package: 'kernlab'
The following object is masked from 'package:scales':
alpha
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
alpha
Warning in
download.file("https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf",
: URL https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf:
cannot open destfile 'data/camels_attributes_v2.0.pdf', reason 'No such file or
directory'
Warning in
download.file("https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf",
: download had nonzero exit status
# Get data specific text filestypes <-c("clim", "geol", "soil", "topo", "vege", "hydro")# Construct URLs and file names for the dataremote_files <-glue('{root}/camels_{types}.txt')local_files <-glue('../data/lab_data/camels_hydro_data/camels_{types}.txt')# Download specific datawalk2(remote_files, local_files, download.file, quiet =TRUE)# Read and merge datacamels <-map(local_files, read_delim, show_col_types =FALSE) camels <-power_full_join(camels ,by ='gauge_id')
Quesiton 1:
zero_q_freq represents the frequency of days where Q = 0 mm/day indicating no discharge/flow.
Question 2:
# Gauge Locations Plotp_mean_plot <-ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat)) +borders("state", colour ="gray50") +geom_point(aes(color = p_mean)) +labs(title ="Mean Daily Precipitation Across the Contintnetal US", color ="Precipitation (mm)") +scale_color_gradient(low ="tan", high ="darkgreen") +theme_map() +coord_fixed(1.3) +# Ensures correct aspect ratiotheme(legend.position ="bottom",plot.title =element_text(hjust =0.5)) # Center the title horizontally# Aridity Plotaridity_plot <-ggplot(data = camels, aes(x = gauge_lon, y = gauge_lat))+borders("state", colour ="gray50") +geom_point(aes(color = aridity)) +labs(title ="Aridity Across the Contintnetal US", color ="Aridity Index") +scale_color_gradient(low ="dodgerblue", high ="red") +theme_map() +coord_fixed(1.3) +# Ensures correct aspect ratio +theme(legend.position ="bottom",plot.title =element_text(hjust =0.5)) # Center the title horizontally# Combine Plotscombined_maps <-ggarrange( p_mean_plot, aridity_plot,ncol =1, nrow =2, # Stack verticallycommon.legend =FALSE, # Separate legendslegend ="bottom", # Place the legend at the bottomheights =c(1, 1) # Ensure both plots have equal height)# Display combined plotscombined_maps
Question 3:
Lab Activity:
# Model Preparationcamels |>select(aridity, p_mean, q_mean) |>drop_na() |>cor()
# Create a scatter plot of aridity vs rainfallggplot(camels, aes(x = aridity, y = p_mean)) +# Add points colored by mean flowgeom_point(aes(color = q_mean)) +# Add a linear regression linegeom_smooth(method ="lm", color ="red", linetype =2) +# Apply the viridis color scalescale_color_viridis_c() +# Add a title, axis labels, and theme (w/ legend on the bottom)theme_linedraw() +theme(legend.position ="bottom") +labs(title ="Aridity vs Rainfall vs Runnoff", x ="Aridity", y ="Rainfall",color ="Mean Flow")
`geom_smooth()` using formula = 'y ~ x'
# Create a scatter plot of aridity vs rainfall with log axesggplot(camels, aes(x = aridity, y = p_mean)) +geom_point(aes(color = q_mean)) +geom_smooth(method ="lm") +scale_color_viridis_c() +# Apply log transformations to the x and y axesscale_x_log10() +scale_y_log10() +theme_linedraw() +theme(legend.position ="bottom") +labs(title ="Aridity vs Rainfall vs Runnoff", x ="Aridity", y ="Rainfall",color ="Mean Flow")
`geom_smooth()` using formula = 'y ~ x'
# Scale the legend to the log scale plotggplot(camels, aes(x = aridity, y = p_mean)) +geom_point(aes(color = q_mean)) +geom_smooth(method ="lm") +# Apply a log transformation to the color scalescale_color_viridis_c(trans ="log") +scale_x_log10() +scale_y_log10() +theme_linedraw() +theme(legend.position ="bottom",# Expand the legend width ...legend.key.width =unit(2.5, "cm"),legend.key.height =unit(.5, "cm")) +labs(title ="Aridity vs Rainfall vs Runnoff", x ="Aridity", y ="Rainfall",color ="Mean Flow")
`geom_smooth()` using formula = 'y ~ x'
# Building the Modelset.seed(123)# Bad form to perform simple transformations on the outcome variable within a # recipe. So, we'll do it here.camels <- camels |>mutate(logQmean =log(q_mean))# Generate the splitcamels_split <-initial_split(camels, prop =0.8)camels_train <-training(camels_split)camels_test <-testing(camels_split)camels_cv <-vfold_cv(camels_train, v =10)# Create a recipe to preprocess the datarec <-recipe(logQmean ~ aridity + p_mean, data = camels_train) %>%# Log transform the predictor variables (aridity and p_mean)step_log(all_predictors()) %>%# Add an interaction term between aridity and p_meanstep_interact(terms =~ aridity:p_mean) |># Drop any rows with missing values in the predstep_naomit(all_predictors(), all_outcomes())# Prepare the databaked_data <-prep(rec, camels_train) |>bake(new_data =NULL)# Interaction with lm# Base lm sets interaction terms with the * symbollm_base <-lm(logQmean ~ aridity * p_mean, data = baked_data)summary(lm_base)
Call:
lm(formula = logQmean ~ aridity * p_mean, data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity:p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
# Sanity Interaction term from recipe ... these should be equal!!summary(lm(logQmean ~ aridity + p_mean + aridity_x_p_mean, data = baked_data))
Call:
lm(formula = logQmean ~ aridity + p_mean + aridity_x_p_mean,
data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity_x_p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
Question 3 Deliverable: Adjusted WF set
# Data Validation# prep %>% bake %>% predicttest_data <-bake(prep(rec), new_data = camels_test)test_data$lm_pred <-predict(lm_base, newdata = test_data)# Model Evaluation#Statisticalmetrics(test_data, truth = logQmean, estimate = lm_pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
# Visualggplot(test_data, aes(x = logQmean, y = lm_pred, colour = aridity)) +# Apply a gradient color scalescale_color_gradient2(low ="brown", mid ="orange", high ="darkgreen") +geom_point() +geom_abline(linetype =2) +theme_linedraw() +labs(title ="Linear Model: Observed vs Predicted",x ="Observed Log Mean Flow",y ="Predicted Log Mean Flow",color ="Aridity")
# Alternative Method: Workflow# Define modellm_model <-linear_reg() %>%# define the engineset_engine("lm") %>%# define the modeset_mode("regression")# Instantiate a workflow ...lm_wf <-workflow() %>%# Add the recipeadd_recipe(rec) %>%# Add the modeladd_model(lm_model) %>%# Fit the model to the training datafit(data = camels_train) # Adding other models to the workflowrf_model <-rand_forest() %>%set_engine("ranger", importance ="impurity") %>%set_mode("regression")rf_wf <-workflow() %>%# Add the recipeadd_recipe(rec) %>%# Add the modeladd_model(rf_model) %>%# Fit the modelfit(data = camels_train)rf_data <-augment(rf_wf, new_data = camels_test)dim(rf_data)
The bagged MLP, neural network model appears to be the best with a mean r-sq of ~0.78 and a rmse of 0.57. The rsme is not the lowest but the r-sq is better than the rand forest at ~0.77. If rsme was a priority I would go with the rf model.
Question 4 Deliverable
# Set seedset.seed(6515)camels <- camels %>%mutate(logQmean =log(q_mean)) # Add logQmean column to df# Data Splitting## Generate split (75/25)camels_split <-initial_split(camels, prop =0.75)## Extract training and testing sets camels_tr <-training(camels_split) camels_te <-testing(camels_split)## 10-fold CV dataset camels_10cv <-vfold_cv(camels_tr, v =10)# Recipe Vars Review# Check for skewingcamels_tr %>%select(pet_mean, p_mean, runoff_ratio, baseflow_index, aridity, slope_mean, area_geospa_fabric) %>%pivot_longer(everything(), names_to ="variable", values_to ="value") %>%ggplot(aes(x = value)) +geom_histogram(bins =30) +facet_wrap(~ variable, scales ="free") +theme_minimal()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
# Recipiealt_rec <-recipe(logQmean ~ pet_mean + p_mean + aridity + runoff_ratio + baseflow_index + slope_mean + area_geospa_fabric, data = camels_tr) %>%step_YeoJohnson(all_predictors()) %>%step_interact(terms =~ pet_mean:p_mean + aridity:runoff_ratio + area_geospa_fabric:slope_mean) %>%step_corr(all_predictors(), threshold =0.9) %>%# Remove highly correlated predictors to avoid multicollinearity.step_normalize(all_predictors()) %>%step_naomit(all_predictors(), all_outcomes())# Define and Train Models## Define rf model rf_alt_model <-rand_forest() %>%set_engine("ranger") %>%set_mode("regression") rf_alt_wf <-workflow() %>%# Add the recipeadd_recipe(alt_rec) %>%# Add the modeladd_model(rf_alt_model) %>%# Fit the modelfit(data = camels_tr) rf_predictions <-augment(rf_alt_wf, new_data = camels_te) ## Define xg model xg_alt_model <-boost_tree() %>%set_engine("xgboost") %>%set_mode("regression") xg_alt_wf <-workflow() %>%# Add the recipeadd_recipe(alt_rec) %>%# Add the modeladd_model(xg_alt_model) %>%# Fit the modelfit(data = camels_tr) xg_predictions <-augment(xg_alt_wf, new_data = camels_te)## Define nueral net model nn_alt_model <-bag_mlp() %>%set_engine("nnet") %>%set_mode("regression") nn_alt_wf <-workflow() %>%# Add the recipeadd_recipe(alt_rec) %>%# Add the modeladd_model(nn_alt_model) %>%# Fit the modelfit(data = camels_tr) nn_predictions <-augment(nn_alt_wf, new_data = camels_te)## Define linear reg model lm_alt_model <-linear_reg() %>%set_engine("lm") %>%set_mode("regression") lm_alt_wf <-workflow() %>%# Add the recipeadd_recipe(alt_rec) %>%# Add the modeladd_model(lm_alt_model) %>%# Fit the modelfit(data = camels_tr) lm_predictions <-augment(lm_alt_wf, new_data = camels_te) ## Define SVM-nonlinear model svm_alt_model <-svm_rbf() %>%set_engine("kernlab") %>%set_mode("regression") svm_alt_wf <-workflow() %>%# Add the recipeadd_recipe(alt_rec) %>%# Add the modeladd_model(svm_alt_model) %>%# Fit the modelfit(data = camels_tr) svm_predictions <-augment(svm_alt_wf, new_data = camels_te)# Implement workflowset analysis alt_wf_set <-workflow_set(preproc =list(rec),models =list(rf = rf_alt_model, xg = xg_alt_model, nn = nn_alt_model, lm = lm_alt_model, svm = svm_alt_model)) %>%workflow_map('fit_resamples', resamples = camels_10cv) autoplot(alt_wf_set)
# Moving forward with the NN Model## Extract the model coefficients nn_coeff <-coef(nn_alt_model) nn_coeff
NULL
## Use the data to make predictionsmetrics(nn_predictions, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.0615
2 rsq standard 0.998
3 mae standard 0.0111
ggplot(nn_predictions, aes(x = logQmean, y = .pred)) +geom_point(aes(color = .pred), size =2) +scale_color_gradient(low ="tan", high ="royalblue") +labs(title ="Observed vs Predicted Values with the NN Model",x ="Observed Log Mean Flow",y ="Predicted Log Mean Flow",color ="Aridity") +geom_abline(linetype =2) +theme_linedraw()
Q4b: I chose a complex formula to attempt to compute multiple elements of the watershed. Temperature (pet), precipitation (p) and aridity are all related as is runoff, slope and catchment area (area_geo…). Finally I included baseflow index because it seems like predicting flow will be challenging without first understanding what water is already there from groundwater sources, other inputs, etc.
Q4c: I used selected the above models in an attempt to find non-linear representations for the data. Given the complexity of the formula I made for my recipe I anticipated the data fitting linear models poorly.
Q4e: For the recipe I created, the bag_mlp neural network model performed the best. It had the highest mean r-squared (~0.771) and the second lowest root mean standard error (rmse) (~0.565). The SVM model had a marginally lower rmse (~0.561) but also had a lower mean r-squared (~0.0760). If I really wanted to I could tune the SVM and NN models to optimize them and then compare them again.
Q4f: I am very happy with the results, outside of a few outliers, my models seems to have improve on the performance of the recipe and models from part 3.