Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(recipes) # For data preprocessing
Attaching package: 'recipes'
The following object is masked from 'package:stringr':
fixed
The following object is masked from 'package:stats':
step
library(broom) # For model diagnosticslibrary(ggpubr) # For visualization
Warning: package 'ggpubr' was built under R version 4.4.3
# Load the airquality data setdata("airquality")# Remove rows where Ozone is NA before preprocessingairquality <- airquality %>%drop_na(Ozone)
# Part 1: Normality Testing# Explore data set structurestr(airquality)
'data.frame': 116 obs. of 6 variables:
$ Ozone : int 41 36 12 18 28 23 19 8 7 16 ...
$ Solar.R: int 190 118 149 313 NA 299 99 19 NA 256 ...
$ Wind : num 7.4 8 12.6 11.5 14.9 8.6 13.8 20.1 6.9 9.7 ...
$ Temp : int 67 72 74 62 66 65 59 61 74 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 6 7 8 9 11 12 ...
summary(airquality)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 2.300 Min. :57.00
1st Qu.: 18.00 1st Qu.:113.5 1st Qu.: 7.400 1st Qu.:71.00
Median : 31.50 Median :207.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :184.8 Mean : 9.862 Mean :77.87
3rd Qu.: 63.25 3rd Qu.:255.5 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :5
Month Day
Min. :5.000 Min. : 1.00
1st Qu.:6.000 1st Qu.: 8.00
Median :7.000 Median :16.00
Mean :7.198 Mean :15.53
3rd Qu.:8.250 3rd Qu.:22.00
Max. :9.000 Max. :31.00
# Perform Shapiro-Wilk test for normalityshapiro.test(airquality$Ozone)
Shapiro-Wilk normality test
data: airquality$Ozone
W = 0.87867, p-value = 2.79e-08
shapiro.test(airquality$Temp)
Shapiro-Wilk normality test
data: airquality$Temp
W = 0.97944, p-value = 0.0719
shapiro.test(airquality$Solar.R)
Shapiro-Wilk normality test
data: airquality$Solar.R
W = 0.93285, p-value = 2.957e-05
shapiro.test(airquality$Wind)
Shapiro-Wilk normality test
data: airquality$Wind
W = 0.97898, p-value = 0.06537
The Shapiro-Wilk test is used to assess whether a data set follows a normal distribution. It is useful to confirm normality before conduting or running tests which assume normality (i.e. linear regressions).
Hypotheses for Shapiro-Wilk Tests: Null: The data is normally distributed. Alt.: The data is not normally distributed.
If the p-value is > 0.05 then, FTR the null; the data is likely normal and doesn’t show significant non-normality. if the p-value is < 0.05 the, reject the null; the data is likely not normally distributed as it shows significant differences from a normal distribution.
In this case, Ozone and Solar r. have p-values less than 0.05 meaning those data are likely not normally distributed. The Wind and Temp variables have a p-value > 0.05 meaning the data are likely normally distributed.
# Part 2: Data Transformation and Feature Engineering# Convert Month into seasons using case_when()airquality <- airquality %>%mutate(Season =case_when( Month %in%c(11, 12, 1) ~"Winter", Month %in%c(2, 3, 4) ~"Spring", Month %in%c(5, 6, 7) ~"Summer", Month %in%c(8, 9, 10) ~"Fall" ))# Check observations per seasontable(airquality$Season)
Fall Summer
55 61
There are 116 observations in the airquality data set, with 61 from the summer months and 55 from the fall months. There are no spring or winter records in the data.
# Part 3: Data Preprocessing# Normalize predictor variablesrec <-recipe(Ozone ~ Temp + Solar.R + Wind + Season, data = airquality) %>%step_normalize(all_numeric_predictors()) %>%step_impute_mean(all_numeric_predictors())prep_rec <-prep(rec)processed_data <-bake(prep_rec, new_data =NULL)
Normalizing data rescales values to have a mean of 0 and standard deviation of 1. This allows predictors to contribute to a model equally instead of veriables with large magnitudes having more weight in the analysis.
The step_impute_mean() function from the recipies package is used to impute missing values with the mean of the variable.
It’s necessary to prep() and bake() the recipie because prep() calculates necessary statistics (i.e. mean for imputation and scaling factors for normalization) while bake applies the transformation. Without prep() theres no transformation to apply without bake() the transformation doesn’t get applied.
# Part 4: Building a Linear Regression Modellm_model <-lm(Ozone ~ ., data = processed_data)summary(lm_model) # Interpret coefficients, R-squared, and p-values
Call:
lm(formula = Ozone ~ ., data = processed_data)
Residuals:
Min 1Q Median 3Q Max
-39.532 -14.806 -2.438 10.502 99.521
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.034 2.965 13.501 < 2e-16 ***
Temp 16.408 2.482 6.611 1.38e-09 ***
Solar.R 5.038 2.184 2.307 0.0229 *
Wind -11.114 2.315 -4.801 4.96e-06 ***
SeasonSummer 3.984 4.197 0.949 0.3445
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.33 on 111 degrees of freedom
Multiple R-squared: 0.5965, Adjusted R-squared: 0.5819
F-statistic: 41.02 on 4 and 111 DF, p-value: < 2.2e-16
This model is attempting to estimate changes in ozone based on the predictor variables. In this case, the coefficients represent the estimated change in ozone for a one-unit change in the associated predictor variable. A positive coefficient means that ozone will increase with increases in the predictor variable while a negative change indicates that ozone will decrease when the predictor variable increases. The R-squared number assesses how well the model explains variability in the y-axis/independent variable where 1 means the model explains all of the y variability and 0 indicated the model explains none of the y variability. Small p-values (< 0.05) indicate that the predictor significantly impacts the modeled ozone levels while large p-values (> 0.05) may indicate lack of statistical significance. The predicted ozone level when all predictors are zero is 40.034 (ppb)(intercept coeff.). The predictor variables from least to greatest impact are (coeff. magnitude): Temp, Wind and Solar R.
# Part 5: Model Diagnostics# Supplement data with fitted values and residualsmodel_results <-augment(lm_model, processed_data)# Extract residuals and visualize their distributionhistogram <-ggplot(model_results, aes(.resid)) +geom_histogram(bins =20, fill ="blue", alpha =0.5) +ggtitle("Residuals Histogram")residuals_plot <-ggqqplot(model_results$.resid) +ggtitle("QQ Plot of Residuals")# Arrange plots side by sideggarrange(histogram, residuals_plot, ncol =2, nrow =1)
# Scatter plot of actual vs. predicted valuesggscatter(model_results, x ="Ozone", y =".fitted",add ="reg.line", conf.int =TRUE,cor.coef =TRUE, cor.method ="spearman",ellipse =TRUE)
With some large residuals, the predictions may not be particularly accurate. The median close to 0 suggests a relatively balanced residual distribution. Temp, Solar.R and Wind are statistically significant and contribute to the model while season is not and it may not be useful. Overall, the model has an R = 0.83 indicating a strong, positive correlation between the model and the observed data. With a very small p-value << 0.05 the model is fairly strong and has a high likelihood of accurately predicting the observed data.