Daily Assignment 11/12: Air Quality and EDA Modeling

ESS 330 - Quantitative Reasoning

Author

library(tidyverse)  # For data wrangling

Warning: package 'lubridate' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(recipes)    # For data preprocessing


Attaching package: 'recipes'

The following object is masked from 'package:stringr':

    fixed

The following object is masked from 'package:stats':

    step

library(broom)      # For model diagnostics
library(ggpubr)     # For visualization

Warning: package 'ggpubr' was built under R version 4.4.3

# Load the airquality data set
data("airquality")

# Remove rows where Ozone is NA before preprocessing
airquality <- airquality %>% drop_na(Ozone)

# Part 1: Normality Testing
# Explore data set structure
str(airquality)

'data.frame':   116 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 28 23 19 8 7 16 ...
 $ Solar.R: int  190 118 149 313 NA 299 99 19 NA 256 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.9 8.6 13.8 20.1 6.9 9.7 ...
 $ Temp   : int  67 72 74 62 66 65 59 61 74 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 6 7 8 9 11 12 ...

summary(airquality)

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 2.300   Min.   :57.00  
 1st Qu.: 18.00   1st Qu.:113.5   1st Qu.: 7.400   1st Qu.:71.00  
 Median : 31.50   Median :207.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :184.8   Mean   : 9.862   Mean   :77.87  
 3rd Qu.: 63.25   3rd Qu.:255.5   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
                  NA's   :5                                       
     Month            Day       
 Min.   :5.000   Min.   : 1.00  
 1st Qu.:6.000   1st Qu.: 8.00  
 Median :7.000   Median :16.00  
 Mean   :7.198   Mean   :15.53  
 3rd Qu.:8.250   3rd Qu.:22.00  
 Max.   :9.000   Max.   :31.00

# Perform Shapiro-Wilk test for normality
shapiro.test(airquality$Ozone)


    Shapiro-Wilk normality test

data:  airquality$Ozone
W = 0.87867, p-value = 2.79e-08

shapiro.test(airquality$Temp)


    Shapiro-Wilk normality test

data:  airquality$Temp
W = 0.97944, p-value = 0.0719

shapiro.test(airquality$Solar.R)


    Shapiro-Wilk normality test

data:  airquality$Solar.R
W = 0.93285, p-value = 2.957e-05

shapiro.test(airquality$Wind)


    Shapiro-Wilk normality test

data:  airquality$Wind
W = 0.97898, p-value = 0.06537

The Shapiro-Wilk test is used to assess whether a data set follows a normal distribution. It is useful to confirm normality before conduting or running tests which assume normality (i.e. linear regressions).

Hypotheses for Shapiro-Wilk Tests: Null: The data is normally distributed. Alt.: The data is not normally distributed.

If the p-value is > 0.05 then, FTR the null; the data is likely normal and doesn’t show significant non-normality. if the p-value is < 0.05 the, reject the null; the data is likely not normally distributed as it shows significant differences from a normal distribution.

In this case, Ozone and Solar r. have p-values less than 0.05 meaning those data are likely not normally distributed. The Wind and Temp variables have a p-value > 0.05 meaning the data are likely normally distributed.

# Part 2: Data Transformation and Feature Engineering
# Convert Month into seasons using case_when()
airquality <- airquality %>%
  mutate(Season = case_when(
    Month %in% c(11, 12, 1) ~ "Winter",
    Month %in% c(2, 3, 4) ~ "Spring",
    Month %in% c(5, 6, 7) ~ "Summer",
    Month %in% c(8, 9, 10) ~ "Fall"
  ))

# Check observations per season
table(airquality$Season)


  Fall Summer 
    55     61

There are 116 observations in the airquality data set, with 61 from the summer months and 55 from the fall months. There are no spring or winter records in the data.

# Part 3: Data Preprocessing
# Normalize predictor variables
rec <- recipe(Ozone ~ Temp + Solar.R + Wind + Season, data = airquality) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_impute_mean(all_numeric_predictors())

prep_rec <- prep(rec)
processed_data <- bake(prep_rec, new_data = NULL)

Normalizing data rescales values to have a mean of 0 and standard deviation of 1. This allows predictors to contribute to a model equally instead of veriables with large magnitudes having more weight in the analysis.

The step_impute_mean() function from the recipies package is used to impute missing values with the mean of the variable.

It’s necessary to prep() and bake() the recipie because prep() calculates necessary statistics (i.e. mean for imputation and scaling factors for normalization) while bake applies the transformation. Without prep() theres no transformation to apply without bake() the transformation doesn’t get applied.

# Part 4: Building a Linear Regression Model
lm_model <- lm(Ozone ~ ., data = processed_data)
summary(lm_model)  # Interpret coefficients, R-squared, and p-values


Call:
lm(formula = Ozone ~ ., data = processed_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.532 -14.806  -2.438  10.502  99.521 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    40.034      2.965  13.501  < 2e-16 ***
Temp           16.408      2.482   6.611 1.38e-09 ***
Solar.R         5.038      2.184   2.307   0.0229 *  
Wind          -11.114      2.315  -4.801 4.96e-06 ***
SeasonSummer    3.984      4.197   0.949   0.3445    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.33 on 111 degrees of freedom
Multiple R-squared:  0.5965,    Adjusted R-squared:  0.5819 
F-statistic: 41.02 on 4 and 111 DF,  p-value: < 2.2e-16

This model is attempting to estimate changes in ozone based on the predictor variables. In this case, the coefficients represent the estimated change in ozone for a one-unit change in the associated predictor variable. A positive coefficient means that ozone will increase with increases in the predictor variable while a negative change indicates that ozone will decrease when the predictor variable increases. The R-squared number assesses how well the model explains variability in the y-axis/independent variable where 1 means the model explains all of the y variability and 0 indicated the model explains none of the y variability. Small p-values (< 0.05) indicate that the predictor significantly impacts the modeled ozone levels while large p-values (> 0.05) may indicate lack of statistical significance. The predicted ozone level when all predictors are zero is 40.034 (ppb)(intercept coeff.). The predictor variables from least to greatest impact are (coeff. magnitude): Temp, Wind and Solar R.

# Part 5: Model Diagnostics
# Supplement data with fitted values and residuals
model_results <- augment(lm_model, processed_data)

# Extract residuals and visualize their distribution
histogram <- ggplot(model_results, aes(.resid)) +
  geom_histogram(bins = 20, fill = "blue", alpha = 0.5) +
  ggtitle("Residuals Histogram")

residuals_plot <- ggqqplot(model_results$.resid) +
  ggtitle("QQ Plot of Residuals")

# Arrange plots side by side
ggarrange(histogram, residuals_plot, ncol = 2, nrow = 1)

# Scatter plot of actual vs. predicted values
ggscatter(model_results, x = "Ozone", y = ".fitted",
          add = "reg.line", conf.int = TRUE,
          cor.coef = TRUE, cor.method = "spearman",
          ellipse = TRUE)

With some large residuals, the predictions may not be particularly accurate. The median close to 0 suggests a relatively balanced residual distribution. Temp, Solar.R and Wind are statistically significant and contribute to the model while season is not and it may not be useful. Overall, the model has an R = 0.83 indicating a strong, positive correlation between the model and the observed data. With a very small p-value << 0.05 the model is fairly strong and has a high likelihood of accurately predicting the observed data.