Back to Article
Article Notebook
Download Source

ESS 330 Final Project Report

Authors
Affiliation

Zachary Cramton

Colorado State University

Bella Conrad

Colorado State University

Rachel Delorie

Colorado State University

Urbanization, Density and Access to Public Parks in the United States

Abstract

Currently, over 50% of the world’s population is living in urban areas and large cities. Urban planning has a key role in the experiences of residents living in these areas, where efficiency, utilitarianism, and profitability have dominated. With more studies focusing on a connection to nature and time outdoors with positive mental and physical health, we wanted to determine if any changes had been made in the urban planning field with regards to this aspect. We hypothesized that there is a linear relationship between urban population density and public open space availability. To determine this, we used datasets from the UN-Habitat Urban Indicators Database and the ParkServ Database. We formatted these datasets to fit together and work efficiently through using R, conducted an exploratory data analysis to define our variables, and then used a machine learning workflow to explore predicted values for different cities. For our built open space results, we used a boosted tree model, which gave us a 2.7644 RMSE and a 0.0136 R-squared, which provide an insignificant relationship. For our open space access results, we used a linear regression model which gave us an 18.5488 RMSE and a 0.6653 R-squared. Looking at these results, we did not find a significant relationship between urban population density and public open space availability but are interested to see if there is potential to develop this relationship. Urban planners should be shifting their values and goals for cities from efficiency and utilitarianism to an increase in natural open spaces instead.

Introduction

Since 2008, the majority of the world’s population has lived in urban areas, a result of urbanization in developing countries (Beall et al., 2010; Kohlhase, 2013). The United States developed earlier than many nations, with more than 50 percent of the population living in urban areas by by the 14th Census in 1920.

Definition of Urban Areas in Census History

Prior to the 2020 Census urban areas were defined as any area with greater than 2500 people. Following the 2010 census urban clusters described areas with populations greater than 2,500 and less than 50,000; urbanized areas described areas with a population greater than 50,000. For the 2020 Census the threshold was changed to 5000 people (Ratcliffe, 2022).

In the century since the 1920 census the percentage of individuals living in urban areas has increased to 80.7% (Slack & Jensen, 2020). As more people moved to urban areas, those areas expanded forming urbanized areas and large cities.

Urban planning has existed for centuries out of necessity, historically dominated by efficiency and utilitarianism, optimizing the world we live in for profitability, corporate productivity, and automobile-based mobility. This optimization came with sacrifices, which now impact an increasingly large majority of the population. In recent years, the discipline has begun to prioritize human factors over utilitarian efficiency. Thousands of years of living in rural settings makes urban living hard for most people’s biology. Connection to nature and time outdoors even in small amounts has been shown to be a vital part maintaining physical and mental health (Bratman et al. (2012)). In an effort to make urban spaces more livable, planners are turning to parks and natural areas to connect people to nature.

Equity issues aside, overturning and correcting more than a century of bad planning is a daunting task. Many cities filled in and built up over the course of the 20th century as land became a premium commodity. How does this density present significant challenges for today’s planning professionals? This research seeks to investigate the relationship between urban demographics like density, and park access. In exploring this relationship, we hypothesize that there is an intermediate/sublinear relationship between urban population density and public open space availability.

Data Overview

This report uses data from the UN-Habitat Urban Indicators Database and the ParkServe® Database maintained by the Trust for Public Land. The UN data relates to the UN-SDG 11.7.1 pertaining to access to open spaces and green areas. The January 2025 version of the UN Open Spaces and Green Areas data includes the average share of urban areas allocated to streets and open public spaces as well as the share of the urban population with convenient access to an open public space.

UN Definition

In this case, the UN defines “convenient access to an open public space” as the “urban population withing 400 meters walking distance along the street network to an open public space” (May et al., 2000).

These data collected by the UN were collected in 2020 and provided as a .xls format spreadsheet. These data were converted to .csv format with Microsoft Excel. The ParkServe® data selected for use is the 2020 data set to match the year the UN data was recorded. Specifically, this report uses elements of the City Park Facts: Acreage & Park System Highlights. The ParkServe® data is much less synthesized and was available as a .xml file. The file was structured for viewing as a spreadsheet rather than for further analysis and included multiple worksheets withing the workbook. In converting the file to a .csv file, the data spread across multiple worksheets was collated in a single worksheet and converted to a summarized dataset .csv file.

These data are lacking a shared numerical position data type but share a city name column formatted as “city_name, two_letter_state_abbreviation”. There is not perfect overlap between cities with data in each database however, there are 25 cities shared between the datasets. Cities present in only one data set will be culled when the data is joined.

Methods

Firstly, we cleaned the data. This is important for our project because the raw data was downloaded as Excel spreadsheets, but some reformatting in Excel was required to effectively export as a .csv file and import the new summarized file to RStudio. Any remaining data cleaning occurred in R as needed, including header changes and creation of additional columns to reflect relational data. We then conducted exploratory data analysis (EDA) to visualize the data to further understand what we were looking at. Next, we joined the datasets by “city name” to have a complete working dataset. This data was combined into a single data frame, using an inner join, because of the high number of cities in one dataset that the other lacks. The new dataset includes only cities found in both datasets, with columns from both.

Limiting Scope

The cities found in only one dataset were cut from the data to accommodate the limited scope of the project. Unfortunately, this left a very small sample size whcih made drawing definitive conclusions difficult. With a bigger scope it is possible that additional data could be used to understand these patterns with more depth.

As we continued, we prepped the data and split it into training and testing datasets. We started by using the 10-fold cross-validation on the training data, but the dataset was too small, so it was overfitting. Then we tried bootstraps before finally switching to 3-fold cross validation for more representative results. We created a recipe and set up several models in regression mode. Using these models and the recipe, we created a workflow set. We then used the map function over workflow using workflow map. We used the highest performing model to fit the data and augment. We moved on to plotting and graphing the data to visually display our test results. If time allowed and our model was more effective, we would have continued to explore using the model to predict values for cities included in only one document.

Results

Exploratory Data Analysis (EDA)

In [1]:
library(baguette)       # Model bagging methods for ensemble learning
Warning: package 'baguette' was built under R version 4.4.3
Loading required package: parsnip
library(flextable)      # Tool for building display-ready tables
Warning: package 'flextable' was built under R version 4.4.3
library(ggfortify)      # Autoplot support for time series, PCA, and more
Warning: package 'ggfortify' was built under R version 4.4.3
Loading required package: ggplot2
Registered S3 method overwritten by 'ggfortify':
  method          from   
  autoplot.glmnet parsnip
library(ggthemes)       # Extra themes and scales for ggplot2
Warning: package 'ggthemes' was built under R version 4.4.3
library(glue)           # String interpolation with embedded R expressions
library(here)           # Simplifies file paths for project-based workflows
Warning: package 'here' was built under R version 4.4.3
here() starts at C:/Users/Zacha/github/CSU/ESS_330/ess_330_final_project
library(parsnip)        # Unified interface to model functions (part of tidymodels)
library(patchwork)      # Combine ggplot2 plots with intuitive syntax
library(plotly)         # Interactive plots built on top of ggplot2 or base R
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following objects are masked from 'package:flextable':

    highlight, style
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
library(powerjoin)      # Flexible joins that track and resolve conflicts
Warning: package 'powerjoin' was built under R version 4.4.3
library(purrr)          # For functional programming (mapping)

Attaching package: 'purrr'
The following object is masked from 'package:flextable':

    compose
library(recipes)        # Preprocessing pipelines for modeling (part of tidymodels)
Loading required package: dplyr

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'recipes'
The following object is masked from 'package:stats':

    step
library(skimr)          # Quick summaries of data frames with descriptive stats
Warning: package 'skimr' was built under R version 4.4.3
library(tidymodels)     # Collection of packages for modeling and machine learning
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ tidyr        1.3.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ rsample      1.2.1     ✔ yardstick    1.3.2
✔ tibble       3.2.1     
Warning: package 'scales' was built under R version 4.4.3
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::compose()  masks flextable::compose()
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks plotly::filter(), stats::filter()
✖ dplyr::lag()      masks stats::lag()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(tidyverse)      # Core packages for data science: dplyr, ggplot2, readr, etc.
Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.4     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::compose()    masks flextable::compose()
✖ scales::discard()   masks purrr::discard()
✖ dplyr::filter()     masks plotly::filter(), stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tune)           # Tools for hyperparameter tuning (part of tidymodels)
library(vip)            # Variable importance plots for many model types
Warning: package 'vip' was built under R version 4.4.3

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(visdat)         # Visualize missing and data types in your dataset
Warning: package 'visdat' was built under R version 4.4.3
library(workflowsets)   # Manage and compare sets of modeling workflows
library(xgboost)        # Fast, regularized gradient boosting machine
Warning: package 'xgboost' was built under R version 4.4.3

Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice

The following object is masked from 'package:plotly':

    slice
library(yardstick)      # Metrics for evaluating model performance (part of tidymodels)

# Import data from csvs and clean NAs
parkserve_data <- read_csv(here("data", "clean_data/parkserve_summarized_facts_2020.csv")) %>% 
  drop_na()
Rows: 100 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): city_name, city_pop, land_area, revised_area, percent_designed_par...
dbl  (5): parkland_area, designed_park_area, natural_park_area, parkland_per...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
un_land_use_data <- read_csv(here("data", "clean_data/un_land_use.csv")) %>% 
  drop_na()
Rows: 59 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): city_name
dbl (2): mean_percent_built_open_space, mean_percent_open_space_access

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add columns and finish cleaning parkserve data
clean_parkserve_data <- parkserve_data %>%
  mutate(
    across(-city_name, ~ as.numeric(.x)),  # Convert all columns to numeric except for city name
    parkland_percent = parkland_percent * 100, # Convert parkland percent from ratio
    # Fix design/natural park area percentage calculations
    percent_designed_parks = ifelse(parkland_area == 0, NA, (designed_park_area / parkland_area) * 100),
    percent_natural_parks = ifelse(parkland_area == 0, NA, (natural_park_area / parkland_area) * 100),
    #New calculations
    dn_area_ratio = ifelse(percent_natural_parks == 0, NA, percent_designed_parks / percent_natural_parks),   # Designed-natural area ratio
    parkland_per_capita = ifelse(city_pop == 0, NA, parkland_area / city_pop),   # Parkland per capita
    land_per_capita = ifelse(city_pop == 0, NA, land_area / city_pop),   # Land per capita (opposite of density)
    pop_near_parks = city_pop * percent_half_mile_walk,   # Pop near parks
    park_units_per_area = ifelse(land_area == 0, NA, park_units / land_area),   # Park unit density
    park_units_per_capita = ifelse(city_pop == 0, NA, park_units / city_pop)   # Park units per capita
  )
  
# Join data removing cities found in only one of the two datasets
urban_parks_data <- clean_parkserve_data %>% 
  inner_join(un_land_use_data, by = "city_name")  

# Basic data structure exploration
  # glimpse(urban_parks_data) # DOTHIS:: Un-comment to use
In [2]:
# Descriptive Stats

  # Write function to round numeric columns to two decimal places
  round_numeric <- function(df) {
    df %>% 
      mutate(across(where(is.numeric), ~round(.x, 2)))
  }

  # Summarize stats by variable
    desc_stats_parks <- urban_parks_data %>% 
      select(where(is.numeric)) %>% 
      pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>% 
      group_by(variable) %>% 
      summarize(mean = mean(value, na.rm = TRUE), 
                median = median(value, na.rm = TRUE), 
                sd = sd(value, na.rm = TRUE),
                Q1 = quantile(value, 0.25, na.rm = TRUE),
                Q3 = quantile(value, 0.75, na.rm = TRUE)) %>% 
      round_numeric()
      
      # Print descriptive stats with flextable
      desc_stats_flex <- flextable(desc_stats_parks) %>%
        set_caption("Table 1: Summarized Urban Parks Statistics") %>% 
        set_header_labels(
          variable = "Variable",
          mean = "Mean",
          median = "Median",
          sd = "Standard Deviation",
          Q1 = "1st Quartile (Q1)",
          Q3 = "3rd Quartile (Q3)") %>% 
        autofit()

# Find Top/Bottom cities for percent parkland
  
  # Select relevant columns
    simplified_vars <- c("city_name", "city_pop", "revised_area", "pop_density", "parkland_area", "dn_area_ratio", "parkland_percent", "parkland_per_1k_pop", "percent_half_mile_walk", "pop_near_parks")
  
  # Filter top/bottom 10 cities
  top10_park_percent <- urban_parks_data %>% 
    arrange(desc(parkland_percent)) %>% 
    slice_head(n = 10) %>% 
    select(all_of(simplified_vars)) %>% 
    round_numeric()
  
  bottom10_park_percent <- urban_parks_data %>% 
    arrange(parkland_percent) %>% 
    slice_head(n = 10) %>% 
    select(all_of(simplified_vars)) %>% 
    round_numeric()
  
  # Create top/bottom 10 flextables w/ function
  # Create function
  make_best_worst_flextbl <-function(df, caption) {
    flextable(df) %>% 
    set_caption(caption) %>%
    set_header_labels(
      city_name = "City Name",
      city_pop = "City Population",
      revised_area = "City Land Area (Revised) (Acres)",
      pop_density = "Population Density (People/Acre)",
      parkland_area = "City Parkland Area (Acres)",
      parkland_percent = "Percent Parkland",
      parkland_per_1k_pop = "Parkland Per (1000) Capita",
      percent_half_mile_walk = "Percent of Residents within 0.5 Miles of a Park",
      dn_area_ratio = "Designed-Natural Park Area Ratio (Designed Park (%) / Natural Park (%)") %>% 
    autofit()
  }
  
  top10_park_percent_flex <- make_best_worst_flextbl(top10_park_percent, "Table 2: Top 10 Cities for Parkland Percentage")
  
  bottom10_park_percent_flex <- make_best_worst_flextbl(bottom10_park_percent, "Table 3: Bottom 10 Cities for Parkland Percentage")
  
# Make plots to visualize the data
# Histogram: Land Area
land_area_plot <- ggplot(urban_parks_data, aes(x = as.numeric(land_area))) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(x = "Land Area (Acres)", y = "Frequency", title = "Figure 1: Distribution of City Land Areas") +
  theme_minimal() 

# Scatterplot: Land Area vs Parkland Area
land_vs_park_area_plot <- ggplot(urban_parks_data, aes(x = as.numeric(land_area), y = parkland_area)) +
  geom_point(color = "forestgreen") +
  labs(x = "Land Area (Acres)", y = "Parkland Area\n(Acres)", title = "Figure 2: Land Area vs Parkland Area") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_y_continuous(labels = scales::label_comma()) +
  theme_minimal() 

# Scatterplot: Population Density vs Parkland Percent
density_vs_park_percent_plot <- ggplot(urban_parks_data, aes(x = as.numeric(pop_density), y = parkland_percent)) +
  geom_point(color = "darkorange") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "Population Density (People/Acre)", y = "Percent Parkland", title = "Figure 3: Population Density vs\nParkland Percent") +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_y_continuous(labels = scales::label_comma()) +
  theme_minimal()

# Scatterplot: Designed-Natural Park Area Ratio vs Parkland Percent
dn_area_ratio_vs_park_percent_plot <- ggplot(urban_parks_data, aes(x = dn_area_ratio, y = parkland_percent)) +
  geom_point(color = "mediumvioletred") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "Designed-Natural Park Area Ratio", y = "Percent Parkland", title = "Figure 4: Designed-Natural Park Area\nRatio vs Parkland Percent") +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_y_continuous(labels = scales::label_comma()) +
  theme_minimal()
  
# Scatterplot: Population Near Parks vs Mean % Open Space Access"
pop_park_proximity_plot <- ggplot(urban_parks_data, aes(x = mean_percent_open_space_access, y = (pop_near_parks/100000))) +
  geom_point(color = "cadetblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "Mean % Open Space Access", y = "100k Population\nNear Parks", title = "Figure 5: Population Near Parks vs\nMean % Open Space Access") +
  theme_minimal()

# Scatterplot: Percent Open Space Access vs Percent Built Open Space
open_space_access_vs_built_plot <- ggplot(urban_parks_data, aes(x = mean_percent_open_space_access, y = mean_percent_built_open_space)) +
  geom_point(color = "seagreen") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "Mean % Open Space Access", y = "Mean % Built\nOpen Space", title = "Figure 6: Accessibility of Built Open Space") +
  theme_minimal()

# Scatterplot: Land Per Capita vs Parkland Area
land_pc_vs_park_area_plot <- ggplot(urban_parks_data, aes(x = land_per_capita, y = parkland_area)) +
  geom_point(color = "royalblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "Land Per Capita (Acres)", y = "Parkland Area (Acres)", title = "Figure 7: Land Per Capita vs Parkland Area") +
  theme_minimal()

# Scatterplot: Parkserve Proxomity Data vs\nUN Access Data
pct_walk_vs_osa_plot <- ggplot(urban_parks_data, aes(x = percent_half_mile_walk, y = mean_percent_open_space_access)) +
  geom_point(color = "darkred") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  labs(x = "% of Pop. < 0.5 Miles\nfrom a Park", y = "Mean % Open\nSpace Access", title = "Figure 8: Park Proximity (Parkserve)\n vs Access (UN)") +
  theme_minimal() 

# Use corr plots to show variables with strong impact on model outputs
  # Function to create a corr plot for a given target variable
  create_corr_plot <- function(data, target_var) {
    
    # Check if target_var exists in the dataset
    if (!(target_var %in% names(data))) {
      stop("Target variable not found in the dataset!")
    }
    
    # Compute correlation of all numeric variables with the target variable
    correlation_matrix <- cor(data %>% select_if(is.numeric), use = "complete.obs")  # Compute correlation matrix
    
    # Extract the correlation values for the target variable and convert to data frame
    correlation_data <- data.frame(
      variable = rownames(correlation_matrix),
      correlation = abs(correlation_matrix[target_var, ])
    ) %>%
        filter(variable != target_var) %>%  # Remove the target variable itself
        arrange(desc(correlation))  # Sort by correlation value
    
    # Create corr plot
    corr_plot <- ggplot(correlation_data, aes(x = reorder(variable, correlation), y = correlation)) +
      geom_bar(stat = "identity", fill = "steelblue") +
      coord_flip() +  # Flip coordinates to create horizontal bars
      labs(x = "Variables", y = "Absolute Correlation", title = paste("Feature Correlation\nw/", target_var)) +
      theme_minimal()
    
    return(corr_plot)
  }

# Make plots
bos_corr_plot <- create_corr_plot(urban_parks_data, "mean_percent_built_open_space") +
  theme(plot.subtitle.title = element_text("Figure 9"))
osa_corr_plot <- create_corr_plot(urban_parks_data, "mean_percent_open_space_access") +
  theme(plot.subtitle.title = element_text("Figure 10"))

# Combine all plots in one figure using patchwork (optional)
pt1_eda_plots <- (land_area_plot | land_vs_park_area_plot) / 
  (density_vs_park_percent_plot | land_pc_vs_park_area_plot) +
    plot_layout(guides = "collect")

pt2_eda_plots <- (dn_area_ratio_vs_park_percent_plot | pct_walk_vs_osa_plot) / 
  (pop_park_proximity_plot | open_space_access_vs_built_plot) +
    plot_layout(guides = "collect")
  
# Display data summary and visualization

  # Display flextables
  desc_stats_flex

Variable

Mean

Median

Standard Deviation

1st Quartile (Q1)

3rd Quartile (Q3)

city_pop

1,113,559.56

655,061.00

1,696,534.55

377,963.00

1,006,142.00

designed_park_area

5,559.20

3,864.00

5,020.97

2,652.00

5,785.00

dn_area_ratio

2.42

1.02

4.90

0.26

1.56

land_area

178,285.40

88,800.00

224,466.29

53,723.00

201,635.00

land_per_capita

0.33

0.16

0.70

0.08

0.28

mean_percent_built_open_space

18.71

18.20

3.23

17.20

20.60

mean_percent_open_space_access

42.06

40.00

17.04

29.80

52.80

natural_park_area

48,113.72

4,538.00

180,621.84

1,429.00

22,527.00

park_units

460.80

302.00

811.06

179.00

416.00

park_units_per_10k_pop

4.55

4.54

1.72

3.28

5.42

park_units_per_area

0.00

0.00

0.00

0.00

0.00

park_units_per_capita

0.00

0.00

0.00

0.00

0.00

parkland_area

53,824.36

9,478.00

180,158.96

5,075.00

28,312.00

parkland_per_1k_pop

143.17

12.09

607.26

9.15

28.14

parkland_per_capita

0.14

0.01

0.61

0.01

0.03

parkland_percent

15.31

12.54

15.58

6.76

17.38

percent_designed_parks

44.66

50.28

27.85

20.43

60.90

percent_half_mile_walk

0.76

0.80

0.20

0.63

0.96

percent_natural_parks

53.78

49.61

28.29

32.55

79.57

pop_density

9.52

6.39

9.84

3.59

12.41

pop_near_parks

939,727.24

360,828.80

1,686,304.23

300,312.72

797,287.04

revised_area

175,389.60

87,844.00

222,631.64

52,765.00

196,098.00

  top10_park_percent_flex

City Name

City Population

City Land Area (Revised) (Acres)

Population Density (People/Acre)

City Parkland Area (Acres)

Designed-Natural Park Area Ratio (Designed Park (%) / Natural Park (%)

Percent Parkland

Parkland Per (1000) Capita

Percent of Residents within 0.5 Miles of a Park

pop_near_parks

Anchorage, AK

299,100

1,086,019

0.28

914,138

0.00

84.17

3,056.30

0.75

224,597.2

New Orleans, LA

386,105

107,655

3.59

27,775

0.11

25.80

71.94

0.80

308,698.7

Washington, DC

702,321

38,955

18.03

9,478

1.09

24.33

13.50

0.98

690,170.8

New York, NY

8,627,852

187,946

45.91

40,190

1.01

21.38

4.66

0.99

8,539,502.8

San Diego, CA

1,399,844

205,918

6.80

39,385

0.29

19.13

28.14

0.81

1,127,322.4

Virginia Beach, VA

457,832

159,341

2.87

28,312

0.26

17.77

61.84

0.68

309,073.2

Boston, MA

687,725

29,175

23.57

5,072

1.02

17.38

7.38

1.00

686,280.8

Honolulu, HI

1,006,142

379,885

2.65

57,141

0.09

15.04

56.79

0.79

797,287.0

Minneapolis, MN

421,339

33,958

12.41

5,075

8.47

14.94

12.04

0.98

413,506.3

Jacksonville, FL

925,142

467,298

1.98

67,707

0.14

14.49

73.19

0.35

322,004.9

  bottom10_park_percent_flex 

City Name

City Population

City Land Area (Revised) (Acres)

Population Density (People/Acre)

City Parkland Area (Acres)

Designed-Natural Park Area Ratio (Designed Park (%) / Natural Park (%)

Percent Parkland

Parkland Per (1000) Capita

Percent of Residents within 0.5 Miles of a Park

pop_near_parks

Durham, NC

275,758

68,678

4.02

2,665

0.11

3.88

9.66

0.51

141,094.34

Memphis, TN

655,061

196,098

3.34

9,194

1.10

4.69

9.15

0.46

300,312.72

Winston-Salem, NC

248,839

83,917

2.97

4,263

7.34

5.08

17.13

0.37

92,267.01

Detroit, MI

660,960

87,844

7.52

5,102

3.63

5.81

7.72

0.80

528,748.17

Toledo, OH

277,467

51,169

5.42

3,175

1.41

6.20

11.44

0.81

225,200.54

Atlanta, GA

498,059

84,250

5.91

5,293

2.70

6.28

10.63

0.72

360,828.80

Cleveland, OH

377,963

46,880

8.06

3,170

1.30

6.76

8.39

0.83

315,251.38

Dallas, TX

1,378,903

215,676

6.39

20,352

1.17

9.44

14.76

0.71

977,793.91

St. Louis, MO

310,144

39,090

7.93

3,749

23.66

9.59

12.09

0.98

303,354.95

Chicago, IL

2,744,859

136,796

20.07

13,609

1.94

9.95

4.74

0.98

2,696,000.51

  # Save patchwork and corr plots
  ggsave("imgs/pt1_eda_plots.png", plot = pt1_eda_plots, width = 8, height = 6, units = "in")
  ggsave("imgs/pt2_eda_plots.png", plot = pt2_eda_plots, width = 8, height = 6, units = "in")
  ggsave("imgs/bos_corr_plot.png", plot = bos_corr_plot, width = 8, height = 6, units = "in")
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
  ggsave("imgs/osa_corr_plot.png", plot = osa_corr_plot, width = 8, height = 6, units = "in")
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
  # Create Presentation Materials (uncomment to use)
# pres_plot1 <- (land_area_plot | land_vs_park_area_plot) /
  # (land_pc_vs_park_area_plot | density_vs_park_percent_plot)
# pres_plot2 <- (pop_park_proximity_plot | pct_walk_vs_osa_plot)

# pres_plot1
# pres_plot2

EDA Summary

The majority of the city’s land areas are distributed within 100,000 and 300,000 acres of a city’s land area. This distribution ranged from 100,000 acres to 550,000 acres, with an outlying data point above 900,000 acres (Figure 1). The frequency of land area decreases as land area increases. We found little to no linear relationship between city land area and city parkland area (Figure 2). City land per capita, measured in acres, does not correlate with Parkland area (Figure 7). The majority of cities have less than one acre of land per capita, and less than 125,000 acres of parkland per capita. Population density is weakly positive correlation with population density. As population density increases, the percentage of parkland in a city also increases (Figure 3). In Figure 5, the majority of data points are below the 45-degree dashed line, indicating that for every data point, the percentage of population living near parks is consistently lower than the mean percentage of open space access. Figure 8 does not show any or a weak correlation between the variables. When examining the correlation of all features within the dataset, we found that Parkland per Capita and Parkland per 1,000 People were the two most positively correlated features to the Mean Percentage of Built Open Space (Figure 5). When applying the same correlation analysis to our second focus variable, Mean Percentage Open Space Access, Park Units Per Area, and Percent of Residents within a Half Mile Walk were the most positively correlated to the variable.

Modeling

In [3]:
# Set a random seed for reproducibility
set.seed(567)

# Split data into training (80%) and testing (20%) sets
up_split <- initial_split(urban_parks_data, prop = 0.8)
up_train <- training(up_split)
up_test <- testing(up_split)

# Create 3-fold cross validation resampling for the training data
up_3cv <- vfold_cv(up_train, v = 3, strata = NULL)
In [4]:
# Recipe for mean_percent_built_open_space
rec_bos_1 <- recipe(mean_percent_built_open_space ~ parkland_per_capita + designed_park_area + land_per_capita + natural_park_area + parkland_area, data = up_train) %>%
  step_nzv(all_predictors()) %>% # Remove predictors with near zero variance to prevent bloated fits
  step_lincomb(all_predictors()) %>% # Removes linear combinations of predictors (redundancy)
  step_YeoJohnson(all_predictors()) %>%  # Apply Yeo-Johnson transformation to numeric variables
  step_corr(all_predictors(), threshold = 0.9)  # Remove highly correlated predictors (threshold > 0.9)

rec_bos_2 <- recipe(mean_percent_built_open_space ~ parkland_per_capita + designed_park_area + land_per_capita, data = up_train) %>%
  step_nzv(all_predictors()) %>% # Remove predictors with near zero variance to prevent bloated fits
  step_lincomb(all_predictors()) %>% # Removes linear combinations of predictors (redundancy)
  step_YeoJohnson(all_predictors()) %>%  # Apply Yeo-Johnson transformation to numeric variables
  step_corr(all_predictors(), threshold = 0.9)  # Remove highly correlated predictors (threshold > 0.9)

# Recipes for mean_percent_open_space_access
rec_osa_1 <- recipe(mean_percent_open_space_access ~ parkland_percent + percent_half_mile_walk + pop_density + parkland_area + pop_near_parks + parkland_per_capita + city_pop, data = up_train) %>%
  # Add interaction terms
  step_interact(terms = ~ parkland_percent:pop_density) %>%
  step_interact(terms = ~ parkland_percent:percent_half_mile_walk) %>%
  step_interact(terms = ~ pop_density:percent_half_mile_walk) %>% 
  step_nzv(all_predictors()) %>% # Remove predictors with near zero variance to prevent bloated fits
  step_lincomb(all_predictors()) %>% # Removes linear combinations of predictors (redundancy)
  step_YeoJohnson(all_predictors()) %>%  # Apply Yeo-Johnson transformation to numeric variables
  step_corr(all_predictors(), threshold = 0.9)  # Remove highly correlated predictors (threshold > 0.9)


rec_osa_2 <- recipe(mean_percent_open_space_access ~ parkland_percent + percent_half_mile_walk + pop_density + parkland_area , data = up_train) %>%
  step_nzv(all_predictors()) %>% # Remove predictors with near zero variance to prevent bloated fits
  step_lincomb(all_predictors()) %>% # Removes linear combinations of predictors (redundancy)
  step_YeoJohnson(all_predictors()) %>%  # Apply Yeo-Johnson transformation to numeric variables
  step_corr(all_predictors(), threshold = 0.9)  # Remove highly correlated predictors (threshold > 0.9)
In [5]:
# Linear Regression
lm_model <- linear_reg() %>%
  set_engine("lm") %>% 
  set_mode("regression")

# Random Forest (Tree-Based Model)
rf_model <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")

# Gradient Boosting Machine (Boosted Tree Model)
# More aggressive regularization to prevent over fitting
xg_model <- boost_tree() %>%
   set_engine("xgboost") %>%
  set_mode("regression")
In [6]:
# Compile list of models
models <- list(lm = lm_model, rf = rf_model, xg = xg_model)
# Compile one recipe list for each output
bos_recipes <- list(bos_1 = rec_bos_1, bos_2 = rec_bos_2)
osa_recipes <- list(osa_1 = rec_osa_1, osa_2 = rec_osa_2)

# Create workflow set for BOS
bos_wf_set <- workflow_set(
  preproc = bos_recipes, 
  models = models) %>% 
  workflow_map('fit_resamples', resamples = up_3cv)
Warning: package 'ranger' was built under R version 4.4.3
# Create workflow set for OSA
osa_wf_set <- workflow_set(
  preproc = osa_recipes, 
  models = models) %>% 
  workflow_map('fit_resamples', resamples = up_3cv)

Vizualizing and Ranking Model Performance

In [7]:
# Visualize Model Performance
bos_modeling_plot <- autoplot(bos_wf_set) + 
  aes(x = wflow_id) + 
  labs(title = "Figure 11: BOS Resample Metrics",
       x = "Model") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

osa_modeling_plot <- autoplot(osa_wf_set) + 
  aes(x = wflow_id) +
  labs(title = "Figure 12: OSA Resample Metrics",
       x = "Model") + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

# Save Model Performance Plots
ggsave("imgs/bos_modeling_plot.png", plot = bos_modeling_plot, width = 8, height = 6, units = "in")
ggsave("imgs/osa_modeling_plot.png", plot = osa_modeling_plot, width = 8, height = 6, units = "in")

# Rank Results
rank_results(bos_wf_set, rank_metric = "rsq", select_best = TRUE) %>% 
  flextable () %>% 
  set_caption("Table 4: Ranking of BOS Model Results")

wflow_id

.config

.metric

mean

std_err

n

preprocessor

model

rank

bos_2_rf

Preprocessor1_Model1

rmse

3.7411029

0.59891367

3

recipe

rand_forest

1

bos_2_rf

Preprocessor1_Model1

rsq

0.3443191

0.28968211

3

recipe

rand_forest

1

bos_1_rf

Preprocessor1_Model1

rmse

3.7018159

0.56685291

3

recipe

rand_forest

2

bos_1_rf

Preprocessor1_Model1

rsq

0.3241036

0.27364624

3

recipe

rand_forest

2

bos_2_xg

Preprocessor1_Model1

rmse

4.8878321

0.94125107

3

recipe

boost_tree

3

bos_2_xg

Preprocessor1_Model1

rsq

0.2500242

0.14210625

3

recipe

boost_tree

3

bos_1_xg

Preprocessor1_Model1

rmse

4.4965564

0.65984413

3

recipe

boost_tree

4

bos_1_xg

Preprocessor1_Model1

rsq

0.1743254

0.08380129

3

recipe

boost_tree

4

bos_1_lm

Preprocessor1_Model1

rmse

21.7443046

17.60228115

3

recipe

linear_reg

5

bos_1_lm

Preprocessor1_Model1

rsq

0.1616320

0.05566050

3

recipe

linear_reg

5

bos_2_lm

Preprocessor1_Model1

rmse

13.7494375

9.82289704

3

recipe

linear_reg

6

bos_2_lm

Preprocessor1_Model1

rsq

0.1116782

0.07409244

3

recipe

linear_reg

6

rank_results(osa_wf_set, rank_metric = "rsq", select_best = TRUE) %>% 
  flextable () %>% 
  set_caption("Table 5: Ranking of OSA Model Results")

wflow_id

.config

.metric

mean

std_err

n

preprocessor

model

rank

osa_1_lm

Preprocessor1_Model1

rmse

275.33964393

256.33604253

3

recipe

linear_reg

1

osa_1_lm

Preprocessor1_Model1

rsq

0.51580872

0.26424787

3

recipe

linear_reg

1

osa_2_lm

Preprocessor1_Model1

rmse

13.46668060

1.70629264

3

recipe

linear_reg

2

osa_2_lm

Preprocessor1_Model1

rsq

0.50547990

0.16955872

3

recipe

linear_reg

2

osa_2_rf

Preprocessor1_Model1

rmse

14.04976508

2.45439883

3

recipe

rand_forest

3

osa_2_rf

Preprocessor1_Model1

rsq

0.30246876

0.12897699

3

recipe

rand_forest

3

osa_1_rf

Preprocessor1_Model1

rmse

13.84363220

2.41203279

3

recipe

rand_forest

4

osa_1_rf

Preprocessor1_Model1

rsq

0.23429515

0.10482788

3

recipe

rand_forest

4

osa_1_xg

Preprocessor1_Model1

rmse

14.59851000

2.15861590

3

recipe

boost_tree

5

osa_1_xg

Preprocessor1_Model1

rsq

0.12253699

0.03106196

3

recipe

boost_tree

5

osa_2_xg

Preprocessor1_Model1

rmse

17.03286366

0.91154682

3

recipe

boost_tree

6

osa_2_xg

Preprocessor1_Model1

rsq

0.06247707

0.04200654

3

recipe

boost_tree

6

Model Tuning

In [8]:
# BOS Model Tuning (XGBoost)
tuned_bos_model <- boost_tree(
  mtry = tune(),
  trees = tune(),
  min_n = tune(),
  learn_rate = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

tune_bos_wf <- workflow() %>%
  add_recipe(rec_bos_2) %>%
  add_model(tuned_bos_model)

# Create parameter set
bos_dials <- parameters(
  mtry(range = c(1, ncol(up_train) - 1)), # Adjust range as needed
  trees(range = c(50, 150)),
  min_n(range = c(2, 5)),
  learn_rate(range = c(0.01, 0.1)),
  tree_depth(range = c(2, 4))
)

bos_grid <- grid_space_filling(bos_dials, size = 5)

bos_model_params <- tune_grid(
  tune_bos_wf,
  resamples = up_3cv,
  grid = bos_grid,
  metrics = metric_set(rmse, rsq, mae),
  control = control_grid(save_pred = TRUE),
  param_info = bos_dials
)

bos_params_plot <- autoplot(bos_model_params) +
  labs(title = "Figure 13: BOS Tuning Metrics") + 
  theme(plot.title = element_text(hjust = 0.5))

ggsave("imgs/bos_params_plot.png", plot = bos_params_plot, width = 8, height = 6, units = "in")

bos_preds <- collect_metrics(bos_model_params)
bos_hp_best <- select_best(bos_model_params, metric = "mae")

bos_final_wf <- finalize_workflow(tune_bos_wf, bos_hp_best)
bos_final_fit <- last_fit(bos_final_wf, split = up_split)
bos_final_metrics <- collect_metrics(bos_final_fit)


# OSA Model Tuning (Ranger - Random Forest)
tuned_osa_model <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>%
  set_engine("ranger") %>%
  set_mode("regression")

tune_osa_wf <- workflow() %>%
  add_recipe(rec_osa_2) %>%
  add_model(tuned_osa_model)

# Create parameter set
osa_dials <- parameters(
  mtry(range = c(1, ncol(up_train) - 1)),
  trees(range = c(100, 500)),
  min_n(range = c(2, 10))
)
osa_grid <- grid_space_filling(osa_dials, size = 7)

osa_model_params <- tune_grid(
  tune_osa_wf,
  resamples = up_3cv,
  grid = osa_grid,
  metrics = metric_set(rmse, rsq, mae),
  control = control_grid(save_pred = TRUE)
)
→ A | warning: 8 columns were requested but there were 4 predictors in the data. 4 will be used.
There were issues with some computations   A: x1
→ B | warning: 11 columns were requested but there were 4 predictors in the data. 4 will be used.
There were issues with some computations   A: x1
→ C | warning: 15 columns were requested but there were 4 predictors in the data. 4 will be used.
There were issues with some computations   A: x1
→ D | warning: 18 columns were requested but there were 4 predictors in the data. 4 will be used.
There were issues with some computations   A: x1
→ E | warning: 22 columns were requested but there were 4 predictors in the data. 4 will be used.
There were issues with some computations   A: x1
There were issues with some computations   A: x3   B: x3   C: x3   D: x3   E: x3
osa_params_plot <- autoplot(osa_model_params) +
  labs(title = "Figure 14: OSA Tuning Metrics") + 
  theme(plot.title = element_text(hjust = 0.5))

ggsave("imgs/osa_params_plot.png", plot = osa_params_plot, width = 8, height = 6, units = "in")

osa_preds <- collect_metrics(osa_model_params)
osa_hp_best <- select_best(osa_model_params, metric = "mae")

osa_final_wf <- finalize_workflow(tune_osa_wf, osa_hp_best)
osa_final_fit <- last_fit(osa_final_wf, split = up_split)
osa_final_metrics <- collect_metrics(osa_final_fit)

Model Tuning Results

In [9]:
# Finalize and fit full model for BOS
bos_final_fit_full <- fit(bos_final_wf, data = urban_parks_data)
bos_augmented_preds <- augment(bos_final_fit_full, new_data = urban_parks_data)

# Add residuals
bos_augmented_preds <- bos_augmented_preds %>% 
  mutate(residual_sq = (mean_percent_built_open_space - .pred)^2)

# Determine common axis ranges
bos_range <- range(c(bos_augmented_preds$.pred, bos_augmented_preds$mean_percent_built_open_space), na.rm = TRUE)
bos_residuals_range <- range(bos_augmented_preds$residuals_sq, na.rm = TRUE)
Warning: Unknown or uninitialised column: `residuals_sq`.
Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
Inf
Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
-Inf
# BOS: Predicted vs Actual
bos_map_preds <- ggplot(bos_augmented_preds, aes(x = .pred, y = mean_percent_built_open_space)) +
  geom_point(aes(color = .pred), size = 3, alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  scale_color_viridis_c(name = "Predicted") +
  coord_fixed(ratio = 1, xlim = bos_range, ylim = bos_range) +
  labs(title = "BOS: Predicted vs Actual", x = "Predicted", y = "Actual") +
  theme_minimal() +
  theme(plot.subtitle.title = element_text("Figure 15"))

# BOS: Squared Residuals
bos_map_resid <- ggplot(bos_augmented_preds, aes(x = .pred, y = residual_sq)) +
  geom_point(aes(color = residual_sq), size = 3, alpha = 0.8) +
  scale_color_viridis_c(name = "Residual²") +
  labs(title = "Figure 15: BOS Squared Residuals", x = "Predicted", y = "Residual²") +
  theme_minimal()+
  theme(plot.subtitle.title = element_text("Figure 16"))

# Combine plots
bos_maps_combined <- bos_map_preds | bos_map_resid
ggsave("imgs/bos_maps_combined.png", plot = bos_maps_combined, width = 8, height = 6, units = "in")
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
osa_final_fit_full <- fit(osa_final_wf, data = urban_parks_data)
osa_augmented_preds <- augment(osa_final_fit_full, new_data = urban_parks_data)

osa_augmented_preds <- osa_augmented_preds %>%
  mutate(residual_sq = (mean_percent_open_space_access - .pred)^2)

# Determine common axis range for OSA
osa_range <- range(c(osa_augmented_preds$.pred, osa_augmented_preds$mean_percent_open_space_access), na.rm = TRUE)
osa_residuals_range <- range(osa_augmented_preds$residuals_sq, na.rm = TRUE)
Warning: Unknown or uninitialised column: `residuals_sq`.
Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
Inf
Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
-Inf
# OSA: Predicted vs Actual
osa_map_preds <- ggplot(osa_augmented_preds, aes(x = .pred, y = mean_percent_open_space_access)) +
  geom_point(aes(color = .pred), size = 3, alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  scale_color_viridis_c(name = "Predicted") +
  coord_fixed(ratio = 1, xlim = osa_range, ylim = osa_range) +
  labs(title = "OSA: Predicted vs Actual", x = "Predicted", y = "Actual") +
  theme_minimal()+
  theme(plot.subtitle.title = element_text("Figure 17"))

# OSA: Squared Residuals
osa_map_resid <- ggplot(osa_augmented_preds, aes(x = .pred, y = residual_sq)) +
  geom_point(aes(color = residual_sq), size = 3, alpha = 0.8) +
  scale_color_viridis_c(name = "Residual²") +
  labs(title = "Figure 16: OSA Squared Residuals", x = "Predicted", y = "Residual²") +
  theme_minimal()+
  theme(plot.subtitle.title = element_text("Figure 18"))

# Combine OSA plots
osa_maps_combined <- osa_map_preds | osa_map_resid
ggsave("imgs/osa_maps_combined.png", plot = osa_maps_combined, width = 8, height = 6, units = "in")
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
Warning in plot_theme(plot): The `plot.subtitle.title` theme element is not defined in the element
hierarchy.
The `plot.subtitle.title` theme element is not defined in the element
hierarchy.

Modeling Summary

Initial modeling results for Built Open Space did not perform well. Boosttree, with the engine xgboost, has the lowest Root Mean Squared Error (RMSE), and the highest R-squared value. The RMSE value was ~5, and the R-squared was 0.25, meaning 25% of the variance can be explained by this regression model (Figure 11). Like Built Open Space, the initial models for Open Space Access did not have a clear winner that stood out. Linear Regression was chosen as the best model to display Open Space Access results because it had a relatively low RMSE and an R-squared value of 0.50, meaning 50% of the variance in this model can be explained (Figure 12). After tuning for RMSE, the Boosted Tree model had an RMSE of 2.76 and an R-squared value of 0.01%. The Built Open Space Predicted vs. Actual values appeared to have a linear relationship; however, the squared residuals show little to no correlation between the predicted values and the residuals. For Open Space Access, using Linear Regression resulted in an RMSE of 18.55 and an R-squared value of 0.66. Like Built Open Space, the predicted and actual values for Open Space Access appear to have a linear relationship, but the residuals do not. Open Space Access results have an outlier in the Open Space Access Predicted vs. Residual values, where the predicted value is very low (>20), and the residual is extremely high, with a value of over 150.

Discussion

Our research found no significant correlation between urban population density and public open space access. The small sample size of 25 cities led to some analysis issues when looking at the relationship between urban demographics and park access. Additionally, Anchorage, Alaska, was an outlying point throughout the analysis, consistently being at the extreme ends of our graphs. Yet, due to our small sample size, it was decided not to remove Alaska and make our sample size even smaller. Anchorage tended to skew the data because there is much higher public open space availability with much less population density due to the nature of the state. The small sample size also created issues in our modeling, such as having to perform a 3-fold cross-validation instead of a standard 10-fold cross-validation (see Modeling section). Built Open Space Access results do not show a relationship between Built Open Space predicted vs. actual residuals, meaning the model failed to accurately predict the amount of built open space in the cities. We did not find any correlation between a city’s population density and the amount of open space the city has; therefore, we reject the hypothesis that there is an intermediate/sublinear relationship between urban population density and public open space availability. This means that an increase in population or population density does not lead to city planners creating more green spaces to accommodate a growing urban population. The mental and physical health effects of a large, urban population without sufficient access to outdoor spaces are unknown and are grounds for further study. This report could be improve with further statistical assessment of these results including ANOVA to better determine significance. That said, the small sample size gives these results extremely high uncertainty. If scope had allowed, statistically testing predicted outputs against the original UN dataset where n > 100 would have helped quantify the uncertainty and variability to better draw conclusions with the small dataset.

Conclusion

In conclusion, we found no correlation between urban population density and public open space access, meaning that there has not been a noticeable change in the urban planning sector to accommodate a continuously growing urban population. Additionally, it is unknown what the mental health effects for this population will be. Looking forward, it would be best to find a different dataset with more cities for our sample size so that we could remove any outliers that may skew our data, allowing us to come to a stronger conclusion. It is important for urban planners to start shifting towards more public open space access to make substantial improvements to society’s overall mental and physical health. This change could be difficult due to the recent and historical lack of public open space for city residents, but it is a change that should be recognized and attempted to move towards.

References

Beall, J., Guha-Khasnobis, B., & Kanbur, R. (2010). Urbanization and development: Multidisciplinary perspectives. Oxford University Press.
Bratman, G. N., Hamilton, J. P., & Daily, G. C. (2012). The impacts of nature experience on human cognitive function and mental health. Annals of the New York Academy of Sciences, 1249(1), 118–136. https://doi.org/https://doi.org/10.1111/j.1749-6632.2011.06400.x
Kohlhase, J. E. (2013). The new urban world 2050: Perspectives, prospects and problems. Regional Science Policy & Practice, 5(2), 153–166.
May, R., Rex, K., Bellini, L., Sadullah, S., Nishi, E., James, F., & Mathangani, A. (2000). UN habitat indicators database: Evaluation as a source of the status of urban development problems and programs. Cities, 17(3), 237–244.
Ratcliffe, M. (2022). Redefining Urban Areas following the 2020 Census. In Census.gov. https://www.census.gov/newsroom/blogs/random-samplings/2022/12/redefining-urban-areas-following-2020-census.html
Slack, T., & Jensen, L. (2020). The changing demography of rural and small-town america. Population Research and Policy Review, 39(5), 775–783.