Daily Exercise 06

ESS 330 - Quantitative Reasoning

COVID-19 Data

Data

We are going to practice some data wrangling skills using a real-world dataset about COVID cases curated and maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:

“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties. A smaller file with only the most recent 30 days of data is also available”

For this lab we will use the historic, recent, country level data which is stored as an updating CSV at this URL:

https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv

Start

I will get you started this week, in the following code chunk, I am attaching the tidyverse package; saving the NY-Times URL as a value called “url”; and I am reading that URL into an object called covid

library(tidyverse)
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv'
covid = read_csv(url)
head(covid, 5)
# A tibble: 5 × 6
  date       county  state   fips  cases deaths
  <date>     <chr>   <chr>   <chr> <dbl>  <dbl>
1 2023-02-22 Autauga Alabama 01001 19732    230
2 2023-02-22 Baldwin Alabama 01003 69641    724
3 2023-02-22 Barbour Alabama 01005  7451    112
4 2023-02-22 Bibb    Alabama 01007  8067    109
5 2023-02-22 Blount  Alabama 01009 18616    261

Hint: You can print the top X rows of a data.frame with slice.

slice(covid, 1:5)

to print the top 5 columns of the raw covid object

Question 1:

Use dplyr verbs to create a data.frame of the 5 counties with the most current cases. Remember, the cases and deaths are cumulative, so you only need to deal with the data for the most current (max) date.

(Hint: filter, arrange, slice)

# Find the most recent date
max_date <- max(covid$date)
most_recent_data <- filter(covid, date == max_date)

# Get the 5 counties with the most cases
top_5_counties_cases <- most_recent_data %>%
  arrange(desc(cases)) %>% 
  slice(1:5)

top_5_counties_cases
# A tibble: 5 × 6
  date       county        state      fips    cases deaths
  <date>     <chr>         <chr>      <chr>   <dbl>  <dbl>
1 2023-03-23 Los Angeles   California 06037 3723182  35960
2 2023-03-23 New York City New York   <NA>  3252967  45123
3 2023-03-23 Miami-Dade    Florida    12086 1547142  12324
4 2023-03-23 Cook          Illinois   17031 1541156  15884
5 2023-03-23 Maricopa      Arizona    04013 1533874  18877

Question 2:

Use dplyr verbs to create a data.frame of the 5 states with the most cases current cases.

(Hint: filter, group_by, summarize, arrange, slice)

# Summarize total cases per state for current data
top_5_states_cases <- most_recent_data %>% 
  group_by(state) %>% 
  summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  arrange(desc(total_cases)) %>%
  slice(1:5)

top_5_states_cases
# A tibble: 5 × 2
  state      total_cases
  <chr>            <dbl>
1 California    12169158
2 Texas          8447233
3 Florida        7542869
4 New York       6805271
5 Illinois       4107931

Question 3:

Use the dplyr verbs to report the 5 counties with the worst current death/cases ratio: (e.g.\(100* (deaths/cases)\))

(Hint: You will need to remove those where cases == 0 and county == Unknown) (Hint: filter, mutate, arrange, slice)

# Calculate death/case ratio for counties;
  # Exclude counties with no cases or missing data
worst_counties_ratio <- most_recent_data %>% 
  filter(cases > 0, county != "Unknown") %>% 
  mutate(death_case_ratio = (deaths / cases)) %>%
  arrange(desc(death_case_ratio)) %>% 
  slice(1:5)

worst_counties_ratio
# A tibble: 5 × 7
  date       county   state    fips  cases deaths death_case_ratio
  <date>     <chr>    <chr>    <chr> <dbl>  <dbl>            <dbl>
1 2023-03-23 Storey   Nevada   32029   197     14           0.0711
2 2023-03-23 Sabine   Texas    48403  1672     94           0.0562
3 2023-03-23 McMullen Texas    48311   196     11           0.0561
4 2023-03-23 Blaine   Nebraska 31009    76      4           0.0526
5 2023-03-23 Grant    Nebraska 31075   114      6           0.0526

Question 4:

Use the dplyr verbs to report the 5 states with the worst current death/case ratio.

(Hint: filter, group_by, summarize, mutate, arrange, slice)

# Caclulate death/case ratio for states
  # Exclude states with 0 cases
worst_states_ratio <- most_recent_data %>% 
  filter(cases > 0) %>%                       # Remove states with 0 cases
  group_by(state) %>% 
  summarize(
    total_cases = sum(cases, na.rm = TRUE),   # Total cases per state
    total_deaths = sum(deaths, na.rm = TRUE)  # Total deaths per state
  ) %>%
  mutate(death_case_ratio = 100 * (total_deaths / total_cases)) %>% 
  arrange(desc(death_case_ratio)) %>%         # Sort by death to case ratio, highest to lowest
  slice(1:5)

worst_states_ratio
# A tibble: 5 × 4
  state        total_cases total_deaths death_case_ratio
  <chr>              <dbl>        <dbl>            <dbl>
1 Pennsylvania     3539135        50701             1.43
2 Michigan         3068195        42311             1.38
3 Georgia          2984923        41056             1.38
4 Nevada            892814        12093             1.35
5 Arizona          2451062        33190             1.35