https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csvDaily Exercise 06
ESS 330 - Quantitative Reasoning
COVID-19 Data
Data
We are going to practice some data wrangling skills using a real-world dataset about COVID cases curated and maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:
“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties. A smaller file with only the most recent 30 days of data is also available”
For this lab we will use the historic, recent, country level data which is stored as an updating CSV at this URL:
Start
I will get you started this week, in the following code chunk, I am attaching the tidyverse package; saving the NY-Times URL as a value called “url”; and I am reading that URL into an object called covid
library(tidyverse)
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv'
covid = read_csv(url)
head(covid, 5)# A tibble: 5 × 6
date county state fips cases deaths
<date> <chr> <chr> <chr> <dbl> <dbl>
1 2023-02-22 Autauga Alabama 01001 19732 230
2 2023-02-22 Baldwin Alabama 01003 69641 724
3 2023-02-22 Barbour Alabama 01005 7451 112
4 2023-02-22 Bibb Alabama 01007 8067 109
5 2023-02-22 Blount Alabama 01009 18616 261
Hint: You can print the top X rows of a data.frame with slice.
slice(covid, 1:5)to print the top 5 columns of the raw covid object
Question 1:
Use dplyr verbs to create a data.frame of the 5 counties with the most current cases. Remember, the cases and deaths are cumulative, so you only need to deal with the data for the most current (max) date.
(Hint: filter, arrange, slice)
# Find the most recent date
max_date <- max(covid$date)
most_recent_data <- filter(covid, date == max_date)
# Get the 5 counties with the most cases
top_5_counties_cases <- most_recent_data %>%
arrange(desc(cases)) %>%
slice(1:5)
top_5_counties_cases# A tibble: 5 × 6
date county state fips cases deaths
<date> <chr> <chr> <chr> <dbl> <dbl>
1 2023-03-23 Los Angeles California 06037 3723182 35960
2 2023-03-23 New York City New York <NA> 3252967 45123
3 2023-03-23 Miami-Dade Florida 12086 1547142 12324
4 2023-03-23 Cook Illinois 17031 1541156 15884
5 2023-03-23 Maricopa Arizona 04013 1533874 18877
Question 2:
Use dplyr verbs to create a data.frame of the 5 states with the most cases current cases.
(Hint: filter, group_by, summarize, arrange, slice)
# Summarize total cases per state for current data
top_5_states_cases <- most_recent_data %>%
group_by(state) %>%
summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
arrange(desc(total_cases)) %>%
slice(1:5)
top_5_states_cases# A tibble: 5 × 2
state total_cases
<chr> <dbl>
1 California 12169158
2 Texas 8447233
3 Florida 7542869
4 New York 6805271
5 Illinois 4107931
Question 3:
Use the dplyr verbs to report the 5 counties with the worst current death/cases ratio: (e.g.\(100* (deaths/cases)\))
(Hint: You will need to remove those where cases == 0 and county == Unknown) (Hint: filter, mutate, arrange, slice)
# Calculate death/case ratio for counties;
# Exclude counties with no cases or missing data
worst_counties_ratio <- most_recent_data %>%
filter(cases > 0, county != "Unknown") %>%
mutate(death_case_ratio = (deaths / cases)) %>%
arrange(desc(death_case_ratio)) %>%
slice(1:5)
worst_counties_ratio# A tibble: 5 × 7
date county state fips cases deaths death_case_ratio
<date> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2023-03-23 Storey Nevada 32029 197 14 0.0711
2 2023-03-23 Sabine Texas 48403 1672 94 0.0562
3 2023-03-23 McMullen Texas 48311 196 11 0.0561
4 2023-03-23 Blaine Nebraska 31009 76 4 0.0526
5 2023-03-23 Grant Nebraska 31075 114 6 0.0526
Question 4:
Use the dplyr verbs to report the 5 states with the worst current death/case ratio.
(Hint: filter, group_by, summarize, mutate, arrange, slice)
# Caclulate death/case ratio for states
# Exclude states with 0 cases
worst_states_ratio <- most_recent_data %>%
filter(cases > 0) %>% # Remove states with 0 cases
group_by(state) %>%
summarize(
total_cases = sum(cases, na.rm = TRUE), # Total cases per state
total_deaths = sum(deaths, na.rm = TRUE) # Total deaths per state
) %>%
mutate(death_case_ratio = 100 * (total_deaths / total_cases)) %>%
arrange(desc(death_case_ratio)) %>% # Sort by death to case ratio, highest to lowest
slice(1:5)
worst_states_ratio# A tibble: 5 × 4
state total_cases total_deaths death_case_ratio
<chr> <dbl> <dbl> <dbl>
1 Pennsylvania 3539135 50701 1.43
2 Michigan 3068195 42311 1.38
3 Georgia 2984923 41056 1.38
4 Nevada 892814 12093 1.35
5 Arizona 2451062 33190 1.35