The COVID-19 pandemic

This notebook uses the NYT github repository to analyze the progression of COVID-19 throughout the US states. The analyses contained herein are not meant to be used as primary literature on their own, and may have mistakes. I make no claims as to the accuracy of my calculations.

The questions I am interested in asking regarding this pandemic are fairly straightforward:

* What is the case fatality rate through time?
* What do the case / death curves look through time?
* Are the curves flattening?

I have used the 2019 population census projections to normalize data by population, and I also used the census bureau areas to compute population density.

To compute per-day difference, I was originally using a savgol_filter as implemented by scipy. As of April 26, 2020, I am using a Gaussian Kernel smoother with a 2 standard deviation bandwidth.

In [1]:
import datetime as dt
today = dt.datetime.now() 
print('This notebook was last updated on', today.strftime('%A %B %d, %Y at %H:%M'))   
This notebook was last updated on Friday January 22, 2021 at 11:00
In [2]:
import sys
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from matplotlib import rc
from matplotlib import ticker
from matplotlib import dates as mdates
from matplotlib.dates import DateFormatter

rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})

%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")

mpl.rcParams['xtick.labelsize'] = 16 
mpl.rcParams['ytick.labelsize'] = 16 
mpl.rcParams['legend.fontsize'] = 14

sys.path.append('./utils')

# see https://github.com/dangeles/dangeles.github.io/blob/master/jupyter/utils/covid_utils.py
import covid_utils as cv 

Loading the data

You can find the spreadsheets I downloaded here: https://github.com/dangeles/dangeles.github.io/blob/master/data/

In [3]:
# load into a dataframe:
pop = pd.read_excel('../data/nst-est2019-01.xlsx', comment='#', header=1)

# fetch NYT data:
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
df = pd.read_csv(url, usecols=[0, 1, 3, 4], parse_dates=['date'], squeeze=True)

pop.columns = np.append(np.array(['state']), pop.columns[1:].values)
pop.state = pop.state.str.strip('.')

# merge dfs:
df = df.merge(pop, left_on='state', right_on='state')

# calculate per population numbers:
df['normedPopCases'] = df.cases/ df[2019]
df['normedPopDeaths'] = df.deaths / df[2019]

# find max:
cases = df.groupby('state').cases.apply(max).sum()
death_toll = df.groupby('state').deaths.apply(max).sum()

print('Cases in the US at last update: {0:.2f}'.format(cases / 10 ** 6), 'million')
print('Death toll in the US at last update: {0:.0f} thousand'.format(death_toll / 10 ** 3))
print('Daily deaths: {0:.0f}'.format(df.groupby(['date']).deaths.apply(np.sum).diff().values[-1]))

# predicted deaths based on cases and a 1 in 55 CFR
print('Predicted deaths tomorrow: {0:.0f}'.format(
        df.groupby(['date']).cases.apply(np.sum).diff().values[-20:-8].mean() / 55))
print('Predicted deaths in 2 weeks: {0:.0f}'.format(
        df.groupby(['date']).cases.apply(np.sum).diff().values[-8].mean() / 55))

# calculate worst off states right now:
c_ = []
for n, g in df[df.cases > 10 ** 3].groupby('state'):
    x = (g.date - g.date.min()) / dt.timedelta(days=1)

    if len(g) < 15:
        continue
    y = g.cases.rolling(window=10, win_type='gaussian',
                          center=True).mean(std=2).round()
    y = g.normedPopDeaths.rolling(window=10, win_type='gaussian',
                                  center=True).mean(std=2).round()
#     y = y.diff()
    c_ += [[n, y.dropna().values[-1]]]

worst = pd.DataFrame(c_, columns=['state', 'new_cases'])
worst.sort_values('new_cases', inplace=True)
worst = worst.state.values[-4:]
print('Worst states right now:', worst)
Cases in the US at last update: 24.68 million
Death toll in the US at last update: 410 thousand
Daily deaths: 4142
Predicted deaths tomorrow: 4483
Predicted deaths in 2 weeks: 4338
Worst states right now: ['Michigan' 'Minnesota' 'Wisconsin' 'Wyoming']

COVID in the total US

In [4]:
us  = df.groupby('date')[['cases', 'deaths']].sum().reset_index()
us = us[us.date >= us[us.deaths > 10].date.min()]
us['RefTime'] = (us.date - us.date.min()) / dt.timedelta(days=1)

fig, ax = plt.subplots(ncols=3, sharex=True, figsize=(16, 4))

ax[0].plot(us.RefTime, us.cases, color='black', label='cases')
ax[0].plot(us.RefTime, us.deaths, color='red', label='deaths')
ax[1].plot(us.RefTime, us.cases.diff().rolling(win_type='exponential',
                                               window=14, center=True).mean(tau=10) / 10 ** 3,
          color='black', label='cases per 1000')
ax[2].plot(us.RefTime, us.deaths.diff().rolling(win_type='exponential',
                                                window=14, center=True).mean(tau=10),
          color='red', label='deaths')

ax[1].scatter(us.RefTime, np.gradient(us.cases) / 10 ** 3,
              color='black', label='cases (raw)', alpha=0.1)
ax[2].scatter(us.RefTime, np.gradient(us.deaths),
              color='red', label='deaths (raw)', alpha=0.1)
ax[2].axhline(np.max(np.gradient(us.deaths)), ls='--', color='blue',
              label='Max Daily Deaths = {0:.0f}'.format(np.max(np.gradient(us.deaths))))
ax[2].axhline(np.gradient(us.deaths)[-1], ls='-.', color='red',
              label='Daily Deaths = {0:.0f}'.format(np.gradient(us.deaths)[-1]))

ax[0].set_yscale('log')
# ax[1].set_yscale('log')

ax[0].set_ylabel('Number')
ax[1].set_ylabel('Number Cases per 1000 / Day')
ax[2].set_ylabel('Number Deaths / Day')
fig.text(0.5, -0.04, 'Days since first 10 US deaths',
         ha='center', fontsize=18)
plt.legend(loc=(1, .4))
plt.tight_layout()

Epidemiological curves of COVID-19

I have plotted the cases and deaths through time in the plots below in 2 different ways. The first column shows the absolute number of cases (first row) or deaths (second row). The second column shows the number of cases (deaths) normalized to the population of each state. The second column can be interpreted as your risk of getting COVID-19 through time for any given state, since it tells you the number of cases (or deaths) per million people for each state.

In [5]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10, 7))
ax[0, :] = cv.plot(ax[0], df, 'cases', 'normedPopCases', n1=1, alpha=0.2 )
ax[1, :] = cv.plot(ax[1], df, 'deaths', 'normedPopDeaths',
                   1, 10 ** -6, ylab='Death', alpha=0.2)

for ai in ax:
    for aij in ai:
        aij.set_yscale('log')
_ = ax[0, 1].legend(loc=(1, 0))

ax[1, 1].set_ylim(1, 5 * 10**3)
plt.tight_layout()