The COVID-19 pandemic

This notebook uses the NYT github repository to analyze the progression of COVID-19 throughout the US states. The analyses contained herein are not meant to be used as primary literature on their own, and may have mistakes. I make no claims as to the accuracy of my calculations.

The questions I am interested in asking regarding this pandemic are fairly straightforward:

* What is the case fatality rate through time?
* What do the case / death curves look through time?
* Are the curves flattening?

I have used the 2019 population census projections to normalize data by population, and I also used the census bureau areas to compute population density.

To compute per-day difference, I was originally using a savgol_filter as implemented by scipy. As of April 26, 2020, I am using a Gaussian Kernel smoother with a 2 standard deviation bandwidth.

In [1]:
import datetime as dt
today = dt.datetime.now() 
print('This notebook was last updated on', today.strftime('%A %B %d, %Y at %H:%M'))   
This notebook was last updated on Thursday August 06, 2020 at 14:16
In [2]:
import sys
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from matplotlib import rc
from matplotlib import ticker
from matplotlib import dates as mdates
from matplotlib.dates import DateFormatter

rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})

%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")

mpl.rcParams['xtick.labelsize'] = 16 
mpl.rcParams['ytick.labelsize'] = 16 
mpl.rcParams['legend.fontsize'] = 14

sys.path.append('./utils')

# see https://github.com/dangeles/dangeles.github.io/blob/master/jupyter/utils/covid_utils.py
import covid_utils as cv 

Loading the data

You can find the spreadsheets I downloaded here: https://github.com/dangeles/dangeles.github.io/blob/master/data/

In [3]:
# load into a dataframe:
pop = pd.read_excel('../data/nst-est2019-01.xlsx', comment='#', header=1)

# fetch NYT data:
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
df = pd.read_csv(url, usecols=[0, 1, 3, 4], parse_dates=['date'], squeeze=True)

pop.columns = np.append(np.array(['state']), pop.columns[1:].values)
pop.state = pop.state.str.strip('.')

# merge dfs:
df = df.merge(pop, left_on='state', right_on='state')

df['normedPopCases'] = df.cases/ df[2019]
df['normedPopDeaths'] = df.deaths / df[2019]

cases = df.groupby('state').cases.apply(max).sum()
death_toll = df.groupby('state').deaths.apply(max).sum()
print('Cases in the US at last update: {0:.2f}'.format(cases / 10 ** 6), 'million')
print('Death toll in the US at last update: {0:.0f} thousand'.format(death_toll / 10 ** 3)) 

# calculate worst off states right now:
c_ = []
for n, g in df[df.cases > 10 ** 3].groupby('state'):
    x = (g.date - g.date.min()) / dt.timedelta(days=1)

    if len(g) < 15:
        continue
    y = g.cases.rolling(window=10, win_type='gaussian',
                          center=True).mean(std=2).round()
    y = y.diff()
    c_ += [[n, y.dropna().values[-1]]]

worst = pd.DataFrame(c_, columns=['state', 'new_cases'])
worst.sort_values('new_cases', inplace=True)
worst = worst.state.values[-4:]
print('Worst states right now:', worst)
Cases in the US at last update: 4.83 million
Death toll in the US at last update: 159 thousand
Worst states right now: ['Georgia' 'California' 'Texas' 'Florida']

COVID in the total US

In [4]:
us  = df.groupby('date')[['cases', 'deaths']].sum().reset_index()
us = us[us.date >= us[us.deaths > 10].date.min()]
us['RefTime'] = (us.date - us.date.min()) / dt.timedelta(days=1)

fig, ax = plt.subplots(ncols=2, sharex=True, figsize=(12, 4))

ax[0].plot(us.RefTime, us.cases, color='black', label='cases')
ax[0].plot(us.RefTime, us.deaths, color='red', label='deaths')
ax[1].plot(us.RefTime, us.cases.diff().rolling(win_type='exponential',
                                               window=8, center=True).mean(tau=10),
          color='black', label='cases')
ax[1].plot(us.RefTime, us.deaths.diff().rolling(win_type='exponential',
                                                window=8, center=True).mean(tau=10),
          color='red', label='deaths')

ax[1].scatter(us.RefTime, np.gradient(us.cases),
              color='black', label='cases (raw)', alpha=0.1)
ax[1].scatter(us.RefTime, np.gradient(us.deaths),
              color='red', label='deaths (raw)', alpha=0.1)
ax[1].axhline(np.max(np.gradient(us.deaths)), ls='--', color='blue', label='Max Daily Deaths = {0:.0f}'.format(np.max(np.gradient(us.deaths))))
ax[1].axhline(np.gradient(us.deaths)[-1], ls='-.', color='red', label='Daily Deaths = {0:.0f}'.format(np.gradient(us.deaths)[-1]))

ax[0].set_yscale('log')
ax[1].set_yscale('log')

ax[0].set_ylabel('Number')
ax[1].set_ylabel('Number / Day')
fig.text(0.5, -0.04, 'Days since first 10 US deaths',
         ha='center', fontsize=18)
plt.legend(loc=(1, .4))
plt.tight_layout()

Epidemiological curves of COVID-19

I have plotted the cases and deaths through time in the plots below in 2 different ways. The first column shows the absolute number of cases (first row) or deaths (second row). The second column shows the number of cases (deaths) normalized to the population of each state. The second column can be interpreted as your risk of getting COVID-19 through time for any given state, since it tells you the number of cases (or deaths) per million people for each state.

In [5]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10, 7))
ax[0, :] = cv.plot(ax[0], df, 'cases', 'normedPopCases', n1=1, alpha=0.2 )
ax[1, :] = cv.plot(ax[1], df, 'deaths', 'normedPopDeaths',
                   1, 10 ** -6, ylab='Death', alpha=0.2)

for ai in ax:
    for aij in ai:
        aij.set_yscale('log')
_ = ax[0, 1].legend(loc=(1, 0))

ax[1, 1].set_ylim(1, 5 * 10**3)
plt.tight_layout()

Are the curves flattening?

Notice that the case curves are on linear scale; the death curves are on log-scale.

In [6]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(12, 7), constrained_layout=True)

fig.suptitle('Rate of Change of COVID19', fontsize=20)
ax[0, :] = cv.plot(ax[0], df, 'cases', 'normedPopCases',  n1=1, gradient=True, window=8)
ax[1, :] = cv.plot(ax[1], df, 'deaths', 'normedPopDeaths',
                   1, 10 ** -8, ylab='Death', gradient=True, window=8)

# ax[1,0].set_ylim(0, 200)
ax[1,1].set_ylim(0, 20)
_ = ax[0, 1].legend(loc=(1, 0))