갈루아의 반서재

공개된 국내 코로나 바이러스(covid-19) 데이터를 이용하여, 주피터 노트북을 가지고 분석해보자. 먼저 예제 실행을 위해 주피터 노트북을 먼저 구동한다.

(base) founder@hilbert:~$ source activate AnnaM
(AnnaM) founder@hilbert:~$ cd annam
(AnnaM) founder@hilbert:~/annam$ jupyter notebook --no-browser --ip=0.0.0.0

 

 Libraries 

예제 실행에 필요한 라이브러리를 가져온다. 

import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.express as px
from datetime import date, timedelta
from sklearn.cluster import KMeans
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
import plotly.offline as py
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from keras.models import Sequential
from keras.layers import LSTM,Dense
from keras.layers import Dropout
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

설치되어 있지 않은 라이브러리는 다음과 같이 설치한다. 예를 들어 여기서는 seaborn 이 없다고 나온다. 다음과 같이 seaborn 을 설치한다. 나머지 라이브러리에 대해서도 마찬가지로 진행한다. 

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-91e381f8ae37> in <module>
      1 import numpy as np
      2 import pandas as pd
----> 3 import seaborn as sns
      4 import matplotlib.pyplot as plt
      5 import matplotlib.dates as mdates

ModuleNotFoundError: No module named 'seaborn'
(AnnaM) founder@hilbert:~/annam$ pip install seaborn                                                                          lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn)                                                                               (41.4.0)
Installing collected packages: seaborn
Successfully installed seaborn-0.10.0

ModuleNotFoundError: No module named 'fbprophet'

(AnnaM) founder@hilbert:~/annam$ pip install fbprophet
Installing collected packages: Cython, cmdstanpy, pystan, ephem, LunarCalendar, pymeeus, convertdate, holidays, setuptools-git, fbprophet
  Running setup.py install for fbprophet ... done
Successfully installed Cython-0.29.15 LunarCalendar-0.0.9 cmdstanpy-0.4.0 convertdate-2.2.0 ephem-3.7.7.1 fbprophet-0.6 holidays-0.10.1 pymeeus-0.3.7 pystan-2.19.1.1 setuptools-git-1.2

ModuleNotFoundError: No module named 'statsmodels'

(AnnaM) founder@hilbert:~/annam$ conda install -c conda-forge statsmodelsDownloading and Extracting Packages
python_abi-3.7       | 4 KB      | ################################################################################################################### | 100%
certifi-2019.11.28   | 149 KB    | ################################################################################################################### | 100%
patsy-0.5.1          | 187 KB    | ################################################################################################################### | 100%
statsmodels-0.11.1   | 10.1 MB   | ################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

성공적으로 라이브러리를 가져왔다.

Using TensorFlow backend.

 

 Reading Data 

먼저 다음에서 사용할 데이터셋을 다운로드받는다.

 

Coronavirus-Dataset

Dataset of COVID-19 in South Korea

www.kaggle.com

해당 데이터는 일정 주기로 갱신되고 있으니 갱신된 데이터가 필요한 경우 새로 다운로드 받으면 된다. 다음과 같이 디렉토리가 구성된 상태이다. 

(AnnaM) founder@hilbert:~/annam/kaggle$ tree
.
├── analysis-on-coronavirus.ipynb
└── input
    └── coronavirusdataset
        ├── case.csv
        ├── patient.csv
        ├── route.csv
        ├── time.csv
        └── trend.csv

2 directories, 6 files
path = 'input/coronavirusdataset/'
patient_data_path = path + 'patient.csv'
route_data_path = path + 'route.csv'
time_data_path = path + 'time.csv'

df_patient = pd.read_csv(patient_data_path)
df_route = pd.read_csv(route_data_path)
df_time = pd.read_csv(time_data_path)

코로나19 전체현황은 아래 사이트에서 확인가능하다.

http://ncov.mohw.go.kr/index.jsp

 

코로나바이러스감염증-19(COVID-19)

코로나바이러스감염증-19 정식 홈페이지로 발생현황, 확진환자 이동경로, 대상별 유의사항, 홍보자료, FAQ, 관련기관(보건소, 선별진료소 찾기), 정부 브리핑, 대응지침 등 안내

ncov.mohw.go.kr

 

 Looking into patient data 

그러면 확진자 데이터부터 살펴보자.

df_patient.head()

  1. id 확진자의 ID (n번째)
  2. sex 확진자의 성별
  3. birth_year 확진자의 생년
  4. country 확진자 국적
  5. region 지역
  6. group 집단감염
  7. infection_reason 감염사유
  8. infection_order 감염순서
  9. infected_by 해당 확진자에게 감염시킨 ID 
  10. contact_number 접촉자수
  11. confirmed_date 확진 판정일
  12. released_date 격리해제일
  13. deceased_date 사망일
  14. state 격리 / 격리해제 / 사망

컬럼별 결측치(missing values)의 수를 확인해보자.

df_patient.isna().sum()
patient_id             0
sex                 7190
birth_year          7203
country                0
region              7432
disease             7841
group               7783
infection_reason    7715
infection_order     7833
infected_by         7799
contact_number      7816
confirmed_date         0
released_date       7813
deceased_date       7833
state                  0
dtype: int64

fillna 메서드를 사용하여 출생년도가 비워져있는 경우 0으로 채우고 astype 메서드로 전체 데이터 자료형을 정수형으로 바꾼다.

df_patient['birth_year'] = df_patient.birth_year.fillna(0.0).astype(int)
df_patient['birth_year'] = df_patient['birth_year'].map(lambda val: val if val > 0 else np.nan)

확진일자를 datetime 자료형으로 바꿔서, 일자별 확진자수와 누적확진자수를 카운트한다. 

df_patient.confirmed_date = pd.to_datetime(df_patient.confirmed_date)
daily_count = df_patient.groupby(df_patient.confirmed_date).patient_id.count()
accumulated_count = daily_count.cumsum()

확진자 나이 계산

df_patient['age'] = 2020 - df_patient['birth_year'] 

연령대 계산

import math
def group_age(age):
    if age >= 0: # not NaN
        if age % 10 != 0:
            lower = int(math.floor(age / 10.0)) * 10
            upper = int(math.ceil(age / 10.0)) * 10 - 1
            return f"{lower}-{upper}"
        else:
            lower = int(age)
            upper = int(age + 9) 
            return f"{lower}-{upper}"
    return "Unknown"


df_patient["age_range"] = df_patient["age"].apply(group_age)
df_patient.head()

patient=df_patient

 

 Preprocessing 

date_cols = ["confirmed_date", "released_date", "deceased_date"]
for col in date_cols:
    patient[col] = pd.to_datetime(patient[col])
patient["time_to_release_since_confirmed"] = patient["released_date"] - patient["confirmed_date"]

patient["time_to_death_since_confirmed"] = patient["deceased_date"] - patient["confirmed_date"]
patient["duration_since_confirmed"] = patient[["time_to_release_since_confirmed", "time_to_death_since_confirmed"]].min(axis=1)
patient["duration_days"] = patient["duration_since_confirmed"].dt.days
age_ranges = sorted(set([ar for ar in patient["age_range"] if ar != "Unknown"]))
patient["state_by_gender"] = patient["state"] + "_" + patient["sex"]
accumulated_count.plot()
plt.title('Accumulated Confirmed Count');

확진자의 현재 상태

infected_patient = patient.shape[0]
rp = patient.loc[patient["state"] == "released"].shape[0]
dp = patient.loc[patient["state"] == "deceased"].shape[0]
ip = patient.loc[patient["state"]== "isolated"].shape[0]
rp=rp/patient.shape[0]
dp=dp/patient.shape[0]
ip=ip/patient.shape[0]
print("The percentage of recovery is "+ str(rp*100) )
print("The percentage of deceased is "+ str(dp*100) )
print("The percentage of isolated is "+ str(ip*100) )
The percentage of recovery is 0.7116533231668573
The percentage of deceased is 0.4574914220358368
The percentage of isolated is 98.8308552547973
states = pd.DataFrame(patient["state"].value_counts())
states["status"] = states.index
states.rename(columns={"state": "count"}, inplace=True)

fig = px.pie(states,
             values="count",
             names="status",
             title="Current state of patients",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="value+percent+label")
fig.show()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-56bd9c98d94c> in <module>
      3 states.rename(columns={"state": "count"}, inplace=True)
      4 
----> 5 fig = px.pie(states,
      6              values="count",
      7              names="status",

AttributeError: module 'plotly.express' has no attribute 'pie'
(AnnaM) founder@hilbert:~/annam/kaggle$ pip install plotly==4.5.4
Installing collected packages: plotly
  Found existing installation: plotly 4.2.1
    Uninstalling plotly-4.2.1:
      Successfully uninstalled plotly-4.2.1

격리해제된 사람들에 대한 정보

released = df_patient[df_patient.state == 'released']
released.head()

격리중인 환자에 대한 정보

isolated_state = df_patient[df_patient.state == 'isolated']
isolated_state.head()

사망자 데이터

dead = df_patient[df_patient.state == 'deceased']
dead.head()

격리해제자들의 연령대

plt.figure(figsize=(10,6))
sns.set_style("darkgrid")
plt.title("Age distribution of the released")
sns.kdeplot(data=released['age'], shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9332fe0b50>

격리중인 확진자들의 연령대 분포

plt.figure(figsize=(10,6))
sns.set_style("darkgrid")
plt.title("Age distribution of the isolated")
sns.kdeplot(data=isolated_state['age'], shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331f44250>

사망자들의 연령대 분포

plt.figure(figsize=(10,6))
sns.set_style("darkgrid")
plt.title("Age distribution of the deceased")
sns.kdeplot(data=dead['age'], shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331ec9350>

male_dead = dead[dead.sex=='male']
female_dead = dead[dead.sex=='female']

성별에 따른 사망자 연령분포

plt.figure(figsize=(10,6))
sns.set_style("darkgrid")
plt.title("Age distribution of the deceased by gender")
sns.kdeplot(data=female_dead['age'], label="Women", shade=True)
sns.kdeplot(data=male_dead['age'],label="Male" ,shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331eb0dd0>

plt.figure(figsize=(10,8))
sns.set_style("darkgrid")
sns.distplot(a=male_dead['age'], label="Men", kde=False)
sns.distplot(a=female_dead['age'], label="Women", kde=False)
plt.title("Age distribution of the deceased by sex")
plt.legend()

<matplotlib.legend.Legend at 0x7f9331e30690>

격리해제자와 사망자의 연령대 비교

sns.kdeplot(data=dead['age'],label='deceased', shade=True)
sns.kdeplot(data=released['age'],label='released', shade=True)
sns.kdeplot(data=isolated_state['age'],label='isolated', shade=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331dc3e90>

* 라벨이 잘못 매겨져 released 가 두 개임

성별 사망자수

plt.figure(figsize=(15, 5))
plt.title('Sex')
dead.sex.value_counts().plot.bar();

감염사유

plt.figure(figsize=(15,5))
plt.title('Infection reason')
df_patient.infection_reason.value_counts().plot.bar();

확진자의 현재상태

sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=df_patient['state'].loc[
    (df_patient['infection_reason']=='contact with patient')
])

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331c43090>

남성확진자의 현재 상태

sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=df_patient['state'].loc[(df_patient['sex']=="male")])

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331badc50>

여성 확진자의 현재 상태

sns.set(rc={'figure.figsize':(5,5)})
sns.countplot(x=df_patient['state'].loc[(df_patient['sex']=="female")])

<matplotlib.axes._subplots.AxesSubplot at 0x7f9331b80d90>

연령대별 남녀 확진자 상태

age_gender_hue_order =["isolated_female", "released_female", "deceased_female",
                       "isolated_male", "released_male", "deceased_male"]
custom_palette = sns.color_palette("Reds")[3:6] + sns.color_palette("Blues")[2:5]

plt.figure(figsize=(12, 8))
sns.countplot(x = "age_range",
              hue="state_by_gender",
              order=age_ranges,
              hue_order=age_gender_hue_order,
              palette=custom_palette,
              data=patient)
plt.title("State by gender and age", fontsize=16)
plt.xlabel("Age range", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc="upper right")
plt.show()

지역별 확진자 연령대와 상태

sns.set_style("whitegrid")
sns.FacetGrid(df_patient, hue = 'state', size = 10)\
.map(plt.scatter, 'age', 'region')\
.add_legend()
plt.title('Region by age and state')
plt.show()

 

경로 데이터 살펴보기

df_route.head()

널값이 존재하는지 체크

df_route.isna().sum()
patient_id    0
date          0
province      0
city          0
visit         0
latitude      0
longitude     0
dtype: int64
clus=df_route.loc[:,['id','latitude','longitude']]
clus.head(10)

클러스터 갯수 파악

K_clusters = range(1,8)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df_route[['latitude']]
X_axis = df_route[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.show()

위의 그래프를 보면 4개 이후에는 상수값을 가진다. 따라서 클러스터의 갯수를 4개로 설정하여 진행한다. 

kmeans = KMeans(n_clusters = 4, init ='k-means++')
kmeans.fit(clus[clus.columns[1:3]])
clus['cluster_label'] = kmeans.fit_predict(clus[clus.columns[1:3]])
centers = kmeans.cluster_centers_
labels = kmeans.predict(clus[clus.columns[1:3]])

클러스트의 시각적 표현

clus.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=100, alpha=0.5)

<matplotlib.collections.PathCollection at 0x7f93310d61d0>

folium 라이브러리를 이용하여 해당 값을 지도상에 놓음으로써 클러스터의 위치를 파악할 수 있다.

감염 지역 지도 표시

import folium
southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')

for lat, lon,city in zip(df_route['latitude'], df_route['longitude'],df_route['city']):
    folium.CircleMarker([lat, lon],
                        radius=5,
                        color='red',
                      popup =('City: ' + str(city) + '<br>'),
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(southkorea_map)
southkorea_map
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-41-fa96feb1b3e7> in <module>
----> 1 import folium
      2 southkorea_map = folium.Map(location=[36.55,126.983333 ], zoom_start=7,tiles='Stamen Toner')
      3 
      4 for lat, lon,city in zip(df_route['latitude'], df_route['longitude'],df_route['city']):
      5     folium.CircleMarker([lat, lon],

ModuleNotFoundError: No module named 'folium'
(AnnaM) founder@hilbert:~/annam/kaggle$ pip install folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1

 

도시내 환자 분석

In [40]:

df_route

plt.figure(figsize=(15,5))
plt.title('Number patients in city')
df_route.city.value_counts().plot.bar();

Patients in Provience/State 지역별 환자

plt.figure(figsize=(15,5))
plt.title('Number patients in province')
df_route.province.value_counts().plot.bar();

감염지역

plt.figure(figsize=(15,5))
plt.title('Visit')
df_route.visit.value_counts().plot.bar();

 

확진부터 격리해제 또는 사망시까지 경과시간

plt.figure(figsize=(12, 8))
sns.boxplot(x="state",
            y="duration_days",
            order=["released", "deceased"],
            data=patient)
plt.title("Time from confirmation to release or death", fontsize=16)
plt.xlabel("State", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

성별 확진일로부터 격리해제 또는 사망까지 경과시간

order_duration_sex = ["female", "male"]
plt.figure(figsize=(12, 8))
sns.boxplot(x="sex",
            y="duration_days",
            order=order_duration_sex,
            hue="state",            
            hue_order=["released", "deceased"],
            data=patient)
plt.title("Time from confirmation to release or death by gender",
          fontsize=16)
plt.xlabel("Gender", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

연령대별 확진일로부터 격리해제 또는 사망까지 경과시간

order_duration_age = sorted(patient["age_range"].unique())[:-1]
plt.figure(figsize=(12, 8))
sns.boxplot(x="age_range",
            y="duration_days",
            order=order_duration_age,
            hue="state",
            hue_order=["released", "deceased"],
            data=patient)
plt.title("Time from confirmation to release or death", fontsize=16)
plt.xlabel("Age Range", fontsize=16)
plt.ylabel("Days", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

예측을 위한 데이터 만들기

data = daily_count.resample('D').first().fillna(0).cumsum()
data = data[20:]
x = np.arange(len(data)).reshape(-1, 1)
y = data.values

회귀모델

from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=[32, 32, 10], max_iter=50000, alpha=0.0005, random_state=26)
_=model.fit(x, y)
test = np.arange(len(data)+7).reshape(-1, 1)
pred = model.predict(test)
prediction = pred.round().astype(int)
week = [data.index[0] + timedelta(days=i) for i in range(len(prediction))]
dt_idx = pd.DatetimeIndex(week)
predicted_count = pd.Series(prediction, dt_idx)

현재 확진자수 및 예측 확진자수의 시각적 표현

accumulated_count.plot()
predicted_count.plot()
plt.title('Prediction of Accumulated Confirmed Count')
plt.legend(['current confirmd count', 'predicted confirmed count'])
plt.show()

Prophet

prophet= pd.DataFrame(data)
prophet
pr_data = prophet.reset_index()
pr_data.columns = ['ds','y']
pr_data.head()

Prediction

m=Prophet()
m.fit(pr_data)
future=m.make_future_dataframe(periods=365)
forecast=m.predict(future)
forecast

예측결과의 시각적 표현

figure = plot_plotly(m, forecast)
py.iplot(figure) 

figure = m.plot(forecast,xlabel='Date',ylabel='Confirmed Count')

figure=m.plot_components(forecast)

자기회귀누적이동평균 Autoregressive integrated moving average(Arima)

confirm_cs = pd.DataFrame(data).cumsum()
arima_data = confirm_cs.reset_index()
arima_data.columns = ['confirmed_date','count']
arima_data.head()

model = ARIMA(arima_data['count'].values, order=(1, 2, 1))
fit_model = model.fit(trend='c', full_output=True, disp=True)
fit_model.summary()

fit_model.plot_predict()
plt.title('Forecast vs Actual')
pd.DataFrame(fit_model.resid).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f93273553d0>

미래값 예상하기

forcast = fit_model.forecast(steps=5)
pred_y = forcast[0].tolist()
pd.DataFrame(pred_y)

LSTM

dataset = pd.DataFrame(data)
dataset.columns = ['Confirmed']
dataset.head()

data = np.array(dataset).reshape(-1, 1)
train_data = dataset[:len(dataset)-5]
test_data = dataset[len(dataset)-5:]
scaler = MinMaxScaler()
scaler.fit(train_data)
scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)
n_input =5
n_features =1
                             
generator = TimeseriesGenerator(scaled_train_data,scaled_train_data, length=n_input, batch_size=1)

lstm_model = Sequential()
lstm_model.add(LSTM(units = 50, return_sequences = True, input_shape = (n_input, n_features)))
lstm_model.add(Dropout(0.2))
lstm_model.add(LSTM(units = 50, return_sequences = True))
lstm_model.add(Dropout(0.2))
lstm_model.add(LSTM(units = 50))
lstm_model.add(Dropout(0.2))
lstm_model.add(Dense(units = 1))
lstm_model.compile(optimizer = 'adam', loss = 'mean_squared_error')
lstm_model.fit(generator, epochs = 30)

Epoch 1/30 22/22

[==============================] - 1s 59ms/step - loss: 0.1572 Epoch 2/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0620 Epoch 3/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0414 Epoch 4/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0301 Epoch 5/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0221 Epoch 6/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0139 Epoch 7/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0203 Epoch 8/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0223 Epoch 9/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0245 Epoch 10/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0294 Epoch 11/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0163 Epoch 12/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0235 Epoch 13/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0146 Epoch 14/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0082 Epoch 15/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0104 Epoch 16/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0097 Epoch 17/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0120 Epoch 18/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0077 Epoch 19/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0160 Epoch 20/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0084 Epoch 21/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0066 Epoch 22/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0064 Epoch 23/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0099 Epoch 24/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0098 Epoch 25/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0098 Epoch 26/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0059 Epoch 27/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0033 Epoch 28/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0040 Epoch 29/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0033 Epoch 30/30 22/22 [==============================] - 0s 7ms/step - loss: 0.0044

Out[61]:

<keras.callbacks.callbacks.History at 0x7f932728a110>

losses_lstm = lstm_model.history.history['loss']
plt.figure(figsize = (30,4))
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.xticks(np.arange(0,100,1))
plt.plot(range(len(losses_lstm)), losses_lstm)

[<matplotlib.lines.Line2D at 0x7f92f2c69110>]

lstm_predictions_scaled = []

batch = scaled_train_data[-n_input:]
current_batch = batch.reshape((1, n_input, n_features))

for i in range(len(test_data)):   
    lstm_pred = lstm_model.predict(current_batch)[0]
    lstm_predictions_scaled.append(lstm_pred) 
    current_batch = np.append(current_batch[:,1:,:],[[lstm_pred]],axis=1)
prediction = pd.DataFrame(scaler.inverse_transform(lstm_predictions_scaled))
prediction.head()

시계열 데이터 검토

이 부분 진행에 앞서 다운로드받은 데이터셋(time.csv)의 경우, test, confirmed, released, deceased 로만 컬럼 구성이 되어 있어, 아래 예제 진행을 위해 acc_test, new_test, acc_confirmed, new_confirmed, acc_released, new_released, acc_deceased, new_deceased 컬럼을 계산하여 생성하였다. 

df_time = pd.read_csv("input/coronavirusdataset/time.csv")
df_time.shape

(53, 32)

confirm_perc=(df_time['acc_confirmed'].sum()+ df_time['new_confirmed'].sum())/(df_time['acc_test'].sum() + df_time['new_test'].sum())*100
released_perc=(df_time['acc_released'].sum()+ df_time['new_released'].sum())/(df_time['acc_test'].sum() + df_time['new_test'].sum())*100
deceased_perc=(df_time['acc_deceased'].sum()+ df_time['new_deceased'].sum())/(df_time['acc_test'].sum() + df_time['new_test'].sum())*100
print("The percentage of confirm  is "+ str(confirm_perc) )
print("The percentage of released is "+ str(released_perc) )
print("The percentage of deceased is "+ str(deceased_perc) )
The percentage of confirm  is 3.4125021882074793
The percentage of released is 0.07596276211775212
The percentage of deceased is 0.023131074119750517
plt.figure(figsize=(100,30))
plt.bar(df_time.date, df_time.acc_test,label="Test")
plt.bar(df_time.date, df_time.acc_confirmed, label = "Confirmed")
plt.xlabel('Date')
plt.ylabel("Count")
plt.title('Test vs Confirmed',fontsize=100)
plt.legend(frameon=True, fontsize=12)
plt.show()

 

f, ax = plt.subplots(figsize=(100, 30))
ax=sns.scatterplot(x="date", y="acc_test", data=df_time,
             color="blue")
ax=sns.scatterplot(x="date", y="acc_confirmed", data=df_time,
             color="orange")


plt.plot(df_time.date,df_time.acc_test,zorder=1)
plt.plot(df_time.date,df_time.acc_confirmed,zorder=1,color="orange")
plt.title('Test vs Confirmed',fontsize=100)

Out[70]:

Text(0.5, 1.0, 'Test vs Confirmed')

신규 검사 및 확진자 시각적 표현

plt.figure(figsize=(100,30))
plt.bar(df_time.date, df_time.new_test,label="Test")
plt.bar(df_time.date, df_time.new_confirmed, label = "Confirmed")
plt.xlabel('Date')
plt.ylabel("Count")
plt.title('New Test vs New Confirmed',fontsize=100)

plt.legend(frameon=True, fontsize=12)
plt.show()
f, ax = plt.subplots(figsize=(100, 30))
ax=sns.scatterplot(x="date", y="new_test", data=df_time,
             color="blue")
ax=sns.scatterplot(x="date", y="new_confirmed", data=df_time,
             color="orange")


plt.plot(df_time.date,df_time.new_test,zorder=1)
plt.plot(df_time.date,df_time.new_confirmed,zorder=1,color="orange")
plt.title('Test vs Confirmed',fontsize=100)

Out[72]:

Text(0.5, 1.0, 'Test vs Confirmed')

누적 확진/사망/격리해제 시각적 표현

plt.figure(figsize=(100,30))
plt.bar(df_time.date, df_time.acc_confirmed, label = "Confirmed")
plt.bar(df_time.date, df_time.acc_released,label="released")
plt.bar(df_time.date, df_time.acc_deceased,label="deceased")
plt.xlabel('Date')
plt.ylabel("Count")
plt.legend(frameon=True, fontsize=12)
plt.show()

 

f, ax = plt.subplots(figsize=(100, 30))
ax=sns.scatterplot(x="date", y="new_confirmed", data=df_time,
             color="red",label = "confirmed")
ax=sns.scatterplot(x="date", y="new_released", data=df_time,
             color="blue",label = "released")
ax=sns.scatterplot(x="date", y="new_deceased", data=df_time,
             color="orange",label = "deceased")
plt.plot(df_time.date,df_time.new_released,zorder=1,color="blue")
plt.plot(df_time.date,df_time.new_deceased,zorder=1,color="orange")
plt.plot(df_time.date,df_time.new_confirmed,zorder=1,color="red")

Out[74]:

[<matplotlib.lines.Line2D at 0x7f92f109c5d0>]

신규 확진/사망/격리해제 시각적 표현

plt.figure(figsize=(100,30))
plt.bar(df_time.date, df_time.new_confirmed, label = "Confirmed")
plt.bar(df_time.date, df_time.new_released,label="released")
plt.bar(df_time.date, df_time.new_deceased,label="deceased")
plt.xlabel('Date')
plt.ylabel("Count")
plt.legend(frameon=True, fontsize=12)
plt.show()

f, ax = plt.subplots(figsize=(100, 30))
ax=sns.scatterplot(x="date", y="new_confirmed", data=df_time,
             color="red",label = "confirmed")
ax=sns.scatterplot(x="date", y="new_released", data=df_time,
             color="blue",label = "released")
ax=sns.scatterplot(x="date", y="new_deceased", data=df_time,
             color="orange",label = "deceased")
plt.plot(df_time.date,df_time.new_released,zorder=1,color="blue")
plt.plot(df_time.date,df_time.new_deceased,zorder=1,color="orange")
plt.plot(df_time.date,df_time.new_confirmed,zorder=1,color="red")

Out[76]:

[<matplotlib.lines.Line2D at 0x7f92efcd2cd0>]

네이버 키워드 검색으로 본 트렌드 데이터

trend = pd.read_csv("input/coronavirusdataset/trend.csv")
trend.head()

trend_last30 = trend.tail(30)
tl3 = trend_last30
f, ax = plt.subplots(figsize=(50, 20))
sns.set_style("dark")
ax=sns.scatterplot(x="date", y="coronavirus", data=tl3,
             color="black",label = "coronavirus")
ax=sns.scatterplot(x="date", y="flu", data=tl3,
             color="red",label = "flu")
ax=sns.scatterplot(x="date", y="cold", data=tl3,
             color="blue",label = "cold")
ax=sns.scatterplot(x="date", y="pneumonia", data=tl3,
             color="orange",label = "pneumonia")
plt.plot(tl3.date,tl3.coronavirus,zorder=1,color="black")
plt.plot(tl3.date,tl3.cold,zorder=1,color="blue")
plt.plot(tl3.date,tl3.pneumonia,zorder=1,color="orange")
plt.plot(tl3.date,tl3.flu,zorder=1,color="red")

[<matplotlib.lines.Line2D at 0x7f92f0e8a710>]

 

원문출처 https://www.kaggle.com/vanshjatana/analysis-on-coronavirus