🏆🏆 3. 미국 연도별 출생 빈도를 알아보자 (3)

🏆 앞으로 진행할 데이터 분석 과정을 위해 이전 포스팅에서의 전처리된 데이터를 가져오자.

import pandas as pd
birth = pd.read_csv("Data/births.csv", encoding = 'utf-8-sig')

birth_til_1988 = birth[:15067]

birth_date = pd.pivot_table(birth_til_1988, index = ['year','month', 'day'], values = ['births'], aggfunc={'births' : 'sum'})
birth_date = birth_date.reset_index()
birth_date[['day']] = birth_date[['day']].astype('int64')

import numpy as np
quartiles = np.percentile(birth_date['births'], [25,50,75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
birth_date = birth_date.query('(births > @mu - 5*@sig) & (births < @mu + 5*@sig)')

birth_date = birth_date.astype('str')
birth_date['births'] = birth_date['births'].astype('int64')
birth_date['date'] = pd.to_datetime(birth_date['year'] + '-' + birth_date['month'] + '-' + birth_date['day'], errors = 'raise')
birth_date['weekday'] = birth_date['date'].dt.weekday

def weekday_func(row):
    if row['weekday'] == 0:
        row['weekday'] = 'Mon'
    elif row['weekday'] == 1:
        row['weekday'] = 'Tue'
    elif row['weekday'] == 2:
        row['weekday'] = 'Wed'
    elif row['weekday'] == 3:
        row['weekday'] = 'Thu'
    elif row['weekday'] == 4:
        row['weekday'] = 'Fri' 
    elif row['weekday'] == 5:
        row['weekday'] = 'Sat'
    elif row['weekday'] == 6:
        row['weekday'] = 'Sun'
    
    return row

birth_date = birth_date.apply(weekday_func, axis = 1)

birth_date.head()

>>
    year	month	day	births	date	    weekday
1969	1	    1	8486	1969-01-01	Wed
1969	1	    2	9002	1969-01-02	Thu
1969	1	    3	9542	1969-01-03	Fri
1969	1	    4	8960	1969-01-04	Sat
1969	1	    5	8390	1969-01-05	Sun

1. 1969년 일별로 출생 추이 파악하기

🏆 앞서 사용하지 않은 1969년도의 데이터를 바탕으로 일별 출생 추이를 파악해보자.

birth_1969 = birth_date[birth_date['year'] == 1969]
birth_1969

>>
    year	month	day	births	date	    weekday
1969	1	    1	8486	1969-01-01	Wed
1969	1	    2	9002	1969-01-02	Thu
1969	1	    3	9542	1969-01-03	Fri
1969	1	    4	8960	1969-01-04	Sat
1969	1	    5	8390	1969-01-05	Sun
...	...	    ...	    ...	...	    ...	        ...
1969	12	    27	9304	1969-12-27	Sat
1969	12	    28	9004	1969-12-28	Sun
1969	12	    29	10980	1969-12-29	Mon
1969	12	    30	12232	1969-12-30	Tue
1969	12	    31	11122	1969-12-31	Wed

rows × 6 columns

기가 막히게 딱 365개의 행을 가진다. 이런걸 보면 기분이 참 좋다😉.

이 데이터프레임도 윤년을 처리해주고 datetime 함수를 적용하자.

import numpy as np
quartiles = np.percentile(birth_1969['births'], [25,50,75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
birth_1969 = birth_1969.query('(births > @mu - 5*@sig) & (births < @mu + 5*@sig)')

birth_1969['day'] = birth_1969['day'].astype('int64')
birth_1969 = birth_1969.astype('str')
birth_1969['births'] = birth_1969['births'].astype('int64')

birth_1969['date'] = pd.to_datetime(birth_1969['year'] + '-' + birth_1969['month'] + '-' + birth_1969['day'], errors = 'raise')
birth_1969.head()

>>
	year	month	day	births	date	    weekday
1969	1	    1	8486	1969-01-01	Wed
1969	1	    2	9002	1969-01-02	Thu
1969	1	    3	9542	1969-01-03	Fri
1969	1	    4	8960	1969-01-04	Sat
1969	1	    5	8390	1969-01-05	Sun

마지막으로 plotly 시각화해주자.

import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = birth_1969['date'], y = birth_1969['births'], marker_color = '#025918'))

fig.update_layout(
{
    'title' : {'text':'<b>Births in 1969</b>', 'font':{'size':25}, 'x':0.5, 'y':0.92},
    'xaxis' : {'showticklabels':True, 'dtick' : 'M1', 'tickfont' : {'size':15}, 'title' : {'text':'Date', 'font':{'size':20}}},
    'yaxis' : {'showticklabels':True, 'tickfont' : {'size':15}, 'title' : {'text':'Births', 'font':{'size':20}}},
    'template':'presentation'
})

fig.add_annotation(
    x = '1969-12-30', y = 12250,
    
    arrowcolor='#025918',
    arrowhead=2,
    arrowsize=1,
    arrowwidth=2,
    
    text = 'MAX : 1969-12-30,12232',
    font = dict(size = 13, color = 'white'),
    
    bordercolor='#025918',
    borderpad=4,
    borderwidth=2,
    
    ax = 20, ay = -40,
    align = 'center',
    bgcolor = '#025918',
    opacity = 0.8
    )

fig.add_annotation(
    x = '1969-4-20', y = 7930,
    
    arrowcolor='#025918',
    arrowhead=2,
    arrowsize=1,
    arrowwidth=2,
    
    text = 'MIN : 1969-4-20, 7928',
    font = dict(size = 13, color = 'white'),
    
    bordercolor='#025918',
    borderpad=4,
    borderwidth=2,
    
    ax = 20, ay = 50,
    align = 'center',
    bgcolor = '#025918',
    opacity = 0.8
    )

fig.show()

4. Heatmap - 분기별 출생 빈도 (1969 ~ 2008)

🏆 마지막으로 전체적인 출생 빈도를 알아보기 위해 히트맵을 그려보자.

birth_quarter = birth.groupby(['year','month']).agg({'births':'sum'}) 
birth_quarter = birth_quarter.reset_index()
birth_quarter = birth_quarter.astype('str')
birth_quarter['Date'] = pd.to_datetime(birth_quarter['year'] + '-' + birth_quarter['month'], errors = 'raise')
birth_quarter.head()

>>
year	month	births	Date
1969	1	293940	1969-01-01
1969	2	270786	1969-02-01
1969	3	296550	1969-03-01
1969	4	282638	1969-04-01
1969	5	289124	1969-05-01

해당 월의 분기를 구하기 위해서 dt . quarter 메서드를 사용하자.

birth_quarter['Quarter'] = birth_quarter['Date'].dt.quarter
birth_quarter['births'] = birth_quarter['births'].astype('int64')
birth_quarter = birth_quarter.groupby(['year', 'Quarter']).agg({'births':'sum'})
birth_quarter_plotly = birth_quarter.reset_index()
birth_quarter_plotly.head()

>>
year	Quarter	births
1969	1	861276
1969	2	863372
1969	3	952010
1969	4	923548
1970	1	891214

깔끔하게 정리되었다!! 시각화하자🙌.

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x = birth_quarter_plotly['year'], y = birth_quarter_plotly['Quarter'], z = birth_quarter_plotly['births'],
        colorscale = 'blues'))

fig.update_layout(
    {
        'title':{'text':'<b>분기별 출생 정도</b>', 'font':{'size':25}, 'x':0.5, 'y':0.92},
        'xaxis':{'showticklabels':True, 'title':{'text':'년도', 'font':{'size':20}}},
        'yaxis':{'showticklabels':True, 'dtick':1, 'title':{'text':'분기', 'font':{'size':20}}}
    })

fig.show()

👉 전체적인 년도의 분기를 보면 3분기의 출산 추이가 가장 높은 것을 확인할 수 있다. 또한 시간이 지날수록 전체적인 출산율이 높아지고 있음을 확인하자.

🏆 추가로 이렇게 복잡한 과정을 거치지 않고 pandas의 stlye 기능을 사용해서 히트맵을 그릴수도 있다. 간단하고 직관적으로 분석할 때 좋은 방법이므로 알아두면 좋을 것 같다😉!!

birth_quarter_df = birth_quarter_plotly[128:160].set_index(['year','Quarter']).unstack()
birth_quarter_df.style.background_gradient(cmap = 'Reds')

👉 최근 8년간의 추세를 보아도 계속해서 출산율이 높아지고 있음을 확인할 수 있다.

🏆 요렇게 해서 CDC 데이터 분석을 한차례 끝냈다. 디스크 공백기 때문에 예전에 사용했던 함수들도 한번씩 찾아보면서 사용법을 다시 익히는 중이라 자연스럽게 손이 조금 굼떠졌다🤦‍♂️. 기초부터 다시 탄탄히 다진다는 생각으로 열심히 달려봐야겠다🏃‍♂️🏃‍♂️🏃‍♂️!!

🏆 내일이면 개강이다. 5전공 3프로젝트의 2022년 1학기가 될 것 같다. 뭐 하면 되겠지라는 마음으로 그냥 열심히 살아봐야겠다. 아자자잣!!

💡파이썬 데이터 사이언스 핸드북(위키북스,2020)의 저자 Jake VanderPlas 분의 깃허브에서 데이터를 가져왔음을 밝힙니다.

Share on

Twitter Facebook LinkedIn

Programin9

🏆🏆 3. 미국 연도별 출생 빈도를 알아보자 (3)

1. 1969년 일별로 출생 추이 파악하기

4. Heatmap - 분기별 출생 빈도 (1969 ~ 2008)

Share on

Leave a comment

You may also enjoy

2023.01.14
🆘 블로그 이전 안내

2023.01.14
🆘 블로그 이전 안내

2023.01.12
🫀심혈관질환 분석 09-패턴분석4

2023.01.11
🫀심혈관질환 분석 08-패턴분석3

Programin9

1. 1969년 일별로 출생 추이 파악하기

4. Heatmap - 분기별 출생 빈도 (1969 ~ 2008)

Share on

Leave a comment

You may also enjoy

2023.01.14 🆘 블로그 이전 안내

2023.01.14 🆘 블로그 이전 안내

2023.01.12 🫀심혈관질환 분석 09-패턴분석4

2023.01.11 🫀심혈관질환 분석 08-패턴분석3

2023.01.14
🆘 블로그 이전 안내

2023.01.14
🆘 블로그 이전 안내

2023.01.12
🫀심혈관질환 분석 09-패턴분석4

2023.01.11
🫀심혈관질환 분석 08-패턴분석3