🏆🏆 3. 미국 연도별 출생 빈도를 알아보자 (2)

🏆 앞으로 진행할 데이터 분석 과정을 위해 이전 포스팅에서의 전처리된 데이터를 가져오자.

import pandas as pd
birth = pd.read_csv("Data/births.csv", encoding = 'utf-8-sig')

birth_til_1988 = birth[:15067]
birth_til_1988_F = birth_til_1988[birth_til_1988['gender'] == 'F'].copy()
birth_til_1988_F = birth_til_1988_F.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_F = birth_til_1988_F.reset_index()

birth_til_1988_M = birth_til_1988[birth_til_1988['gender'] == 'M'].copy()
birth_til_1988_M = birth_til_1988_M.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_M = birth_til_1988_M.reset_index()

birth_after_1989 = birth[15067:]
birth_after_1989_F = birth_after_1989[birth_after_1989['gender']=="F"].copy()
birth_after_1989_F = birth_after_1989_F[['year','month','births']].copy()

birth_after_1989_M = birth_after_1989[birth_after_1989['gender']=="M"].copy()
birth_after_1989_M = birth_after_1989_M[['year','month','births']].copy()

birth_F = pd.concat([birth_til_1988_F, birth_after_1989_F])
birth_M = pd.concat([birth_til_1988_M, birth_after_1989_M])

1. 연도별 남성 / 여성 출생 수 비교

🏆 위에서 만든 birth_F / birth_M 데이터프레임을 각 연도별로 groupby 해서 출생 수를 구한다.

birth_F_year = birth_F.groupby('year').aggregate({'births' : 'sum'})
birth_M_year = birth_M.groupby('year').aggregate({'births' : 'sum'})
birth_year = pd.merge(birth_F_year, birth_M_year, on = 'year', how = 'inner')
birth_year.columns = ['births_F', 'births_M']
birth_year.head()

>>
        births_F	births_M
year		
1969	1753634	    1846572
1970	1819164	    1918636
1971	1736774	    1826774
1972	1592347	    1673888
1973	1533102	    1613023

plotly로 시각화해주자.

import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = birth_year.index, y = birth_year['births_F'], name='Female',
        text = birth_year['births_F'], texttemplate = '%{text:,}'
    ))

fig.add_trace(
    go.Scatter(
        x = birth_year.index, y = birth_year['births_M'], name = 'Male',
        text = birth_year['births_M'], texttemplate = '%{text:,}', textfont= dict(color = 'white')
    ))

fig.update_layout(
    {
        'title' : {'text' : '<b>Female/Male births per year</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y': 0.9},
        'xaxis' : {'showticklabels' : True, 'dtick' : 1, 'title' : {'text' : 'year', 'font' : {'size' : 20}}},
        'yaxis' : {'showticklabels' : True, 'tick0' : '1M', 'title' : {'text' : 'births', 'font' : {'size' : 20}}},
        
        'showlegend' : True

    })
fig.show()

👉 전체적인 추세를 보았을 때 남성의 출생 추이가 여성보다 계속 높았던 것을 확인할 수 있다.

👉 또한 1970년대 초반을 제외하고는 점진적으로 증가하는 모습을 보여주며, 이는 아마도 1960년대 말과 1970년대의 오일쇼크와 베트남전쟁에 관련되어 있지 않을까 생각한다. 물론 다른 몇가지 이유가 있을 수 있겠지만, 출생 데이터만을 가지고 정확히 파악하기는 어려울 것 같다.

2. 요일별 출생 추이 확인하기

🏆 이번에는 요일별로 전처리를 해줄 것이기 때문에 데이터프레임의 year, month, day 열을 사용해야한다. 이 세개의 열을 to_datetime 함수를 사용해서 요일을 구해줄 것이다.

birth_date = pd.pivot_table(birth_til_1988, index = ['year','month', 'day'], values = ['births'], aggfunc={'births' : 'sum'})
birth_date.head()

>>
                        births
year	month	day	
1969	1	    1.0	    8486
                2.0	    9002
                3.0 	9542
                4.0	    8960
                5.0	    8390

birth_date = birth_date.reset_index()
birth_date[['day']] = birth_date[['day']].astype('int64')
birth_date.head()

>>
year	month	day	births
1969	1	1	8486
1969	1	2	9002
1969	1	3	9542
1969	1	4	8960
1969	1	5	8390

🏆 시계열 처리는 어지간하면 to_datetime 함수로 다 해결이 되지만, 가끔씩 윤년이 발생하는 경우가 있다. 데이터의 양이 적은 경우에는 윤년이 발생하는 경우를 제외시켜주는 함수를 하나 작성해도 되지만 이 경우에는 연도가 적지 않은 데이터이기 때문에 윤년을 따로 처리해주는 코드를 하나 작성해주겠다(아직 정확한 작동원리는 잘 모르겠다… 좀더 공부해서 따로 블로그에서 다뤄야겠다🤢).

import numpy as np
quartiles = np.percentile(birth_date['births'], [25,50,75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
birth_date = birth_date.query('(births > @mu - 5*@sig) & (births < @mu + 5*@sig)')

birth_date = birth_date.astype('str')
birth_date['births'] = birth_date['births'].astype('int64')
birth_date['date'] = pd.to_datetime(birth_date['year'] + '-' + birth_date['month'] + '-' + birth_date['day'], errors = 'raise')
birth_date.head()

>>
    year	month	day	births	date
1969	1	    1	8486	1969-01-01
1969	1	    2	9002	1969-01-02
1969	1	    3	9542	1969-01-03
1969	1	    4	8960	1969-01-04
1969	1	    5	8390	1969-01-05

요일을 구하기 위해서 . dt . weekday 메서드를 사용하자.

birth_date['weekday'] = birth_date['date'].dt.weekday
birth_date.head()

>>
    year	month	day	births	date	    weekday
1969	1	    1	8486	1969-01-01	2
1969	1	    2	9002	1969-01-02	3
1969	1	    3	9542	1969-01-03	4
1969	1	    4	8960	1969-01-04	5
1969	1	    5	8390	1969-01-05	6

weekday 열이 새로 생긴것을 확인할수 있으나, 숫자로 구성되어있다. 이는 위에서 사용한 메서드의 결과값으로 0부터 일요일을 의미한다. 따라서 이 값을 우리에게 익숙한 영어로 바꿔줄 필요가 있다. 🔑 이 과정은 따로 함수를 만들어서 apply 함수로 데이터프레임에 적용시키는 순서로 진행한다.

def weekday_func(row):
    if row['weekday'] == 0:
        row['weekday'] = 'Mon'
    elif row['weekday'] == 1:
        row['weekday'] = 'Tue'
    elif row['weekday'] == 2:
        row['weekday'] = 'Wed'
    elif row['weekday'] == 3:
        row['weekday'] = 'Thu'
    elif row['weekday'] == 4:
        row['weekday'] = 'Fri' 
    elif row['weekday'] == 5:
        row['weekday'] = 'Sat'
    elif row['weekday'] == 6:
        row['weekday'] = 'Sun'
    
    return row

birth_date = birth_date.apply(weekday_func, axis = 1)
birth_date.head()

>>
	year	month	day	births	date	    weekday
1969	1	    1	8486	1969-01-01	Wed
1969	1	    2	9002	1969-01-02	Thu
1969	1	    3	9542	1969-01-03	Fri
1969	1	    4	8960	1969-01-04	Sat
1969	1	    5	8390	1969-01-05	Sun

위 데이터는 1969년부터 1988년까지의 정보를 담고 있다. 우리는 이 년도들을 10년 단위로 끊어서 요일별로 정리할 것이다. 이를 위해 또 하나의 함수를 만들고 apply 해주자.

def decade_func(row):
    if row['year'] // 10 == 196:
        row['decade'] = 1960
    elif row['year'] // 10 == 197:
        row['decade'] = 1970
    elif row['year'] // 10 == 198:
        row['decade'] = 1980
        
    return row

birth_decade_weekday = birth_date.apply(decade_func, axis = 1).copy()

birth_decade_weekday = birth_decade_weekday[['decade', 'weekday', 'births']].copy()

birth_decade_weekday = birth_decade_weekday.groupby(['decade','weekday']).sum().copy()
birth_decade_weekday = birth_decade_weekday.reset_index()
birth_decade_weekday['decade'] = pd.Categorical(birth_decade_weekday['decade'], categories=[1960,1970,1980], ordered = True)
birth_decade_weekday['weekday'] = pd.Categorical(birth_decade_weekday['weekday'], categories=['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], ordered = True)

birth_decade_weekday = birth_decade_weekday.sort_values(['decade','weekday'])
birth_decade_weekday = birth_decade_weekday.set_index('decade')

birth_decade_weekday.head()

>>
        weekday	births
decade		
1960	Mon	    526638
1960	Tue	    549754
1960	Wed	    537910
1960	Thu	    517742
1960	Fri	    531220

우리가 원힌 대로 10년 주기, 요일별로 그룹화가 지어진 것을 확인할 수 있다. 이 과정에서 pd . Categorical ( ) 함수를 사용해서 해당 열의 순서를 지정했다. 이 함수의 기능은 다른 포스트에서 더 자세히 다뤄보아겠다(점점 다룰 게 많아진다…😅).

마지막으로 각 10년 주기로 데이터를 나눠주면 우리가 원하는 최종 데이터가 만들어진다!!

birth_1960 = birth_decade_weekday[birth_decade_weekday.index == 1960]
birth_1970 = birth_decade_weekday[birth_decade_weekday.index == 1970]
birth_1980 = birth_decade_weekday[birth_decade_weekday.index == 1980]

>>
        weekday	births
decade		
Mon	    4949739
Tue 	5173612
Wed	    5045584
Thu	    5013476
Fri	    5069502
Sat	    4214867
Sun	    4041017

import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = birth_1970['weekday'], y = birth_1970['births'], name = '1970'))

fig.add_trace(
    go.Scatter(
        x = birth_1980['weekday'], y = birth_1980['births'], name = '1980'))

fig.update_layout(
{
    'title' : {'text' : '<b>Births by Weekday in 1970 / 1980</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y' : 0.92},
    'showlegend' : True,
    'xaxis' : {'showticklabels' : True, 'title' : {'text' : 'Weekday', 'font' : {'size' : 15}}},
    'yaxis' : {'showticklabels' : True, 'title' : {'text' : 'Births', 'font' : {'size' : 15}}}
})
fig.show()

👉 1970년대 초반의 높지 않은 출생율의 영향으로 전체적으로 봤을 때는 1980년대가 1970년대보다 높은것을 확인할 수 있다. 하지만 주말에는 1970년대가 더 높은 추이를 보인다.

👉 왜 1960년도 데이터도 만들고 안쓰냐는 질문이 생길수도 있다. 하지만 이 데이터가 수집되기 시작한 년도가 1969년도부터이기 때문에 1960년대는 그냥 1969년의 데이터이다. 이걸 다른 년도와 다루는 것이 오히려 그 분석의 의미를 흐리기 때문에 1970년도와 1980년도만을 비교하자.

🏆 이번 데이터는 포스팅 두번이면 끝날 줄 알았는데 뭔가 부가적으로 내가 알고 싶은 것도 많고 기록하고 싶은 것도 많아서인지 주저리주저리 말이 길어졌다… 블로그를 운영하면 할수록 따로 다뤄야할 것도 많아지고 모르는 것도 계속 생기는 기분이다. 그래도 이렇게 정리해 두면 나중에 한번씩 찾아볼 수 있을 테니 더 재밌게 해봐야겠드아!!

🏆 다음 포스팅에서 계속~~

💡파이썬 데이터 사이언스 핸드북(위키북스,2020)의 저자 Jake VanderPlas 분의 깃허브에서 데이터를 가져왔음을 밝힙니다.

Share on

Twitter Facebook LinkedIn

Programin9

🏆🏆 3. 미국 연도별 출생 빈도를 알아보자 (2)

1. 연도별 남성 / 여성 출생 수 비교

2. 요일별 출생 추이 확인하기

Share on

Leave a comment

You may also enjoy

2023.01.14
🆘 블로그 이전 안내

2023.01.14
🆘 블로그 이전 안내

2023.01.12
🫀심혈관질환 분석 09-패턴분석4

2023.01.11
🫀심혈관질환 분석 08-패턴분석3

Programin9

1. 연도별 남성 / 여성 출생 수 비교

2. 요일별 출생 추이 확인하기

Share on

Leave a comment

You may also enjoy

2023.01.14 🆘 블로그 이전 안내

2023.01.14 🆘 블로그 이전 안내

2023.01.12 🫀심혈관질환 분석 09-패턴분석4

2023.01.11 🫀심혈관질환 분석 08-패턴분석3

2023.01.14
🆘 블로그 이전 안내

2023.01.14
🆘 블로그 이전 안내

2023.01.12
🫀심혈관질환 분석 09-패턴분석4

2023.01.11
🫀심혈관질환 분석 08-패턴분석3