πŸ† μ•žμœΌλ‘œ 진행할 데이터 뢄석 과정을 μœ„ν•΄ 이전 ν¬μŠ€νŒ…μ—μ„œμ˜ μ „μ²˜λ¦¬λœ 데이터λ₯Ό κ°€μ Έμ˜€μž.

import pandas as pd
birth = pd.read_csv("Data/births.csv", encoding = 'utf-8-sig')

birth_til_1988 = birth[:15067]
birth_til_1988_F = birth_til_1988[birth_til_1988['gender'] == 'F'].copy()
birth_til_1988_F = birth_til_1988_F.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_F = birth_til_1988_F.reset_index()

birth_til_1988_M = birth_til_1988[birth_til_1988['gender'] == 'M'].copy()
birth_til_1988_M = birth_til_1988_M.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_M = birth_til_1988_M.reset_index()

birth_after_1989 = birth[15067:]
birth_after_1989_F = birth_after_1989[birth_after_1989['gender']=="F"].copy()
birth_after_1989_F = birth_after_1989_F[['year','month','births']].copy()

birth_after_1989_M = birth_after_1989[birth_after_1989['gender']=="M"].copy()
birth_after_1989_M = birth_after_1989_M[['year','month','births']].copy()

birth_F = pd.concat([birth_til_1988_F, birth_after_1989_F])
birth_M = pd.concat([birth_til_1988_M, birth_after_1989_M])

1. 연도별 남성 / μ—¬μ„± μΆœμƒ 수 비ꡐ

πŸ† μœ„μ—μ„œ λ§Œλ“  birth_F / birth_M λ°μ΄ν„°ν”„λ ˆμž„μ„ 각 μ—°λ„λ³„λ‘œ groupby ν•΄μ„œ μΆœμƒ 수λ₯Ό κ΅¬ν•œλ‹€.

birth_F_year = birth_F.groupby('year').aggregate({'births' : 'sum'})
birth_M_year = birth_M.groupby('year').aggregate({'births' : 'sum'})
birth_year = pd.merge(birth_F_year, birth_M_year, on = 'year', how = 'inner')
birth_year.columns = ['births_F', 'births_M']
birth_year.head()
>>
        births_F	births_M
year		
1969	1753634	    1846572
1970	1819164	    1918636
1971	1736774	    1826774
1972	1592347	    1673888
1973	1533102	    1613023

plotly둜 μ‹œκ°ν™”ν•΄μ£Όμž.

import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = birth_year.index, y = birth_year['births_F'], name='Female',
        text = birth_year['births_F'], texttemplate = '%{text:,}'
    ))

fig.add_trace(
    go.Scatter(
        x = birth_year.index, y = birth_year['births_M'], name = 'Male',
        text = birth_year['births_M'], texttemplate = '%{text:,}', textfont= dict(color = 'white')
    ))

fig.update_layout(
    {
        'title' : {'text' : '<b>Female/Male births per year</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y': 0.9},
        'xaxis' : {'showticklabels' : True, 'dtick' : 1, 'title' : {'text' : 'year', 'font' : {'size' : 20}}},
        'yaxis' : {'showticklabels' : True, 'tick0' : '1M', 'title' : {'text' : 'births', 'font' : {'size' : 20}}},
        
        'showlegend' : True

    })
fig.show()

πŸ‘‰ 전체적인 μΆ”μ„Έλ₯Ό λ³΄μ•˜μ„ λ•Œ λ‚¨μ„±μ˜ μΆœμƒ 좔이가 여성보닀 계속 λ†’μ•˜λ˜ 것을 확인할 수 μžˆλ‹€.

πŸ‘‰ λ˜ν•œ 1970λ…„λŒ€ μ΄ˆλ°˜μ„ μ œμ™Έν•˜κ³ λŠ” μ μ§„μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” λͺ¨μŠ΅μ„ 보여주며, μ΄λŠ” μ•„λ§ˆλ„ 1960λ…„λŒ€ 말과 1970λ…„λŒ€μ˜ μ˜€μΌμ‡Όν¬μ™€ λ² νŠΈλ‚¨μ „μŸμ— κ΄€λ ¨λ˜μ–΄ μžˆμ§€ μ•Šμ„κΉŒ μƒκ°ν•œλ‹€. λ¬Όλ‘  λ‹€λ₯Έ λͺ‡κ°€μ§€ μ΄μœ κ°€ μžˆμ„ 수 μžˆκ² μ§€λ§Œ, μΆœμƒ λ°μ΄ν„°λ§Œμ„ 가지고 μ •ν™•νžˆ νŒŒμ•…ν•˜κΈ°λŠ” μ–΄λ €μšΈ 것 κ°™λ‹€.

2. μš”μΌλ³„ μΆœμƒ 좔이 ν™•μΈν•˜κΈ°

πŸ† μ΄λ²ˆμ—λŠ” μš”μΌλ³„λ‘œ μ „μ²˜λ¦¬λ₯Ό 해쀄 것이기 λ•Œλ¬Έμ— λ°μ΄ν„°ν”„λ ˆμž„μ˜ year, month, day 열을 μ‚¬μš©ν•΄μ•Όν•œλ‹€. 이 μ„Έκ°œμ˜ 열을 to_datetime ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•΄μ„œ μš”μΌμ„ ꡬ해쀄 것이닀.

birth_date = pd.pivot_table(birth_til_1988, index = ['year','month', 'day'], values = ['births'], aggfunc={'births' : 'sum'})
birth_date.head()
>>
                        births
year	month	day	
1969	1	    1.0	    8486
                2.0	    9002
                3.0 	9542
                4.0	    8960
                5.0	    8390
birth_date = birth_date.reset_index()
birth_date[['day']] = birth_date[['day']].astype('int64')
birth_date.head()
>>
year	month	day	births
0	1969	1	1	8486
1	1969	1	2	9002
2	1969	1	3	9542
3	1969	1	4	8960
4	1969	1	5	8390

πŸ† μ‹œκ³„μ—΄ μ²˜λ¦¬λŠ” μ–΄μ§€κ°„ν•˜λ©΄ to_datetime ν•¨μˆ˜λ‘œ λ‹€ 해결이 λ˜μ§€λ§Œ, 가끔씩 μœ€λ…„μ΄ λ°œμƒν•˜λŠ” κ²½μš°κ°€ μžˆλ‹€. λ°μ΄ν„°μ˜ 양이 적은 κ²½μš°μ—λŠ” μœ€λ…„μ΄ λ°œμƒν•˜λŠ” 경우λ₯Ό μ œμ™Έμ‹œμΌœμ£ΌλŠ” ν•¨μˆ˜λ₯Ό ν•˜λ‚˜ μž‘μ„±ν•΄λ„ λ˜μ§€λ§Œ 이 κ²½μš°μ—λŠ” 연도가 적지 μ•Šμ€ 데이터이기 λ•Œλ¬Έμ— μœ€λ…„μ„ λ”°λ‘œ μ²˜λ¦¬ν•΄μ£ΌλŠ” μ½”λ“œλ₯Ό ν•˜λ‚˜ μž‘μ„±ν•΄μ£Όκ² λ‹€(아직 μ •ν™•ν•œ μž‘λ™μ›λ¦¬λŠ” 잘 λͺ¨λ₯΄κ² λ‹€β€¦ 쒀더 κ³΅λΆ€ν•΄μ„œ λ”°λ‘œ λΈ”λ‘œκ·Έμ—μ„œ λ‹€λ€„μ•Όκ² λ‹€πŸ€’).

import numpy as np
quartiles = np.percentile(birth_date['births'], [25,50,75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
birth_date = birth_date.query('(births > @mu - 5*@sig) & (births < @mu + 5*@sig)')
birth_date = birth_date.astype('str')
birth_date['births'] = birth_date['births'].astype('int64')
birth_date['date'] = pd.to_datetime(birth_date['year'] + '-' + birth_date['month'] + '-' + birth_date['day'], errors = 'raise')
birth_date.head()
>>
    year	month	day	births	date
0	1969	1	    1	8486	1969-01-01
1	1969	1	    2	9002	1969-01-02
2	1969	1	    3	9542	1969-01-03
3	1969	1	    4	8960	1969-01-04
4	1969	1	    5	8390	1969-01-05

μš”μΌμ„ κ΅¬ν•˜κΈ° μœ„ν•΄μ„œ . dt . weekday λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μž.

birth_date['weekday'] = birth_date['date'].dt.weekday
birth_date.head()
>>
    year	month	day	births	date	    weekday
0	1969	1	    1	8486	1969-01-01	2
1	1969	1	    2	9002	1969-01-02	3
2	1969	1	    3	9542	1969-01-03	4
3	1969	1	    4	8960	1969-01-04	5
4	1969	1	    5	8390	1969-01-05	6

weekday 열이 μƒˆλ‘œ 생긴것을 ν™•μΈν• μˆ˜ μžˆμœΌλ‚˜, 숫자둜 κ΅¬μ„±λ˜μ–΄μžˆλ‹€. μ΄λŠ” μœ„μ—μ„œ μ‚¬μš©ν•œ λ©”μ„œλ“œμ˜ κ²°κ³Όκ°’μœΌλ‘œ 0λΆ€ν„° μΌμš”μΌμ„ μ˜λ―Έν•œλ‹€. λ”°λΌμ„œ 이 값을 μš°λ¦¬μ—κ²Œ μ΅μˆ™ν•œ μ˜μ–΄λ‘œ 바꿔쀄 ν•„μš”κ°€ μžˆλ‹€. πŸ”‘ 이 과정은 λ”°λ‘œ ν•¨μˆ˜λ₯Ό λ§Œλ“€μ–΄μ„œ apply ν•¨μˆ˜λ‘œ λ°μ΄ν„°ν”„λ ˆμž„μ— μ μš©μ‹œν‚€λŠ” μˆœμ„œλ‘œ μ§„ν–‰ν•œλ‹€.

def weekday_func(row):
    if row['weekday'] == 0:
        row['weekday'] = 'Mon'
    elif row['weekday'] == 1:
        row['weekday'] = 'Tue'
    elif row['weekday'] == 2:
        row['weekday'] = 'Wed'
    elif row['weekday'] == 3:
        row['weekday'] = 'Thu'
    elif row['weekday'] == 4:
        row['weekday'] = 'Fri' 
    elif row['weekday'] == 5:
        row['weekday'] = 'Sat'
    elif row['weekday'] == 6:
        row['weekday'] = 'Sun'
    
    return row

birth_date = birth_date.apply(weekday_func, axis = 1)
birth_date.head()
>>
	year	month	day	births	date	    weekday
0	1969	1	    1	8486	1969-01-01	Wed
1	1969	1	    2	9002	1969-01-02	Thu
2	1969	1	    3	9542	1969-01-03	Fri
3	1969	1	    4	8960	1969-01-04	Sat
4	1969	1	    5	8390	1969-01-05	Sun

μœ„ λ°μ΄ν„°λŠ” 1969λ…„λΆ€ν„° 1988λ…„κΉŒμ§€μ˜ 정보λ₯Ό λ‹΄κ³  μžˆλ‹€. μš°λ¦¬λŠ” 이 년도듀을 10λ…„ λ‹¨μœ„λ‘œ λŠμ–΄μ„œ μš”μΌλ³„λ‘œ 정리할 것이닀. 이λ₯Ό μœ„ν•΄ 또 ν•˜λ‚˜μ˜ ν•¨μˆ˜λ₯Ό λ§Œλ“€κ³  apply ν•΄μ£Όμž.

def decade_func(row):
    if row['year'] // 10 == 196:
        row['decade'] = 1960
    elif row['year'] // 10 == 197:
        row['decade'] = 1970
    elif row['year'] // 10 == 198:
        row['decade'] = 1980
        
    return row

birth_decade_weekday = birth_date.apply(decade_func, axis = 1).copy()

birth_decade_weekday = birth_decade_weekday[['decade', 'weekday', 'births']].copy()

birth_decade_weekday = birth_decade_weekday.groupby(['decade','weekday']).sum().copy()
birth_decade_weekday = birth_decade_weekday.reset_index()
birth_decade_weekday['decade'] = pd.Categorical(birth_decade_weekday['decade'], categories=[1960,1970,1980], ordered = True)
birth_decade_weekday['weekday'] = pd.Categorical(birth_decade_weekday['weekday'], categories=['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], ordered = True)

birth_decade_weekday = birth_decade_weekday.sort_values(['decade','weekday'])
birth_decade_weekday = birth_decade_weekday.set_index('decade')

birth_decade_weekday.head()
>>
        weekday	births
decade		
1960	Mon	    526638
1960	Tue	    549754
1960	Wed	    537910
1960	Thu	    517742
1960	Fri	    531220

μš°λ¦¬κ°€ μ›νžŒ λŒ€λ‘œ 10λ…„ μ£ΌκΈ°, μš”μΌλ³„λ‘œ κ·Έλ£Ήν™”κ°€ 지어진 것을 확인할 수 μžˆλ‹€. 이 κ³Όμ •μ—μ„œ pd . Categorical ( ) ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•΄μ„œ ν•΄λ‹Ή μ—΄μ˜ μˆœμ„œλ₯Ό μ§€μ •ν–ˆλ‹€. 이 ν•¨μˆ˜μ˜ κΈ°λŠ₯은 λ‹€λ₯Έ ν¬μŠ€νŠΈμ—μ„œ 더 μžμ„Ένžˆ 닀뀄보아겠닀(점점 λ‹€λ£° 게 λ§Žμ•„μ§„λ‹€β€¦πŸ˜…).

λ§ˆμ§€λ§‰μœΌλ‘œ 각 10λ…„ 주기둜 데이터λ₯Ό λ‚˜λˆ μ£Όλ©΄ μš°λ¦¬κ°€ μ›ν•˜λŠ” μ΅œμ’… 데이터가 λ§Œλ“€μ–΄μ§„λ‹€!!

birth_1960 = birth_decade_weekday[birth_decade_weekday.index == 1960]
birth_1970 = birth_decade_weekday[birth_decade_weekday.index == 1970]
birth_1980 = birth_decade_weekday[birth_decade_weekday.index == 1980]
>>
        weekday	births
decade		
1980	Mon	    4949739
1980	Tue 	5173612
1980	Wed	    5045584
1980	Thu	    5013476
1980	Fri	    5069502
1980	Sat	    4214867
1980	Sun	    4041017
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = birth_1970['weekday'], y = birth_1970['births'], name = '1970'))

fig.add_trace(
    go.Scatter(
        x = birth_1980['weekday'], y = birth_1980['births'], name = '1980'))

fig.update_layout(
{
    'title' : {'text' : '<b>Births by Weekday in 1970 / 1980</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y' : 0.92},
    'showlegend' : True,
    'xaxis' : {'showticklabels' : True, 'title' : {'text' : 'Weekday', 'font' : {'size' : 15}}},
    'yaxis' : {'showticklabels' : True, 'title' : {'text' : 'Births', 'font' : {'size' : 15}}}
})
fig.show()

πŸ‘‰ 1970λ…„λŒ€ 초반의 높지 μ•Šμ€ μΆœμƒμœ¨μ˜ 영ν–₯으둜 μ „μ²΄μ μœΌλ‘œ 봀을 λ•ŒλŠ” 1980λ…„λŒ€κ°€ 1970λ…„λŒ€λ³΄λ‹€ 높은것을 확인할 수 μžˆλ‹€. ν•˜μ§€λ§Œ μ£Όλ§μ—λŠ” 1970λ…„λŒ€κ°€ 더 높은 좔이λ₯Ό 보인닀.

πŸ‘‰ μ™œ 1960년도 데이터도 λ§Œλ“€κ³  μ•ˆμ“°λƒλŠ” 질문이 μƒκΈΈμˆ˜λ„ μžˆλ‹€. ν•˜μ§€λ§Œ 이 데이터가 μˆ˜μ§‘λ˜κΈ° μ‹œμž‘ν•œ 년도가 1969년도뢀터이기 λ•Œλ¬Έμ— 1960λ…„λŒ€λŠ” κ·Έλƒ₯ 1969λ…„μ˜ 데이터이닀. 이걸 λ‹€λ₯Έ 년도와 λ‹€λ£¨λŠ” 것이 였히렀 κ·Έ λΆ„μ„μ˜ 의미λ₯Ό 흐리기 λ•Œλ¬Έμ— 1970년도와 1980λ…„λ„λ§Œμ„ λΉ„κ΅ν•˜μž.


πŸ† 이번 λ°μ΄ν„°λŠ” ν¬μŠ€νŒ… λ‘λ²ˆμ΄λ©΄ 끝날 쀄 μ•Œμ•˜λŠ”λ° λ­”κ°€ λΆ€κ°€μ μœΌλ‘œ λ‚΄κ°€ μ•Œκ³  싢은 것도 많고 κΈ°λ‘ν•˜κ³  싢은 것도 λ§Žμ•„μ„œμΈμ§€ 주저리주저리 말이 κΈΈμ–΄μ‘Œλ‹€β€¦ λΈ”λ‘œκ·Έλ₯Ό μš΄μ˜ν•˜λ©΄ ν• μˆ˜λ‘ λ”°λ‘œ 닀뀄야할 것도 λ§Žμ•„μ§€κ³  λͺ¨λ₯΄λŠ” 것도 계속 μƒκΈ°λŠ” 기뢄이닀. κ·Έλž˜λ„ μ΄λ ‡κ²Œ 정리해 두면 λ‚˜μ€‘μ— ν•œλ²ˆμ”© μ°Ύμ•„λ³Ό 수 μžˆμ„ ν…Œλ‹ˆ 더 재밌게 ν•΄λ΄μ•Όκ² λ“œμ•„!!

πŸ† λ‹€μŒ ν¬μŠ€νŒ…μ—μ„œ 계속~~


πŸ’‘νŒŒμ΄μ¬ 데이터 μ‚¬μ΄μ–ΈμŠ€ ν•Έλ“œλΆ(μœ„ν‚€λΆμŠ€,2020)의 μ €μž Jake VanderPlas λΆ„μ˜ κΉƒν—ˆλΈŒμ—μ„œ 데이터λ₯Ό κ°€μ Έμ™”μŒμ„ λ°νž™λ‹ˆλ‹€.

Leave a comment