ππ 3. λ―Έκ΅ μ°λλ³ μΆμ λΉλλ₯Ό μμ보μ (2)
π μμΌλ‘ μ§νν λ°μ΄ν° λΆμ κ³Όμ μ μν΄ μ΄μ ν¬μ€ν μμμ μ μ²λ¦¬λ λ°μ΄ν°λ₯Ό κ°μ Έμ€μ.
import pandas as pd
birth = pd.read_csv("Data/births.csv", encoding = 'utf-8-sig')
birth_til_1988 = birth[:15067]
birth_til_1988_F = birth_til_1988[birth_til_1988['gender'] == 'F'].copy()
birth_til_1988_F = birth_til_1988_F.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_F = birth_til_1988_F.reset_index()
birth_til_1988_M = birth_til_1988[birth_til_1988['gender'] == 'M'].copy()
birth_til_1988_M = birth_til_1988_M.groupby(by = ['year', 'month']).aggregate({'births' : 'sum'})
birth_til_1988_M = birth_til_1988_M.reset_index()
birth_after_1989 = birth[15067:]
birth_after_1989_F = birth_after_1989[birth_after_1989['gender']=="F"].copy()
birth_after_1989_F = birth_after_1989_F[['year','month','births']].copy()
birth_after_1989_M = birth_after_1989[birth_after_1989['gender']=="M"].copy()
birth_after_1989_M = birth_after_1989_M[['year','month','births']].copy()
birth_F = pd.concat([birth_til_1988_F, birth_after_1989_F])
birth_M = pd.concat([birth_til_1988_M, birth_after_1989_M])
1. μ°λλ³ λ¨μ± / μ¬μ± μΆμ μ λΉκ΅
π μμμ λ§λ birth_F / birth_M λ°μ΄ν°νλ μμ κ° μ°λλ³λ‘ groupby ν΄μ μΆμ μλ₯Ό ꡬνλ€.
birth_F_year = birth_F.groupby('year').aggregate({'births' : 'sum'})
birth_M_year = birth_M.groupby('year').aggregate({'births' : 'sum'})
birth_year = pd.merge(birth_F_year, birth_M_year, on = 'year', how = 'inner')
birth_year.columns = ['births_F', 'births_M']
birth_year.head()
>>
births_F births_M
year
1969 1753634 1846572
1970 1819164 1918636
1971 1736774 1826774
1972 1592347 1673888
1973 1533102 1613023
plotlyλ‘ μκ°νν΄μ£Όμ.
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
fig = go.Figure()
fig.add_trace(
go.Scatter(
x = birth_year.index, y = birth_year['births_F'], name='Female',
text = birth_year['births_F'], texttemplate = '%{text:,}'
))
fig.add_trace(
go.Scatter(
x = birth_year.index, y = birth_year['births_M'], name = 'Male',
text = birth_year['births_M'], texttemplate = '%{text:,}', textfont= dict(color = 'white')
))
fig.update_layout(
{
'title' : {'text' : '<b>Female/Male births per year</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y': 0.9},
'xaxis' : {'showticklabels' : True, 'dtick' : 1, 'title' : {'text' : 'year', 'font' : {'size' : 20}}},
'yaxis' : {'showticklabels' : True, 'tick0' : '1M', 'title' : {'text' : 'births', 'font' : {'size' : 20}}},
'showlegend' : True
})
fig.show()
π μ 체μ μΈ μΆμΈλ₯Ό 보μμ λ λ¨μ±μ μΆμ μΆμ΄κ° μ¬μ±λ³΄λ€ κ³μ λμλ κ²μ νμΈν μ μλ€.
π λν 1970λ λ μ΄λ°μ μ μΈνκ³ λ μ μ§μ μΌλ‘ μ¦κ°νλ λͺ¨μ΅μ 보μ¬μ£Όλ©°, μ΄λ μλ§λ 1960λ λ λ§κ³Ό 1970λ λμ μ€μΌμΌν¬μ λ² νΈλ¨μ μμ κ΄λ ¨λμ΄ μμ§ μμκΉ μκ°νλ€. λ¬Όλ‘ λ€λ₯Έ λͺκ°μ§ μ΄μ κ° μμ μ μκ² μ§λ§, μΆμ λ°μ΄ν°λ§μ κ°μ§κ³ μ νν νμ νκΈ°λ μ΄λ €μΈ κ² κ°λ€.
2. μμΌλ³ μΆμ μΆμ΄ νμΈνκΈ°
π μ΄λ²μλ μμΌλ³λ‘ μ μ²λ¦¬λ₯Ό ν΄μ€ κ²μ΄κΈ° λλ¬Έμ λ°μ΄ν°νλ μμ year, month, day μ΄μ μ¬μ©ν΄μΌνλ€. μ΄ μΈκ°μ μ΄μ to_datetime ν¨μλ₯Ό μ¬μ©ν΄μ μμΌμ ꡬν΄μ€ κ²μ΄λ€.
birth_date = pd.pivot_table(birth_til_1988, index = ['year','month', 'day'], values = ['births'], aggfunc={'births' : 'sum'})
birth_date.head()
>>
births
year month day
1969 1 1.0 8486
2.0 9002
3.0 9542
4.0 8960
5.0 8390
birth_date = birth_date.reset_index()
birth_date[['day']] = birth_date[['day']].astype('int64')
birth_date.head()
>>
year month day births
0 1969 1 1 8486
1 1969 1 2 9002
2 1969 1 3 9542
3 1969 1 4 8960
4 1969 1 5 8390
π μκ³μ΄ μ²λ¦¬λ μ΄μ§κ°νλ©΄ to_datetime ν¨μλ‘ λ€ ν΄κ²°μ΄ λμ§λ§, κ°λμ© μ€λ μ΄ λ°μνλ κ²½μ°κ° μλ€. λ°μ΄ν°μ μμ΄ μ μ κ²½μ°μλ μ€λ μ΄ λ°μνλ κ²½μ°λ₯Ό μ μΈμμΌμ£Όλ ν¨μλ₯Ό νλ μμ±ν΄λ λμ§λ§ μ΄ κ²½μ°μλ μ°λκ° μ μ§ μμ λ°μ΄ν°μ΄κΈ° λλ¬Έμ μ€λ μ λ°λ‘ μ²λ¦¬ν΄μ£Όλ μ½λλ₯Ό νλ μμ±ν΄μ£Όκ² λ€(μμ§ μ νν μλμ리λ μ λͺ¨λ₯΄κ² λ€β¦ μ’λ 곡λΆν΄μ λ°λ‘ λΈλ‘κ·Έμμ λ€λ€μΌκ² λ€π€’).
import numpy as np
quartiles = np.percentile(birth_date['births'], [25,50,75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
birth_date = birth_date.query('(births > @mu - 5*@sig) & (births < @mu + 5*@sig)')
birth_date = birth_date.astype('str')
birth_date['births'] = birth_date['births'].astype('int64')
birth_date['date'] = pd.to_datetime(birth_date['year'] + '-' + birth_date['month'] + '-' + birth_date['day'], errors = 'raise')
birth_date.head()
>>
year month day births date
0 1969 1 1 8486 1969-01-01
1 1969 1 2 9002 1969-01-02
2 1969 1 3 9542 1969-01-03
3 1969 1 4 8960 1969-01-04
4 1969 1 5 8390 1969-01-05
μμΌμ ꡬνκΈ° μν΄μ . dt . weekday λ©μλλ₯Ό μ¬μ©νμ.
birth_date['weekday'] = birth_date['date'].dt.weekday
birth_date.head()
>>
year month day births date weekday
0 1969 1 1 8486 1969-01-01 2
1 1969 1 2 9002 1969-01-02 3
2 1969 1 3 9542 1969-01-03 4
3 1969 1 4 8960 1969-01-04 5
4 1969 1 5 8390 1969-01-05 6
weekday μ΄μ΄ μλ‘ μκΈ΄κ²μ νμΈν μ μμΌλ, μ«μλ‘ κ΅¬μ±λμ΄μλ€. μ΄λ μμμ μ¬μ©ν λ©μλμ κ²°κ³Όκ°μΌλ‘ 0λΆν° μΌμμΌμ μλ―Ένλ€. λ°λΌμ μ΄ κ°μ μ°λ¦¬μκ² μ΅μν μμ΄λ‘ λ°κΏμ€ νμκ° μλ€. π μ΄ κ³Όμ μ λ°λ‘ ν¨μλ₯Ό λ§λ€μ΄μ apply ν¨μλ‘ λ°μ΄ν°νλ μμ μ μ©μν€λ μμλ‘ μ§ννλ€.
def weekday_func(row):
if row['weekday'] == 0:
row['weekday'] = 'Mon'
elif row['weekday'] == 1:
row['weekday'] = 'Tue'
elif row['weekday'] == 2:
row['weekday'] = 'Wed'
elif row['weekday'] == 3:
row['weekday'] = 'Thu'
elif row['weekday'] == 4:
row['weekday'] = 'Fri'
elif row['weekday'] == 5:
row['weekday'] = 'Sat'
elif row['weekday'] == 6:
row['weekday'] = 'Sun'
return row
birth_date = birth_date.apply(weekday_func, axis = 1)
birth_date.head()
>>
year month day births date weekday
0 1969 1 1 8486 1969-01-01 Wed
1 1969 1 2 9002 1969-01-02 Thu
2 1969 1 3 9542 1969-01-03 Fri
3 1969 1 4 8960 1969-01-04 Sat
4 1969 1 5 8390 1969-01-05 Sun
μ λ°μ΄ν°λ 1969λ λΆν° 1988λ κΉμ§μ μ 보λ₯Ό λ΄κ³ μλ€. μ°λ¦¬λ μ΄ λ λλ€μ 10λ λ¨μλ‘ λμ΄μ μμΌλ³λ‘ μ 리ν κ²μ΄λ€. μ΄λ₯Ό μν΄ λ νλμ ν¨μλ₯Ό λ§λ€κ³ apply ν΄μ£Όμ.
def decade_func(row):
if row['year'] // 10 == 196:
row['decade'] = 1960
elif row['year'] // 10 == 197:
row['decade'] = 1970
elif row['year'] // 10 == 198:
row['decade'] = 1980
return row
birth_decade_weekday = birth_date.apply(decade_func, axis = 1).copy()
birth_decade_weekday = birth_decade_weekday[['decade', 'weekday', 'births']].copy()
birth_decade_weekday = birth_decade_weekday.groupby(['decade','weekday']).sum().copy()
birth_decade_weekday = birth_decade_weekday.reset_index()
birth_decade_weekday['decade'] = pd.Categorical(birth_decade_weekday['decade'], categories=[1960,1970,1980], ordered = True)
birth_decade_weekday['weekday'] = pd.Categorical(birth_decade_weekday['weekday'], categories=['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], ordered = True)
birth_decade_weekday = birth_decade_weekday.sort_values(['decade','weekday'])
birth_decade_weekday = birth_decade_weekday.set_index('decade')
birth_decade_weekday.head()
>>
weekday births
decade
1960 Mon 526638
1960 Tue 549754
1960 Wed 537910
1960 Thu 517742
1960 Fri 531220
μ°λ¦¬κ° μν λλ‘ 10λ μ£ΌκΈ°, μμΌλ³λ‘ κ·Έλ£Ήνκ° μ§μ΄μ§ κ²μ νμΈν μ μλ€. μ΄ κ³Όμ μμ pd . Categorical ( ) ν¨μλ₯Ό μ¬μ©ν΄μ ν΄λΉ μ΄μ μμλ₯Ό μ§μ νλ€. μ΄ ν¨μμ κΈ°λ₯μ λ€λ₯Έ ν¬μ€νΈμμ λ μμΈν λ€λ€λ³΄μκ² λ€(μ μ λ€λ£° κ² λ§μμ§λ€β¦π ).
λ§μ§λ§μΌλ‘ κ° 10λ μ£ΌκΈ°λ‘ λ°μ΄ν°λ₯Ό λλ μ£Όλ©΄ μ°λ¦¬κ° μνλ μ΅μ’ λ°μ΄ν°κ° λ§λ€μ΄μ§λ€!!
birth_1960 = birth_decade_weekday[birth_decade_weekday.index == 1960]
birth_1970 = birth_decade_weekday[birth_decade_weekday.index == 1970]
birth_1980 = birth_decade_weekday[birth_decade_weekday.index == 1980]
>>
weekday births
decade
1980 Mon 4949739
1980 Tue 5173612
1980 Wed 5045584
1980 Thu 5013476
1980 Fri 5069502
1980 Sat 4214867
1980 Sun 4041017
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
fig = go.Figure()
fig.add_trace(
go.Scatter(
x = birth_1970['weekday'], y = birth_1970['births'], name = '1970'))
fig.add_trace(
go.Scatter(
x = birth_1980['weekday'], y = birth_1980['births'], name = '1980'))
fig.update_layout(
{
'title' : {'text' : '<b>Births by Weekday in 1970 / 1980</b>', 'font' : {'size' : 25}, 'x' : 0.5, 'y' : 0.92},
'showlegend' : True,
'xaxis' : {'showticklabels' : True, 'title' : {'text' : 'Weekday', 'font' : {'size' : 15}}},
'yaxis' : {'showticklabels' : True, 'title' : {'text' : 'Births', 'font' : {'size' : 15}}}
})
fig.show()
π 1970λ λ μ΄λ°μ λμ§ μμ μΆμμ¨μ μν₯μΌλ‘ μ 체μ μΌλ‘ λ΄€μ λλ 1980λ λκ° 1970λ λλ³΄λ€ λμκ²μ νμΈν μ μλ€. νμ§λ§ μ£Όλ§μλ 1970λ λκ° λ λμ μΆμ΄λ₯Ό 보μΈλ€.
π μ 1960λ λ λ°μ΄ν°λ λ§λ€κ³ μμ°λλ μ§λ¬Έμ΄ μκΈΈμλ μλ€. νμ§λ§ μ΄ λ°μ΄ν°κ° μμ§λκΈ° μμν λ λκ° 1969λ λλΆν°μ΄κΈ° λλ¬Έμ 1960λ λλ κ·Έλ₯ 1969λ μ λ°μ΄ν°μ΄λ€. μ΄κ±Έ λ€λ₯Έ λ λμ λ€λ£¨λ κ²μ΄ μ€νλ € κ·Έ λΆμμ μλ―Έλ₯Ό ν리기 λλ¬Έμ 1970λ λμ 1980λ λλ§μ λΉκ΅νμ.
π μ΄λ² λ°μ΄ν°λ ν¬μ€ν λλ²μ΄λ©΄ λλ μ€ μμλλ° λκ° λΆκ°μ μΌλ‘ λ΄κ° μκ³ μΆμ κ²λ λ§κ³ κΈ°λ‘νκ³ μΆμ κ²λ λ§μμμΈμ§ μ£Όμ 리주μ 리 λ§μ΄ κΈΈμ΄μ‘λ€β¦ λΈλ‘κ·Έλ₯Ό μ΄μνλ©΄ ν μλ‘ λ°λ‘ λ€λ€μΌν κ²λ λ§μμ§κ³ λͺ¨λ₯΄λ κ²λ κ³μ μκΈ°λ κΈ°λΆμ΄λ€. κ·Έλλ μ΄λ κ² μ λ¦¬ν΄ λλ©΄ λμ€μ νλ²μ© μ°Ύμλ³Ό μ μμ ν λ λ μ¬λ°κ² ν΄λ΄μΌκ² λμ!!
π λ€μ ν¬μ€ν μμ κ³μ~~
Leave a comment