๐๐ 1. ๋ฏธ๊ตญ์ state ๋ณ ๋ฉด์ ๋น ์ธ๊ตฌ์ ๊ตฌํ๊ธฐ(1)
1. ๋ฐ์ดํฐ ์ดํด๋ณด๊ธฐ
๐ ๋ฐ์ดํฐ๋ ๋ค์ url์์ ๊ฐ์ ธ์๋ค.
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
1.1 population data
๐ ์ง์ญ, ์ฐ๋ น๋(total / under18), ๋ฐ์ดํฐ์์ง๋ ๋, ์ธ๊ตฌ์๋ฅผ ๋ณด์ฌ์ฃผ๋ ๋ฐ์ดํฐ
population = pd.read_csv("Data/state-population.csv", encoding = 'utf-8-sig')
population.head()
>>
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
1.2 areas data
๐ ๊ฐ state ๋ณ ๋ฉด์ ์ ๋ณด์ฌ์ฃผ๋ ๋ฐ์ดํฐ
areas = pd.read_csv("Data/state-areas.csv", encoding = 'utf-8-sig')
areas.head()
>>
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
1.3 abbrevs data
๐ ๊ฐ state์ ์ฝ์ด(abbreviation)๋ฅผ ๋ณด์ฌ์ฃผ๋ ๋ฐ์ดํฐ
abbrevs = pd.read_csv("Data/state-abbrevs.csv", encoding = 'utf-8-sig')
abbrevs.head()
>>
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
2. ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
2.1.population data
pop = pd.read_csv("Data/state-population.csv", encoding = 'utf-8-sig')
pop.head()
>>
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
2.1.1. null ๊ฐ ์ฒ๋ฆฌ
pop.isnull().sum()
>>
state/region 0
ages 0
year 0
population 20
dtype: int64
population column์ NaN๊ฐ์ด ์์์ ์ ์ ์๋ค.
๋ฐ๋ผ์ population ์ด์์ NaN ๊ฐ์ ๊ฐ์ง๋ ํ์ ๊ฐ์ ธ์๋ณด๋ฉด ๋ค์๊ณผ ๊ฐ๋ค.
pop[pop['population'].isnull()]
>>
state/region ages year population
2448 PR under18 1990 NaN
2449 PR total 1990 NaN
2450 PR total 1991 NaN
2451 PR under18 1991 NaN
2452 PR total 1993 NaN
2453 PR under18 1993 NaN
2454 PR under18 1992 NaN
2455 PR total 1992 NaN
2456 PR under18 1994 NaN
2457 PR total 1994 NaN
2458 PR total 1995 NaN
2459 PR under18 1995 NaN
2460 PR under18 1996 NaN
2461 PR total 1996 NaN
2462 PR under18 1998 NaN
2463 PR total 1998 NaN
2464 PR total 1997 NaN
2465 PR under18 1997 NaN
2466 PR total 1999 NaN
2467 PR under18 1999 NaN
state/region ์ด PR์ธ ํ์์ ๋ชจ๋ NaN ๊ฐ์ด ๋ฐ์ํจ์ ์ ์ ์๋ค.
๊ทธ๋ฌ๋ฉด ์ด NaN ๊ฐ์ ์ฒ๋ฆฌํด์ค์ผ ํ๋๋ฐ, ๊ทธ ์ ์ PR์ ์ ๋ณด๋ฅผ ์๊ธฐ ์ํด์ state/region ์ด PR์ธ ํ์์ NaN์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ธํด๋ณด์.
PR = pop[pop['state/region']=='PR']
PR = PR[PR['population'].notnull()]
PR.head()
>>
state/region ages year population
2468 PR total 2000 3810605.0
2469 PR under18 2000 1089063.0
2470 PR total 2001 3818774.0
2471 PR under18 2001 1077566.0
2472 PR total 2002 3823701.0
PR.tail()
>>
state/region ages year population
2491 PR under18 2010 896945.0
2492 PR under18 2011 869327.0
2493 PR total 2011 3686580.0
2494 PR under18 2012 841740.0
2495 PR total 2012 3651545.0
์ด๋ฅผ ๋ณด๋ฉด PR ์ง์ญ์ 2000๋
๋ถํฐ ์ง๊ณ๊ฐ ์์๋์์์ ์ ์ ์๋ค.
๋ํ ์์ NaN๋ฐ์ดํฐ๋ฅผ ๋ณด๋ฉด 1990~1999 ๊น์ง๋ NaN ๊ฐ์ด๊ณ , 2000๋
๋ถํฐ๋ ์ค๊ฐ์ ๋๊ธด ๋ถ๋ถ์ด ์๊ธฐ ๋๋ฌธ์ PR์์๋ ๊ทธ๋ฅ NaN ๊ฐ์ drop์ํค๋ฉด ๋ ๊ฒ๊ฐ๋ค.
pop = pop.dropna()
pop.isnull().sum()
>>
state/region 0
ages 0
year 0
population 0
dtype: int64
dropna ๊ฒฐ๊ณผ๋ฅผ ํ์ธํด๋ณด๋ ์ด์ NaN ๊ฐ์ด ์๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
์ด์ ๋ ๋ค์ ์์ ์ ์ฒ๋ฆฌ ๊ณผ์ ์ ์ํด ์ด์ ์ด๋ฆ์ ๋ฐ๊ฟ์ฃผ์.
abbrevs data ์ abbreviation ์ด์ ์ง์ญ์ ์ฝ์ด๊ฐ ๋ค์ด๊ฐ ์๊ธฐ ๋๋ฌธ์ population ๋ฐ์ดํฐ์ state/region ์ด์ ๋ง์ถฐ์ฃผ๋ฉด ๋ ๊ฒ ๊ฐ๋ค.
population ๋ฐ์ดํฐ์ ์ด์ ๋ฆฌ์คํธ๋ก ๋ฐ์์ค๊ณ , ๊ทธ ๋ฆฌ์คํธ์ ์ ๊ทผํ๋ ๋ฐฉ๋ฒ์ผ๋ก ์ด์ ์ด๋ฆ์ ๋ฐ๊ฟ์ฃผ์๋ค.
cols = pop.columns.to_list()
cols[0] = 'abbreviation'
pop.columns = cols
pop.columns
>> Index(['abbreviation', 'ages', 'year', 'population'], dtype='object')
์ ๋ณํ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
NaN ๊ฐ์ ์์ ์คฌ์ผ๋, ์ด์ ๋ ์ค๋ณต๊ฐ์ด ์๋์ง๋ ํ์ธํด๋ณด์.
duplicated() ์ value_counts๋ฅผ ์ด์ฉํ๋ค.
pop.duplicated().sum()
>> 0
pop.value_counts().max()
>> 1
์ค๋ณต๊ฐ์ด ์๋ค๋ ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ ์ ์๋ค.
2.1.2. total / under18 ๋ถ๋ฅํ๊ธฐ
๐ ์ฐ๋ น๋๋ณ๋ก ์งํ๋ฅผ ์์๋ณด๊ธฐ ์ํด ๋ถ๋ฅ
pop_tot = pop[pop['ages'] == 'total']
pop_tot.head()
>>
abbreviation ages year population
1 AL total 2012 4817528.0
3 AL total 2010 4785570.0
5 AL total 2011 4801627.0
6 AL total 2009 4757938.0
9 AL total 2013 4833722.0
pop_18 = pop[pop['ages'] == 'under18']
pop_18.head()
>>
abbreviation ages year population
0 AL under18 2012 1117489.0
2 AL under18 2010 1130966.0
4 AL under18 2011 1125763.0
7 AL under18 2009 1134192.0
8 AL under18 2013 1111481.0
2.2. areas Data
area = pd.read_csv("Data/state-areas.csv", encoding = 'utf-8-sig')
area.head()
>>
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
area.shape
>> (52, 2)
2.2.1. null๊ฐ/ ์ค๋ณต๊ฐ ์ฒ๋ฆฌํ๊ธฐ
area.isnull().sum()
>>
state 0
area (sq. mi) 0
dtype: int64
area.duplicated().sum()
>> 0
NaN ๊ฐ๊ณผ ์ค๋ณต๊ฐ ๋ชจ๋ ์์์ ํ์ธํ ์ ์๋ค. ๊ธฐ๋ถ์ด ์ข๋ค.
2.3. abbreves data
abb = pd.read_csv("Data/state-abbrevs.csv", encoding = 'utf-8-sig')
abb.head()
>>
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
abb.shape
>> (51, 2)
2.3.1. null๊ฐ/ ์ค๋ณต๊ฐ ์ฒ๋ฆฌํ๊ธฐ
abb.isnull().sum()
>>
state 0
abbreviation 0
dtype: int64
abb.duplicated().sum()
>> 0
abbreves ๋ฐ์ดํฐ๋ NaN ๊ฐ๊ณผ ์ค๋ณต๊ฐ์ด ์๋ ๊น๋ํ ๋ฐ์ดํฐ์ด๋ค.
๐ ์ด๋ ๊ฒ ์ธ ๊ฐ์ง ๋ฐ์ดํฐ์ NaN ๊ฐ, ์ค๋ณต๊ฐ์ ํ์ธํ๊ณ ์ฒ๋ฆฌํ๋ค.
๐ ๋ค์ ํฌ์คํธ์์๋ ์ด ๋ฐ์ดํฐ๋ค์ ๋ณํฉํ๊ณ ํํฐ๋งํด์ ์ํ๋ ๋ฐ์ดํฐ ํํ๋ฅผ ๋ง๋๋ ๊ณผ์ ์ ๋ค๋ฃฐ๊ฒ์ด๋ค.
Leave a comment