๐๐ 1. ๋ฏธ๊ตญ์ state ๋ณ ๋ฉด์ ๋น ์ธ๊ตฌ์ ๊ตฌํ๊ธฐ(2)
๐ ์ ๋ฒ ํฌ์คํ ์ ์ด์ด ์ด๋ฒ์๋ ์ ์ฒ๋ฆฌํ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ง๊ณ ์ฐ๋ฆฌ๊ฐ ์ํ๋ ํํ๋ก ๋ณ๊ฒฝํด๋ณด์.
3. area / abb ๋ณํฉ
๐ area ๋ฐ์ดํฐ์ state๋ฅผ ์ฝ์ด๋ก ๋ํ๋ด๊ธฐ ์ํด ๋์ ๋ณํฉํ์ฌ ์ ๋ฆฌํด์ค๋ค.
๐ ์ฐ์ ๋ ๋ฐ์ดํฐ๋ฅผ ๊ฐ๋จํ ์ดํด๋ณด์.
area.head()
>>
state area(sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
abb.head()
>>
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
๋ ๋ฐ์ดํฐ์ ๊ณตํต์ผ๋ก ์๋ state ์ด์ ๊ธฐ์ค์ผ๋ก ๋ณํฉํด์ฃผ๊ณ , ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค๊ธฐ ์ํด outer join ํด์ค๋ค.
๋ณํฉ ํ์ ๋ ๋ฐ์ดํฐ ์ค ํ๋์ ๋ฐ์ดํฐ์๋ง ์๋ ์ ๋ณด๋ NaN ๊ฐ์ด ๋ ๊ฐ๋ฅ์ฑ์ด ์๊ธฐ ๋๋ฌธ์ null ๊ฐ ์ ๋ฌด๋ ํ์ธํด์ผํ๋ค.
area_abb = pd.merge(area, abb, on = 'state', how = 'outer')
area_abb.head()
>>
state area (sq. mi) abbreviation
0 Alabama 52423 AL
1 Alaska 656425 AK
2 Arizona 114006 AZ
3 Arkansas 53182 AR
4 California 163707 CA
area_abb.isnull().sum()
>>
state 0
area (sq. mi) 0
abbreviation 1
dtype: int64
abbreviation ์ด์ NaN ๊ฐ์ด ์๋ค. ์ด๋์ ๋์จ ๊ฑด์ง ํ์ธํด๋ณด์.
area_abb[area_abb['abbreviation'].isnull()]
>>
state area (sq. mi) abbreviation
51 Puerto Rico 3515 NaN
Puetro Rico ์ ์ฝ์ด๊ฐ ์ ํด์ง์ง ์์ ๊ฑธ ๋ณผ ์ ์๋ค.
fillna() ํจ์๋ฅผ ํตํด NaN ๊ฐ์ PR๋ก ์ฑ์์ฃผ์.
area_abb = area_abb.fillna("PR")
area_abb[area_abb['state'] == 'Puerto Rico']
>>
state area (sq. mi) abbreviation
51 Puerto Rico 3515 PR
๐ ์ด๋ ๊ฒ area ๋ฐ์ดํฐ์ abbreviation ๋ฐ์ดํฐ๋ฅผ ๋ณํฉํ area_abb ๋ฐ์ดํฐ๋ฅผ ์์ฑํ๋ค.
๐ ์ด์ ์ํ๋ ์ธ๊ตฌ์ ์ ๋ณด๋ฅผ ๊ฐ์ง๊ณ ์๋ population ๋ฐ์ดํฐ์ ๋ณํฉํด๋ณด์.
4.population๊ณผ ๋ณํฉ
4.1. pop_tot ๊ณผ ๋ณํฉ
pop_tot.head()
abbreviation ages year population
1 AL total 2012 4817528.0
3 AL total 2010 4785570.0
5 AL total 2011 4801627.0
6 AL total 2009 4757938.0
9 AL total 2013 4833722.0
area_abb.head()
>>
state area (sq. mi) abbreviation
0 Alabama 52423 AL
1 Alaska 656425 AK
2 Arizona 114006 AZ
3 Arkansas 53182 AR
4 California 163707 CA
pop_tot ๋ฐ์ดํฐ์ area_abb ๋ฐ์ดํฐ์ ๊ณตํต์ผ๋ก ๋ค์ด๊ฐ์๋ abbreviation ์ ๊ธฐ์ค์ผ๋ก ๋ณํฉํ๊ณ , outer join ํด์ฃผ์.
์ด ๊ฒฝ์ฐ๋ NaN ๊ฐ์ด ์๋์ง ํ์ธํด์ผํ๋ค.
final_tot = pd.merge(pop_tot, area_abb, on = 'abbreviation', how = 'outer')
final_tot
>>
abbreviation ages year population state area (sq. mi)
0 AL total 2012 4817528.0 Alabama 52423.0
1 AL total 2010 4785570.0 Alabama 52423.0
2 AL total 2011 4801627.0 Alabama 52423.0
3 AL total 2009 4757938.0 Alabama 52423.0
4 AL total 2013 4833722.0 Alabama 52423.0
... ... ... ... ... ... ...
1257 USA total 2013 316128839.0 NaN NaN
1258 USA total 2009 306771529.0 NaN NaN
1259 USA total 2010 309326295.0 NaN NaN
1260 USA total 2011 311582564.0 NaN NaN
1261 USA total 2012 313873685.0 NaN NaN
1262 rows ร 6 columns
final_tot.isnull().sum()
>>
abbreviation 0
ages 0
year 0
population 0
state 24
area (sq. mi) 24
dtype: int64
NaN ๊ฐ์ด ์๊ธฐ ๋๋ฌธ์ ์ด๋์ ๋์จ ๊ฐ์ธ์ง ํ์ธํ์.
final_tot[final_tot['state'].isnull()]
>>
abbreviation ages year population state area (sq. mi)
1238 USA total 1990 249622814.0 NaN NaN
1239 USA total 1991 252980942.0 NaN NaN
1240 USA total 1992 256514231.0 NaN NaN
1241 USA total 1993 259918595.0 NaN NaN
1242 USA total 1994 263125826.0 NaN NaN
1243 USA total 1995 266278403.0 NaN NaN
1244 USA total 1996 269394291.0 NaN NaN
1245 USA total 1997 272646932.0 NaN NaN
1246 USA total 1998 275854116.0 NaN NaN
1247 USA total 2000 282162411.0 NaN NaN
1248 USA total 1999 279040181.0 NaN NaN
1249 USA total 2001 284968955.0 NaN NaN
1250 USA total 2002 287625193.0 NaN NaN
1251 USA total 2003 290107933.0 NaN NaN
1252 USA total 2004 292805298.0 NaN NaN
1253 USA total 2005 295516599.0 NaN NaN
1254 USA total 2006 298379912.0 NaN NaN
1255 USA total 2007 301231207.0 NaN NaN
1256 USA total 2008 304093966.0 NaN NaN
1257 USA total 2013 316128839.0 NaN NaN
1258 USA total 2009 306771529.0 NaN NaN
1259 USA total 2010 309326295.0 NaN NaN
1260 USA total 2011 311582564.0 NaN NaN
1261 USA total 2012 313873685.0 NaN NaN
ํ์ธ ๊ฒฐ๊ณผ USA์์ NaN ๊ฐ์ด ๋์จ ๊ฒ์ ํ์ธํ ์ ์๋ค
ํ์ง๋ง ์ฐ๋ฆฌ๊ฐ ๊ตฌํ๋ ค๋ state ๋ณ ๋ฉด์ ๋๋น ์ธ๊ตฌ์์๋ ๊ด๊ณ๊ฐ ์๋ ๋ด์ฉ์ด๋ฏ๋ก ๊ทธ๋ฅ dropna()๋ฅผ ํด์ ์์ ์ฃผ๋๋ก ํ์.
final_tot = final_tot.dropna()
final_tot.head()
>>
abbreviation ages year population state area (sq. mi)
0 AL total 2012 4817528.0 Alabama 52423.0
1 AL total 2010 4785570.0 Alabama 52423.0
2 AL total 2011 4801627.0 Alabama 52423.0
3 AL total 2009 4757938.0 Alabama 52423.0
4 AL total 2013 4833722.0 Alabama 52423.0
final_tot.shape
>> (1238, 6)
์๋์ 1262ํ์์ 24ํ์ NaN๊ฐ์ด ์ ๊ฑฐ๋์ด 1238ํ์ด ๋ ๊ฒ๋ ํ์ธํ์.
final_tot.isnull().sum()
>>
abbreviation 0
ages 0
year 0
population 0
state 0
area (sq. mi) 0
dtype: int64
final_tot.duplicated().sum()
>> 0
ํ์ธ ๊ฒฐ๊ณผ NaN ๊ฐ๊ณผ ์ค๋ณต๊ฐ์ด ์๋ ๊น๋ํ ๋ฐ์ดํฐ๊ฐ ๋์๋ค.
4.2. pop_18 ๊ณผ ๋ณํฉ
์ด ๊ฒฝ์ฐ ์ญ์ USA ๊ฐ์ด NaN ์ด ๋์ฌ ๊ฒ์ด๋ค.
dropna() ๋ฅผ ํด์ ์ ๋ฆฌํด์ฃผ์.
final_18 = pd.merge(pop_18, area_abb, on = 'abbreviation', how = 'outer')
final_18 = final_18.dropna()
final_18.head()
>>
abbreviation ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL under18 2010 1130966.0 Alabama 52423.0
2 AL under18 2011 1125763.0 Alabama 52423.0
3 AL under18 2009 1134192.0 Alabama 52423.0
4 AL under18 2013 1111481.0 Alabama 52423.0
5. ๋ฉด์ ๋น ์ธ๊ตฌ์
๐ pop/area ๋ผ๋ ์๋ก์ด ์ด์ ๋ง๋ค์ด์ ํด๋น ๋ด์ฉ์ ๋ฃ์ด์ค๋ค.
๐ 1. final_tot ์์ฑ
final_tot['pop/area'] = final_tot['population']/final_tot['area (sq. mi)']
final_tot.head()
>>
abbreviation ages year population state area (sq. mi) pop/area
0 AL total 2012 4817528.0 Alabama 52423.0 91.897221
1 AL total 2010 4785570.0 Alabama 52423.0 91.287603
2 AL total 2011 4801627.0 Alabama 52423.0 91.593900
3 AL total 2009 4757938.0 Alabama 52423.0 90.760506
4 AL total 2013 4833722.0 Alabama 52423.0 92.206131
๐ 2. fianl_18 ์์ฑ
final_18['pop/area'] = final_18['population']/final_18['area (sq. mi)']
final_18.head()
>>
abbreviation ages year population state area (sq. mi) pop/area
0 AL under18 2012 1117489.0 Alabama 52423.0 21.316769
1 AL under18 2010 1130966.0 Alabama 52423.0 21.573851
2 AL under18 2011 1125763.0 Alabama 52423.0 21.474601
3 AL under18 2009 1134192.0 Alabama 52423.0 21.635389
4 AL under18 2013 1111481.0 Alabama 52423.0 21.202163
๋ชจ๋ ์ ์ฒ๋ฆฌ๊ฐ ๋๋ฌ๋ค!! ์ด์ ๋ถํฐ๋ ํ์ํ ๋ถ๋ถ๋ง ๊ฐ์ ธ์์ ์๊ฐํ๋ฅผ ํ๊ฑฐ๋ ๋ถ์ํ๋ฉด ๋๋ค.
6. 2013๋ / total์์ ๋ฉด์ ๋น ์ธ๊ตฌ์
๐ ๋ฉด์ ๋น ์ธ๊ตฌ์๊ฐ ๋ง์ ์์๋๋ก ์ ๋ ฌํด์ฃผ์.
density_2013_tot = final_tot[final_tot['year']== 2013]
density_2013_tot = density_2013_tot.sort_values('pop/area', ascending=False)
density_2013_tot.head()
>>
abbreviation ages year population state area (sq. mi) pop/area
195 DC total 2013 646449.0 District of Columbia 68.0 9506.602941
1233 PR total 2013 3615086.0 Puerto Rico 3515.0 1028.473969
723 NJ total 2013 8899339.0 New Jersey 8722.0 1020.332378
955 RI total 2013 1051511.0 Rhode Island 1545.0 680.589644
147 CT total 2013 3596080.0 Connecticut 5544.0 648.643579
์ด์ density_2013_tot ๋ฐ์ดํฐ์์ ์์ 10๊ฐ์ ์ง์ญ๋ง ๊ฐ์ ธ์ค์.
density_2013_tot_top10 = density_2013_tot[['abbreviation', 'pop/area']][:10]
density_2013_tot_top10
>>
abbreviation pop/area
195 DC 9506.602941
1233 PR 1028.473969
723 NJ 1020.332378
955 RI 680.589644
147 CT 648.643579
524 MA 634.090384
483 MD 477.860401
187 DE 473.771238
772 NY 360.736613
236 FL 297.345722
๐ ์ ๋ฒ ํฌ์คํธ์ ์ด๋ฒ ํฌ์คํธ๋ฅผ ๊ฑฐ์ณ ๋๋์ด ์ํ๋ ๋ฐ์ดํฐ๋ฅผ ์ป์๋ค!! ์ด์ ๋ค์ ํฌ์คํธ์์๋ ์ด ๋ฐ์ดํฐ๋ฅผ ์๊ฐํํด๋ณด์.
Leave a comment