๐Ÿ† ์ €๋ฒˆ ํฌ์ŠคํŒ…์— ์ด์–ด ์ด๋ฒˆ์—๋Š” ์ „์ฒ˜๋ฆฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝํ•ด๋ณด์ž.

3. area / abb ๋ณ‘ํ•ฉ

๐Ÿ† area ๋ฐ์ดํ„ฐ์˜ state๋ฅผ ์•ฝ์–ด๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ๋‘˜์„ ๋ณ‘ํ•ฉํ•˜์—ฌ ์ •๋ฆฌํ•ด์ค€๋‹ค.
๐Ÿ† ์šฐ์„  ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด์ž.

area.head()
>>
        state	  area(sq. mi)
0	Alabama	     52423
1	Alaska	     656425
2	Arizona	     114006
3	Arkansas     53182
4	California   163707
abb.head()
>>
        state	   abbreviation
0	Alabama	        AL
1	Alaska	        AK
2	Arizona	        AZ
3	Arkansas	AR
4	California	CA

๋‘ ๋ฐ์ดํ„ฐ์— ๊ณตํ†ต์œผ๋กœ ์žˆ๋Š” state ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ•ด์ฃผ๊ณ , ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด outer join ํ•ด์ค€๋‹ค.
๋ณ‘ํ•ฉ ํ›„์— ๋‘ ๋ฐ์ดํ„ฐ ์ค‘ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์—๋งŒ ์žˆ๋Š” ์ •๋ณด๋Š” NaN ๊ฐ’์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๊ธฐ ๋–„๋ฌธ์— null ๊ฐ’ ์œ ๋ฌด๋„ ํ™•์ธํ•ด์•ผํ•œ๋‹ค.

area_abb = pd.merge(area, abb, on = 'state', how = 'outer')
area_abb.head()
>>
        state	  area (sq. mi)	abbreviation
0	Alabama	      52423	    AL
1	Alaska	      656425	    AK
2	Arizona	      114006	    AZ
3	Arkansas      53182	    AR
4	California    163707	    CA
area_abb.isnull().sum()
>>
state            0
area (sq. mi)    0
abbreviation     1
dtype: int64

abbreviation ์—ด์— NaN ๊ฐ’์ด ์žˆ๋‹ค. ์–ด๋””์„œ ๋‚˜์˜จ ๊ฑด์ง€ ํ™•์ธํ•ด๋ณด์ž.

area_abb[area_abb['abbreviation'].isnull()]
>>
	state	      area (sq. mi)	abbreviation
51	Puerto Rico	  3515	            NaN

Puetro Rico ์˜ ์•ฝ์–ด๊ฐ€ ์ •ํ•ด์ง€์ง€ ์•Š์€ ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
fillna() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด NaN ๊ฐ’์„ PR๋กœ ์ฑ„์›Œ์ฃผ์ž.

area_abb = area_abb.fillna("PR")
area_abb[area_abb['state'] == 'Puerto Rico']
>>
	state	      area (sq. mi)	abbreviation
51	Puerto Rico	  3515	            PR

๐Ÿ† ์ด๋ ‡๊ฒŒ area ๋ฐ์ดํ„ฐ์™€ abbreviation ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘ํ•ฉํ•œ area_abb ๋ฐ์ดํ„ฐ๋ฅผ ์™„์„ฑํ–ˆ๋‹ค.
๐Ÿ† ์ด์ œ ์›ํ•˜๋Š” ์ธ๊ตฌ์ˆ˜ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” population ๋ฐ์ดํ„ฐ์™€ ๋ณ‘ํ•ฉํ•ด๋ณด์ž.

4.population๊ณผ ๋ณ‘ํ•ฉ

4.1. pop_tot ๊ณผ ๋ณ‘ํ•ฉ


pop_tot.head()
	abbreviation	ages	year	population
1	     AL	        total	2012	4817528.0
3	     AL	        total	2010	4785570.0
5	     AL	        total	2011	4801627.0
6	     AL	        total	2009	4757938.0
9	     AL	        total	2013	4833722.0
area_abb.head()
>>
        state	  area (sq. mi)	abbreviation
0	Alabama	      52423	    AL
1	Alaska	      656425	    AK
2	Arizona	      114006	    AZ
3	Arkansas      53182	    AR
4	California    163707	    CA

pop_tot ๋ฐ์ดํ„ฐ์™€ area_abb ๋ฐ์ดํ„ฐ์— ๊ณตํ†ต์œผ๋กœ ๋“ค์–ด๊ฐ€์žˆ๋Š” abbreviation ์„ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ•˜๊ณ , outer join ํ•ด์ฃผ์ž.
์ด ๊ฒฝ์šฐ๋„ NaN ๊ฐ’์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผํ•œ๋‹ค.

final_tot = pd.merge(pop_tot, area_abb, on = 'abbreviation', how = 'outer')
final_tot
>>

    abbreviation    ages	year	population	state	    area (sq. mi)
0	AL	    total	2012	4817528.0	Alabama	    52423.0
1	AL	    total	2010	4785570.0	Alabama	    52423.0
2	AL	    total	2011	4801627.0	Alabama	    52423.0
3	AL	    total	2009	4757938.0	Alabama	    52423.0
4	AL	    total	2013	4833722.0	Alabama	    52423.0
        ...	...	...	...	...	...	...
1257	USA	    total	2013	316128839.0	NaN	    NaN
1258	USA	    total	2009	306771529.0	NaN	    NaN
1259	USA	    total	2010	309326295.0	NaN	    NaN
1260	USA	    total	2011	311582564.0	NaN	    NaN
1261	USA	    total	2012	313873685.0	NaN	    NaN

1262 rows ร— 6 columns
final_tot.isnull().sum()
>>
abbreviation      0
ages              0
year              0
population        0
state            24
area (sq. mi)    24
dtype: int64

NaN ๊ฐ’์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋””์„œ ๋‚˜์˜จ ๊ฐ’์ธ์ง€ ํ™•์ธํ•˜์ž.

final_tot[final_tot['state'].isnull()]
>>
	abbreviation	ages	year	population	state	area (sq. mi)
1238	    USA	        total	1990	249622814.0	NaN	NaN
1239	    USA	        total	1991	252980942.0	NaN	NaN
1240	    USA	        total	1992	256514231.0	NaN	NaN
1241	    USA	        total	1993	259918595.0	NaN	NaN
1242	    USA	        total	1994	263125826.0	NaN	NaN
1243	    USA	        total	1995	266278403.0	NaN	NaN
1244	    USA	        total	1996	269394291.0	NaN	NaN
1245	    USA	        total	1997	272646932.0	NaN	NaN
1246	    USA	        total	1998	275854116.0	NaN	NaN
1247	    USA	        total	2000	282162411.0	NaN	NaN
1248	    USA	        total	1999	279040181.0	NaN	NaN
1249	    USA	        total	2001	284968955.0	NaN	NaN
1250	    USA	        total	2002	287625193.0	NaN	NaN
1251	    USA	        total	2003	290107933.0	NaN	NaN
1252	    USA	        total	2004	292805298.0	NaN	NaN
1253	    USA	        total	2005	295516599.0	NaN	NaN
1254	    USA	        total	2006	298379912.0	NaN	NaN
1255	    USA	        total	2007	301231207.0	NaN	NaN
1256	    USA	        total	2008	304093966.0	NaN	NaN
1257	    USA	        total	2013	316128839.0	NaN	NaN
1258	    USA	        total	2009	306771529.0	NaN	NaN
1259	    USA	         total	2010	309326295.0	NaN	NaN
1260	    USA	        total	2011	311582564.0	NaN	NaN
1261	    USA	        total	2012	313873685.0	NaN	NaN

ํ™•์ธ ๊ฒฐ๊ณผ USA์—์„œ NaN ๊ฐ’์ด ๋‚˜์˜จ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค
ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ•˜๋ ค๋Š” state ๋ณ„ ๋ฉด์ ๋Œ€๋น„ ์ธ๊ตฌ์ˆ˜์™€๋Š” ๊ด€๊ณ„๊ฐ€ ์—†๋Š” ๋‚ด์šฉ์ด๋ฏ€๋กœ ๊ทธ๋ƒฅ dropna()๋ฅผ ํ•ด์„œ ์—†์• ์ฃผ๋„๋ก ํ•˜์ž.

final_tot = final_tot.dropna()
final_tot.head()
>>
	abbreviation	ages	year	population	state	    area (sq. mi)
0	    AL	        total	2012	4817528.0	Alabama	    52423.0
1	    AL	        total	2010	4785570.0	Alabama	    52423.0
2	    AL	        total	2011	4801627.0	Alabama	    52423.0
3	    AL	        total	2009	4757938.0	Alabama	    52423.0
4	    AL	        total	2013	4833722.0	Alabama	    52423.0
final_tot.shape
>> (1238, 6)

์›๋ž˜์˜ 1262ํ–‰์—์„œ 24ํ–‰์˜ NaN๊ฐ’์ด ์ œ๊ฑฐ๋˜์–ด 1238ํ–‰์ด ๋œ ๊ฒƒ๋„ ํ™•์ธํ•˜์ž.

final_tot.isnull().sum()
>>
abbreviation     0
ages             0
year             0
population       0
state            0
area (sq. mi)    0
dtype: int64
final_tot.duplicated().sum()
>> 0

ํ™•์ธ ๊ฒฐ๊ณผ NaN ๊ฐ’๊ณผ ์ค‘๋ณต๊ฐ’์ด ์—†๋Š” ๊น”๋”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋˜์—ˆ๋‹ค.

4.2. pop_18 ๊ณผ ๋ณ‘ํ•ฉ


์ด ๊ฒฝ์šฐ ์—ญ์‹œ USA ๊ฐ’์ด NaN ์ด ๋‚˜์˜ฌ ๊ฒƒ์ด๋‹ค.
dropna() ๋ฅผ ํ•ด์„œ ์ •๋ฆฌํ•ด์ฃผ์ž.

final_18 = pd.merge(pop_18, area_abb, on = 'abbreviation', how = 'outer')
final_18 = final_18.dropna()
final_18.head()
>>
	abbreviation	ages	year	population	state	    area (sq. mi)
0	    AL	       under18	2012	1117489.0	Alabama	    52423.0
1	    AL	       under18	2010	1130966.0	Alabama	    52423.0
2	    AL	       under18	2011	1125763.0	Alabama	    52423.0
3	    AL	       under18	2009	1134192.0	Alabama	    52423.0
4	    AL	       under18	2013	1111481.0	Alabama	    52423.0

5. ๋ฉด์  ๋‹น ์ธ๊ตฌ์ˆ˜

๐Ÿ† pop/area ๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๋งŒ๋“ค์–ด์„œ ํ•ด๋‹น ๋‚ด์šฉ์„ ๋„ฃ์–ด์ค€๋‹ค.

๐Ÿ† 1. final_tot ์ƒ์„ฑ

final_tot['pop/area'] = final_tot['population']/final_tot['area (sq. mi)']
final_tot.head()
>>
    abbreviation	ages	year	population	state	    area (sq. mi)	pop/area
0	AL	        total	2012	4817528.0	Alabama	    52423.0	        91.897221
1	AL	        total	2010	4785570.0	Alabama	    52423.0	        91.287603
2	AL	        total	2011	4801627.0	Alabama	    52423.0	        91.593900
3	AL	        total	2009	4757938.0	Alabama	    52423.0	        90.760506
4	AL	        total	2013	4833722.0	Alabama	    52423.0	        92.206131

๐Ÿ† 2. fianl_18 ์ƒ์„ฑ

final_18['pop/area'] = final_18['population']/final_18['area (sq. mi)']
final_18.head()
>>
    abbreviation	ages	    year	population	state	    area (sq. mi)	pop/area
0	AL	        under18	    2012	1117489.0	Alabama	    52423.0	        21.316769
1	AL	        under18	    2010	1130966.0	Alabama	    52423.0	        21.573851
2	AL	        under18	    2011	1125763.0	Alabama	    52423.0	        21.474601
3	AL	        under18	    2009	1134192.0	Alabama	    52423.0	        21.635389
4	AL	        under18	    2013	1111481.0	Alabama	    52423.0	        21.202163

๋ชจ๋“  ์ „์ฒ˜๋ฆฌ๊ฐ€ ๋๋‚ฌ๋‹ค!! ์ด์ œ๋ถ€ํ„ฐ๋Š” ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๊ฐ€์ ธ์™€์„œ ์‹œ๊ฐํ™”๋ฅผ ํ•˜๊ฑฐ๋‚˜ ๋ถ„์„ํ•˜๋ฉด ๋œ๋‹ค.

6. 2013๋…„ / total์—์„œ ๋ฉด์  ๋‹น ์ธ๊ตฌ์ˆ˜

๐Ÿ† ๋ฉด์ ๋‹น ์ธ๊ตฌ์ˆ˜๊ฐ€ ๋งŽ์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•ด์ฃผ์ž.

density_2013_tot = final_tot[final_tot['year']== 2013]
density_2013_tot = density_2013_tot.sort_values('pop/area', ascending=False)
density_2013_tot.head()
>>
    abbreviation  ages	year	population	state	                area (sq. mi)	    pop/area
195	DC	  total	2013	646449.0	District of Columbia	68.0	            9506.602941
1233	PR	  total	2013	3615086.0	Puerto Rico	        3515.0	            1028.473969
723	NJ	  total	2013	8899339.0	New Jersey	        8722.0	            1020.332378
955	RI	  total	2013	1051511.0	Rhode Island	        1545.0	            680.589644
147	CT	  total	2013	3596080.0	Connecticut	        5544.0	            648.643579

์ด์ œ density_2013_tot ๋ฐ์ดํ„ฐ์—์„œ ์ƒ์œ„ 10๊ฐœ์˜ ์ง€์—ญ๋งŒ ๊ฐ€์ ธ์˜ค์ž.

density_2013_tot_top10 = density_2013_tot[['abbreviation', 'pop/area']][:10]
density_2013_tot_top10
>>	
    abbreviation    pop/area
195	DC	    9506.602941
1233	PR	    1028.473969
723	NJ	    1020.332378
955	RI	    680.589644
147	CT	    648.643579
524	MA	    634.090384
483	MD	    477.860401
187	DE	    473.771238
772	NY	    360.736613
236	FL	    297.345722

๐Ÿ† ์ €๋ฒˆ ํฌ์ŠคํŠธ์™€ ์ด๋ฒˆ ํฌ์ŠคํŠธ๋ฅผ ๊ฑฐ์ณ ๋“œ๋””์–ด ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์—ˆ๋‹ค!! ์ด์ œ ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•ด๋ณด์ž.


๐Ÿ’กํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค ํ•ธ๋“œ๋ถ(์œ„ํ‚ค๋ถ์Šค,2020)์˜ ์ €์ž Jake VanderPlas ๋ถ„์˜ ๊นƒํ—ˆ๋ธŒ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™”์Œ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment