1. ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ

๐Ÿ† ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ url์—์„œ ๊ฐ€์ ธ์™”๋‹ค.

https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv

https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv

https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv

1.1 population data


๐Ÿ† ์ง€์—ญ, ์—ฐ๋ น๋Œ€(total / under18), ๋ฐ์ดํ„ฐ์ˆ˜์ง‘๋…„๋„, ์ธ๊ตฌ์ˆ˜๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๋ฐ์ดํ„ฐ

population = pd.read_csv("Data/state-population.csv", encoding = 'utf-8-sig')
population.head()
>>
	state/region	ages	year	population
0	AL	       under18	2012	 1117489.0
1	AL	       total	2012	 4817528.0
2	AL	       under18	2010	 1130966.0
3	AL	       total	2010	 4785570.0
4	AL	       under18	2011	 1125763.0

1.2 areas data


๐Ÿ† ๊ฐ state ๋ณ„ ๋ฉด์ ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ฐ์ดํ„ฐ

areas = pd.read_csv("Data/state-areas.csv", encoding = 'utf-8-sig')
areas.head()
>>
	state	    area (sq. mi)
0	Alabama	      52423
1	Alaska	      656425
2	Arizona	      114006
3	Arkansas      53182
4	California    163707

1.3 abbrevs data


๐Ÿ† ๊ฐ state์˜ ์•ฝ์–ด(abbreviation)๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๋ฐ์ดํ„ฐ

abbrevs = pd.read_csv("Data/state-abbrevs.csv", encoding = 'utf-8-sig')
abbrevs.head()
>>
	state	   abbreviation
0	Alabama	        AL
1	Alaska	        AK
2	Arizona	        AZ
3	Arkansas	AR
4	California	CA

2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

2.1.population data


pop = pd.read_csv("Data/state-population.csv", encoding = 'utf-8-sig')
pop.head()
>>
	state/region	ages	year	population
0	AL	       under18	2012	 1117489.0
1	AL	       total	2012	 4817528.0
2	AL	       under18	2010	 1130966.0
3	AL	       total	2010	 4785570.0
4	AL	       under18	2011	 1125763.0

2.1.1. null ๊ฐ’ ์ฒ˜๋ฆฌ


pop.isnull().sum()
>>
state/region     0
ages             0
year             0
population      20
dtype: int64

population column์— NaN๊ฐ’์ด ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ population ์—ด์—์„œ NaN ๊ฐ’์„ ๊ฐ€์ง€๋Š” ํ–‰์„ ๊ฐ€์ ธ์™€๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

pop[pop['population'].isnull()]
>>
	state/region	ages	year	population
2448	     PR	      under18	1990	   NaN
2449	     PR	      total	1990	   NaN
2450	     PR	      total	1991	   NaN
2451	     PR	      under18	1991	   NaN
2452	     PR	      total	1993	   NaN
2453	     PR	      under18	1993	   NaN
2454	     PR	      under18	1992	   NaN
2455	     PR	      total	1992	   NaN
2456	     PR	      under18	1994	   NaN
2457	     PR	      total	1994	   NaN
2458	     PR	      total	1995	   NaN
2459	     PR	      under18	1995	   NaN
2460	     PR	      under18	1996	   NaN
2461	     PR	      total	1996	   NaN
2462	     PR	      under18	1998	   NaN
2463	     PR	      total	1998	   NaN
2464	     PR	      total	1997	   NaN
2465	     PR	      under18	1997	   NaN
2466	     PR	      total	1999	   NaN
2467	     PR	      under18	1999	   NaN

state/region ์ด PR์ธ ํ–‰์—์„œ ๋ชจ๋“  NaN ๊ฐ’์ด ๋ฐœ์ƒํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด ์ด NaN ๊ฐ’์„ ์ฒ˜๋ฆฌํ•ด์ค˜์•ผ ํ•˜๋Š”๋ฐ, ๊ทธ ์ „์— PR์˜ ์ •๋ณด๋ฅผ ์•Œ๊ธฐ ์œ„ํ•ด์„œ state/region ์ด PR์ธ ํ–‰์—์„œ NaN์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด๋ณด์ž.

PR = pop[pop['state/region']=='PR']
PR = PR[PR['population'].notnull()]
PR.head()
>>
	state/region	ages	year	population
2468	     PR	        total	2000	3810605.0
2469	     PR	       under18	2000	1089063.0
2470	     PR	        total	2001	3818774.0
2471	     PR	       under18	2001	1077566.0
2472	     PR	        total	2002	3823701.0
PR.tail()
>>
	state/region	ages	year	population
2491	     PR	       under18	2010	896945.0
2492	     PR	       under18	2011	869327.0
2493	     PR	        total	2011	3686580.0
2494	     PR	       under18	2012	841740.0
2495	     PR	        total	2012	3651545.0

์ด๋ฅผ ๋ณด๋ฉด PR ์ง€์—ญ์€ 2000๋…„๋ถ€ํ„ฐ ์ง‘๊ณ„๊ฐ€ ์‹œ์ž‘๋˜์—ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๋˜ํ•œ ์œ„์˜ NaN๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด 1990~1999 ๊นŒ์ง€๋Š” NaN ๊ฐ’์ด๊ณ , 2000๋…„๋ถ€ํ„ฐ๋Š” ์ค‘๊ฐ„์— ๋Š๊ธด ๋ถ€๋ถ„์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— PR์—์„œ๋Š” ๊ทธ๋ƒฅ NaN ๊ฐ’์„ drop์‹œํ‚ค๋ฉด ๋ ๊ฒƒ๊ฐ™๋‹ค.

pop = pop.dropna()
pop.isnull().sum()
>>
state/region    0
ages            0
year            0
population      0
dtype: int64

dropna ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด๋ณด๋‹ˆ ์ด์ œ NaN ๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด์ œ๋Š” ๋’ค์— ์žˆ์„ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์œ„ํ•ด ์—ด์˜ ์ด๋ฆ„์„ ๋ฐ”๊ฟ”์ฃผ์ž.
abbrevs data ์˜ abbreviation ์—ด์— ์ง€์—ญ์˜ ์•ฝ์–ด๊ฐ€ ๋“ค์–ด๊ฐ€ ์žˆ๊ธฐ ๋–„๋ฌธ์— population ๋ฐ์ดํ„ฐ์˜ state/region ์—ด์„ ๋งž์ถฐ์ฃผ๋ฉด ๋ ๊ฒƒ ๊ฐ™๋‹ค.
population ๋ฐ์ดํ„ฐ์˜ ์—ด์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ›์•„์˜ค๊ณ , ๊ทธ ๋ฆฌ์ŠคํŠธ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์—ด์˜ ์ด๋ฆ„์„ ๋ฐ”๊ฟ”์ฃผ์—ˆ๋‹ค.

cols = pop.columns.to_list()
cols[0] = 'abbreviation'
pop.columns = cols
pop.columns
>> Index(['abbreviation', 'ages', 'year', 'population'], dtype='object')

์ž˜ ๋ณ€ํ˜•๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
NaN ๊ฐ’์„ ์—†์• ์คฌ์œผ๋‹ˆ, ์ด์ œ๋Š” ์ค‘๋ณต๊ฐ’์ด ์žˆ๋Š”์ง€๋„ ํ™•์ธํ•ด๋ณด์ž.
duplicated() ์™€ value_counts๋ฅผ ์ด์šฉํ•œ๋‹ค.

pop.duplicated().sum()
>> 0
pop.value_counts().max()
>> 1

์ค‘๋ณต๊ฐ’์ด ์—†๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

2.1.2. total / under18 ๋ถ„๋ฅ˜ํ•˜๊ธฐ


๐Ÿ† ์—ฐ๋ น๋Œ€๋ณ„๋กœ ์ง€ํ‘œ๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด ๋ถ„๋ฅ˜

pop_tot = pop[pop['ages'] == 'total']
pop_tot.head()
>>
	abbreviation	ages	year	population
1	     AL	        total	2012	4817528.0
3	     AL	        total	2010	4785570.0
5	     AL	        total	2011	4801627.0
6	     AL	        total	2009	4757938.0
9	     AL	        total	2013	4833722.0
pop_18 = pop[pop['ages'] == 'under18']
pop_18.head()
>>
	abbreviation	ages	year	population
0	     AL	       under18	2012	1117489.0
2	     AL	       under18	2010	1130966.0
4	     AL	       under18	2011	1125763.0
7	     AL	       under18	2009	1134192.0
8	     AL	       under18	2013	1111481.0

2.2. areas Data


area = pd.read_csv("Data/state-areas.csv", encoding = 'utf-8-sig')
area.head()
>>
	state	    area (sq. mi)
0	Alabama	        52423
1	Alaska	        656425
2	Arizona	        114006
3	Arkansas	53182
4	California	163707
area.shape
>> (52, 2)

2.2.1. null๊ฐ’/ ์ค‘๋ณต๊ฐ’ ์ฒ˜๋ฆฌํ•˜๊ธฐ


area.isnull().sum()
>>
state            0
area (sq. mi)    0
dtype: int64
area.duplicated().sum()
>> 0

NaN ๊ฐ’๊ณผ ์ค‘๋ณต๊ฐ’ ๋ชจ๋‘ ์—†์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ๋ถ„์ด ์ข‹๋‹ค.

2.3. abbreves data


abb = pd.read_csv("Data/state-abbrevs.csv", encoding = 'utf-8-sig')
abb.head()
>>
	state	   abbreviation
0	Alabama	        AL
1	Alaska	        AK
2	Arizona	        AZ
3	Arkansas	AR
4	California	CA
abb.shape
>> (51, 2)

2.3.1. null๊ฐ’/ ์ค‘๋ณต๊ฐ’ ์ฒ˜๋ฆฌํ•˜๊ธฐ


abb.isnull().sum()
>>
state           0
abbreviation    0
dtype: int64
abb.duplicated().sum()
>> 0

abbreves ๋ฐ์ดํ„ฐ๋„ NaN ๊ฐ’๊ณผ ์ค‘๋ณต๊ฐ’์ด ์—†๋Š” ๊น”๋”ํ•œ ๋ฐ์ดํ„ฐ์ด๋‹ค.


๐Ÿ† ์ด๋ ‡๊ฒŒ ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์˜ NaN ๊ฐ’, ์ค‘๋ณต๊ฐ’์„ ํ™•์ธํ•˜๊ณ  ์ฒ˜๋ฆฌํ–ˆ๋‹ค.
๐Ÿ† ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด ๋ฐ์ดํ„ฐ๋“ค์„ ๋ณ‘ํ•ฉํ•˜๊ณ  ํ•„ํ„ฐ๋งํ•ด์„œ ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ ํ˜•ํƒœ๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •์„ ๋‹ค๋ฃฐ๊ฒƒ์ด๋‹ค.


๐Ÿ’กํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค ํ•ธ๋“œ๋ถ(์œ„ํ‚ค๋ถ์Šค,2020)์˜ ์ €์ž Jake VanderPlas ๋ถ„์˜ ๊นƒํ—ˆ๋ธŒ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™”์Œ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment