λ³Έλ¬Έ λ°”λ‘œκ°€κΈ°
λ§ˆμΌ€νŒ…/데이터 뢄석

[파이썬] 숫자 데이터 κ°€κ³΅ν•˜κΈ°

by 퍼포마첼라 2025. 3. 14.

 

μƒˆλ‘œμš΄ κ°’ κ³„μ‚°ν•˜κΈ°

 

κΈ°μ‘΄ 데이터 뢈러였기

 

ν—ˆλ¦¬λ‘˜λ ˆ/엉덩이 λ‘˜λ ˆλ‘œ λΉ„λ§Œλ„λ₯Ό κ³„μ‚°ν•œλ‹€.

 

round()

ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•΄ μ†Œμˆ˜μ  2번째 μžλ¦¬κΉŒμ§€ ν‘œμ‹œλ˜κ²Œ λ§Œλ“ λ‹€.

 

μƒˆλ‘œμš΄ μ»¬λŸΌμ„ λ§Œλ“€μ–΄ μ €μž₯ν•΄μ€€λ‹€.

 

이런 μƒˆλ‘œμš΄ λ°μ΄ν„°λ‘œ λΉ„λ§Œλ„μ™€ 당뇨(diabetes) 사이에 μ–΄λ–€ 상관관계가 μžˆλŠ”μ§€λ„ μ•Œμ•„λ³Ό 수 μžˆκ² λ‹€.

 

μ‹€μŠ΅ 1 

import pandas as pd

patient_df = pd.read_csv('data/patient.csv')

# 여기에 μ½”λ“œλ₯Ό μž‘μ„±ν•˜μ„Έμš”.
bmi = round(patient_df['weight'] / (patient_df['height'] *  patient_df['height']), 1)

patient_df['bmi'] = bmi
patient_df

 

μ œκ³±μ„ **2라고 ν‘œν˜„ν•˜λŠ” 방법을 λͺ°λžλ‹€. 기얡해둬야지.

 

# λͺ¨λ²”λ‹΅μ•ˆ
import pandas as pd

patient_df = pd.read_csv('data/patient.csv')

patient_df['bmi'] = round(patient_df['weight'] / patient_df['height']**2, 1)
patient_df

 


μ •κ·œν™”

 

μŠ€μΌ€μΌλ§

 

μŠ€μΌ€μΌλ§ 방법

1. μ •κ·œν™”

2. ν‘œμ€€ν™”

 

μ •κ·œν™” 방법

 

μ•žμ— μ¨μžˆλŠ”λŒ€λ‘œ 식을 μž‘μ„±ν•΄μ£Όλ‹€κ°€ λ„ˆλ¬΄ κΈΈμ–΄μ„œ μ—”ν„°λ₯Ό μΉ˜λ‹ˆ 였λ₯˜κ°€ λ‚œλ‹€

 

 

\

μ—­μŠ¬λž˜μ‹œλ₯Ό μ¨μ£Όλ‹ˆ μ—λŸ¬κ°€ 사라진닀.

 

μ •κ·œν™”ν•œ 값을 μ €μž₯ν•΄μ£Όκ³  확인해본닀.

 

 

μ‹€μŠ΅ 1

import pandas as pd

patient_df = pd.read_csv('data/patient.csv')

# 여기에 μ½”λ“œλ₯Ό μž‘μ„±ν•˜μ„Έμš”.
patient_df['weight'] = (patient_df['weight'] - patient_df['weight'].min()) / \
(patient_df['weight'].max() - patient_df['weight'].min())

patient_df

 


ν‘œμ€€ν™”

 

 

κΈ°λ³Έ 데이터

 

평균값 / ν‘œμ€€νŽΈμ°¨

 

κ³„μ‚°ν•œ 값을 μ €μž₯ν•΄μ£Όκ³  확인

 

ν‘œμ€€ν™”κ°€ μ œλŒ€λ‘œ λ˜μ—ˆλ‹€λ©΄, 평균은 0, ν‘œμ€€νŽΈμ°¨λŠ” 1이 λœλ‹€.

https://www.codeit.kr/tutorials/143/scientific-notation-and-fixed-point-notation

 

과학적 ν‘œκΈ°λ²•κ³Ό κ³ μ • μ†Œμˆ˜μ  ν‘œκΈ°λ²• | μ½”λ“œμž‡

| pandasλ₯Ό μ‚¬μš©ν•˜λ‹€ 보면 μˆ«μžκ°€ `2.424625e-17`, `6.974262e-01`... 이런 μ‹μœΌλ‘œ 좜λ ₯될 λ•Œκ°€ μžˆμŠ΅λ‹ˆλ‹€. 이건 '과학적 ν‘œκΈ°λ²•(Scientific Notation)'μ΄λΌλŠ” λ°©λ²•μœΌλ‘œ 숫자λ₯Ό ν‘œν˜„ν•œ κ²λ‹ˆλ‹€. 과학적 ν‘œκΈ°λ²•μ€ 숫자

www.codeit.kr

 

정리

 

μ‹€μŠ΅ 1

import pandas as pd

patient_df = pd.read_csv('data/patient.csv')

# 여기에 μ½”λ“œλ₯Ό μž‘μ„±ν•˜μ„Έμš”.
patient_df['weight'] = (patient_df['weight'] - patient_df['weight'].mean()) / \
patient_df['weight'].std()

patient_df

 


cut()ν•¨μˆ˜λ‘œ 데이터 κ΅¬κ°„ν™”ν•˜κΈ°

 

 

κΈ°λ³Έ 데이터

 

λ‚˜μ΄ ꡬ간을 μ§€μ •ν•˜κΈ° μœ„ν•΄ μ΅œμ†Œκ°’κ³Ό μ΅œλŒ€κ°’μ„ ν™•μΈν•œλ‹€.

 

pd.cut()

νŒλ‹€μŠ€μ˜ cut ν•¨μˆ˜λ₯Ό μ΄μš©ν•΄ ꡬ간을 μ§€μ •ν•΄μ€€λ‹€.

그럼 20-30, 30-40 .. μ΄λ ‡κ²Œ λ‚˜λˆ„μ–΄μ§„ κ±Έ 확인할 수 μžˆλ‹€.

 

μƒˆλ‘œμš΄ μ»¬λŸΌμ— μ €μž₯ν•΄μ€€λ‹€. 

 

κ²°κ³Όλ₯Ό 확인해보면 NaN값이 μžˆλ‹€.

(40.0, 50.0] 을 λ³Ό λ•Œ κ΄„ν˜Έμ˜ μ˜λ―ΈλŠ” 40은 ν¬ν•¨ν•˜μ§€ μ•Šκ³  50은 ν¬ν•¨ν•œλ‹€λŠ” μ˜λ―Έμ΄λ‹€.

'(' μ†Œκ΄„ν˜ΈλŠ” 미만/초과, ']'λŒ€κ΄„ν˜ΈλŠ” μ΄ν•˜/μ΄μƒμ΄λΌλŠ” λœ»μ΄λ‹€.

κ·Έλž˜μ„œ 20살인 ν™˜μžμ˜ λ‚˜μ΄ 그룹이 NaN이 λ‚˜μ˜€κ²Œ 된 것이닀.

 

right = False

right νŒŒλΌλ―Έν„°μ˜ 기본값은 Trueκ³ , κ·Έ 경우 μœ„μ²˜λŸΌ λ‚˜μ˜¨λ‹€. 이λ₯Ό False둜 λ°”κΏ”μ£Όλ©΄ 50이상 60미만으둜 λ³€κ²½λœλ‹€.

 

labels

labels νŒŒλΌλ―Έν„°λ₯Ό μ΄μš©ν•΄μ„œ 보기 μ’‹κ²Œ λ§Œλ“€μ–΄μ€„ μˆ˜λ„ μžˆλ‹€.

 

μ‹€μŠ΅ 1

import pandas as pd
import numpy as np

patient_df = pd.read_csv('data/patient.csv')
patient_df['bmi'] = round(patient_df['weight'] / patient_df['height']**2, 1)

# 여기에 μ½”λ“œλ₯Ό μž‘μ„±ν•˜μ„Έμš”.
bmi_max = patient_df['bmi'].max() + 1 #BMI 끝점
patient_df['obesity'] = pd.cut(patient_df['bmi'], bins=[0, 18.5, 25, 30, bmi_max], right=False, labels=['under','healthy','over','obese'])
patient_df

 


apply()

import pandas as pd

df = pd.DataFrame([[1, 9], [4, 16]], columns=['column1', 'column2'])
df

import numpy as np

np.sqrt(9)  # 좜λ ₯ λ‚΄μš© -> 3.0

df['column2'].apply(np.sqrt)

0    3.0
1    4.0
Name: column2, dtype: float64

 

μ‹€μŠ΅

patient_df = pd.read_csv('data/patient.csv')
patient_df.head()

def group_age(x):
    if x >= 10 and x < 20:
        return '10s'
    elif x >= 20 and x < 30:
        return '20s'
    elif x >= 30 and x < 40:
        return '30s'
    elif x >= 40 and x < 50:
        return '40s'
    elif x >= 50 and x < 60:
        return '50s'
    else:
        return '60s'

patient_df['age_group'] = patient_df['age'].apply(group_age)


μ½”λ“œμž‡ 16. μˆ«μžλ°μ΄ν„° κ°€κ³΅ν•˜κΈ°