기본 알고리즘 4가지

1. Linear Regression

선형 회귀 : 어떤 직선이 가장 최선의 직선일지 판단하는 것 (y=ax+b에서 a,b결정)

가중치는 최소제곱법으로 구한다.

coef_ : 가중치(회귀계수)

intercept_ : 편향

Ridge(L2 > 변수 줄이기), Lasso(L1 > 변수 제거), ElasticNet(L1+L2)

1-1. 단순 회귀

x값이 하나인 것

1-2. 다중 회귀

x값이 여러개인 것

1-3. 회귀 계수 살펴보기

# 회귀계수 확인
# print('* 변수:', list(x))
print('* 가중치:',model.coef_.round(2))
print('* 편향:',model.intercept_.round(2))

1-4. 회귀 계수 시각화

# 가중치 시각화
tmp = pd.DataFrame()
tmp['feature'] = list(x)
tmp['weight'] = model.coef_
tmp.sort_values(by='weight',ascending=True,inplace=True)
plt.figure(figsize=(3,5))
plt.barh(tmp['feature'],tmp['weight'])
plt.show()

1-5. 회귀선 그리기

# dist = a * speed + b
a = model.coef_
b = model.intercept_
speed = np.array([x_train.min(), x_train.max()])
dist = a * speed + b
# print(speed)
# print(dist)

# 학습 데이터 확인
plt.scatter(x=x_train['speed'], y=y_train) # 학습데이터
# plt.scatter(x=x_test, y=y_test) # 평가 데이터
plt.plot(speed, dist, color='red') # 회귀선
plt.axhline(y_train.mean(), linestyle='--') #평균
plt.xlabel('Speed(mph)')
plt.ylabel('Dist(ft)')
plt.show()

1-6. 예측값 실젯값 시각화

# 예측값 실젯값 시각화
plt.rc('font',size=8)
plt.rc('axes',linewidth=0.3)
plt.figure(figsize=(12,3))
plt.plot(y_test.values, label='Actual',linewidth=0.7, marker='o', markersize=2)
plt.plot(y_pred, label='Predicted',linewidth=0.7, marker='o', markersize=2)
plt.title('Actual vs Predicted', size=15, pad=10)
plt.legend()
plt.ylabel('Dist(ft)')
plt.show()

2. KNN( K - Nearest Neighbor )

k개 최근접 이웃, 적절한 k값을 찾는 것이 중요하다.

회귀는 주변의 값의 평균

분류는 주변의 값 중에서 최빈값 (k는 보통 홀수로 잡음)

x,y의 데이터의 거리가 다르기 때문에 정규화를 해야한다.

특히, 학습데이터와 평가데이터를 각각 min-max 스케일링을 통해 정규화한다.

2-1. 정규화

# 모듈 불러오기
from sklearn.preprocessing import MinMaxScaler

# 정규화
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

- 모델 선언할때, n_neighbors 설정 가능

model = KNeighborsRegressor(n_neighbors=3)

- 분류 평가 방법

# 5단계: 평가하기
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

2-2. 정규화 후 데이터 시각화

plt.figure(figsize=(6,3))
plt.boxplot(x_train, vert=False, labels = list(x)) #누워라
plt.show()

3. Decision Tree

질문을 통해 학습하는 지도학습 알고리즘

분석 과정을 실제로 눈으로 확인할 수 있는 화이트박스 모델이다.

Root Node(뿌리 마디), Terminal Node(끝 마디), Depth(깊이)

- 가지치기

하이퍼파라미터 값을 조정하여 가지치기하며 과대적합을 줄여야 한다.

주요 하이퍼파라미터 3가지

max_depth : 트리의 최대 깊이
max_samples_split : 노드를 분할하기 위한 최소의 샘플 개수
max_saples_leaf : 리프(마지막) 노드가 되기 위한 최소의 샘플 수

3-1. 지니 불순도(Gini Impurity)

지니 불순도 = 1 - (양성 클래스 비율^2 + 음성 클래스 비율^2)

0이 순도가 높은 것, 0.5가 가장 불순도가 높은 것

3-2. 엔트로피(Entropy)

순도가 높으면 0, 불순도가 높으면 1

3-3. 정보 이득(Information Gain)

정보 이득 : 부모의 엔트로피 - 자식 엔트로피

정보가 크면 불순도가 줄어든다.

3-4. 시각화(Graphviz 활용)

# 2단계: 선언하기
model = DecisionTreeClassifier(max_depth=5,random_state=1)

# 5단계 평가하기
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

class_ names는 오름차순으로 설정됌 [0,1] 주황색 : 0, 파란색 : 1

# 시각화 모듈 불러오기
from sklearn.tree import export_graphviz
from IPython.display import Image

# 이미지 파일 만들기
export_graphviz(model,                                 # 모델 이름
                out_file='tree.dot',                   # 파일 이름
                feature_names=list(x),                 # Feature 이름
                class_names=['die', 'survived'],       # Target Class 이름
                rounded=True,                          # 둥근 테두리
                precision=2,                           # 불순도 소숫점 자리수
                max_depth=3,                           # 시각화할 트리 깊이
                filled=True)                           # 박스 내부 채우기

# 파일 변환
!dot tree.dot -Tpng -otree.png -Gdpi=300

# 이미지 파일 표시
Image(filename='tree.png')

- 변수 중요도 시각화

model.feature_importances

# 변수 중요도
plt.figure(figsize=(5, 5))
plt.barh(y=list(x), width=model.feature_importances_)
plt.show()

4. Logistic Regression

Linear Regression을 확률 문제로 모델링한 것이다.

판별 결과 : 0초과 1미만의 값이 된다. (0,1)
기본 확률 값 : 0.5를 임계값으로 하지만 크게, 작게 조절하여 판단기준을 느슨하게 만들 수 있다.

Logistic 함수는 sigmoid 함수에 포함되며 시그모이드함수(손실함수)라고도 부른다.

4-1. 코드

max_iter : 손실 함수를 여러번 만들게 하는것 (기본값은 100)

f(x) : 선형 판별식(회귀식)

손실 함수 : 확률을 가져오고 최적의 가중치로 변경함

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500, random_state=1)

from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

4-2. 작동 순서 (이중, 다중 분류)

# 회귀 계수 확인
print(list(x))
print(model.coef_)
print(model.intercept_)

# 선형 판별식(f(x),z)
z = model.decision_function(x_test)
print(z[10:21])

# 시그모이드 함수 사용 - 이중분류
from scipy.special import expit
print(expit(z)[10:21].round(2))

# softmax 함수 사용 - 다중분류
from scipy.special import softmax
print(softmax(z, axis=1)[:5].round(2))

# 확률값 확인
p = model.predict_proba(x_test)
print(p[10:21].round(2))

4-3. 이중 분류의 임계값 조정

# 확률값 얻기
p = model.predict_proba(x_test)
p1 = ([:,1]) #1인 확률만 가져오기
# 임계값 = 0.45
y_pred2 = [1 if x > 0.45 else 0 for x in p1]
print(classification_report(y_test, y_pred2))

'KT AIVLE School > 머신러닝' 카테고리의 다른 글

Hyperparameter (1)	2024.10.04
K-Fold Cross Validation (0)	2024.10.02
Graphviz 사용 준비 (2)	2024.09.27
성능 평가 (1)	2024.09.27
모델링 코드 구조 (3)	2024.09.26

1. Linear Regression

1-1. 단순 회귀

1-2. 다중 회귀

1-3. 회귀 계수 살펴보기

1-4. 회귀 계수 시각화

1-5. 회귀선 그리기

1-6. 예측값 실젯값 시각화

2. KNN( K - Nearest Neighbor )

2-1. 정규화

2-2. 정규화 후 데이터 시각화

3. Decision Tree

- 가지치기

3-1. 지니 불순도(Gini Impurity)

3-2. 엔트로피(Entropy)

3-3. 정보 이득(Information Gain)

3-4. 시각화(Graphviz 활용)

- 변수 중요도 시각화

4. Logistic Regression

4-1. 코드

4-2. 작동 순서 (이중, 다중 분류)

4-3. 이중 분류의 임계값 조정

'KT AIVLE School > 머신러닝' 카테고리의 다른 글

티스토리툴바