Anomaly detection

Anomaly detection 개념 학습을 목적으로 정리한 글입니다.

Anomaly 개념

Anomaly / Novelty / Outlier 개념
- Anomalies : 정상 데이터와 본질적으로 다르고 일반적으로 발생빈도가 적은 객체
- Novelties : 정상 데이터와 본질적으로 같지만 유형이 다른 객체
- Outliers VS Noise data 차이 : 노이즈는 데이터 수집에 있어 자연 발생적인 변동성이다. (Noise is random error or variance in a measured variable)

Anomaly 의미

데이터 생성 매커니즘 관점 : generated by a different mechanism
밀도 관점 : true probability density is very low 매우 낮은 발생빈도

Anomaly의 종류

Global Outlier : 정상 데이터와 본질적으로 다르고 일반적으로 발생빈도가 적은 객체
Contextual/Contitional Outlier (local outlier) : 특정 맥락이 선행될 때 이상치로 판달될 수 있는 객체
Collective/Group outlier : 한 번 이상치가 발생할 때 대규모로 발생

Anomaly Detection

Anomaly Detection 의 적용 사례

Industrial Monitoring, System Security, Fraud Detection 등 다양한 분야에 적용되고 있다.

Anomaly Dectection에서의 Label

Label of Anomalies 유무에 따라 모델 학습 과정이 달라진다.

지도 학습
- 높은 Detection 성능, classification 모델 사용 가능.
- label 정보가 존재하는 경우 현실적으로 드물며, lass imbalance 문제 존재
반지도 학습 (일반적인 연구방식)
- 대부분의 경우 정상 데이터에 대한 label 확보가능하여 가장 현실적인 케이스. 비지도 학습보다 성능 보장.
- 정상데이터만으로 학습하여 representation feature 과적합 가능성 높으며, 지도학습보다 낮은 성능.
비지도 학습
- 전체 데이터 셋이 정상이라는 가정이 필요하며, 가장 낮은 성능, noise에 민감함.

Classification vs Anomaly Detection

Anomaly Detection의 목적은 Normal, Abnormal 구분하는 것으로 Supervised Learning 이지만, 실질적인 수행 방식은 Unsupervised Learning이다.
1. Binary Classification : 지도학습으로 주어진 데이터를 잘 구분하는 분류 경계면을 찾는 것이 목적이기 때문에 A, B 모두 Normal로 분류한다.
2. Anomaly Detection : Abnormal data가 소수, 클래스(범주) 대표성이 떨어지기 때문에 Normal data 만 활용하여 어디까지가 Normal인지 boundary 탐지한다. A,B는 Normal이 아니다 로 결론 짓는다. (Abnormal data도 여러 범주가 있을 수 있다.)

둘 중 어느 방법론으로 문제에 접근해야할 지 결정하기 위해 2가지를 고려한다.

범주간의 불균형이 심한가?
abnormal class(minority class) 샘플이 충분한가?
- 충분한 경우 class imbalance 완화 (oversampling/undersampling) 후 classification으로 문제에 접근한다.
- abnormal class(minority class) 샘플이 충분하지 않은 경우 (극소수인 경우) Anomaly Detection으로 접근한다.

Generalization vs Specialization trade-off

Anomaly Detection Model

Anomaly detection 모델링 : Normal data만 Training data로 활용

Anomaly detection 모델링할 때 Normal data만 Training data로 활용한다.

Anomaly Detection Model을 머신러닝, 딥러닝, Hybrid 3가지 유형으로 분류할 수 있다.
(Assumption : 반지도학습 & normal data 가 abnormal data 보다 많다.)

머신러닝

비정형(이미지,자연어) 작동 어려우나, 정형 데이터에 대해서 준수한 성능

OC-SVM, SVDD / Isolation Forest / Clustering

딥러닝

비정형, 시계열 데이터에 대해서 수행 가능, 모델 학습을 위한 loss 함수 설계가 key.

Autoencoder 계열 / Word2Vec 계열 / GAN 계열 / Deep SVDD (one-class Neural Network)

Hybird (머신러닝, 딥러닝 혼합)

딥러닝 모델을 Feature Extractor으로 활용, 해당 output 정형 벡터에 머신러닝 적용

End-to-End 학습이 불가능하며, 최적화된 Feature 선택된다는 보장이 없음

AutoEncoder 계열 + 머신러닝 / Word2Vec(Embedding) 계열 + 머신러닝

Performance Measures 평가 지표

Confusion Matrix for novelty detection
when the cut-off (threshold) is set
Detection Rate = Sensitivity
EER(Equal Error Rate), IE(Integrated Error)

FRR = FAR → EER
모델링 결과 Abnormal/normal score(가능성, 연속적인 값)을 최종 결과로 반환, 분류AUROC 처럼 cutoff 변경하면서 performance 측정한 EER(Equal Error Rate), IE(Integrated Error) 두 값 모두 lower the better

Challenges of Anomaly Detection

Gray Area : Normal vs Abnormal data 의 border가 명확하지 않음
Application-specific outlier detection : 알고리즘이 도메인에 종속적, 도메인에 따라 Abnormal 판별기준이 달라진다.
- Clinical data : small deviation
- marketing data : larger fluctuation
Understandability : why these are outliers 답할 수 있도록 해석가능해야함

Reference : https://tp46.github.io/general/2018/11/27/model-based-novelty-detection/

JINSTORY

Anomaly detection - 개요