局部异常因子与KL散度异常检测算法简述

Local Outlier Factor

Given local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:
Reachability Distance

RDk(x,x′)=max(∥x−x(k)∥,∥x−x′∥)

where x(k) stands for the kth point nearest to x in training set {xi}ni=1. Note that k is manually selected.
Local Reachability Density

LRDk(x)=(1k∑i=1kRDk(x(i),x))−1

Local Outlier Factor

LOFk(x)=1k∑ki=1LRDk(x(i))LRDk(x)

Evidently, as the LOF of x ascends, the probability that x is an outlier also goes up. Theoretically, it is an easy algorithm with intuitive principle. However, when n is a very large number, it also requires tremendous computation amount.

Here is a simple example

n=100; x=[(rand(n/2,2)-0.5)*20; randn(n/2,2)]; x(n,1)=14;
k=3; x2=sum(x.^2,2);
[s, t]=sort(sqrt(repmat(x2,1,n)+repmat(x2',n,1)-2*x*x'), 2);

for i=1:k+1
    for j=1:k
        RD(:,j)=max(s(t(t(:,i),j+1),k), s(t(:,i),j+1));
    end
    LRD(:,i)=1./mean(RD,2);
end
LOF=mean(LRD(:,2:k+1),2)./LRD(:,1);

figure(1); clf; hold on
plot(x(:,1),x(:,2),'rx');
for i=1:n
    plot(x(i,1),x(i,2),'bo', 'MarkerSize', LOF(i)*10);
end

KL Divergence

In unsupervised learning problems, there is usually little information about the outliers. However, when some known normal sample set {x′i′}n′i′=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.
Kullback-Leibler (KL) divergence, also known as Relative Entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-

w(x)=p′(x)p(x)

where p′(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. The ratio of normal sample approaches 1 but of an outlier is away from 1.
To begin with, let transform the model of density ratio to a parameterized linear model:

wα(x)=∑j=1bαjψj(x)=αTψ(x)

where α=(α1,⋯,αb)T is the parameter vector and ψ(x)=(ψ1,⋯,ψb)T is a non-negative basis function vector. Then wα(x)p(x) can be seen as an estimation of p′(x). Define the similarity between wα(x)p(x) and p′(x) as KL distance, i.e.

KL(p′∥wα(x)p(x))=∫p′(x)logp′(x)wα(x)p(x)

In general case, KL distance is non-negative and equals to zero only if wαp=p′. When KL distance is considerably small, wαp can be regarded near to p′. In order to guarantee that wαp is well-defined, we apply the following constraint

∫wα(x)p(x)dx=1,∀x,wα(x)p(x)≥0

Then by approximation, we can transform the estimation above to the following optimal problem:

maxα1n′∑i′=1n′logwα(x′i′)s.t.1n∑i=1nwα(xi)=1,α1,…,αn′≥0

We briefly summarize the estimation process:

Initialize α.
Repeatedly carry out the following process until α comes a suitable precision:
1. α←α+ϵAT(1./Aα)
2. α←α+(1−bTα)b(bTb)
3. α←max(0,α)
4. α←α/(bTα)

where A is the matrix whose (i′,j)th element is ψj(x′i′). b is the vector whose jth element is 1n∑ni=1ψj(xi).

Here is an example (Gaussian Kernal Model):

function [ a ] = KLIEP( k, r )

a0=rand(size(k,2),1); b=mean(r)'; c=sum(b.^2);
for o=1:1000
    a=a0+0.01*k'*(1./k*a0); a=a+b*(1-sum(b.*a))/c;
    a=max(0,a); a=a/sum(b.*a);
    if norm(a-a0)<0.001, break, end
    a0=a;
end

end

n=100; x=randn(n,1); y=randn(n,1); y(n)=5;
hhs=2*[1,5,10].^2; m=5;
x2=x.^2; xx=repmat(x2,1,n)+repmat(x2',n,1)-2*(x*x');
y2=y.^2; yx=repmat(y2,1,n)+repmat(x2',n,1)-2*y*x';
u=floor(m*(0:n-1)/n)+1;
u=u(randperm(n));

for hk=1:length(hhs)
    hh=hhs(hk);k=exp(-xx/hh); r=exp(-yx/hh);
    for i=1:m
        g(hk,i)=mean(k(u==i,:)*KLIEP(k(u~=i,:),r));
    end
end
[gh,ggh]=max(mean(g,2));
HH=hhs(ggh);
k=exp(-xx/HH); r=exp(-yx/HH); s=r*KLIEP(k,r);

figure(1); clf; hold on; plot(y,s,'rx');

SVM

Furthermore, outlier detection can be done using support vector machine techniques. Due to the time limit, we just outline the main structure of that algorithm.
A typical SVM outlier detector gives a hyper-ball that contains nearly all the sample points. Then a point which is outlying the hyper-ball can be seen as an outlier. Concretely speaking, we get the center c and radius R by solving the following optimal problem:

minc,r,ξ(R2+C∑i=1nξi)s.t.∥xi−c∥2≤R2+ξi,ξi≥0,∀i=1,2,…,n

It can be solved by using Lagrange multiplers:

L(c,r,ξ,α,β)=R2+C∑i=1nξi−∑i=1nαi(R2+ξi−∥xi−c∥2)−∑i=1nβiξi

Then its dual problem can be formulated as:

maxα,βinfc,R,ξL(c,r,ξ,α,β),s.t.α≥0,β≥0

KKT condition:

∂L∂c=0∂L∂R=0∂L∂ξi=0⇒⇒⇒c=∑ni=1αixi∑ni=1αi∑i=1nαi=1αi+βi=C,∀i=1,2,…,n

Therefore, the dual problem can be solved by

α^=argmaxα=⎛⎝∑i=1nαixTixi−∑i,j=1nαiαjxTixj⎞⎠s.t.0≤αi≤C,∀i=1,2,…,n

It is in the form of typical quadratic programming problem. After solving it, we are able to further solve c and R:

R2^=∥∥∥∥xi−∑j=1nα^jxj∥∥∥∥2,c^=∑i=1nα^ixi

where xi is the support vector satisfying ∥xi−c∥2=R2 and 0<αi<C.
Hence, when a sample point x satisfies

∥x−c^∥2>R2^

it can be viewed as an outlier.

时间： 2024-10-28 06:39:26

局部异常因子与KL散度异常检测算法简述

Local Outlier Factor

KL Divergence

SVM

局部异常因子与KL散度异常检测算法简述的相关文章

机器学习-异常检测算法（二）：Local Outlier Factor

c++异常-C++异常类型初始化以及捕捉异常

link函数调用，缺少参数会不会触发异常？会触发什么异常？try catch可以么？

了解C++编程中指定的异常和未经处理的异常_C 语言

浅谈异常结构图、编译期异常和运行期异常的区别_java

SRVE0068E: 未捕获到 servlet CXFServlet 的其中一个服务方法中抛出的异常。抛出的异常：java.lang.IncompatibleClassChangeError

浅谈KL散度

皮肤检测算法三种，示例与代码

matlab-求stbc-OFDM系统的不同检测算法ml，zf，MMSE，mmse-sic的MATLAB仿真程序