局部异常因子与KL散度异常检测算法简述

Local Outlier Factor

Given local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first:
Reachability Distance

RDk(x,x′)=max(∥x−x(k)∥,∥x−x′∥)

where x(k) stands for the kth point nearest to x in training set {xi}ni=1. Note that k is manually selected.
Local Reachability Density

LRDk(x)=(1k∑i=1kRDk(x(i),x))−1

Local Outlier Factor

LOFk(x)=1k∑ki=1LRDk(x(i))LRDk(x)

Evidently, as the LOF of x ascends, the probability that x is an outlier also goes up. Theoretically, it is an easy algorithm with intuitive principle. However, when n is a very large number, it also requires tremendous computation amount.

Here is a simple example

n=100; x=[(rand(n/2,2)-0.5)*20; randn(n/2,2)]; x(n,1)=14;
k=3; x2=sum(x.^2,2);
[s, t]=sort(sqrt(repmat(x2,1,n)+repmat(x2',n,1)-2*x*x'), 2);

for i=1:k+1
    for j=1:k
        RD(:,j)=max(s(t(t(:,i),j+1),k), s(t(:,i),j+1));
    end
    LRD(:,i)=1./mean(RD,2);
end
LOF=mean(LRD(:,2:k+1),2)./LRD(:,1);

figure(1); clf; hold on
plot(x(:,1),x(:,2),'rx');
for i=1:n
    plot(x(i,1),x(i,2),'bo', 'MarkerSize', LOF(i)*10);
end

KL Divergence

In unsupervised learning problems, there is usually little information about the outliers. However, when some known normal sample set {x′i′}n′i′=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.
Kullback-Leibler (KL) divergence, also known as Relative Entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-

w(x)=p′(x)p(x)

where p′(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. The ratio of normal sample approaches 1 but of an outlier is away from 1.
To begin with, let transform the model of density ratio to a parameterized linear model:

wα(x)=∑j=1bαjψj(x)=αTψ(x)

where α=(α1,⋯,αb)T is the parameter vector and ψ(x)=(ψ1,⋯,ψb)T is a non-negative basis function vector. Then wα(x)p(x) can be seen as an estimation of p′(x). Define the similarity between wα(x)p(x) and p′(x) as KL distance, i.e.

KL(p′∥wα(x)p(x))=∫p′(x)logp′(x)wα(x)p(x)

In general case, KL distance is non-negative and equals to zero only if wαp=p′. When KL distance is considerably small, wαp can be regarded near to p′. In order to guarantee that wαp is well-defined, we apply the following constraint

∫wα(x)p(x)dx=1,∀x,wα(x)p(x)≥0

Then by approximation, we can transform the estimation above to the following optimal problem:

maxα1n′∑i′=1n′logwα(x′i′)s.t.1n∑i=1nwα(xi)=1,α1,…,αn′≥0

We briefly summarize the estimation process:

  1. Initialize α.
  2. Repeatedly carry out the following process until α comes a suitable precision:
    1. α←α+ϵAT(1./Aα)
    2. α←α+(1−bTα)b(bTb)
    3. α←max(0,α)
    4. α←α/(bTα)

where A is the matrix whose (i′,j)th element is ψj(x′i′). b is the vector whose jth element is 1n∑ni=1ψj(xi).

Here is an example (Gaussian Kernal Model):

function [ a ] = KLIEP( k, r )

a0=rand(size(k,2),1); b=mean(r)'; c=sum(b.^2);
for o=1:1000
    a=a0+0.01*k'*(1./k*a0); a=a+b*(1-sum(b.*a))/c;
    a=max(0,a); a=a/sum(b.*a);
    if norm(a-a0)<0.001, break, end
    a0=a;
end

end
n=100; x=randn(n,1); y=randn(n,1); y(n)=5;
hhs=2*[1,5,10].^2; m=5;
x2=x.^2; xx=repmat(x2,1,n)+repmat(x2',n,1)-2*(x*x');
y2=y.^2; yx=repmat(y2,1,n)+repmat(x2',n,1)-2*y*x';
u=floor(m*(0:n-1)/n)+1;
u=u(randperm(n));

for hk=1:length(hhs)
    hh=hhs(hk);k=exp(-xx/hh); r=exp(-yx/hh);
    for i=1:m
        g(hk,i)=mean(k(u==i,:)*KLIEP(k(u~=i,:),r));
    end
end
[gh,ggh]=max(mean(g,2));
HH=hhs(ggh);
k=exp(-xx/HH); r=exp(-yx/HH); s=r*KLIEP(k,r);

figure(1); clf; hold on; plot(y,s,'rx');

SVM

Furthermore, outlier detection can be done using support vector machine techniques. Due to the time limit, we just outline the main structure of that algorithm.
A typical SVM outlier detector gives a hyper-ball that contains nearly all the sample points. Then a point which is outlying the hyper-ball can be seen as an outlier. Concretely speaking, we get the center c and radius R by solving the following optimal problem:

minc,r,ξ(R2+C∑i=1nξi)s.t.∥xi−c∥2≤R2+ξi,ξi≥0,∀i=1,2,…,n

It can be solved by using Lagrange multiplers:

L(c,r,ξ,α,β)=R2+C∑i=1nξi−∑i=1nαi(R2+ξi−∥xi−c∥2)−∑i=1nβiξi

Then its dual problem can be formulated as:

maxα,βinfc,R,ξL(c,r,ξ,α,β),s.t.α≥0,β≥0

KKT condition:

∂L∂c=0∂L∂R=0∂L∂ξi=0⇒⇒⇒c=∑ni=1αixi∑ni=1αi∑i=1nαi=1αi+βi=C,∀i=1,2,…,n

Therefore, the dual problem can be solved by

α^=argmaxα=⎛⎝∑i=1nαixTixi−∑i,j=1nαiαjxTixj⎞⎠s.t.0≤αi≤C,∀i=1,2,…,n

It is in the form of typical quadratic programming problem. After solving it, we are able to further solve c and R:

R2^=∥∥∥∥xi−∑j=1nα^jxj∥∥∥∥2,c^=∑i=1nα^ixi

where xi is the support vector satisfying ∥xi−c∥2=R2 and 0<αi<C.
Hence, when a sample point x satisfies

∥x−c^∥2>R2^

it can be viewed as an outlier.

时间: 2024-10-28 06:39:26

局部异常因子与KL散度异常检测算法简述的相关文章

机器学习-异常检测算法(二):Local Outlier Factor

Local Outlier Factor(LOF)是基于密度的经典算法(Breuning et.al. 2000), 文章发表于 SIGMOD 2000, 到目前已经有 3000+ 的引用.在 LOF 之前的异常检测算法大多是基于统计方法的,或者是借用了一些聚类算法用于异常点的识别(比如 ,DBSCAN,OPTICS).但是,基于统计的异常检测算法通常需要假设数据服从特定的概率分布,这个假设往往是不成立的.而聚类的方法通常只能给出 0/1 的判断(即:是不是异常点),不能量化每个数据点的异常程度

c++异常-C++异常类型初始化以及捕捉异常

问题描述 C++异常类型初始化以及捕捉异常 C++异常类型初始化时的括号内字符串有什么用?比如throw runtime error("dddd")中的dddd有什么用?还有后面catch时括号里的参数有什么用? 解决方案 throw可以直接抛出/产生异常,导致控制流程转到catch块. 你写的throw runtime error("dddd")中,dddd为抛出异常的内容

link函数调用,缺少参数会不会触发异常?会触发什么异常?try catch可以么?

问题描述 link函数调用,缺少参数会不会触发异常?会触发什么异常?try catch可以么? link函数调用,缺少参数会不会触发异常?会触发什么异常?try catch可以么? 解决方案 要看什么函数了,如果函数允许可选参数,那么可以,否则会触发InvalidArgumentException

了解C++编程中指定的异常和未经处理的异常_C 语言

noexceptC++11:指定函数是否可能会引发异常. 语法 ReturnType FunctionName(params) noexcept; ReturnType FunctionName(params) noexcept(noexcept(expression); 参数 表达式 计算结果是 True 或 False 的常量表达式.无条件版本相当于 noexcept(true). 备注 noexcept(及其同义词 noecept(true))指定函数绝不会引发异常,或允许从异常直接或间接

浅谈异常结构图、编译期异常和运行期异常的区别_java

异常处理一般有2种方式,要么捕获异常try-catch,要么抛出异常throws 如果一个方法后面抛出一个运行时期异常(throws RuntimeException),调用者无须处理 如果一个方法后面抛出一个编译时期异常,调用者必须处理,或者抛出或者try-catch: 运行时期的异常一般都不处理,一般是程序逻辑上的错误,比如分母为0作为除数了... 注意如果在try里面出现了异常后,try下面的语句就不会执行,回去寻找catch匹配异常处理会,接下来的语句会处理的(也就是在try-catch

SRVE0068E: 未捕获到 servlet CXFServlet 的其中一个服务方法中抛出的异常。抛出的异常:java.lang.IncompatibleClassChangeError

问题描述 RT.[08-10-2516:37:40:421CST]00000030ServletWrappeESRVE0068E:未捕获到servletCXFServlet的其中一个服务方法中抛出的异常.抛出的异常:java.lang.IncompatibleClassChangeErroratorg.apache.cxf.wsdl11.ServiceWSDLBuilder.addExtensibiltyElements(ServiceWSDLBuilder.java:227)atorg.apa

浅谈KL散度

一.第一种理解 相对熵(relative entropy)又称为KL散度(Kullback–Leibler divergence,简称KLD),信息散度(information divergence),信息增益(information gain).  KL散度是两个概率分布P和Q差别的非对称性的度量.       KL散度是用来度量使用基于Q的编码来编码来自P的样本平均所需的额外的比特个数. 典型情况下,P表示数据的真实分布,Q表示数据的理论分布,模型分布,或P的近似分布. 根据shannon的

皮肤检测算法三种,示例与代码

今天是地球日,就选了张相关主题的图像做测试   第一种:RGB color space 第二种:RG color space 第三种:Ycrcb之cr分量+otsu阈值化   还有别的一些模型,效果不太好就不贴了   1.rgb model [cpp] view plaincopy // skin region location using rgb limitation   void SkinRGB(IplImage* rgb,IplImage* _dst)   {       assert(r

matlab-求stbc-OFDM系统的不同检测算法ml,zf,MMSE,mmse-sic的MATLAB仿真程序

问题描述 求stbc-OFDM系统的不同检测算法ml,zf,MMSE,mmse-sic的MATLAB仿真程序 100C 一个以空时分组码为基础的同步MIMO-OFDM系统,通过仿真来比较两个用户时的MIMO-OFDM系统使用多种多用户检测算法的性能与单用户MIMO-OFDM的性能的优劣.给每一个用户配备两根发射天线与两根接收天线,调制方式为固定的QPSK调制. 解决方案 http://download.csdn.net/detail/wilburyan/8682857