日志归档与数据挖掘（日志中心）

日志归档与数据挖掘

http://netkiller.github.io/journal/log.html

Mr. Neo Chen (陈景峰), netkiller, BG7NYT

中国广东省深圳市龙华新区民治街道溪山美地
518131
+86 13113668890
+86 755 29812080
<netkiller@msn.com>

转载请与作者联系，转载时请务必标明文章原始出处和作者信息及本声明。

文档出处:

http://netkiller.github.io

http://netkiller.sourceforge.net

2014-12-16

摘要

2013-03-19 第一版

2014-12-16 第二版

我的系列文档

Netkiller Architect 手札	Netkiller Developer 手札	Netkiller PHP 手札	Netkiller Python 手札	Netkiller Testing 手札
Netkiller Cryptography 手札	Netkiller Linux 手札	Netkiller Debian 手札	Netkiller CentOS 手札	Netkiller FreeBSD 手札
Netkiller Shell 手札	Netkiller Security 手札	Netkiller Web 手札	Netkiller Monitoring 手札	Netkiller Storage 手札
Netkiller Mail 手札	Netkiller Docbook 手札	Netkiller Version 手札	Netkiller Database 手札	Netkiller PostgreSQL 手札
Netkiller MySQL 手札	Netkiller NoSQL 手札	Netkiller LDAP 手札	Netkiller Network 手札	Netkiller Cisco IOS 手札
Netkiller H3C 手札	Netkiller Multimedia 手札	Netkiller Perl 手札	Netkiller Amateur Radio 手札	Netkiller DevOps 手札

1. 什么日志归档

归档，是指将日志整理完毕且有保存价值的文件，经系统整理交日志服务器保存的过程。

2. 为什么要做日志归档

随时调出历史日志查询。
通过日志做数据挖掘，挖掘有价值的数据。
查看应用程序的工作状态

3. 何时做日志归档

日志归档应该是企业规定的一项制度(“归档制度”)，系统建设之初就应该考虑到日志归档问题。如果你的企业没有这项工作或制度，在看完本文后建议你立即实施。

4. 归档日志放在哪里

简单的可以采用单节点服务器加备份方案。

随着日志规模扩大，未来必须采用分布式文件系统，甚至涉及到远程异地容灾。

5. 谁去做日志归档

我的答案是日志归档自动化，人工检查或抽检。

6. 怎样做日志归档

将所有服务器的日志都汇总到一处，有几种方法

日志归档常用方法：

ftp 定是下载，这种做法适合小文件且日志量不大，定是下载到指定服务器，缺点是重复传输，实时性差。
rsyslog 一类的程序，比较通用，但扩展不便。
rsync 定是同步，适合打文件同步，好于FTP，实时性差。

6.1. 日志格式转换

首先我来介绍一种简单的方案

我用D语言写了一个程序将 WEB 日志正则分解然后通过管道传递给数据库处理程序

6.1.1. 将日志放入数据库

将WEB服务器日志通过管道处理然后写入数据库

处理程序源码

$ vim match.d
import std.regex;
import std.stdio;
import std.string;
import std.array;

void main()
{
    // nginx
	//auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`);

	// apache2
	auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`);

	foreach(line; stdin.byLine)
	{

		foreach(m; match(line, r)){
			//writeln(m.hit);
			auto c = m.captures;
			c.popFront();
			//writeln(c);
			auto value = join(c, "\",\"");
			auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value );
			writeln(sql);
		}
	}
}

编译

$ dmd match.d
$ strip match

$ ls
match  match.d  match.o

简单用法

$ cat access.log | ./match

高级用法

$ cat access.log | match | mysql -hlocalhost -ulog -p123456 logging

实时处理日志，首先创建一个管道，寻该日志文件写入管道中。

cat  管道名 | match | mysql -hlocalhost -ulog -p123456 logging

这样就可以实现实时日志插入。

提示

上面程序稍加修改即可实现Hbase, Hypertable 本版

6.1.2. Apache Pipe

Apache 日志管道过滤 CustomLog "| /srv/match >> /tmp/access.log" combined

<VirtualHost *:80>
        ServerAdmin webmaster@localhost

        #DocumentRoot /var/www
        DocumentRoot /www
        <Directory />
                Options FollowSymLinks
                AllowOverride None
        </Directory>
        #<Directory /var/www/>
        <Directory /www/>
                Options Indexes FollowSymLinks MultiViews
                AllowOverride None
                Order allow,deny
                allow from all
        </Directory>

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
        <Directory "/usr/lib/cgi-bin">
                AllowOverride None
                Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                Order allow,deny
                Allow from all
        </Directory>

        ErrorLog ${APACHE_LOG_DIR}/error.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        #CustomLog ${APACHE_LOG_DIR}/access.log combined
        CustomLog "| /srv/match >> /tmp/access.log" combined

    Alias /doc/ "/usr/share/doc/"
    <Directory "/usr/share/doc/">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride None
        Order deny,allow
        Deny from all
        Allow from 127.0.0.0/255.0.0.0 ::1/128
    </Directory>

</VirtualHost>

经过管道转换过的日志效果

$ tail /tmp/access.log
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET /favicon.ico HTTP/1.1","404","501","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");

6.1.3. Log format

通过定义LogFormat可以直接输出SQL形式的日志

Apache

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %O" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent

Nginx

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

但对于系统管理员使用grep,awk,sed,sort,uniq分析时造成一定的麻烦。所以我建议仍然采用正则分解

产生有规则日志格式，Apache：

LogFormat \
        "\"%h\",%{%Y%m%d%H%M%S}t,%>s,\"%b\",\"%{Content-Type}o\",  \
        \"%U\",\"%{Referer}i\",\"%{User-Agent}i\""

将access.log文件导入到MySQL中

LOAD DATA INFILE '/local/access_log' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\'

6.1.4. 日志导入到 MongoDB

# rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# yum install mongodb

D语言日志处理程序

import std.regex;
//import std.range;
import std.stdio;
import std.string;
import std.array;

void main()
{
	// nginx
	auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`);
	// apache2
	//auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`);
	foreach(line; stdin.byLine)
	{
		//writeln(line);
		//auto m = match(line, r);
		foreach(m; match(line, r)){
			//writeln(m.hit);
			auto c = m.captures;
			c.popFront();
			//writeln(c);
			/*
			SQL
			auto value = join(c, "\",\"");
			auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value );
			writeln(sql);
			*/
			// MongoDB
			string bson = format("db.logging.access.save({
						'remote_addr': '%s',
						'remote_user': '%s',
						'time_local': '%s',
						'request': '%s',
						'status': '%s',
						'body_bytes_sent':'%s',
						'http_referer': '%s',
						'http_user_agent': '%s',
						'http_x_forwarded_for': '%s'
						})",
						c[0],c[2],c[3],c[4],c[5],c[6],c[7],c[8],c[9]
						);
			writeln(bson);

		}
	}

}

编译日志处理程序

dmd mlog.d

用法

cat /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx

处理压错过的日志

# zcat /var/log/nginx/*.access.log-*.gz | /srv/mlog | mongo 192.168.6.1/logging -uneo -pchen

实时采集日志

tail -f /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx

6.2. 日志中心方案

上面的方案虽然简单，但太依赖系统管理员，需要配置很多服务器，每种应用软件产生的日志都不同，所以很复杂。如果中途出现故障，将会丢失一部日志。

于是我又回到了起点，所有日志存放在自己的服务器上，定时将他们同步到日志服务器，这样解决了日志归档。远程收集日志，通过UDP协议推送汇总到日志中心，这样解决了日志实时监控、抓取等等对实时性要求较高的需求。

为此我用了两三天写了一个软件，下载地址：https://github.com/netkiller/logging

这种方案并不是最佳的，只是比较适合我的场景，而且我仅用了两三天就完成了软件的开发。后面我会进一步扩展，增加消息队列传送日志的功能。

6.2.1. 软件安装

$ git clone https://github.com/netkiller/logging.git
$ cd logging
$ python3 setup.py sdist
$ python3 setup.py install

6.2.2. 节点推送端

安装启动脚本

CentOS

# cp logging/init.d/ulog /etc/init.d

Ubuntu

$ sudo cp init.d/ulog /etc/init.d/	

$ service ulog
Usage: /etc/init.d/ulog {start|stop|status|restart}

配置脚本，打开 /etc/init.d/ulog 文件

配置日志中心的IP地址

HOST=xxx.xxx.xxx.xxx

然后配置端口与采集那些日志

	done << EOF
1213 /var/log/nginx/access.log
1214 /tmp/test.log
1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log
EOF

格式为

Port | Logfile
------------------------------
1213 /var/log/nginx/access.log
1214 /tmp/test.log
1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log

1213 目的端口号（日志中心端口）后面是你需要监控的日志，如果日志每日产生一个文件写法类似 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log

提示

每日产生一个新日志文件需要定时重启 ulog 方法是 /etc/init.d/ulog restart

配置完成后启动推送程序

# service ulog start

查看状态

$ service ulog status
13865 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/rlog -d -H 127.0.0.1 -p 1213 /var/log/nginx/access.log

停止推送

# service ulog stop

6.2.3. 日志收集端

# cp logging/init.d/ucollection /etc/init.d

# /etc/init.d/ucollection
Usage: /etc/init.d/ucollection {start|stop|status|restart}

配置接收端口与保存文件，打开 /etc/init.d/ucollection 文件，看到下面段落

done << EOF
1213 /tmp/nginx/access.log
1214 /tmp/test/test.log
1215 /tmp/app/$(date +"%Y-%m-%d.%H:%M:%S").log
1216 /tmp/db/$(date +"%Y-%m-%d")/mysql.log
1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log
EOF

格式如下，表示接收来自1213端口的数据，并保存到/tmp/nginx/access.log文件中。

Port | Logfile
1213 /tmp/nginx/access.log

如果需要分割日志配置如下

1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log

上面配置日志文件将会产生在下面的目录中

$ find /tmp/cache/
/tmp/cache/
/tmp/cache/2014
/tmp/cache/2014/12
/tmp/cache/2014/12/16
/tmp/cache/2014/12/16/cache.log

提示

同样，如果分割日志需要重启收集端程序。

启动收集端

# service ulog start

停止程序

# service ulog stop

查看状态

$ init.d/ucollection status
12429 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1213 -l /tmp/nginx/access.log
12432 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1214 -l /tmp/test/test.log
12435 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1215 -l /tmp/app/2014-12-16.09:55:15.log
12438 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1216 -l /tmp/db/2014-12-16/mysql.log
12441 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1217 -l /tmp/cache/2014/12/16/cache.log

6.2.4. 日志监控

监控来自1217宽口的数据

$ collection -p 1213

192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/log.html HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/docbook.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/journal.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /images/by-nc-sa.png HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /js/q.js HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"

启动后实时将最新日志传送过来

时间： 2025-01-20 16:54:52

日志归档与数据挖掘（日志中心）的相关文章

DB2通过深度压缩进行存储优化：备份映像和日志归档压缩

自 DB2 Universal Database Version 8 开始,您已经能够压缩备份映像.自 DB2 10.1 版本开始,也可以对日志归档应用压缩功能.无论您是否具备 DB2 Storage Optimization Feature 许可,都可以使用这些功能. 用于压缩备份映像和归档日志的默认算法与 compress(1) UNIX 实用程序采用的算法类似. 备份压缩您可以为各备份映像单独启用备份压缩.当执行备份操作时,请指定 COMPRESS 语句,具体例子如下所示: BACKUP

日志分析工具：数据中心管理的新装备

一种新型管理工具正在IT组织中成为主流.和繁琐的测试或评估方法不同,这些新兴工具关注的是系统和应用程序最常见的共同点:日志文件. 在复杂而严苛的数据中心环境中,通常会使用一些成熟的管理工具来查找隐患,但是这些工具无法感知细微的因果关联,数据中心的日常排错和优化目标难以实现.日志分析工具可以填补这些空缺,让IT专业人员在面对复杂的数据中心基础架构时能做出更有效和正确的决策. 几乎所有的系统和应用程序都会产生日志文件.日志是带时间标记的足迹,记录行为.条件和事件.在实际环境中,单独的日志文件价值

Oracle没有开启日志归档备份Oracle数据库方法

故障现象: 没有开启Oracle日志归档,紧急情况下,如何进行Oracle数据库的冷备份? 解决方案: Oracle数据库的冷备份,整理如下: 什么时候必须用冷备份? 1. 数据库的模式为非归档模式 2. 用于现场保护冷备份的过程: 1. 首先查看备份文件的位置: 数据文件.控制文件.日志文件 select name from v$datafile union all select name from v$controlfile union all select member from v$l

DB2 10.1 HADR日志归档考虑因素

10.1 在所有数据库上配置日志归档要对 DB2 HADR 使用日志归档,需要将主要数据库和所有备用数据库都配置从所有日志归档位置自动检索日志的功能. 10.2 备用数据库上的日志文件管理备用数据库会自动管理其本地日志路径中的日志文件.它将从主要数据库收到的日志写入其日志路径.备用数据库不会从其本地日志路径删除日志文件,除非它被主要数据库告知主要数据库已归档了该文件.此行为提供了对日志文件丢失的附加保护.如果在将日志文件安全地存储在归档中之前主要数据库发生故障,备用数据库将确保日志文件已归档

logback写入日志的时候，日志文件总是空的

问题描述在用logback做日志组件的时候,日志文件可以正常生成,不过文件里面却总是什么都没有. logback配置文件如下: <?xml version="1.0" encoding="UTF-8"?>  <configuration> <appender name="stdout&

高效、便捷、一站式日志解决方案--阿里云日志服务

日志服务(Log Service,Log)是针对日志场景的一站式服务,在阿里巴巴集团内部被广泛使用.用户无需开发就能快捷完成日志生命周期中采集.消费.投递以及查询功能. 日志服务当前提供如下功能日志中枢(LogHub):通过Agent/API实时收集.订阅.消费日志数据日志投递(LogShipper):将日志定时归档至存储/计算类服务(ODPS/OSS) 日志查询(LogSearch):提供基于时间.关键词查询用以定位及分析问题其中日志中枢为基础功能,用户可根据需求选择日志投递及日志查询功

IIS日志分析及IIS日志分析软件下载

IIS日志分析是我们asp.net程序员必须了解知识,因为我们开发的网站都是基于IIS服务器.当网站访问缓慢时,除了检查程序代码和优化程序代码外,IIS日志就是我们寻找网站缓慢另一个途径! IIS日志文件存放位置,一般默认位置是:(C:/WINDOWS/system32/LogFiles);在IIS信息服务窗口中,点击要设置的网站的属性,在"网页"选项卡上可以看到"启动日志记录 "项,可以设置日志的保存位置.日志的记录格式等等,虚拟主机用户可以通过空间商提供的后台

如何利用网站日志分析工具查看日志

很多朋友不知道网站日志是什么以及如何查看网站日志,更不知怎么分析网站日志,在一些SEO论坛经常看到有很多朋友问,"我的站点怎么一直只收录首页不收录内页啊?","怎么我的网站还没收录啊?","我的网站是不是出问题了啊?"等等,其实这些虽然有很多原因导致,但是碰到这些问题我们第一个要检查的是什么呢?是网站日志,看网站日志是一个站长最最基础的也是必须的一道程序,网站有问题的时候我们要看,网站没问题的时候我们也要观察,网站日志怎么看?下面我给大家介绍下如

日志系统之Flume日志收集

最近接手维护一个日志系统,它用于对应用服务器上的日志进行收集然后提供实时分析.处理并最后将日志存储到目标存储引擎.针对这三个环节,业界已经有一套组件来应对各自的需求需求,它们是flume+kafka+hdfs/hbase.我们在实时分析.存储这两个环节,选择跟业界的实践相同,但agent是团队自己写的,出于对多种数据源的扩展需求以及原来收集日志的方式存在的一些不足,于是调研了一下flume的agent.结果是flume非常契合我们的实际需求,并且拥有良好的扩展性与稳定性.于是打算采用flume的