Blktrace原理简介及使用

Blktrace简介

Blktrace是一个用户态的工具,用来收集磁盘IO信息中当IO进行到块设备层(block层,所以叫blk trace)时的详细信息(如IO请求提交,入队,合并,完成等等一些列的信息)。

 

块设备层处于下图(借用褚霸的图)中的 “block layer”

 

 

Blktrace工作原理

 

(1)     blktrace测试的时候,会分配物理机上逻辑cpu个数个线程,并且每一个线程绑定一个逻辑cpu来收集数据

(2)     blktrace在debugfs挂载的路径(默认是/sys/kernel/debug )下每个线程产生一个文件(就有了对应的文件描述符),然后调用ioctl函数(携带文件描述符, _IOWR(0x12,115,struct blk_user_trace_setup),& blk_user_trace_setup三个参数),产生系统调用将这些东西给内核去调用相应函数来处理,由内核经由debugfs文件系统往此文件描述符写入数据

(3)     blktrace需要结合blkparse来使用,由blkparse来解析blktrace产生的特定格式的二进制数据

(4)     blkparse仅打开blktrace产生的文件,从文件里面取数据做展示以及最后做per cpu的统计输出,但blkparse中展示的数据状态(如 A,U,Q,详细见下)是blkparse在t->action & 0xffff之后自己把数值转换为“A,Q,U之类的状态”来展示的。

 

Blktrace安装

1.       yum install blktrace

2.       源码获取(你也可以从源码安装)

git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt

cd bt

make

make install

 

Blktrace的使用

 

Debugfs挂载

    由之前的blktrace工作原理可知,blktrace需要借助内核经由debugfs文件系统(debugfs文件系统在内存中)来输出信息

 

所以用blktrace工具之前需要先挂载debugfs文件系统

mount      –t debugfs    debugfs /sys/kernel/debug

 

或者在/etc/fstab中添加下面一行以便在开机启动的时候自动挂载

debug      /sys/kernel/debug           debugfs    default     0       0

 

blktrace具体的磁盘或分区

blktrace具体语法man blktrace,这里讲常用的

 

文件输出

mkdir test  #blktrace生成的数据默认会在当前目录,如之前在blktrace原理中提到,每个逻辑cpu都有一个线程,产生一个文件,故会产生cpu数目个文件

blktrace –d /dev/sda –o test1

#对 /dev/sda的trace,输出文件名为test1. Blktrace.[0-cpu数-1]   (文件里面存的是二进制数据,需要blkparse来解析)

 

终端输出

Blktrace –d /dev/sda –o - |blkparse  -i –

输出到终端用“-”表示,可是都是一堆二进制东西,没法看,所以需要实时blkparse来解析

Blkparse 的“-i”后加文件名,blktrace输出为“-“代表终端(代码里面写死了,就是用这个符号来代表终端),blkparse也用“-”来代表终端解析

 

blkparse解析blktrace产生的数据

blkparse具体语法man blkparse,这里讲常用的

 

文件解析

blkparse  -i    test1 #对test1.blktrace. [0-cpu数-1]都解析(只统计有数据的),

 

实时解析

实时数据的解析即上blktrace的“终端输出”

 

使用实例

终端1:

blktrace /dev/sda -o - |blkparse -i – 跑着

终端2:

dd if=/dev/zero of=/root/a1 bs=4k count=1000

 

终端1显示

8,0   16     3041    94.435078912   891  A   W 72411584 + 8 <- (8,2) 71884224

8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

8,0   16     3043    94.435080790   891  M   W 72411584 + 8 [flush-8:0]

8,0   16     3044    94.435083089   891  A   W 72411592 + 8 <- (8,2) 71884232

 

输出解析

这是默认输出格式,代码里默认输出格式为,再按action输出或不输出后续信息

 

先输出   –f "%D %2c %8s %5T.%9t %5p %2a %3d " 

 

其中每个字母代表意思如下,数字代表占几个字符,和printf里的数字输出一样的

 

8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

由于默认格式为先输出–f "%D %2c %8s %5T.%9t %5p %2a %3d "

(1)8,0 按默认输出对应%D,主从设备号

(2)16 按默认输出对应%2c,表示cpu id

(3)3042 按默认输出对应%8s,表示序列号(序列号是blkparse自己产生的一个序号,实际IO里没有这个号)

(4)94.435079691 按默认对应%5T.%9t,表示”秒.纳秒”

(5)891对应%5p,表示,进程id

(6)Q对应%2a,表示Action,Action表格如下(如Q表示IO handled by request queue code),更详细的含义见附录action表

The following table shows the various actions which may be output.              

Act Description

A IO was remapped to a different device

B IO bounced

C IO completion

D IO issued to driver

F IO front merged with request on queue

G Get request

I IO inserted onto request queue

M IO back merged with request on queue

P Plug request

Q IO handled by request queue code

S Sleep request

T Unplug due to timeout

U Unplug request

X Split

(7)W 对应%3d,表示RWBS域(W表示写操作),各字母含义如下

         至少包含“RWD“( R 读,W写,D块被忽略)中的1个字符

         还可以附加“BS“(B barrier,S同步)

 

再输出(源代码里面这么写的)

switch (act[0]) {

         case 'R':   /* Requeue */

         case 'C': /* Complete */

                   if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

                            char *p = dump_pdu(pdu_buf, pdu_len);

                            if (p)

                                     fprintf(ofp, "(%s) ", p);

                            fprintf(ofp, "[%d]n", t->error);

                   } else {

                            if (elapsed != -1ULL) {

                                     if (t_sec(t))

                                               fprintf(ofp, "%llu + %u (%8llu) [%d]n",

                                                        (unsigned long long) t->sector,

                                                        t_sec(t), elapsed, t->error);

                                     else

                                               fprintf(ofp, "%llu (%8llu) [%d]n",

                                                        (unsigned long long) t->sector,

                                                        elapsed, t->error);

                            } else {

                                     if (t_sec(t))

                                               fprintf(ofp, "%llu + %u [%d]n",

                                                        (unsigned long long) t->sector,

                                                        t_sec(t), t->error);

                                     else

                                               fprintf(ofp, "%llu [%d]n",

                                                        (unsigned long long) t->sector,

                                                        t->error);

                            }

                   }

                   break;

 

         case 'D':           /* Issue */

         case 'I':   /* Insert */

         case 'Q':           /* Queue */

         case 'B':   /* Bounce */

                   if (t->action & BLK_TC_ACT(BLK_TC_PC)) {

                            char *p;

                            fprintf(ofp, "%u ", t->bytes);

                            p = dump_pdu(pdu_buf, pdu_len);

                            if (p)

                                     fprintf(ofp, "(%s) ", p);

                            fprintf(ofp, "[%s]n", name);

                   } else {

                            if (elapsed != -1ULL) {

                                     if (t_sec(t))

                                               fprintf(ofp, "%llu + %u (%8llu) [%s]n",

                                                        (unsigned long long) t->sector,

                                                        t_sec(t), elapsed, name);

                                     else

                                               fprintf(ofp, "(%8llu) [%s]n", elapsed,

                                                        name);

                            } else {

                                     if (t_sec(t))

                                               fprintf(ofp, "%llu + %u [%s]n",

                                                        (unsigned long long) t->sector,

                                                        t_sec(t), name);

                                     else

                                               fprintf(ofp, "[%s]n", name);

                            }

                   }

                   break;

 

         case 'M':  /* Back merge */

         case 'F':    /* Front merge */

         case 'G':   /* Get request */

         case 'S':    /* Sleep request */

                   if (t_sec(t))

                            fprintf(ofp, "%llu + %u [%s]n",

                                     (unsigned long long) t->sector, t_sec(t), name);

                   else

                            fprintf(ofp, "[%s]n", name);

                   break;

 

         case 'P':   /* Plug */

                   fprintf(ofp, "[%s]n", name);

                   break;

 

         case 'U':   /* Unplug IO */

         case 'T': /* Unplug timer */

                   fprintf(ofp, "[%s] %un", name, get_pdu_int(t));

                   break;

 

         case 'A': /* remap */

                   get_pdu_remap(t, &r);

                   fprintf(ofp, "%llu + %u <- (%d,%d) %llun",

                            (unsigned long long) t->sector, t_sec(t),

                            MAJOR(r.device_from), MINOR(r.device_from),

                            (unsigned long long) r.sector_from);

                   break;

 

         case 'X': /* Split */

                   fprintf(ofp, "%llu / %u [%s]n", (unsigned long long) t->sector,

                            get_pdu_int(t), name);

                   break;

 

         case 'm':  /* Message */

                   fprintf(ofp, "%*sn", pdu_len, pdu_buf);

                   break;

 

         default:

                   fprintf(stderr, "Unknown action %cn", act[0]);

                   break;

         }

所以

 

具体解析

8,0   16     3042    94.435079691   891  Q   W 72411584 + 8 [flush-8:0]

中的act[0]=’Q’,后面的72411584是(8,0即sda)相对8:0的扇区起始号,+8,为后面连续的8个扇区(默认一个扇区512byte,所以8个扇区就是4K),后面的[flush-8:0]是程序的名字。

 

8,0   16     3041    94.435078912   891  A   W 72411584 + 8 <- (8,2) 71884224

Action[0]=’A’, 72411584是相对8:0(即sda)的起始扇区号,(8,2)是相对/dev/sda2分区的扇区号为71884224,(由于/dev/sda2分区时sda磁盘上面的一个分区,故sda2上面的起始位置要先映射到sda磁盘上面去)

 

由于扇区号在磁盘上面是连续的,磁盘又被格式化成很多块,一个块里包含多个扇区,所以,扇区号/块大小=块号,

根据块号你就可以找到对应的inode,

debugfs -R 'icheck  块号'  具体磁盘或分区

如你的扇区号是相对sda2上面算出来的块号,那debugfs –R ‘icheck 块号’ /dev/sda2就可以找到对应的inode

 

根据inode你就可以找到对应的文件是什么了
find / -inum your_inode

 

有一个例子见淘宝牛人写的一篇链接地址

 

附录:action含义

C – complete A previously issued request has been completed. The output

will detail the sector and size of that request, as well as the success or

failure of it.

 

D – issued A request that previously resided on the block layer queue or in

the io scheduler has been sent to the driver.

 

I – inserted A request is being sent to the io scheduler for addition to the

internal queue and later service by the driver. The request is fully formed

at this time.

 

Q – queued This notes intent to queue io at the given location. No real requests

exists yet.

 

B – bounced The data pages attached to this bio are not reachable by the

hardware and must be bounced to a lower memory location. This causes

a big slowdown in io performance, since the data must be copied to/from

kernel buffers. Usually this can be fixed with using better hardware -

either a better io controller, or a platform with an IOMMU.

 

m – message Text message generated via kernel call to blk add trace msg.

 

M – back merge A previously inserted request exists that ends on the boundary

of where this io begins, so the io scheduler can merge them together.

 

F – front merge Same as the back merge, except this io ends where a previously

inserted requests starts.

 

G – get request To send any type of request to a block device, a struct request

container must be allocated first.

 

S – sleep No available request structures were available, so the issuer has to

wait for one to be freed.

 

P – plug When io is queued to a previously empty block device queue, Linux

will plug the queue in anticipation of future ios being added before this

data is needed.

 

U – unplug Some request data already queued in the device, start sending

requests to the driver. This may happen automatically if a timeout period

has passed (see next entry) or if a number of requests have been added to

the queue.

 

T – unplug due to timer If nobody requests the io that was queued after

plugging the queue, Linux will automatically unplug it after a defined

period has passed.

 

X – split On raid or device mapper setups, an incoming io may straddle a

device or internal zone and needs to be chopped up into smaller pieces

for service. This may indicate a performance problem due to a bad setup

of that raid/dm device, but may also just be part of normal boundary

conditions. dm is notably bad at this and will clone lots of io.

 

A – remap For stacked devices, incoming io is remapped to device below it in

the io stack. The remap action details what exactly is being remapped to

what.

外带一张图,可能看得更清楚

时间: 2024-11-05 06:03:41

Blktrace原理简介及使用的相关文章

JavaScript 包管理器工作原理简介

本文讲的是JavaScript 包管理器工作原理简介, 不久前,Node.js 社区的负责人之一 ashley williams 发了一条这样的推特: lockfiles = awesome for apps, bad for libs this is not a new thought, i'm confused why's everyone mad about this 锁文件 = 棒(对于应用而言),坏(对于库而言),这不是一个新想法,我只是很困惑,为什么所有的人都因为这个很崩溃 - @a

Sql注入原理简介_动力节点Java学院整理

一.什么是sql注入呢? 所谓SQL注入,就是通过把SQL命令插入到Web表单递交或输入域名或页面请求的查询字符串,最终达到欺骗服务器执行恶意的SQL命令,比如先前的很多影视网站泄露VIP会员密码大多就是通过WEB表单递交查询字符暴出的,这类表单特别容易受到SQL注入式攻击.当应用程序使用输入内容来构造动态sql语句以访问数据库时,会发生sql注入攻击.如果代码使用存储过程,而这些存储过程作为包含未筛选的用户输入的字符串来传递,也会发生sql注入. 黑客通过SQL注入攻击可以拿到网站数据库的访问

PXE无盘网络原理简介及实战应用

一.PXE工作原理: PXE(Preboot Execution Environment)称为远程引导技术,方便了管理员简易安装大批量的计算机系统.在服务器上架设DHCP,FTP,TFTP服务,首先客户机从DHCP服务器的地址池中获取IP地址,然后客户端通过TFTP从服务器下载内核.内核的驱动以及硬盘的引导文件.然后引导操作系统开始安装. 二.PXE的优势: 1. 减少了系统安装人员误操作 2. 节省时间方便大批量裸机操作系统的安装 三.PXE使用范围: 常常使用于企业.网吧群体裸机系统的安装

Memcached 分布式缓存实现原理简介_Linux

摘要 在高并发环境下,大量的读.写请求涌向数据库,此时磁盘IO将成为瓶颈,从而导致过高的响应延迟,因此缓存应运而生.无论是单机缓存还是分布式缓存都有其适应场景和优缺点,当今存在的缓存产品也是数不胜数,最常见的有redis和memcached等,既然是分布式,那么他们是怎么实现分布式的呢?本文主要介绍分布式缓存服务mencached的分布式实现原理. 缓存本质 计算机体系缓存 什么是缓存,我们先看看计算机体系结构中的存储体系,根据冯·诺依曼计算机体系结构模型,计算机分为五大部分:运算器.控制器.存

selenium-webdriver(python) (十四) webdriver原理简介

之前看乙醇视频中提到,selenium 的ruby 实现有一个小后门,在代码中加上$DEBUG=1 ,再运行脚本的过程中,就可以看到客户端请求的信息与服务器端返回的数据:觉得这个功能很强大,可以帮助理解webdriver的运行原理. 后来查了半天,python并没有提供这样一个方便的后门,不过我们可以通过代理的方式获得这些交互信息: 一.需要安装java 虚拟机与selenium-server-standalone ,参考 <selenium + python自动化测试环境搭建>第7.8操作:

虚拟路由器冗余协议VRRP的原理简介及应用

VRRP简介: VRRP(VIRTUAL ROUTER REDUNDANCY PROTOCOL),又称为虚拟路由器冗余协议.是一种lan接入设备备份协议.他可以把一个虚拟路由器的责任动态分配到局域网中的VRRP路由器中的一台.控制虚拟路由器ip地址的vrrp路由器称为主路由器,他负责转发数据包到这些虚IP.一旦主路由器不可用,这种选择过程就提供了动态的故障转移机制,这就允许虚拟路由器的 IP 地址可以作为终端主机的默认第一跳路由器.使用 VRRP 的好处是有更高的默认路径的可用性而无需在每个终端

Oracle恢复内部原理简介

Oracle 7 v7.2 恢复大纲 作者:Andrea Borr  & Bill Bridge 版本:1                May 3, 1995 本文概述了Oracle 7.2版本如何进行数据库恢复.本文读者应当熟悉Oracle 7.2的管理指南.相比于管理指南,本文目的是为了更详细描述Oracle恢复用到的算法.数据结构以及一些技术细节. 一.简介 Oracle数据库提供了下列两类失败模式下的数据库恢复: 1.  实例失败:丢失了Oracle数据缓存中的数据或者内存中的数据 2

RRDTool原理简介

1.概述 RRDtool 代表 "Round Robin Database tool" ,作者同时也是 MRTG 软件的发明人.官方站点位于http://oss.oetiker.ch/rrdtool/ . 所谓的"Round Robin" 其实是一种存储数据的方式,使用固定大小的空间来存储数据,并有一个指针指向最新的数据的位置.我们可以把用于存储数据的数据库的空间看成一个圆,上面有很多刻度.这些刻度所在的位置就代表用于存储数据的地方.所谓指针,可以认为是从圆心指向这

Android 显示原理简介

转:http://djt.qq.com/article/view/987 作者:yearzhu,2011年进入腾讯公司,从事过Web端及移动端的测试工作,喜爱新鲜事物及新技术,目前在SNG开放平台测试组负责的移动互联SDK的测试工作.   现在越来越多的应用开始重视流畅度方面的测试,了解Android应用程序是如何在屏幕上显示的则是基础中的基础,就让我们一起看看小小屏幕中大大的学问.这也是我下篇文章--<Android应用流畅度测试分析>的基础.       首先,用一句话来概括一下Andro