使用systemtap调试Linux内核
最近在公司看Linux内核的nmi死锁检测功能的实现机制,当然,是因为它变了,所以我才看的,简单来说就是在红帽的某牛提交了一个内核patch:new nmi_watchdog using perf events,这个patch已经被合入到内核主线2.6.38版本,所以使用自该版本开始后内核的Linux系统,其/proc/interrupts显示的中断数不再按每秒1000次的频率增长。关于new nmi_watchdog问题,本文不再多说,后续写专篇文章,下面看使用systemtap调试Linux内核的几个案例,因为我最近就是通过这个手段来理解new nmi_watchdog的实现机制,相比利用printk或kgdb而言,使用systemtap更为简单方便,效率也大大提高。
系统环境:
[root@localhost ~]# cat /etc/issue CentOS release 6.2 (Final) Kernel \r on an \m [root@localhost ~]# uname -a Linux localhost.lenkydomain 3.6.11 #1 SMP Wed Feb 20 21:26:16 CST 2013 x86_64 x86_64 x86_64 GNU/Linux [root@localhost ~]# stap -V Systemtap translator/driver (version 2.1/0.152, non-git sources) Copyright (C) 2005-2012 Red Hat, Inc. and others This is free software; see the source for copying conditions. enabled features: LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS
案例一,判断函数的真实执行路径,比如这个函数:
static inline void x86_assign_hw_event(struct perf_event *event, struct cpu_hw_events *cpuc, int i) { struct hw_perf_event *hwc = &event->hw; hwc->idx = cpuc->assign[i]; hwc->last_cpu = smp_processor_id(); hwc->last_tag = ++cpuc->tags[i]; if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) { a-> hwc->config_base = 0; hwc->event_base = 0; } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) { b-> hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL; hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED); hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30; } else { c-> hwc->config_base = x86_pmu_config_addr(hwc->idx); hwc->event_base = x86_pmu_event_addr(hwc->idx); hwc->event_base_rdpmc = hwc->idx; } }
我想知道nmi_watchdog的perf event走的是路径a?路径b?还是路径c?
以前利用printk的矬做法是修改这个函数,在a、b、c分别插入printk(“aaa\n”);、printk(“bbb\n”);、printk(“ccc\n”);,然后需要重新编译内核,重启机器跑一次逻辑,再根据打印结果做判断。
利用systemtap的做法是在上面三个路径上分别下探测点,并设置执行语句为打印相应的字符串即可。
比如在上面的路径b处下探测点:
[root@localhost ~]# stap -ve 'probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824"){printf("bbb\n")}'
在另一个终端触发nmi_watchdog设置逻辑:
[root@localhost ~]# echo 0 > /proc/sys/kernel/nmi_watchdog ; echo 1 > /proc/sys/kernel/nmi_watchdog ;
可以看到探测点被执行到,并且打印了相关信息,根据打印的信息来看,探测点被多次执行到:
[root@localhost ~]# stap -ve 'probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824"){printf("bbb\n")}' Pass 1: parsed user script and 90 library script(s) using 190140virt/24872res/2788shr/22676data kb, in 170usr/10sys/178real ms. Pass 2: analyzed script: 2 probe(s), 0 function(s), 0 embed(s), 0 global(s) using 358304virt/79680res/18160shr/63552data kb, in 280usr/20sys/310real ms. Pass 3: translated to C into "/tmp/stapGGJa26/stap_b1c9e6f6ba4bf7f4d0a8eb727add532b_1128_src.c" using 358304virt/79840res/18312shr/63552data kb, in 10usr/0sys/5real ms. Pass 4: compiled C into "stap_b1c9e6f6ba4bf7f4d0a8eb727add532b_1128.ko" in 1210usr/290sys/1525real ms. Pass 5: starting run. bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb
三处都下探测点的完整执行情况:
[root@localhost perf_study]# cat x86_assign_hw_event.stp probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:821"){printf("aaa\n")} probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824"){printf("bbb\n")} probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:828"){printf("ccc\n")} [root@localhost perf_study]# stap x86_assign_hw_event.stp bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb bbb
另外一个常见的情况是要判断某个函数在某个逻辑中是否被跑到,以前的做法也是在函数的入口处加上打印语句,然后跑一遍功能逻辑,看是否有信息打印出来,现在利用systemtap可以这样:
[root@localhost perf_study]# stap -e 'probe kernel.function("x86_assign_hw_event"){printf("test\n")}' test test test test test test test test
案例二,获取函数调用堆栈,示例:
[root@localhost perf_study]# cat bt.stp #stap -v bt.stp schedule probe kernel.function(@1){ print("----------------START-------------------------\n") printf("In process [%s]\n", execname()) print_regs() print_backtrace() print("----------------END-------------------------\n") exit() } [root@localhost perf_study]# stap -v bt.stp x86_assign_hw_event Pass 1: parsed user script and 90 library script(s) using 190148virt/24884res/2784shr/22684data kb, in 170usr/10sys/179real ms. Pass 2: analyzed script: 1 probe(s), 4 function(s), 2 embed(s), 0 global(s) using 344464virt/50624res/3872shr/47224data kb, in 500usr/290sys/788real ms. Pass 3: using cached /root/.systemtap/cache/13/stap_13ea16365226db9619f3f14ab2a27efc_2536.c Pass 4: using cached /root/.systemtap/cache/13/stap_13ea16365226db9619f3f14ab2a27efc_2536.ko Pass 5: starting run. ----------------START------------------------- In process [swapper/0] RIP: ffffffff81024f40 RSP: ffff88022fc03d18 EFLAGS: 00000086 RAX: 0000000000000021 RBX: ffff88022fc0c6e0 RCX: 0000000000000000 RDX: ffff88022fc169f8 RSI: ffff88022fc03d7c RDI: ffff8802215efc00 RBP: ffff88022fc03d58 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000003 R12: 0000000000000000 R13: ffff88022fc0c6e0 R14: ffff88022fc0c6e0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88022fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000419248 CR3: 0000000222468000 CR4: 00000000000007f0 0xffffffff81024f40 : x86_pmu_enable+0x140/0x280 [kernel] 0xffffffff8110c7eb : perf_pmu_enable+0x2b/0x40 [kernel] 0xffffffff810247a9 : x86_pmu_commit_txn+0xa9/0xb0 [kernel] 0xffffffff8110ee8a : group_sched_in+0x13a/0x170 [kernel] 0xffffffff8110fb3d : __perf_event_enable+0x29d/0x2e0 [kernel] 0xffffffff8110d038 : remote_function+0x48/0x60 [kernel] 0xffffffff810b2371 : generic_smp_call_function_single_interrupt+0xa1/0x100 [kernel] 0xffffffff810383f7 : smp_call_function_single_interrupt+0x27/0x40 [kernel] 0xffffffff815276ca : call_function_single_interrupt+0x6a/0x70 [kernel] ----------------END------------------------- Pass 5: run completed in 10usr/30sys/4564real ms.
案例三,打印某代码路径上的变量值:
[root@localhost perf_study]# stap -e 'probe kernel.statement("x86_assign_hw_event@arch/x86/kernel/cpu/perf_event.c:824"){printf("%d\n", $hwc->idx)}' 33 33 33 33 33 33 33 33
如果有这样的提示错误:
[root@localhost perf_study]# stap -e 'probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824"){printf("%d\n", $hwc->idx)}' semantic error: not accessible at this address (0xffffffff81024f8e, dieoffset: 0x343357): identifier '$hwc' at <input>:1:81 source: probe kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824"){printf("%d\n", $hwc->idx)} ^ Pass 2: analysis failed. Try again with another '--vp 01' option.
那么可以用-L选项看一下到底有哪些可用变量:
[root@localhost perf_study]# stap -L 'kernel.statement("*@arch/x86/kernel/cpu/perf_event.c:824")' kernel.statement("x86_assign_hw_event@arch/x86/kernel/cpu/perf_event.c:824") $hwc:struct hw_perf_event* $i:int $cpuc:struct cpu_hw_events* $event:struct perf_event* kernel.statement("x86_pmu_enable@arch/x86/kernel/cpu/perf_event.c:824") $n_running:int $cpuc:struct cpu_hw_events* $added:int
可以看到,是因为*匹配到两处(应该是宏的开启与否导致代码行号有移动,但个人暂不确定具体原因是否如此),因此可以明确使用“x86_assign_hw_event@arch/x86/kernel/cpu/perf_event.c:824”。
案例四,获知函数指针的具体指向,比如获取下面event_init函数指针的具体指向:
struct pmu *perf_init_event(struct perf_event *event) { struct pmu *pmu = NULL; int idx; int ret; idx = srcu_read_lock(&pmus_srcu); rcu_read_lock(); pmu = idr_find(&pmu_idr, event->attr.type); rcu_read_unlock(); if (pmu) { event->pmu = pmu; ret = pmu->event_init(event); if (ret) pmu = ERR_PTR(ret); goto unlock; } list_for_each_entry_rcu(pmu, &pmus, entry) { event->pmu = pmu; ret = pmu->event_init(event); if (!ret) goto unlock; if (ret != -ENOENT) { pmu = ERR_PTR(ret); goto unlock; } } pmu = ERR_PTR(-ENOENT); unlock: srcu_read_unlock(&pmus_srcu, idx); return pmu; }
[root@localhost perf_study]# cat perf_init_event.stp probe kernel.statement("perf_init_event@kernel/events/core.c:5892"){ addr = sprintf("%p", $pmu->event_init); print("5892-Function name:\n") print_stack(addr) } probe kernel.statement("perf_init_event@kernel/events/core.c:5900"){ addr = sprintf("%p", $pmu->event_init); print("5900-Function name:\n") print_stack(addr) } [root@localhost perf_study]# stap -v perf_init_event.stp Pass 1: parsed user script and 90 library script(s) using 190092virt/24736res/2784shr/22628data kb, in 160usr/20sys/178real ms. Pass 2: analyzed script: 2 probe(s), 7 function(s), 3 embed(s), 0 global(s) using 357076virt/62224res/3896shr/59836data kb, in 450usr/290sys/743real ms. Pass 3: translated to C into "/tmp/stapsgSbwn/stap_8b6901af9c04b3e4907826ce793aca76_4839_src.c" using 354588virt/65020res/6820shr/59836data kb, in 170usr/10sys/175real ms. Pass 4: compiled C into "stap_8b6901af9c04b3e4907826ce793aca76_4839.ko" in 2610usr/440sys/2892real ms. Pass 5: starting run. 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel] 5900-Function name: 0xffffffff81025510 : x86_pmu_event_init+0x0/0x220 [kernel]
即指向的是函数x86_pmu_event_init()。
转载请保留地址:http://lenky.info/archives/2013/02/24/2209 或 http://lenky.info/?p=2209
备注:如无特殊说明,文章内容均出自Lenky个人的真实理解而并非存心妄自揣测来故意愚人耳目。由于个人水平有限,虽力求内容正确无误,但仍然难免出错,请勿见怪,如果可以则请留言告之,并欢迎来讨论。另外值得说明的是,Lenky的部分文章以及部分内容参考借鉴了网络上各位网友的热心分享,特别是一些带有完全参考的文章,其后附带的链接内容也许更直接、更丰富,而我只是做了一下归纳&转述,在此也一并表示感谢。关于本站的所有技术文章,欢迎转载,但请遵从CC创作共享协议,而一些私人性质较强的心情随笔,建议不要转载。
法律:根据最新颁布的《信息网络传播权保护条例》,如果您认为本文章的任何内容侵犯了您的权利,请以或书面等方式告知,本站将及时删除相关内容或链接。