#542213 openafs-modules-2.6.30-1-686-bigmem: Kernel oops

#542213#5
Date:
2009-08-18 13:29:35 UTC
From:
To:
Not sure if the severity level is correct (does crashing the system
count as breaking unrelated software and causing data loss? :-)).

I've had these for the past few kernel version. Happens quite rarely,
but leave it running for a few days or weeks and it'll crash.

Details --

Aug 18 11:58:23 snout vmunix: [1932743.157090] kernel BUG at /usr/src/modules/openafs/src/libafs/MODLOAD-2.6.30-1-686-bigmem-MP/afs_dcache.c:734!
Aug 18 11:58:23 snout vmunix: [1932743.157093] invalid opcode: 0000 [#1] SMP
Aug 18 11:58:23 snout vmunix: [1932743.157097] last sysfs file: /sys/devices/platform/coretemp.3/temp1_label
Aug 18 11:58:23 snout vmunix: [1932743.157099] Modules linked in: openafs(P) edd joydev sg st sr_mod ide_gd_mod ide_cd_mod cdrom nvidia(P) binfmt_misc ppdev lp bridge stp bnep sco l2cap bluetooth ipmi_devintf ibmaem ibmpex ipmi_msghandler
 battery vmnet parport_pc parport vmblock vmci vmmon autofs4 acpi_cpufreq cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_stats microcode nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc fuse coretemp firewire_s
bp2 loop snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device psmouse snd pcspkr soundcore evdev i2c_i801 processor butt
on serio_raw asus_atk0110 snd_page_alloc i2c_core ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod hid_a4tech ata_generic ide_pci_generic ide_core sd_mod usbhid hid crc_t10dif uhci_hcd firewire_ohci ahci e100 firewire_c
ore crc_itu_t mii pata_jmicron atl1e libata scsi_mod ehci_hcd intel_agp usbcore agpgart therma
Aug 18 11:58:23 snout vmunix:  fan
Aug 18 11:58:23 snout vmunix:  thermal_sys
Aug 18 11:58:23 snout vmunix: [1932743.157183]
Aug 18 11:58:23 snout vmunix: [1932743.157187] Pid: 26115, comm: afs_cachetrim Tainted: P           (2.6.30-1-686-bigmem #1) P5QL-E
Aug 18 11:58:23 snout vmunix: [1932743.157190] EIP: 0060:[<f912a27e>] EFLAGS: 00010286 CPU: 0
Aug 18 11:58:23 snout vmunix: [1932743.157214] EIP is at afs_HashOutDCache+0xc7/0xcc [openafs]
Aug 18 11:58:23 snout vmunix: [1932743.157217] EAX: 00000026 EBX: f91e9b20 ECX: e0955e60 EDX: f9172984
Aug 18 11:58:23 snout vmunix: [1932743.157219] ESI: 0000234d EDI: f91e9b20 EBP: 00000000 ESP: e0955e5c
Aug 18 11:58:23 snout vmunix: [1932743.157222]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Aug 18 11:58:23 snout vmunix: [1932743.157225] Process afs_cachetrim (pid: 26115, ti=e0954000 task=f3c09510 task.ti=e0954000)
Aug 18 11:58:23 snout vmunix: [1932743.157227] Stack:
Aug 18 11:58:23 snout vmunix: [1932743.157228]  f9172984 f9212160 00000010 f912a7f1 e0955fbc 00000000 f8745000 f8f7e000
Aug 18 11:58:23 snout vmunix: [1932743.157234]  00000000 00000000 00000010 fffffff7 0000000c 00000001 00007a12 f8316000
Aug 18 11:58:23 snout vmunix: [1932743.157240]  00000005 00000000 4a9635b8 00000000 4a9635e5 00000000 4a9635c1 00000000
Aug 18 11:58:23 snout vmunix: [1932743.157246] Call Trace:
Aug 18 11:58:23 snout vmunix: [1932743.157260]  [<f912a7f1>] ? afs_GetDownD+0x4e6/0x5d2 [openafs]
Aug 18 11:58:23 snout vmunix: [1932743.157343]  [<f912d618>] ? afs_CacheTruncateDaemon+0x110/0x37e [openafs]
Aug 18 11:58:23 snout vmunix: [1932743.157366]  [<f916bdd4>] ? afsd_thread+0x348/0x5c4 [openafs]
Aug 18 11:58:23 snout vmunix: [1932743.157392]  [<f916ba8c>] ? afsd_thread+0x0/0x5c4 [openafs]
Aug 18 11:58:23 snout vmunix: [1932743.157418]  [<c0108037>] ? kernel_thread_helper+0x7/0x10
Aug 18 11:58:23 snout vmunix: [1932743.157425] Code: eb fe c7 43 78 00 00 00 00 31 c0 80 4b 72 02 5b 5e c3 68 6e 29 17 f9 e8 ed d3 1f c7 0f 0b 58 eb c6 68 84 29 17 f9 e8 de d3 1f c7 <0f> 0b 5e eb d0 53 89 c3 ff 05 50 e5 18 f9 ff 05 e0 ec 18 f9 e8
Aug 18 11:58:23 snout vmunix: [1932743.157456] EIP: [<f912a27e>] afs_HashOutDCache+0xc7/0xcc [openafs] SS:ESP 0068:e0955e5c
Aug 18 11:58:23 snout vmunix: [1932743.157480] ---[ end trace b6246d06bf7ee213 ]---

#542213#14
Date:
2009-08-18 23:29:33 UTC
From:
To:
Yair Mahalalel <yair@snout.mahalalel.com> writes:

That's a somewhat older version.  Have you tried upgrading to 1.4.11 to
see if that helps?  There were a couple of Linux fixes between pre1 and
pre3 that may be related.

I suspect the actual error message is on the line before this one.  Do you
have the complete crash output still available?

#542213#23
Date:
2009-08-18 23:29:33 UTC
From:
To:
[ Resending since the original mail bounced. ]

Yair Mahalalel <yair@snout.mahalalel.com> writes:

That's a somewhat older version.  Have you tried upgrading to 1.4.11 to
see if that helps?  There were a couple of Linux fixes between pre1 and
pre3 that may be related.

I suspect the actual error message is on the line before this one.  Do you
have the complete crash output still available?

#542213#28
Date:
2009-08-24 22:47:40 UTC
From:
To:
Yair Mahalalel <yair@mahalalel.com> writes:

And you're still seeing the same problem?

Hm, yes.  That is somewhat useful, though; it points to a problem with the
dcache layer at least.  Anything else before that?

#542213#33
Date:
2009-08-24 23:04:23 UTC
From:
To:
The problem is a rare system crash. It will probably take a few weeks
until the next one, if at all.

No, nothing relevant. Anything I can add to increase the logging
verbosity?

Thanks,
Yair.

#542213#38
Date:
2009-08-24 23:23:35 UTC
From:
To:
Yair Mahalalel <yair@mahalalel.com> writes:

No, unfortunately, that must be all we have.

I'll pass this along to the upstream developers.

#542213#43
Date:
2014-10-26 00:55:39 UTC
From:
To:
This bug is the same as (or at least very similar to) upstream bug
128767 <https://rt.central.org/rt/Ticket/Display.html?id=128767>. It was
due to certain disk errors being ignored, which caused corruption in the
dcache hash chains that would go unnoticed for a long time (until a
panic like the one in this bug report ocurred, much later).

There were a variety of different commits to track this down or fix it,
the last of which is commit c8ffb8b9eefc8c2a0c6b41e3e29c0f03940a5fcf,
which was in 1.6.6. The actual panic in this bug report was maybe
'fixed' a bit before that, but I'd need to go looking to get more info.

Yair, if you still care about this, can I assume you haven't seen this
panic in a while?

I defer to Ben if he wants to close this or get confirmation, etc etc.