#1029968 bttv/v4l: WARNING: CPU: 6 PID: 6164 at mm/vmalloc.c:487 __vmap_pages_range_noflush+0x3e0/0x4d0

Package:
src:linux
Source:
src:linux
Submitter:
David Alan Gilbert
Date:
2023-12-03 18:39:04 UTC
Severity:
normal
Tags:
#1029968#5
Date:
2023-01-29 17:14:04 UTC
From:
To:
Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?

This was triggered when trying to capture video via a bt878 card using the
https://github.com/ali1234/vhs-teletext  script.  Note 'xawtv' is working
fine, so the card is generally happy.

   * What exactly did you do (or not do) that was effective (or
     ineffective)?

I ran 'teletext record -d /dev/vbi0 > capture.vbi' from the above package
that triggers a read from vbi0; an strace shows:
[pid  6466] read(3, 0x27feb20, 65536)   = -1 ENOMEM (Cannot allocate memory)

and the WARN_ON in the dmesg

   * What was the outcome of this action?

a WARNING in the dmesg, see below; and the python script backtraced
with apparently an -ENOMEM; but there's plenty of RAM:
dg@major:~/vhs-teletext$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7937         789        2963           2        4484        7147
Swap:            974           0         974

   * What outcome did you expect instead?
Happy gathered video.

*** End of the template - remove these template lines ***

#1029968#10
Date:
2023-01-29 19:03:12 UTC
From:
To:
   WORKS https://snapshot.debian.org/archive/debian/20220601T031637Z/pool/main/l/linux-signed-amd64/linux-image-5.17.0-1-amd64_5.17.3-1_amd64.deb
   5.17.0-1-amd64 #1 SMP PREEMPT Debian 5.17.3-1


So I think it's time to move upstream and bisect between 5.17 and 5.18

Dave

#1029968#15
Date:
2023-01-29 18:36:31 UTC
From:
To:
Note an easier way to trigger the bug is    cat /dev/vbi0 > /dev/null

   WORKS https://snapshot.debian.org/archive/debian/20220101T024315Z/pool/main/l/linux-signed-amd64/linux-image-5.10.0-9-amd64_5.10.70-1_amd64.deb
   linux-image-5.10.0-9-amd64_5.10.70-1_amd64.deb - kexecing in works

   WORKS https://snapshot.debian.org/archive/debian/20220101T024315Z/pool/main/l/linux-signed-amd64/linux-image-5.15.0-2-amd64_5.15.5-2_amd64.deb
   5.15.5-2

   FAILS https://snapshot.debian.org/archive/debian/20220701T034227Z/pool/main/l/linux-signed-amd64/linux-image-amd64_5.18.5-1_amd64.deb
     [   98.154835] BUG: unable to handle page fault for address: ffffbc2480f40000
[   98.154848] #PF: supervisor write access in kernel mode
[   98.154853] #PF: error_code(0x000b) - reserved bit violation

That 5.18 failure is a bit different, but the backtrace is similar;
somewhere between 5.18 it's changed from a BUG to a WARN but
back in 5.15 it just worked.

Dave

#1029968#20
Date:
2023-01-30 00:34:46 UTC
From:
To:
I built upstream kernels 5.18.0 and 6.1.0 and both of them work for me.
Which makes life much more painful to find.

#1029968#25
Date:
2023-01-30 21:27:23 UTC
From:
To:
This is IOMMU related.

Upstream 6.1 and 5.18 *do* exhibit the bug, but only with intel_iommu=on
where as Debian seems to default it to on.

#1029968#30
Date:
2023-01-30 22:10:04 UTC
From:
To:
Upstream 5.17 works
Upstream 5.18 fails

(with intel_iommu=on)

Let the bisect begin.

#1029968#35
Date:
2023-02-01 01:52:13 UTC
From:
To:
dg@major:~/kernel/kernel-clone$ git bisect good
f5ff79fddf0efecca538046b5cc20fb3ded2ec4f is the first bad commit
commit f5ff79fddf0efecca538046b5cc20fb3ded2ec4f
Author: Christoph Hellwig <hch@lst.de>
Date:   Sat Feb 26 16:40:21 2022 +0100

    dma-mapping: remove CONFIG_DMA_REMAP

That sounds like a believable cause given that it's IOMMU related
and device related.

#1029968#40
Date:
2023-02-01 02:48:03 UTC
From:
To:
Note that the oops at this bisect point is messier than on the newer
kernels; the newer kernels hit a WARN in __vmap_pages_range_noflush, where
as at this point it slams into a BUG, but the rest of the backtrace is similar;

[   78.988024] BUG: unable to handle page fault for address: ffffbd7fc1100000
[   78.988033] #PF: supervisor write access in kernel mode
[   78.988036] #PF: error_code(0x000b) - reserved bit violation
[   78.988038] PGD 100000067 P4D 100000067 PUD 1001a6067 PMD 22b791067 PTE 8000270800cb9063
[   78.988046] Oops: 000b [#1] PREEMPT SMP PTI
[   78.988050] CPU: 7 PID: 879 Comm: cat Tainted: G          I       5.17.0-rc1dg+ #20
[   78.988054] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./P55M Pro, BIOS P1.50 09/10/2009
[   78.988056] RIP: 0010:__memset+0x24/0x30
[   78.988063] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[   78.988067] RSP: 0018:ffffbd7fc0cc7d50 EFLAGS: 00010206
[   78.988071] RAX: 0000000000000000 RBX: ffff9679eb788a88 RCX: 0000000000002000
[   78.988073] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffbd7fc1100000
[   78.988076] RBP: 0000000080000000 R08: 0000000000000000 R09: ffffbd7fc1100000
[   78.988079] R10: 0000000000010000 R11: ffffbd7fc1100000 R12: 0000000000000010
[   78.988081] R13: 0000000000000010 R14: ffff9679ec8e4130 R15: 0000000000000010
[   78.988084] FS:  00007f91fb65a740(0000) GS:ffff9679efdc0000(0000) knlGS:0000000000000000
[   78.988087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.988090] CR2: ffffbd7fc1100000 CR3: 000000022ce02000 CR4: 00000000000006e0
[   78.988093] Call Trace:
[   78.988096]  <TASK>
[   78.988100]  __videobuf_iolock+0x5cd/0x659 [videobuf_dma_sg]
[   78.988110]  vbi_buffer_prepare+0x1aa/0x2b0 [bttv]
[   78.988125]  __videobuf_read_start+0xb9/0x1d0 [videobuf_core]
[   78.988133]  videobuf_read_stream+0x2cb/0x330 [videobuf_core]
[   78.988140]  bttv_read+0xc5/0x1d0 [bttv]
[   78.988151]  v4l2_read+0x6b/0x80 [videodev]
[   78.988169]  vfs_read+0x97/0x190
[   78.988175]  ksys_read+0x63/0xe0
[   78.988179]  do_syscall_64+0x3a/0x80
[   78.988185]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   78.988191] RIP: 0033:0x7f91fb7550ed
[   78.988194] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d f6 54 0a 00 e8 39 fe 01 00 66 0f 1f 84 00 00 00 00 00 80 3d f1 24 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[   78.988198] RSP: 002b:00007ffdad058348 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   78.988202] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f91fb7550ed
[   78.988204] RDX: 0000000000020000 RSI: 00007f91fb350000 RDI: 0000000000000003
[   78.988207] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
[   78.988209] R10: 00007f91fb66cb40 R11: 0000000000000246 R12: 00007f91fb350000
[   78.988212] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000000
[   78.988216]  </TASK>
[   78.988218] Modules linked in: rfkill qrtr tuner_simple tuner_types intel_powerclamp coretemp tuner snd_hda_codec_via tda7432 kvm_intel snd_hda_codec_generic snd_hda_codec_hdmi tvaudio ledtrig_audio msp3400 snd_hda_intel kvm snd_intel_dspcfg bttv snd_intel_sdw_acpi snd_hda_codec irqbypass tea575x snd_hda_core tveeprom videobuf_dma_sg intel_cstate videobuf_core videodev snd_hwdep snd_bt87x intel_uncore snd_pcm serio_raw pcspkr iTCO_wdt i7core_edac mc joydev intel_pmc_bxt snd_timer iTCO_vendor_support snd watchdog soundcore sg evdev acpi_cpufreq firewire_sbp2 fuse msr configfs efi_pstore ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 nouveau sd_mod t10_pi crc_t10dif crct10dif_generic ata_generic crct10dif_common hid_generic usbhid hid mxm_wmi wmi video i2c_algo_bit drm_ttm_helper ttm drm_kms_helper cec rc_core drm ahci libahci pata_via libata ehci_pci ehci_hcd psmouse r8169 usbcore scsi_mod crc32c_intel realtek i2c_i801 mdio_devres i2c_smbus libphy firewire_ohci
[   78.988301]  firewire_core lpc_ich usb_common scsi_common crc_itu_t button
[   78.988311] CR2: ffffbd7fc1100000
[   78.988314] ---[ end trace 0000000000000000 ]---
[   78.988316] RIP: 0010:__memset+0x24/0x30
[   78.988320] Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[   78.988324] RSP: 0018:ffffbd7fc0cc7d50 EFLAGS: 00010206
[   78.988327] RAX: 0000000000000000 RBX: ffff9679eb788a88 RCX: 0000000000002000
[   78.988329] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffbd7fc1100000
[   78.988332] RBP: 0000000080000000 R08: 0000000000000000 R09: ffffbd7fc1100000
[   78.988334] R10: 0000000000010000 R11: ffffbd7fc1100000 R12: 0000000000000010
[   78.988337] R13: 0000000000000010 R14: ffff9679ec8e4130 R15: 0000000000000010
[   78.988340] FS:  00007f91fb65a740(0000) GS:ffff9679efdc0000(0000) knlGS:0000000000000000
[   78.988343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.988346] CR2: ffffbd7fc1100000 CR3: 000000022ce02000 CR4: 00000000000006e0

#1029968#45
Date:
2023-02-01 06:24:51 UTC
From:
To:
Thanks for that thorough analyses!
If you're 'penguin42' on IRC, then I'd suggest to present your findings to
iommu@lists.linux.dev as both the author and the reviewer are highly likely
subscribed to that list.

scripts/get_maintainer.pl drivers/iommu/dma-iommu.c
scripts/get_maintainer.pl kernel/dma/Makefile

list them both and both results have also that ML in their result.

HTH

#1029968#50
Date:
2023-02-01 11:55:01 UTC
From:
To:
* Diederik de Haas (didi.debian@cknow.org) wrote:

Thanks for the reply,

Yep, that's me.

Yep, will do; I'm just going to try a 6.2rc as well just in case it's
got fixed very recently, and have a poke about in case I can see
any obvious cause now I know the change that triggered it.
I'll include the linux-media list as well since it's just as likely
that it's a fault on the v4l/bttv driver.

Dave

#1029968#55
Date:
2023-02-01 13:59:20 UTC
From:
To:
Confirmed still happens on upstream 6.2.0-rc6
#1029968#60
Date:
2023-02-01 17:46:43 UTC
From:
To:
#1029968#65
Date:
2023-02-01 19:07:21 UTC
From:
To:
Control: forwarded -1 https://lore.kernel.org/linux-iommu/Y9qSwkLxeMpffZK%2F@gallifrey/

Thanks :-) Updated bug accordingly.

#1029968#72
Date:
2023-02-04 01:06:02 UTC
From:
To:
It's a bit messy...

a) The patch I bisected to is not the root cause of the bug; it just
triggers a ~9 year old bug in the v4l code - so this patch isn't going
to get changed.

b) The ~9 year old bug is in a particularly hairy piece of memory management
code  in v4l that I doubt anyone is going to fix.

c) The plan is all the drivers using that API are to either be retired
or rewritten using a new API; that's already been done for some of the
drivers and the bttv one is a few months out.  I'm not sure that's
any use to this version of Debian though.

d) The work arounds are:
  1) Disable iommu
  2) some v4l tools can use an mmap interface rather than the read(2)
interface; that seems to be OK.

Dave

#1029968#77
Date:
2023-02-04 18:03:37 UTC
From:
To:
Also reading the upstream conversation, that seems like a good summary :)

I could be wrong (ofc), but I doubt this issue will be fixed in kernel 6.1
which is planned to be Bookworm's kernel.

What you could do is test any patch(es) they put out and provide feedback on
those and if you find that the patch(es) fixes the/your issue, you could
consider providing a "Tested-By: yourname <your.email>" which can have a
positive effect on the maintainer accepting the patch from which it then can
procede further up the chain to Linus.
When it does end up in Linus' tree and it would need to have a new Kconfig
option enabled in Debian's kernel, feel free to request that.

I think IOMMU is a really good thing, so disabling that does not sound ideal,
but it could be an acceptable workaround (for now).
If you want to run Bookworm on your system, you'd probably have to wait till a
suitable Backports kernel becomes available. If you go for Trixie, you should
get (in time) the proper fix by just upgrading.

Thanks for all the investigative work and bringing it to upstream :-)

Cheers,
  Diederik

#1029968#84
Date:
2023-11-12 17:08:34 UTC
From:
To:
This looks like it's fixed in 6.6, I think they had a major rewrite
in there.
It's a conversion to vb2 in the series starting with
d1846d72587e9241e73a18da14a325b43700013b

There are a couple of minor oddities with that
(they list the sequence cost the bttv had) but that's relatively minor.

Dave

#1029968#89
Date:
2023-12-03 16:50:07 UTC
From:
To:
As well as the fixes in 6.6, we also need this patchup series here:

https://lore.kernel.org/linux-media/ZWibhE350L3BTRK8@gallifrey/T/#t

These seem to make it pretty nicely.

#1029968#94
Date:
2023-12-03 18:24:41 UTC
From:
To:
Control: forwarded -1 https://lore.kernel.org/linux-iommu/Y9qSwkLxeMpffZK%2F@gallifrey/ https://lore.kernel.org/linux-media/cover.1701349092.git.hverkuil-cisco@xs4all.nl/

A 6.6 kernel is now available in Experimental

Excellent :) I saw they had "Fixes:" tags, which normally means they'll
get backported to older kernel series like 6.6.