#838569 diffoscope: readelf(1): Ignore data/instruction addresses that are de facto line numbers #838569
- Package:
- diffoscope
- Source:
- diffoscope
- Submitter:
- Daniel Shahaf
- Date:
- 2018-12-15 15:04:31 UTC
- Severity:
- wishlist
Dear Maintainer, A difference in an ELF binary file can cause offsets throughout the file to shift, usually by all of them by the same amount. Typical example: │ │ │ │ │ ./build/../src/nvim/indent_c.c:658 │ │ │ │ │ - 44436: 48 8d 35 01 77 1c 00 lea 0x1c7701(%rip),%rsi │ │ │ │ │ + 44436: 48 8d 35 f8 76 1c 00 lea 0x1c76f8(%rip),%rsi │ │ │ │ │ ./build/../src/nvim/main.c:749 │ │ │ │ │ 46eea: 48 8b 3c 24 mov (%rsp),%rdi │ │ │ │ │ - 46eee: 48 8d 35 7f 50 1c 00 lea 0x1c507f(%rip),%rsi │ │ │ │ │ + 46eee: 48 8d 35 76 50 1c 00 lea 0x1c5076(%rip),%rsi Here, 0x1c7701-0x1c76f8 = 0x1c507f-0x1c5076 = 9. There are several screenfuls of such differences, which reduces the signal-to-noise ratio of the output, since all of these differences are secondary; the primary difference is whatever caused the 9 bytes shift in the first place. (On this instance, the 9 bytes offset was caused by a string literal being present in the first build but not in the second build.) Could these offset differences in readelf(1) output be ignored, at least optionally? This would make it easier to find the root cause by reading the diff. Cheers, Daniel
Hi Daniel, Love the idea! However, my gut cautions against ignoring them. even with an option. Perhaps there is a perfect solution whereby we would normalise these two offsets to — making it up here! — relative values, but simply need to nclude that we have done that once in the diff. That way, we have a) still captured the underlying issue, b) reduced the noise, and c) avoided a cumbersome option flag. Regards,
Chris Lamb wrote on Mon, Sep 19, 2016 at 09:47:10 +0100:
I'm not sure I understand what your idea is. Could you give an example
of how the output might look?
Do you mean, for example,
.
@@ -1,2 +3,4 @@
-0x42
+«0x42 + 0x10»
.
where the original files read "0x42" (first file) and "0x52" (second file)?
Daniel, Apologies for not explaining myself better - I don't actually have a concrete idea for the output, but I was just expressing a wish to avoid a flag to ignore certain things so was using a hypothetical solution. We already have a few and I wish they would/could disappear! :) Regards,
Chris Lamb: One idea that crossed my mind at some point that might be able to solve this as well: be able to record other kinds of differences than just line-oriented ones. Initially, I thought of this as a way to add image comparison as I felt sad not knowing any free software that could easily provide similar features to what GitHub offers [1]. But why stop with images? In the precise case of the readelf output, having line-oriented diff means we are carrying around a useless and confusing information: the line numbers are not helpful in anyway to locate and undrstand the differences. But what if we could replace the line numbers by the instruction addresses? Then the noise mentioned by Daniel disappears. Meanwhile, the actual output will become even more relevant. Such an approach would require some structural changes to the code, but could have benefits on many fronts. [1]: https://help.github.com/articles/rendering-and-diffing-images/ Hope that's any useful,
Chris Lamb wrote on Tue, Sep 20, 2016 at 10:26:18 +0100:
Perhaps the output could replace all offsets into the .rodata section by
sequential numbers? For example, if the .rodata section starts at 0xA00
and ends at 0xC00, and the output references 0xA80, 0xA70, 0xB80,
and 0xB70, then those could be translated to .rodata#2, .rodata#1,
.rodata#4, and .rodata#3 respectively. To make this lossless, the
(.rodata#42 ↦ 0xB53) mapping could be appended to the file and included
in the diff.
Example output:
lea «.rodata#1»(%rip),%rsi
⋮
<at the end>
.rodata#1 is 0xA70
.rodata#2 is 0xA80
The actual hex values could be displayed as a tooltip on the 'lea' line,
or appended to that line as a '# comment' that will be considered equal
by the unidiff (like 'diff -w' considers space and tab equal).
Cheers,
Daniel
Jérémy Bobbio wrote on Tue, Sep 20, 2016 at 13:18:49 +0000: In the example in the OP, the (source code) line numbers and instruction addresses are the same between both builds. It is the .rodata addresses embeddded into the instructions that differ. However, in the .text section, each disassembled instruction is preceded by its address. I think it would make sense to have the diff ignore those addresses: they serve a purpose similar to line numbers, and ignoring them cannot cause a difference to be missed. Cheers, Daniel
Daniel Shahaf: Thanks for pointing this out, I had actually misunderstood the problem at hand. :)
Daniel Shahaf wrote: Alas I'm not very learned in ELF, so I will trust the specifics are fine, but just to check: … would be displayed (when different, of course!) as *something* like: - .rodata#1 is 0xA70 + .rodata#1 is 0xA71 So, tooltips are not only HTML-specific that would also hide data, particularly for a) users who do not even know they need to run their mouse over something, b) users who generally drive their browser via a keyboard (probably more common for users of diffoscope!) and c) users with accessibility requirements. Anyway, great idea - love it. Regards,
Jérémy Bobbio wrote: Pff, you don't like my existing image comparison? ;-) Regards,
Chris Lamb wrote on Tue, Sep 20, 2016 at 20:58:29 +0100: I'm not too familiar with ELF either. I know a little about which C variables live in which section, e.g., .rodata is storage for string literals. Yes. I was thinking of something like the HTML <acronym> tag. In my browser, <acronym title="tooltip">foo</acronym> renders «foo» with a dotted underline whose raison d'être is your concern (a). I assume the user agents of people in categories (b) and (c) have similar solutions. In any case, displaying the values in a comment is probably better since it makes the information available without a user action. (As I said in my last email, the comment should be exempted from being diffed.)
Even so, you can't search the page with CTRL+F and, of course, it makes the output too different between --text and --html :) Anyway, small issue ... Regards,
Daniel Shahaf wrote on Tue, Sep 20, 2016 at 18:47:31 +0000: flexc++ has a difference on *every* line of several sections (.rodata, .eh_frame, others) because the sections start 0xc0 bytes later in the second build than in the first build: │ │ │ │ │ - 0x0043a320 01000200 00000000 623a423a 633a433a ........b:B:c:C: │ │ │ │ │ - 0x0043a330 64663a46 68693a49 3a4b6c3a 4c3a6d3a df:Fhi:I:Kl:L:m: ⋮ │ │ │ │ │ + 0x0043a3e0 01000200 00000000 623a423a 633a433a ........b:B:c:C: │ │ │ │ │ + 0x0043a3f0 64663a46 68693a49 3a4b6c3a 4c3a6d3a df:Fhi:I:Kl:L:m: Hence, filing this as a separate issue. #838260 can remain about offsets embedded in instructions. (It's safe to ignore these addresses because the start/end of the section already appear elsewhere in the diffed output.)
forwarded 838569 https://salsa.debian.org/reproducible-builds/diffoscope/issues/17 thanks I've forwarded this upstream here: https://salsa.debian.org/reproducible-builds/diffoscope/issues/17 Regards,