#838569 diffoscope: readelf(1): Ignore data/instruction addresses that are de facto line numbers

#838569#5
Date:
2016-09-19 07:05:13 UTC
From:
To:
Dear Maintainer,

A difference in an ELF binary file can cause offsets throughout the
file to shift, usually by all of them by the same amount.

Typical example:

│   │   │   │   │  ./build/../src/nvim/indent_c.c:658
│   │   │   │   │ -   44436:	48 8d 35 01 77 1c 00 	lea    0x1c7701(%rip),%rsi
│   │   │   │   │ +   44436:	48 8d 35 f8 76 1c 00 	lea    0x1c76f8(%rip),%rsi
│   │   │   │   │  ./build/../src/nvim/main.c:749
│   │   │   │   │     46eea:	48 8b 3c 24          	mov    (%rsp),%rdi
│   │   │   │   │ -   46eee:	48 8d 35 7f 50 1c 00 	lea    0x1c507f(%rip),%rsi
│   │   │   │   │ +   46eee:	48 8d 35 76 50 1c 00 	lea    0x1c5076(%rip),%rsi

Here, 0x1c7701-0x1c76f8 = 0x1c507f-0x1c5076 = 9.  There are several
screenfuls of such differences, which reduces the signal-to-noise ratio
of the output, since all of these differences are secondary; the primary
difference is whatever caused the 9 bytes shift in the first place.

(On this instance, the 9 bytes offset was caused by a string literal
being present in the first build but not in the second build.)

Could these offset differences in readelf(1) output be ignored, at least
optionally?  This would make it easier to find the root cause by reading
the diff.

Cheers,

Daniel

#838569#10
Date:
2016-09-19 08:47:10 UTC
From:
To:
Hi Daniel,

Love the idea! However, my gut cautions against ignoring them. even with an
option.

Perhaps there is a perfect solution whereby we would normalise these two
offsets to — making it up here! — relative values, but simply need to
nclude that we have done that once in the diff. That way, we have a) still
captured the underlying issue, b) reduced the noise, and c) avoided a
cumbersome option flag.


Regards,

#838569#15
Date:
2016-09-20 08:37:11 UTC
From:
To:
Chris Lamb wrote on Mon, Sep 19, 2016 at 09:47:10 +0100:

I'm not sure I understand what your idea is.  Could you give an example
of how the output might look?

Do you mean, for example,
.
    @@ -1,2 +3,4 @@
    -0x42
    +«0x42 + 0x10»
.
where the original files read "0x42" (first file) and "0x52" (second file)?

#838569#20
Date:
2016-09-20 09:26:18 UTC
From:
To:
Daniel,

Apologies for not explaining myself better - I don't actually have a
concrete idea for the output, but I was just expressing a wish to avoid
a flag to ignore certain things so was using a hypothetical solution.

We already have a few and I wish they would/could disappear! :)


Regards,

#838569#25
Date:
2016-09-20 13:18:49 UTC
From:
To:
Chris Lamb:

One idea that crossed my mind at some point that might be able to solve
this as well: be able to record other kinds of differences than just
line-oriented ones. Initially, I thought of this as a way to add image
comparison as I felt sad not knowing any free software that could easily
provide similar features to what GitHub offers [1].

But why stop with images? In the precise case of the readelf output,
having line-oriented diff means we are carrying around a useless and
confusing information: the line numbers are not helpful in anyway to
locate and undrstand the differences.

But what if we could replace the line numbers by the instruction
addresses? Then the noise mentioned by Daniel disappears. Meanwhile, the
actual output will become even more relevant.

Such an approach would require some structural changes to the code, but
could have benefits on many fronts.

 [1]: https://help.github.com/articles/rendering-and-diffing-images/

Hope that's any useful,

#838569#30
Date:
2016-09-20 18:18:16 UTC
From:
To:
Chris Lamb wrote on Tue, Sep 20, 2016 at 10:26:18 +0100:

Perhaps the output could replace all offsets into the .rodata section by
sequential numbers?  For example, if the .rodata section starts at 0xA00
and ends at 0xC00, and the output references 0xA80, 0xA70, 0xB80,
and 0xB70, then those could be translated to .rodata#2, .rodata#1,
.rodata#4, and .rodata#3 respectively.  To make this lossless, the
(.rodata#42 ↦ 0xB53) mapping could be appended to the file and included
in the diff.

Example output:

    lea    «.rodata#1»(%rip),%rsi
    ⋮
    <at the end>
    .rodata#1 is 0xA70
    .rodata#2 is 0xA80

The actual hex values could be displayed as a tooltip on the 'lea' line,
or appended to that line as a '# comment' that will be considered equal
by the unidiff (like 'diff -w' considers space and tab equal).

Cheers,

Daniel

#838569#35
Date:
2016-09-20 18:47:31 UTC
From:
To:
Jérémy Bobbio wrote on Tue, Sep 20, 2016 at 13:18:49 +0000:

In the example in the OP, the (source code) line numbers and instruction
addresses are the same between both builds.  It is the .rodata addresses
embeddded into the instructions that differ.

However, in the .text section, each disassembled instruction is preceded
by its address.  I think it would make sense to have the diff ignore
those addresses: they serve a purpose similar to line numbers, and
ignoring them cannot cause a difference to be missed.

Cheers,

Daniel

#838569#40
Date:
2016-09-20 19:34:18 UTC
From:
To:
Daniel Shahaf:

Thanks for pointing this out, I had actually misunderstood the problem
at hand. :)

#838569#45
Date:
2016-09-20 19:58:29 UTC
From:
To:
Daniel Shahaf wrote:

Alas I'm not very learned in ELF, so I will trust the specifics are fine,
but just to check:

… would be displayed (when different, of course!) as *something* like:

 -     .rodata#1 is 0xA70
 +     .rodata#1 is 0xA71

So, tooltips are not only HTML-specific that would also hide data, particularly
for a) users who do not even know they need to run their mouse over something,
b) users who generally drive their browser via a keyboard (probably more common
for users of diffoscope!) and c) users with accessibility requirements.

Anyway, great idea - love it.


Regards,

#838569#50
Date:
2016-09-20 19:59:45 UTC
From:
To:
Jérémy Bobbio wrote:

Pff, you don't like my existing image comparison? ;-)


Regards,

#838569#55
Date:
2016-09-21 15:37:43 UTC
From:
To:
Chris Lamb wrote on Tue, Sep 20, 2016 at 20:58:29 +0100:

I'm not too familiar with ELF either.  I know a little about which
C variables live in which section, e.g., .rodata is storage for string
literals.

Yes.

I was thinking of something like the HTML <acronym> tag.  In my browser,
<acronym title="tooltip">foo</acronym> renders «foo» with a dotted underline
whose raison d'être is your concern (a).  I assume the user agents of
people in categories (b) and (c) have similar solutions.

In any case, displaying the values in a comment is probably better since
it makes the information available without a user action.  (As I said in
my last email, the comment should be exempted from being diffed.)

#838569#60
Date:
2016-09-21 17:30:22 UTC
From:
To:
Even so, you can't search the page with CTRL+F and, of course, it makes the
output too different between --text and --html :)

Anyway, small issue ...


Regards,

#838569#65
Date:
2016-09-22 13:46:17 UTC
From:
To:
Daniel Shahaf wrote on Tue, Sep 20, 2016 at 18:47:31 +0000:

flexc++ has a difference on *every* line of several sections (.rodata,
.eh_frame, others) because the sections start 0xc0 bytes later in the
second build than in the first build:

│   │   │   │   │ -  0x0043a320 01000200 00000000 623a423a 633a433a ........b:B:c:C:
│   │   │   │   │ -  0x0043a330 64663a46 68693a49 3a4b6c3a 4c3a6d3a df:Fhi:I:Kl:L:m:
⋮
│   │   │   │   │ +  0x0043a3e0 01000200 00000000 623a423a 633a433a ........b:B:c:C:
│   │   │   │   │ +  0x0043a3f0 64663a46 68693a49 3a4b6c3a 4c3a6d3a df:Fhi:I:Kl:L:m:

Hence, filing this as a separate issue.  #838260 can remain about
offsets embedded in instructions.

(It's safe to ignore these addresses because the start/end of the
section already appear elsewhere in the diffed output.)

#838569#74
Date:
2018-12-15 15:02:22 UTC
From:
To:
forwarded 838569 https://salsa.debian.org/reproducible-builds/diffoscope/issues/17
thanks

I've forwarded this upstream here:

https://salsa.debian.org/reproducible-builds/diffoscope/issues/17


Regards,