#891867 diffoscope: improve .changes diffs

Package:
diffoscope
Source:
diffoscope
Submitter:
Helmut Grohne
Date:
2021-05-26 15:06:03 UTC
Severity:
wishlist
Tags:
#891867#5
Date:
2018-03-01 20:53:00 UTC
From:
To:
Hi awesome reproducible team!

I'm working on adding more build profiles and diffoscope + reproducible
builds turn out to be an awesome tool in validating that my profiles
only drop packages and never change package contents. :)

Now I've run into a strange corner case with diffing .changes files. If
the last entry in a Checksums-Sha256 field is removed, new last entry is
removed and added. For example:

├── Files
│ @@ -1,12 +1,8 @@
│
│   a6434caf7853f0e84013e6004b934351 10496 debug optional cracklib-runtime-dbgsym_2.9.2-5.1_amd64.deb
│   93d547da959c0080bb45cc05d354539f 148972 admin optional cracklib-runtime_2.9.2-5.1_amd64.deb
│ - 608dbc69affb8158f7f203e3c83e5595 11084 libs optional cracklib2_2.9.2-5.1_amd64.buildinfo
│ + 1941a0fd2dbf8b3df8536caab1895993 9042 libs optional cracklib2_2.9.2-5.1_amd64.buildinfo
│   b2010c8e3498d469ae3b5bf5e5469da1 21416 debug optional libcrack2-dbgsym_2.9.2-5.1_amd64.deb
│   f099f5629a8b0cad9e5fba5ded9f88d9 32052 libdevel extra libcrack2-dev_2.9.2-5.1_amd64.deb
│   e3b100bcf99561c965cc5da8ef58e206 120108 debian-installer optional libcrack2-udeb_2.9.2-5.1_amd64.udeb
│ - e04c3f5f14a2c8eb2022ba8eb1d5d915 54788 libs optional libcrack2_2.9.2-5.1_amd64.deb
│ - 85fa278ff743cf93ead1075664df42f2 12880 debug optional python-cracklib-dbgsym_2.9.2-5.1_amd64.deb
│ - 674e3f26f4301f5c253fddb0366a3832 23392 python optional python-cracklib_2.9.2-5.1_amd64.deb
│ - c1bceeb542a7dbed7d2989fd93bba60b 14160 debug optional python3-cracklib-dbgsym_2.9.2-5.1_amd64.deb
│ - b9868203bdda021286129428b5eb3e55 23424 python optional python3-cracklib_2.9.2-5.1_amd64.deb
│ + e04c3f5f14a2c8eb2022ba8eb1d5d915 54788 libs optional libcrack2_2.9.2-5.1_amd64.deb

The DotChangesFile comaprator performs a super call into DebControlFile,
which pulls the Checksums-Sha256 field out of the parsed deb822 file. In
doing so it strips the trailing newline and this what makes for the
difference above. I propose working around the issue with the following
patch:
--- a/diffoscope/comparators/debian.py
+++ b/diffoscope/comparators/debian.py
@@ -142,16 +142,16 @@
         # Compare Files as string
         if self.deb822.get('Files'):
             differences.append(Difference.from_text(
-                self.deb822.get_as_string('Files'),
-                other.deb822.get_as_string('Files'),
+                self.deb822.get_as_string('Files') + "\n",
+                other.deb822.get_as_string('Files') + "\n",
                 self.path,
                 other.path,
                 source='Files',
             ))
         else:
             differences.append(Difference.from_text(
-                self.deb822.get_as_string('Checksums-Sha256'),
-                other.deb822.get_as_string('Checksums-Sha256'),
+                self.deb822.get_as_string('Checksums-Sha256') + "\n",
+                other.deb822.get_as_string('Checksums-Sha256') + "\n",
                 self.path,
                 other.path,
                 source='Checksums-Sha256',

Many thanks to Mattia Rizzolo for pointing precisely at the relevant
source code.

Helmut

#891867#10
Date:
2018-03-01 21:00:27 UTC
From:
To:
Hi Helmut!

Thanks for the report with patch :)

Why not .strip() them all instead, out of interest?

Aw, you know how to flatter us!


Regards,

#891867#15
Date:
2018-03-01 21:00:27 UTC
From:
To:
Hi Helmut!

Thanks for the report with patch :)

Why not .strip() them all instead, out of interest?

Aw, you know how to flatter us!


Regards,

#891867#18
Date:
2018-03-01 21:24:01 UTC
From:
To:
Because deb822.get_as_string() returns a single string containing all of
'Checksums-Sha256' (in this case) with the \n embedded in that string.
So to .str() them all you'd need to instead do a .splitlines() of
deb822.get_as_string() and then add something clever to Difference to
handle the array instead of a text.

I am not a fan of "randomly" adding \n either, but for sure it is the
simplest way to make the diff behaviour here saner.

#891867#23
Date:
2018-03-01 21:20:00 UTC
From:
To:
Control: tags -1 - patch

Feel free to do so. I just felt that changing the type from str to
List(str) would be non-trivial and I couldn't immediately figure out
whether that works as intended. It would have the additional benefit of
also removing the indentation. When removing the first line you get the
very same problem with indentation rather than newline. Diff these
against one another:

"line1\n line2\n line3"
"line1\n line2"
"line2\n line3"

So yeah, we need a better solution here.

Helmut

#891867#30
Date:
2018-03-01 21:46:29 UTC
From:
To:
Mattia Rizzolo wrote:

Ah, I see:

  '\n 75308f81588e44097465180b3572e3469d70c34fccd3ba14978d8a76d4036c7f 688 source_1.dsc\n ce8b1507e6cc936c6252 […]

[…]

Assuming you mean .strip() ..  Okay, why don't we .rstrip()? That would at
least normalise the suffix without screwing up the suffix? I'm probably
missing something :)


Best wishes,

#891867#35
Date:
2018-03-01 22:01:06 UTC
From:
To:
Helmut Grohne wrote:

(.strip() has a prototype of "str" → "str" so I am confused about all these
references to "[str]".)


Regards,

#891867#40
Date:
2018-12-15 15:01:23 UTC
From:
To:
forwarded 891867 https://salsa.debian.org/reproducible-builds/diffoscope/issues/30
thanks

I've forwarded this upstream here:

https://salsa.debian.org/reproducible-builds/diffoscope/issues/30


Regards,

#891867#49
Date:
2020-05-29 08:23:23 UTC
From:
To:
#891867#54
Date:
2020-05-29 20:03:08 UTC
From:
To:
Control: reopen -1

If you close a bug, please include the fixed version.

I confirm that the issue is reproducible with diffoscope version 145.

Helmut

#891867#61
Date:
2021-05-26 08:59:03 UTC
From:
To:
Hey Helmut,

So sorry, I didn't see this email at the time.

Will do. :)  Glancing through the about salsa issue, it was fixed in
both 145 but also in 146.

Can you confirm in, say, version 175? If so, do you have a testcase that
I can replicate locally?


Regards,

#891867#66
Date:
2021-05-26 12:21:46 UTC
From:
To:
Hi Chris,

I find this way of dealing with the issue quite disrespectful. I've
included a patch (which admittedly is more of a workaround) and the bug
log fully explains what the issue is, why it is there and what can be
done to fix it. You didn't like my solution, fine, I removed the patch
tag.

To replicate it locally, you can create a rfc822 file, copy it and
remove the first or last line of a multiline field in your copy.

As for a real-world example, look for any reproducible source package
that builds both arch:all and arch:any packages. Say dash. Now perform a
full build.  Then look into what kind of package comes last (or first)
in the .dsc's Package-List. Happens to be that dash (arch:any) is last.
Now drop it by performing an indep-only build. Go compare your builds.
This is what you see (with version 175):

| ...
| │ ├── Checksums-Sha256
| │ │ @@ -1,4 +1,2 @@
| │ │
| │ │ - f54228a4191361ffd0bc9ebe8b22d2545356aff9efa6d4a949dc8fa22114519c 34828 ash_0.5.11+git20210120+802ebd4-1_all.deb
| │ │ - e09828d59480ac2182e7781c358cea5c60cc513d68ee435f07809b50253a01cc 154972 dash-dbgsym_0.5.11+git20210120+802ebd4-1_amd64.deb
| │ │ - 572fc08ed88b8c79a99322110f8597ef0cbb49ccbdb1a31b485cdc40461debd2 115512 dash_0.5.11+git20210120+802ebd4-1_amd64.deb
| │ │ + f54228a4191361ffd0bc9ebe8b22d2545356aff9efa6d4a949dc8fa22114519c 34828 ash_0.5.11+git20210120+802ebd4-1_all.deb

This is a real-world issue, well-understood since three years.

If you don't like the original workaround, how about this variant that
deals with line endings and indentation simultaneously?

| def normalize_multiline(lines: str) -> str:
|     return "".join(line.strip() + "\n" for line in lines.splitlines())

Can we please stop running in circles one step per year?

Helmut