#807270 mk-origtargz: create reproducible tarballs and --mtime option

Package:
devscripts
Source:
devscripts
Description:
scripts to make the life of a Debian Package maintainer easier
Submitter:
Hans-Christoph Steiner
Date:
2025-03-23 15:51:02 UTC
Severity:
wishlist
#807270#5
Date:
2015-12-06 21:21:04 UTC
From:
To:
Whenever mk-origtargz is repacking a zipball, it should zero out the
timestamps in the tar format so that the process produces the same
tarball every time it runs.  This can be done using tar's --mtime= flag.

Additionally, it would be very useful if mk-origtargz also had a --mtime
option which forced the tarball to be repacked using the date given to
the --mtime="Wed Oct 28 10:12:27 2015 -0700" flag.  Here's an example of
how to do that in perl:

https://stackoverflow.com/a/16728218

This gets us ever closer to the goals of reproducible builds, where we
can guarantee that a given original source code, the resulting binaries
are always the same.  For more on that topic:

https://reproducible-builds.org/

#807270#10
Date:
2015-12-07 13:30:10 UTC
From:
To:
Hi,

This is an "important wishlist" :-)

Let's read the date from debian/changelog top entry and set mtime
as described here.

 Currently, mk-origtargz calls gzip with "-n"

 xz and bzip2 does not seem to have such option and we set no flag.

 None of these are guranteed to produce the same result since
 compression seems to be arch dependent (at least gzip)

If you know any options to improve REPRODUCEBILITY of gzip/xz/bzip2, let
us know.

Osamu

#807270#15
Date:
2015-12-07 14:04:12 UTC
From:
To:
Hi,

Second thought ...

uscan/mk-origtargz/uupdate is not run during the binary package building
process.  Does the reproducible build aims to create source package in
reproducible way?

If reproducible build is aiming for binary build reproducibility,
changing behavior of uscan/mk-origtargz/uupdate has no impact.
...

Why you need this?  unzip preserves file timestamps inside of zip
archive.  Am I right?  Is this something we need to do for repacking of
tar.gz?

Yah, if it is needed.

Well ... it is simpler than this as long as we know what date to set.
Just run tar with --mtime option in the code with the reference file or
date string.

Regards,

Osamu

#807270#20
Date:
2015-12-07 19:07:53 UTC
From:
To:
Osamu Aoki:

We want to have the whole process able to be inspected, including the
process of making the source tarballs.  But yes, binary reproducibility
is more important.  In this case, it is pretty easy to make reproducible
source tarballs, so I think its worth doing.

I believe unzip will preserve the timestamps.

As long as mk-origtargz has an --mtime option, then we can use the most
appropriate date.  For example, with Android SDK packages, we can get
the git commit date of the release, since upstream does not post release
tarballs, only git tags.  It is this use case that made me want
mk-origtargz to support --mtime.

.hc

#807270#25
Date:
2015-12-09 12:04:56 UTC
From:
To:
Hi,

Please also remember that reproducing upstream content including the
file time stamp is important factor.

So why you wish to overwrite mtime?  Does the upstream release zipball
with different time stamp everytime user request to download?

Please be concrete on the needs with actual example package so we are
not expanding on fantasy.

If we add features, we need to add infinite number of them unless there
is a strong case which makes addition useful.  Does android SDK zip ball
has rondom timestamp inside zipball?

Regards,

Osamu

#807270#30
Date:
2015-12-10 10:42:46 UTC
From:
To:
Osamu Aoki:

Yes, Google's http://googlesource.com website provides nice .tgz
download links for every commit, but those tarballs are different everytime.

http://googlesource.com uses the current date/time as the time stamp
each time you download it.  The timestamp is the mostly likely variation
when producing source tarballs from git/etc.

.hc

#807270#35
Date:
2015-12-10 16:47:24 UTC
From:
To:
Hi,

OK.  This is deprecated source but now you are tyalking about not just
zapball but tarball.

OK so your feature request is to have such option not just for zip but
for all archive.  That makes some sense to me now.

Good night.  I need some time to think.

Osamu

#807270#40
Date:
2017-08-31 08:57:02 UTC
From:
To:
user reproducible-builds@lists.alioth.debian.org
usertags 807270 toolchain
thanks

Hey all,

Adding a Reproducible Builds usertag and pinging the ML -- I hadn't
spotted this wishlist bug before.


Best wishes,

#807270#45
Date:
2025-03-20 19:22:27 UTC
From:
To:
That parameter was explicitly added for reproducibility here:
https://salsa.debian.org/kernel-team/linux/-/commit/ea024852d4

The Debian kernel team switched from their own 'genorig.py' script to
using ``uscan``, which IIUC invokes mk-origtargz here:
https://salsa.debian.org/kernel-team/linux/-/commit/55243dbd8d6842f

But I want to use my local clone of the upstream kernel instead of
downloading ~250MB each time, so I want to restore that 'genorig.py'
script for myself, but still get identical results.

The sha256sums of the uncompressed tar archives are identical and
diff-ing the extracted orig.tar.xz archives showed no difference at all.
So I went looking what could be the reason why the sha256sums of the
orig.tar.xz files were different. And that's when I found the first
mentioned commit. And reproducibility is good, so it seems best if
mk-origtargz is improved to produce reproducible results.

So a +1 on this feature request from me.

Cheers,
  Diederik

#807270#50
Date:
2025-03-20 21:37:15 UTC
From:
To:
+1 on reproducible tarballs.  I've been spending way too much time to
achieve this for 'make dist' tarballs of a couple of projects (libtasn1,
libidn2, inetutils, ...).  It is not a simple matter.  Modification time
of files is used by 'make' for dependency rebuild ordering and may also
end up as timestamps inside files.

"Diederik de Haas" <didi.debian@cknow.org> writes:

Here is one resource to read for more hints:

https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html

/Simon

#807270#55
Date:
2025-03-20 23:09:01 UTC
From:
To:
sure, +1, patches welcome! :) \o/
#807270#60
Date:
2025-03-21 09:06:50 UTC
From:
To:
Holger Levsen <holger@layer-acht.org> writes:

Attached starting point, thoughts?

https://salsa.debian.org/debian/devscripts/-/merge_requests/490

The patch needs review/improvement from those more familiar with
mk-origtargz and the debian/tests/ framework.

My main argument is that solving this is harder than it looks, and I
fear that solving the general problem here may actually be infeasible.
It can help to realize this, otherwise one may think that solving this
is just a matter of adding the right parameters (which is what the patch
attempt to do).

While we could attempt to continue patch things, how about a bigger
question: why do we re-create tarballs?

I guess there are many different use-cases, but I believe some of them
are symptoms of some bigger problem.  The solution in those use-cases
isn't to improve reproducability of tarball re-creation, it is to avoid
creating our own tarballs.  Maybe some use-cases really do require us to
re-create tarballs, and maybe in those particular cases designing a
solution to the --mtime concern is feasible.

For those wanting to understand why solving the --mtime concern is a
hard problem, here is a partial helper tool to aid with this:

https://lists.gnu.org/archive/html/bug-gnulib/2025-02/msg00166.html

I dislike all that complexity though, so for some upstream projects
(libtasn1, libidn2, inetutils, ...) I am using a heavy hammer like this:

TAR_OPTIONS += --mode=go+u,go-w --mtime=$(abs_top_srcdir)/NEWS
mtime-NEWS-to-git-HEAD:
	$(AM_V_GEN)if test -e $(srcdir)/.git \
			&& command -v git > /dev/null; then \
		touch -m -t "$$(git log -1 --format=%cd --date=format-local:%Y%m%d%H%M.%S)" $(srcdir)/NEWS; \
	fi

We could do the same in Debian, replacing NEWS with last timestamp of
debian/changelog, but it is important to remember that this is an ugly
workaround rather than a solution.  Solving it like this will lead to
other problems.  Solving it properly requires going to the root cause of
the problem, which is what Bruno is chasing in that e-mail thread.

/Simon

#807270#65
Date:
2025-03-23 15:35:07 UTC
From:
To:
I had made some comments on the MR, but I think it's useful to keep it
all together, so I'll redo that here. At the end of the message.

... having looked into this a bit more, I agree. (more later)

I consider that out of scope for this bug, so I won't comment on that.
rules and (especially if everyone uses that) consistency.
There is one 'problem' though: it only supports git (for now?).

The ``genorig.py`` script stored the orig_date like this:

  orig_date = time.strftime("%a, %d %b %Y %H:%M:%S +0000",
      time.gmtime(
          os.stat(os.path.join(self.dir, self.orig, 'Makefile'))
          .st_mtime))

And then orig_date is used to set the --mtime parameter to tar.
That ``genorig.py`` script also had a useful comment:

    # exclude_files() will change dir mtimes.  Capture the original
    # release time so we can apply it to the final tarball.

I don't really care which date format is used, but I do care that it's
used consistently. And if the archive is repackaged or not, the mtime
should be the same (which was the whole idea behind storing orig_date).
Similarly it shouldn't matter if the archive is created via ``uscan`` or
via a call to ``mk-origtargz`` directly.

It's indeed a(n ugly) workaround but I do think it's useful; having each
package declare which upstream file to use sounds like a very bad idea.
into too many rabbit holes, I'll settle for a decent one ;-)

And now for a review of the patch/MR itself:

First of all: thanks for a proper commit message :-)
to go with ustar and not go with any of the other archive formats.
I usually put links at the bottom of the commit message which can be
used for background/further reading, but the commit message itself
should contain all the information needed.

https://www.gnu.org/software/tar/manual/html_section/Formats.html

says about ustar: "Archive format defined by POSIX.1-1988 and later."
which I think is a really good argument (I like standards).

I also see 'posix' as archive format:
"The format defined by POSIX.1-2001 and later."
"This archive format will be the default format for future versions of
GNU tar."

POSIX.1-2001 doesn't sound too recent, so why not go with that?
There may be very valid reasons, but please describe why you choose NOT
to go with that.
That can then be used in the future to re-evaluate that choice.

btw: Is this what you mean by 'pax'?
The serverfault page describes it as POSIX.1-2001, but the Formats page
doesn't have the word 'pax'.
The upstream tar git repo does have a 'paxutils' submodule, not to be
confused with the 'pax-utils' Debian package.
Then there's also a 'pax' Debian package "Portable Archive Interchange
(cpio, pax, tar)" which sounds useful (?), but its package description
has this: "This is the MirBSD paxtar implementation supporting the
formats ... old tar, and ustar, but not the format known as pax yet" :-O

[ continuing with 'Formats' ]
"The default format for GNU tar is defined at compilation time. You may
check it by running tar --help, and examining the last lines of its
output. Usually, GNU tar is configured to create archives in ‘gnu’
format, however, a future version will switch to ‘posix’."

```sh
diederik@bagend:~$ tar --version
tar (GNU tar) 1.35
diederik@bagend:~$ tar --help | tail -n3
*This* tar defaults to:
--format=gnu -f- -b20 --quoting-style=escape --rmt-command=/usr/sbin/rmt
--rsh-command=/usr/bin/rsh
```

The ``gnu`` archive format description has:
"Format used by GNU tar versions up to 1.13.25."
I didn't see a format specification in ``debian/rules``, so it seems the
default is still ``gnu``?

It sounds to me that ``ustar`` or ``posix`` are better then ``gnu``, but
is it wise if the Debian tar package uses a different archive format
(by default) then what mk-origtargz does/will do?
The Debian tar maintainer had its last upload 4 years ago and I haven't
found any upload by its official 'uploader' ... ever, so CC-ing them
didn't sound too useful.

And then I stopped myself from going into more rabbit holes ...

Excellent.

Idem.

Why? Does this change the permissions of the files in the archive?
If so, then that sounds like a bad idea.
If it is useful, then the reasoning for doing that should be documented
with an optional link to the Guix page (?) that you used for its
justification.

Without it, this patch won't close bug 807270, but referencing that bug
in this patch seems *very* useful.
And I want to reiterate that "exclude_files() will change dir mtimes",
which IIUC makes things NOT reproducible.

You're using a mix of tabs and spaces above; please use only spaces to
match the rest of the file.

Cheers,
  Diederik

#807270#70
Date:
2025-03-23 15:44:20 UTC
From:
To:
Archive format selection:
pax                      POSIX 1003.1-2001 (pax) format
posix                    same as pax

Cheers,
  Diederik

#807270#75
Date:
2025-03-23 15:48:08 UTC
From:
To:
Forgot to mention this, but please also add a link to
https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html

which you shared in your mail before the patch and is *really* useful!

Cheers,
  Diederik