#1009196 texlive-binaries: Reproducible content of .fmt files

Package:
texlive-binaries
Source:
texlive-bin
Description:
Binaries for TeX Live
Submitter:
Roland Clobus
Date:
2022-05-04 16:03:03 UTC
Severity:
wishlist
Tags:
#1009196#5
Date:
2022-04-08 16:57:56 UTC
From:
To:
Hello maintainers of texlive-binaries,

While working on the “reproducible builds” effort [1], I have noticed that the
live image for Cinnamon in bookworm is no longer reproducible [2].

The attached patch ensures that the output of the function 'exception_strings'
always uses the same order of the hyphenation exceptions.
I've written the solution in C, perhaps someone more versed in lua could
rewrite it more elegantly.
(The lua manual says for the 'next' function: 'The order in which the indices
are enumerated is not specified' [3])

With the attached patch applied, I'm able (with the help of FORCE_SOURCE_DATE=1
and SOURCE_DATE_EPOCH) to reproducibly rebuild the .fmt files, as created by
'fmtutil --sys --all'.

Small test case to reproduce:
export FORCE_SOURCE_DATE=1
export SOURCE_DATE_EPOCH=$(date +%s)
for i in `seq 1 10`; do luahbtex -ini -jobname=luahbtex -progname=luabhtex
luatex.ini > /dev/null; md5sum luahbtex.*; done

With kind regards,
Roland Clobus

 [1]: https://wiki.debian.org/ReproducibleBuilds
 [2]:
https://jenkins.debian.net/view/live/job/reproducible_debian_live_build_cinnamon_bookworm/
 [3]: http://www.lua.org/manual/5.4/manual.html#pdf-next

#1009196#10
Date:
2022-04-11 04:56:42 UTC
From:
To:
Hi Luigi, hi all luatex devs,

here at Debian we got a bug report about reproducability of luatex
format dumps. It contains a patch to make the hyphenation exception list
sorted. (I attach the patch)

Could you please take a look whether this is still relevant for the
latest release of luatex.

Thanks

Norbert

#1009196#15
Date:
2022-04-11 07:00:19 UTC
From:
To:
it actually defeats one of the security properties of lua (which was
explicitly introduced at some point: make sure that hashes have random
order each run so that it's harder to retrieve sensitive data from mem)

that said, it means that as soon as something gets stored in the format
otherwise (than exceptions) one can face the same issue (although one
can work around that by sorting etc)

if you want reproducibility for some testing, mess with this instead:

#if !defined(luai_makeseed)
#include <time.h>
#define luai_makeseed()		cast(unsigned int, time(NULL))
#endif

anyway, formats with embedded lua data (serialized or bytecode is never
guaranteed the same unless one does soem effort)

fwiw: the easiest solution is to not store patterns and exceptions in
the format and just load them runtime which is just as fast (in
retrospect not a good idea to store it but it was needed for some plain
compatibility testing)

Hans

(who in the past has been bitten by this 'random feature' when we made
the switch to 5.3, or maybe it was even 5.2; it used to be 'random per
binary' and became 'random per run' but we decided to stick with
official lua)
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
#1009196#20
Date:
2022-04-11 11:01:32 UTC
From:
To:
Hi Hans, hi Roland,

thanks for your answer.

Well, that is a good point to *not* implement the change.

Roland, do you have any comments? I guess the reproducability strive is
not as important as security.

So if something in this way should be done, it would need to
changes sort order if and only if FORCE_SOURCE_DATE=1 in the env
(this is what has required for tex engines to obey SOURCE_DATE_EPOCH
settings).

Roland, if you have time, please adjust the patch to work within the
above constraints.

Best regards

Norbert

#1009196#25
Date:
2022-04-11 11:48:44 UTC
From:
To:
not only fmt, every output  could suffer from the same problem if it
depends on a lua table that is not an array --   temp data, log and pdf .
The format should  serialize only array, or use a metatable
(e.g.
https://stackoverflow.com/questions/30970034/lua-in-pairs-with-same-order-as-its-written
)
Even if we hard code  in some way an ordered table data structure,  it's
still the responsibility of the format to use it -- but then  metatables
are more flexible.

#1009196#30
Date:
2022-04-11 13:26:04 UTC
From:
To:
If the final output (pdf) has traces of that, it might be of concern.
But for now the discussion is about the fmt dump, which is independent
of these items.

Best regards

Norbert

#1009196#35
Date:
2022-04-11 14:34:00 UTC
From:
To:
Hello Hans, Norbert,

Thanks for your answers.

Well, reproducibility is *another* aspect of security; this time not for
the regular environments that users will use, but for build environments.

Reproducibility (as enforced by SOURCE_DATE_EPOCH) is typically enabled
in an environment that generates binaries from source code for
redistribution. It will guarantee that the build environment has not
been tampered with, because you can (if you have made a similar build
environment yourself) generate the binary files bit-for-bit identical.
For a regular, production environment you should not have
SOURCE_DATE_EPOCH set.

Other programming languages also have solved the security risks
associated with the randomness of the hashes and reproducibility, see
[1]. For Perl, the hashes can be de-randomized with PERL_HASH_SEED.
Python uses PYTHONHASHSEED.
For Lua an environment variable LUA_HASH_SEED could be introduced, or
per default the value of SOURCE_DATE_EPOCH (if set) instead of
time(NULL) could be used to seed the hashes.

The texlive-binaries in Debian contain an embedded copy of Lua 5.3. The
Lua 5.4 version of luai_makeseed is more complex, see [2]. I'll write a
feature request for Lua later, that is out-of-scope for this scenario.

Ack. Thanks for the pointer to luai_makeseed, that was some missing
information that I needed. I'll post an updated patch soon (most
probably much smaller and more elegant). As written above, the hash seed
will be de-randomized only when both FORCE_SOURCE_DATE=1 and
SOURCE_DATE_EPOCH are set.

With kind regards,
Roland Clobus

[1] https://reproducible-builds.org/docs/stable-outputs/
[2] https://sources.debian.org/src/lua5.4/5.4.4-1/src/lstate.c/?hl=73#L73

#1009196#40
Date:
2022-04-11 15:29:11 UTC
From:
To:
fyi: it is unlikely that luatex will move to 5.4 because it might break
exisiting code and/or introduce incompatibilties (so we assume 5.3 for now)

Hans
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
#1009196#45
Date:
2022-04-19 07:16:50 UTC
From:
To:
Hello list,

For Lua-based TeX binaries, only when FORCE_SOURCE_DATE=1 and
SOURCE_DATE_EPOCH are set, this will initialise the Lua seed to the
value of SOURCE_DATE_EPOCH instead of a random value.
With this patch, the .fmt files can be generated bit-for-bit identical.

Regarding the patch:
* This patch is intended only for Lua 5.3 that is embedded in
texlive-binaries
* A re-definition of `luai_makeseed` is unfortunately not sufficient for
Lua 5.3, for 5.4.4 and later it would be. [1]
* I've added no validation for the content of SOURCE_DATE_EPOCH:
** 1) That happens in other code locations already
** 2) Even if the value would be incorrect, the Lua seed will still be
de-randomized
* Do you want some comment lines?
* The sorting from by previous patch is no longer required. Only
lstate.c needs to be modified.

With kind regards,
Roland Clobus


PS: If you later intend to upgrade to another version of Lua, the fixed
seed value can help you in automated tests to see different behaviour
due to the upgrade.

[1]
https://github.com/lua/lua/commit/97e394ba1805fbe394a5704de660403901559e54

#1009196#50
Date:
2022-04-19 07:52:46 UTC
From:
To:
Thank you very much for your patch, I will check it this weekend.
#1009196#55
Date:
2022-04-19 09:18:24 UTC
From:
To:
Hello list,
While preparing for a generic change request for Lua, I found a mail by
Hans Hagen [1], stating that all cases have been found in luatex.
Sorting the table (as in my original patch) is also a solution, but my
proposed patch in lstate.c will fix the root cause.

I would rather fix the root cause.
If you prefer the sorting patch, I'll adapt it to activate only when
FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set.

With kind regards,
Roland Clobus

[1] http://lua-users.org/lists/lua-l/2014-07/msg00564.html

#1009196#60
Date:
2022-05-04 13:03:42 UTC
From:
To:
Hello luigi, list,
Have you found the time already to review my patch? [1]

With kind regards,
Roland Clobus

[1] https://mailman.ntg.nl/pipermail/dev-luatex/2022-April/006659.html

#1009196#65
Date:
2022-05-04 13:16:58 UTC
From:
To:
Yes, Hans and I are discussing.
If possible, I would like to use a --reproducible switch at the command
line.

#1009196#70
Date:
2022-05-04 16:00:52 UTC
From:
To:
teams, instead of using SOURCE_DATE_EPOCH. I would rather suggest to use
SOURCE_DATE_EPOCH, which is already in the code base, instead of adding
a new code path.

If you find the time, please read the documentation on SOURCE_DATE_EPOCH
[1] and the page that mentions a checklist [2].

The short summary: SOURCE_DATE_EPOCH has been standardized and is
primarily intended to be used by rebuilders of the binaries, not the
developers or end-users.

In the past, when SOURCE_DATE_EPOCH was getting established, texlive
additionally added FORCE_SOURCE_DATE=1. Nowadays, if it can be avoided,
I would recommend to use only SOURCE_DATE_EPOCH.
See [3] for all uses of FORCE_SOURCE_DATE_ in Debian. As you can see, it
is mainly used in several tests to ensure that packages have output that
can be compared against a reference.

With kind regards,
Roland Clobus

[1] https://reproducible-builds.org/docs/source-date-epoch/
[2]
https://wiki.debian.org/ReproducibleBuilds/StandardEnvironmentVariables#Checklist
[3] https://codesearch.debian.net/search?q=FORCE_SOURCE_DATE&literal=0