- Package:
- src:licensecheck
- Source:
- licensecheck
- Submitter:
- Gianfranco Costamagna
- Date:
- 2025-04-08 11:09:01 UTC
- Severity:
- wishlist
Hi, as discussed on irc, it might be useful to use binwalk (now with a Python library/binding), to spot what is hidden/embedded into binary blobs, and then use the correct tool to search for copyrights/licenses. Or, as Jonas suggested on irc, use it when in --strict mode, and the parse failed, to let the user know what was containing the blob, to better understand why licensecheck failed to parse it. Pabs suggested hachoir tool cheers, Gianfranco
maybe licensecheck would optionally get it involved if asked to. Jonas
(in a private email) suggested the slowdown issue is already quite
serious, so this path needs to be undertaken with caution.
However, I have this idea.
My personal issue with licensecheck is that it tries to parse binary
files, but parses them as text, thus dumping huge lumps of binary junk
into the generated copyright file:
Files: ./data/icons/hicolor/48x48/apps/com.github.maoschanz.drawing.png
./help/C/figures/icon.png
Copyright: ^@CC Attribution-ShareAlike
http:creativecommons.org/licenses/by-sa/4.0/ÃTb^E^@^@^E<U+0091>IDATh<U+0081>í<U+0098>kl^TU^X<U+0086><U+009F>3{ë^h<U+008B>
n¸¤¼Ñ^E;{Ö<%¥Ì^Qz'ü<U+0083>E<U+0082>ë½<U+009D>
r^RtÅ|вxÒk³Ú ð¼
License: CC-BY-SA
FIXME
This output is useless, wrong, and it takes a lot of time to generate
since binary files are typically bigger than text files.
What if we tried to detect binary files before parsing them?
A very dumb algorithm would:
1) Check the first 8-16-32-whatever-sensible bytes for magic sequences
of files that might contain copyright/license metadata, e.g. PNG, JPEG,
SVG… (we need to keep this list short)
2) If something’s detected, parse that in a special way, Perl seems to
have a lot of modules for that
3) If nothing found but the file looks binary (TBD how we detect this),
use hachoir of whatever suitable if available, otherwise say UNKNOWN
4) Never dump binary stuff
At worst, a filter to remove non-ASCII stuff from binary-looking files
would be very useful.
Quoting Andrej Shadura (2019-11-13 14:29:13) Licensecheck currently expects to be handed only sourcecode. I agree that it makes great sense to expand to handle other file types as well, but *how* to handle other files depends on why you are running licensecheck at all. Original purpose as authored by KDE developers was conformance with a narrow subset of licenses. Extending to cover binary files would then probably mean a select few well-known extensions handed over to well-defined parsers - and everything else being either skipped or treated as an error, depending on more narrow use-case. Common use nowadays for Debian packaging is to detect most possible copyright and licensing hints. Extending to cover binary files would then probably consult libfile-libmagic-perl and/or file extensions, and maintain a list of more detailed parsers to hand it over to based on those. Which detailed parser(s) to use and how insisting to be in drilling into content depends again on the more narrow use-case. Should licensecheck detect or ignore or declare "None" for PDF content? PDF metadata fields? RDF resource embedded in PDF headers? metadata embedded in ICC profile embedded in PNG object embedded in PDF object? I think that parsing binary data in Licensecheck should be optional, to limit complexity for those using it only for processing text-based sourcecode. I think it should be configurable which parsers to use when, and offer some high-level "profiles" for common use-cases. For use right now, I recommend to combine licensecheck with helper scripts part of cdbs (but *not* build-depend on or otherwise use cdbs). For examples of using those helper scripts to pre-parse some binary files and skip select other ones, while not accidentally silencing later introduced unknown types of files, see file debian/copyright-check in the source code of ghostscript (or pandoc or valentina), and the files /usr/lib/cdbs/license-miner and /usr/lib/cdbs/licensecheck2dep5 in package cdbs. - Jonas
Good morning, Attached please find your PDF account statement and invoice as of 05/11/2023. Please notice you have a past due balance for invoice IN0099203. Please provide payment as soon as possible. Best Regards, Shawneen Chisholm Accounts Receivable Coordinator UNITED RENTALS, INC. Branch L02 BONNYVILLE 4920 56TH AVE BONNYVILLE AB T9N 2N8 CA 780-826-7610 CONFIDENTIALITY NOTICE: The contents of this email message and any attachments are intended solely for the addressee(s). This may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message, please alert the sender immediately by reply email and then delete this message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited