#867302 libstring-copyright-perl: incorrectly parses multi-line copyright notices

#867302#5
Date:
2017-07-05 15:45:17 UTC
From:
To:
Dear Maintainer,

For https://sources.debian.net/src/sagemath/7.6-2/sage/src/sage/misc/edit_module.py/

$ licensecheck --copyright src/sage/misc/edit_module.py
src/sage/misc/edit_module.py: GPL
  [Copyright: 2007 Nils Bruin <nbruin@sfu.ca> and]

This is wrong, but I can work around it with the following sed script:

$ cat src/sage/misc/edit_module.py | tr '\n' '\t' | sed -e 's/\(,\|\band\)\s*\t#\?\s*/\1 /g' | tr '\t' '\n' > fixed.py
$ licensecheck --copyright fixed.py
fixed.py: GPL
  [Copyright: 2007 Nils Bruin <nbruin@sfu.ca> and William Stein <wstein@math.ucsd.edu>]

It would be good if this logic were incorporated into licensecheck itself. I'd
help, but my perl is really bad.

(Also perhaps the # in the regex should be a (?:#|//|/*) or something like that)

X

#867302#10
Date:
2017-07-05 16:23:00 UTC
From:
To:
Looks like I can do this by editing /usr/share/perl5/String/Copyright.pm as follows:

 		# stringify objects
 		$copyright = "$copyright";
+		$copyright =~ s/(,|\band)\s*\n(?:#|\/\/|\/\*)?\s*/$1 /g;

Please test and apply if it's good!

X

Ximin Luo:

#867302#23
Date:
2017-07-05 18:07:00 UTC
From:
To:
Ximin Luo:

This breaks some of my test cases; attached is an updated patch. It gives good results for Sage:

$ licensecheck -l200 --copyright src/sage/plot/arrow.py src/sage/combinat/words/paths.py src/sage/sets/finite_set_maps.py src/sage/modular/modform/all.py
src/sage/plot/arrow.py: GPL
  [Copyright: 2006 Alex Clemesha <clemesha@gmail.com>, William Stein <wstein@gmail.com>, 2008 Mike Hansen <mhansen@gmail.com>, 2009 Emily Kirkman]

src/sage/combinat/words/paths.py: GPL (v2 or later)
  [Copyright: 2009 Sebastien Labbe <slabqc@gmail.com>, / 2008 Arnaud bergeron <abergeron@gmail.coms>,]

src/sage/sets/finite_set_maps.py: GPL
  [Copyright: 2010 Florent Hivert <Florent.Hivert@univ-rouen.fr>,]

src/sage/modular/modform/all.py: GPL
  [Copyright: 2004-2006 William Stein <wstein@gmail.com>]

It's a little complicated - it uses replacement expressions. If you can think of a better way of doing it, please let me know!

X

#867302#28
Date:
2017-07-05 18:15:02 UTC
From:
To:
Hi Ximin,

Quoting Ximin Luo (2017-07-05 17:45:17)

Unfortunately it is not as simple as throwing a regex at it: One of my
reasons for taking over and working on licensecheck was a remark once on
d-devel@ that it was far too slow to be usable for Chromium, and I
wanted to (silently so as to not make too much of a fool of myself) take
the challenge of optimizing it.

Unlikely in its days living in devscripts, licensecheck routines to
match copyright holders have been separated into new library
String::Copyright (libstring-copyright-perl in Debian), and the code has
been refactored to use a single large RE2-compatible regex to match each
copyright statement, in the hope of some day switching to use the RE2
engine and become faster...

My first brief look at this has revealed a few bugs: In next release of
licensecheck the leading # is stripped _before_ handing over to
String::Copyright code (as was intended for years).

Have a look (if interested) at /usr/share/perl5/String/Copyright.pm and
in particular the (huge when expanded) $signs_and_more_re at line 138.

Replacing $blank_re with $blank_or_break_re in $owners_re (line 136)
succeeds in detecting the second copyright holder, but then also bogusly
includes the license statement as a copyright holder.

That is the most elegant signature I have seen. Ever!

It beats my primary school teacher who used "kh" to mean both her
initials and an abbreviation of the danish equivalent of "kind regards".


 - Jonas

#867302#37
Date:
2017-07-05 19:09:41 UTC
From:
To:
Quoting Ximin Luo (2017-07-05 20:07:00)

Thanks!

I thought you wrote you were not into perl ;-)

I will take a closer look and get back to you on this.

 - Jonas

#867302#42
Date:
2017-07-05 19:13:59 UTC
From:
To:
Quoting Ximin Luo (2017-07-05 20:07:00)

The patch relaxes the $dash_re regex to match multiple dashes.  Can you
provide me an example of where that is useful?

 - Jonas

#867302#47
Date:
2017-07-05 19:17:00 UTC
From:
To:
Jonas Smedegaard:

Thanks for the tips! I'm not sure if you got my other follow-ups to the bug report - I did in fact find String::Copyright, but I didn't know about the history nor plans for it, so thanks for filling me in on that.

At any rate, here is an updated version of my patch, along with some test cases for Sage's copyright notices.

I did try to think of a way to achieve the same logic *inside* the massive $re regexes. However I don't think this is possible, at least with my current approach - which tries to be conservative in order to adapt to humans being annoyingly inconsistent.

What it does is, it joins subsequent lines only when the indent is greater than the main line (with the "Copyright" part). This means I have to call length() in an expression-replacement, which I don't think is possible to do inside a normal regex...

As for speed:

# with the patch
$ time debian/rules debian/licensecheck.copyright
licensecheck -l250 -i ^sage/build/ -r --deb-machine --merge-licenses sage > "debian/licensecheck.copyright"

real	0m35.318s
user	0m35.204s
sys	0m0.056s

# without the patch
$ time debian/rules debian/licensecheck.copyright
licensecheck -l250 -i ^sage/build/ -r --deb-machine --merge-licenses sage > "debian/licensecheck.copyright"

real	0m31.168s
user	0m31.040s
sys	0m0.076s

X

#867302#52
Date:
2017-07-05 19:21:00 UTC
From:
To:
Jonas Smedegaard:

Yes, if you look at copyright-test.sh that I just sent in that other email, and run it in the sage/ directory of the sagemath package, you'll see that this $dash_re is useful for src/sage/modular/modform/all.py:

#       Copyright (C) 2004--2006 William Stein <wstein@gmail.com>

X

#867302#57
Date:
2017-07-05 22:13:43 UTC
From:
To:
Quoting Ximin Luo (2017-07-05 21:17:00)

I did see your other emails, but only after I posted my initial reply (I
am slow at writing emails).

I have now published App::Licensecheck 3.0.30 to CPAN, and if it
survives CPANtesters inspections then I will release that to Debian.
That release does not fix the topic of this bugreport, but it does fix a
bug in that String::Copyright expects plain text as input but was passed
text with comment markers by App::Licensecheck.  Which seems is what
complicates your patch, so I will ask you to please try again with that
newer App::Licensecheck to see how much you can reduce the patch.

If you want to try with the 3.0.30 release before it gets packaged for
Debian, you can do it like this:

  sudo apt install cpanminus
  cpanm App::Licensecheck
  export PATH="$HOME/perl5/bin:$PATH"
  export PERL5LIB="$HOME/perl5/lib/perl5"

...and when done exploring (assuming you want _any_ local CPAN gone):

  rm -rf ~/perl5 ~/.cpanm

NB! It is easiest for me if you file a new bugreport for each separate
issue - e.g. the one of not matching double-dashed year ranges.  Fine if
you work on a patch that addresses multiple issues, but still safer to
report the issues separately, so that I don't accidentally miss fixing
some of it, e.g. if I choose to resolve things differently than with
your tested patch..

Thanks :-)

 - Jonas

#867302#62
Date:
2023-05-12 14:36:38 UTC
From:
To:
Good morning,

 Attached please find your PDF account statement and invoice as of 05/11/2023. Please notice you have a past due balance  for invoice IN0099203.

 Please provide payment as soon as possible.




 Best Regards,
 Shawneen Chisholm
 Accounts Receivable Coordinator

 UNITED RENTALS, INC.
Branch L02 BONNYVILLE
4920 56TH AVE
BONNYVILLE AB T9N 2N8 CA
780-826-7610


 CONFIDENTIALITY NOTICE: The contents of this email message and any attachments are intended solely for the addressee(s). This may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message, please alert the sender immediately by reply email and then delete this message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited

#867302#67
Date:
2023-05-12 14:36:53 UTC
From:
To:
Good morning,

 Attached please find your PDF account statement and invoice as of 05/11/2023. Please notice you have a past due balance  for invoice IN0099203.

 Please provide payment as soon as possible.




 Best Regards,
 Shawneen Chisholm
 Accounts Receivable Coordinator

 UNITED RENTALS, INC.
Branch L02 BONNYVILLE
4920 56TH AVE
BONNYVILLE AB T9N 2N8 CA
780-826-7610


 CONFIDENTIALITY NOTICE: The contents of this email message and any attachments are intended solely for the addressee(s). This may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message, please alert the sender immediately by reply email and then delete this message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited