#696090 libdpkg-perl: Dpkg::Control(::Hash) parse performance degraded by tie'ing

#696090#5
Date:
2012-12-16 16:49:46 UTC
From:
To:
Hi,

Trying to use Dpkg::Control on a "Packages_amd64" file[1], I got a bit
annoyed with its performance.  I believe the problem is basically that
Dpkg::Control(::Hash) uses its tied hash when parsing and thereby
"wasting" a lot time.

I have attached a small "Benchmark" script to demonstrate the problem.
In this script I have deviced 3 different ways of using the
Dpkg::Control object to parse the file.

 dpkg-std - This is the method I have seen used in e.g. devscripts
 dpkg-break-tie - My "poor's man" method of avoiding the tie overhead.
 dpkg-reuse-obj - The "break-tie" with an additional reuse of the
                  Dpkg::Hash object.  A "->clear()" method or an extra
                  object to ->parse() could be the API for this case.

My testing suggest that dpkg-std and dpkg-break-tie is about a factor
2 apart.  I realise that this is not a perfect test as the tied hash
do also keep track of insert order (and "corrects" the case of field
names).

For reference, the Lintian variant of this parses the same file in
about 3 seconds.  I don't think it is the underlying way of parsing
that makes the difference.  And the Dpkg::Control object has features
not provided by the Lintian variant (which just gives you a hashref
with lowercase keys).

~Niels

[1] I believe my particular copy is a merge of the main, contrib and
non-free Packages_amd64 files, but I guess the one from main should be
large enough to reproduce this.

#696090#10
Date:
2012-12-16 17:47:03 UTC
From:
To:
Hi!

Right, this seems a bit too much. I'm working right now on optimizing
the parsing code w/o sacrificing the general interface.

(A Packages file can also be parsed using a Dpkg::Index object.)

Thanks,
Guillem

#696090#15
Date:
2013-02-01 06:53:34 UTC
From:
To:
Hi!
to move the capitalization to the output functions, and to change the
parser to bypass the tied hash. Then, as mentioned in the other bug
report, in addition adding support for the user to ignore ordering and
always using direct hash speeds up things a bit. Here are some numbers
(although I might have messed up the tests at the time, I'll run those
again once I'm cleaning this up before pushing):


,-- master:
dpkg-break-tie: 88 wallclock secs (87.68 usr +  0.12 sys = 87.80 CPU) @  0.06/s (n=5)
dpkg-reuse-obj: 29 wallclock secs (29.67 usr +  0.06 sys = 29.73 CPU) @  0.17/s (n=5)
  dpkg-std: 88 wallclock secs (87.67 usr +  0.05 sys = 87.72 CPU) @  0.06/s (n=5)
               s/iter dpkg-break-tie       dpkg-std dpkg-reuse-obj
dpkg-break-tie   17.6             --            -0%           -66%
dpkg-std         17.5             0%             --           -66%
dpkg-reuse-obj   5.95           195%           195%             --
`---

,-- after commit 'Move field capitalization to output functions':
dpkg-break-tie: 41 wallclock secs (40.81 usr +  0.08 sys = 40.89 CPU) @  0.12/s (n=5)
dpkg-reuse-obj: 30 wallclock secs (29.55 usr +  0.12 sys = 29.67 CPU) @  0.17/s (n=5)
  dpkg-std: 74 wallclock secs (74.44 usr +  0.07 sys = 74.51 CPU) @  0.07/s (n=5)
               s/iter       dpkg-std dpkg-break-tie dpkg-reuse-obj
dpkg-std         14.9             --           -45%           -60%
dpkg-break-tie   8.18            82%             --           -27%
dpkg-reuse-obj   5.93           151%            38%             --
`---

,-- after commit 'Optimize parsing bypassing the tied hash':
  dpkg-std: 37 wallclock secs (36.89 usr +  0.12 sys = 37.01 CPU) @  0.14/s (n=5)
         s/iter dpkg-std
dpkg-std   7.40       --
`---

,-- after commit 'Do not preserve insertion order' (w/ preserve_order = 0):
Benchmark: timing 5 iterations of dpkg-std...
  dpkg-std: 34 wallclock secs (33.98 usr +  0.08 sys = 34.06 CPU) @  0.15/s (n=5)
         s/iter dpkg-std
dpkg-std   6.81       --
`---

,-- after commit 'Do not preserve insertion order' (w/ direct_hash = 1):
  dpkg-std: 33 wallclock secs (32.93 usr +  0.08 sys = 33.01 CPU) @  0.15/s (n=5)
         s/iter dpkg-std
dpkg-std   6.60       --
`---


,-- after all optimizations (w/ preserve_order = 1):
dpkg-break-tie: 38 wallclock secs (37.30 usr +  0.08 sys = 37.38 CPU) @  0.13/s (n=5)
dpkg-reuse-obj: 27 wallclock secs (26.85 usr +  0.15 sys = 27.00 CPU) @  0.19/s (n=5)
  dpkg-std: 37 wallclock secs (37.32 usr +  0.10 sys = 37.42 CPU) @  0.13/s (n=5)
               s/iter       dpkg-std dpkg-break-tie dpkg-reuse-obj
dpkg-std         7.48             --            -0%           -28%
dpkg-break-tie   7.48             0%             --           -28%
dpkg-reuse-obj   5.40            39%            38%             --
`---

,-- after all optimizations (w/ preserve_order = 0 -> no order tracking):
dpkg-break-tie: 35 wallclock secs (34.62 usr +  0.06 sys = 34.68 CPU) @  0.14/s (n=5)
dpkg-reuse-obj: 23 wallclock secs (22.95 usr +  0.07 sys = 23.02 CPU) @  0.22/s (n=5)
  dpkg-std: 35 wallclock secs (34.42 usr +  0.09 sys = 34.51 CPU) @  0.14/s (n=5)
               s/iter dpkg-break-tie       dpkg-std dpkg-reuse-obj
dpkg-break-tie   6.94             --            -0%           -34%
dpkg-std         6.90             0%             --           -33%
dpkg-reuse-obj   4.60            51%            50%             --
`---

,-- after all optimizations (w/ direct_hash = 1 -> no Hash::Tie):
dpkg-break-tie: 33 wallclock secs (32.84 usr +  0.07 sys = 32.91 CPU) @  0.15/s (n=5)
dpkg-reuse-obj: 23 wallclock secs (22.71 usr +  0.05 sys = 22.76 CPU) @  0.22/s (n=5)
  dpkg-std: 32 wallclock secs (32.75 usr +  0.06 sys = 32.81 CPU) @  0.15/s (n=5)
               s/iter dpkg-break-tie       dpkg-std dpkg-reuse-obj
dpkg-break-tie   6.58             --            -0%           -31%
dpkg-std         6.56             0%             --           -31%
dpkg-reuse-obj   4.55            45%            44%             --
`---


Thanks,
Guillem

#696090#20
Date:
2013-10-06 08:40:24 UTC
From:
To:
Hi Guillem,

Thanks for working on this bug.

As far as I can tell, you reported that you had fixes for this bug back
in Feburary.  I had a look today at the public dpkg git repository and I
cannot see any of the commits you mentioned; is there something that has
been holding up these changes in the past 8 months?

~Niels

#696090#25
Date:
2013-10-15 17:44:54 UTC
From:
To:
Hi Niels!

No problem.

Yeah, I mentioned it in passing in [0] (should have probably CCed you,
sorry). The patches I had were problematic in that they might not
preserve backward compat in some usage cases, and that switching old
code to always expect lower-case field names seems error-prone as it's
not easy to spot the code users. I'll rereview the patches to see if
some can be salvaged, otherwise what I'd like to do is cleanup much of
the current Dpkg::Control mess by introducing a new hierarchy, something
like Dpkg::Deb822 or similar, which would also not have circular deps by
having field definitions by class, instead of a global one.

  [0] <http://lists.debian.org/20130725014341.GA15604@gaara.hadrons.org>

Hope that explains.

Thanks,
Guillem