Fabre

#960154 Feed UDD with just-in-time packaging hints from Lintian #960154

Package:: lintian

Source:: lintian

Submitter:: Jelmer Vernooij

Date:: 2021-04-20 19:57:02 UTC

Severity:: normal

#960154#5

Date:: 2020-05-09 23:37:52 UTC

From:

To:

Not sure whether to file this against UDD or lintian or detagtive; filing it
against lintian since it changed more recently and I suspect that's related.

For a while now, the UDD tables for lintian have been empty. They were
previously populated with lintian data.

(Independently of that, I'm curious if there are other avenues of
getting this information that don't involve UDD)

#960154#10

Date:: 2020-05-10 00:30:56 UTC

From:

To:

clone 960154 -1
retitle 960154 lintian: please provide a stable, parsable output
reassign -1 qa.debian.org
user qa.debian.org@packages.debian.org
usertags -1 udd
block -1 by 960154
retitle -1 udd.debian.org: change the lintian importer to use the new export file
thanks

It was caused by a change in lintian, but it will have to be fixed in
both lintian and udd.

This was discussed recently in #-qa.
The frontend rewrite lechner did, completely removed the logs.gz that
was used by udd (and not only udd, I fear…)

I believe the first step is going to have lintian decide on a nice
machine-parsable (text!) format and publish that; then udd will adopt
its importer.


(PS, lechner: I went looking in lindsay to see if there was already a
replacement for the logs.gz file, and I noticed that the files under
/srv/lintian.debian.org/ are all with mixed ownership lintian:lintian
and lechner:lintian.... that might not be a issue right now since you
are the only one working on that part of lintian, but it's not quite
nice to any other team member in the future, even if the umask seems to
allow all team members to edit those files; I recommend you take on the
habit to deploy that stuff entirely under the role account.)

#960154#25

Date:: 2021-04-11 19:45:14 UTC

From:

To:

Hi,

As you know, both of these already happened several months ago. I have
not commented here because I am still chewing on a related, but much
harder problem:

Lintian will soon cease to run blindly across the archive and instead
produce packaging hints on demand, as uploads are received by the
archive. There is no batch process anymore that will produce files for
the entire archive the way you expect. Instead, Lintian's new website
https://lintian.debian.*net* offers a JSON interface [1] to get up to
date information similar to DAKweb. [2] The Node.js maintainers
already use it.

Retitling to better reflect the problem at hand. Ideas are welcome.

Kind regards
Felix Lechner

[1] https://lintian.debian.net/query
[2] https://ftp-team.pages.debian.net/dak/epydoc/dakweb-module.html

#960154#30

Date:: 2021-04-13 16:45:53 UTC

From:

To:

[ Adding lucas@ to CC since he is the main person behind UDD after all ]

Indeed, I consider that done by now.

I'd have probably used a different bug, but guess we'll cope.

So, if we really go down this route, I think we need to:

* Have the importer able to run a full import of everything, which means
  looping through all sources (which means running some ~30k HTTP GETs)
  and storing them.
* Figure out a way for UDD to know it needs to check the status of a
  package.  This likely means a job that compares the set of known
  (package, version, suite) (is the tuple right?) with what is available
  in the lintian table: if something is missing query the lintian
  website for new data.
* perhaps have the lintian website *push* new data to udd.d.o.  I'm
  conflicted if this should be just a trigger ("hey I've just processed
  this, check it out yourself") or if it should carry the actual data as
  well.  I'm sure you'd like a HTTP post or such, but I can tell you
  that we'd likely prefer something through SSH.


Since after all you did look at udd several times, I believe you should
already be able to implement the first 2?



All this said, I still don't understand why you wouldn't be able to
provide a view of everything.  Since you set up that API, couldn't you
have a endpoint with *all* packages and everything, like the current
dump?  That sounds much more trivial than what you are proposing…

#960154#35

Date:: 2021-04-13 18:03:12 UTC

From:

To:

Hi,

I thought you might get upset the other way around. This bug already
blocks a UDD counterpart (#960156). The solutions I offered to date
were stop gaps.

I do not believe that is practicable. There are other ideas below.

Such a polling technique seems likewise like a so-so solution.

I love this idea (from Jelmer), if you can make it work. We will
publish the files you consume now in real time. You can subscribe via
RabbitMQ and collect them, if that is helpful to you.

As a UDD user, I believe the data may be better off being curated in
real time, if the effort can be justified. The tables don't always
match up.

It is a speed issue. We are in the process of moving to DSA-operated
equipment. Maybe they have faster disks.

Kind regards
Felix Lechner

#960154#38

Date:: 2021-04-13 18:15:46 UTC

From:

To:

These two points (and noting that the second also takes care of the
first) are still needed, for whenever UDD misses a notification or
similar, or for bootstrapping the tables (else we'd need a complete
re-run of all lintian, which I understand that with the new setup is
going to be somewhat rarer than it used to be as well, so…).

Mh, as myself I never used RabbitMQ, but I suppose it's a one way.
probably more "contemporary" than you providing SSH triggers or so.
However I'd have no clues how to incorporate a long-running process in
the current UDD setup, I'll have to leave that to Lucas.

#960154#43

Date:: 2021-04-13 18:26:24 UTC

From:

To:

From the UDD point of view, I would very much prefer to get a full dump
something I can import every few hours, than having to deal with a
stream of updates or with querying a per-package API.

Currently the full import (that runs twice a day) takes about 10 minutes
(and I don't remember if it has been optimized, so there might be space
for improvement).

Lucas

#960154#48

Date:: 2021-04-13 18:49:27 UTC

From:

To:

Hi,

Since few users ever need *all* data, would it make sense to
re-conceive UDD as a "query broker" to help people get the data they
actually need?

The power of COPY. The Lintian website currently takes 12 hours to
import a single run across the archive in 42 bulk UPSERTS via JSON
(but will eventually cease to generate data that way).

Kind regards
Felix Lechner

#960154#53

Date:: 2021-04-13 18:58:07 UTC

From:

To:

Well if you adopt it, feel free to reimplement it the way you want :)

No, it's just PREPARE/EXECUTE (inside a single big transaction -- I
think)

Lucas

#960154#58

Date:: 2021-04-14 08:48:54 UTC

From:

To:

(Adding debian-qa@ to Cc to broaden the discussion a bit)

Hi,

On the issue of lintian.d.n/lintian.d.o/UDD/tracker.d.o, I wonder if the
separation of concerns is the right one.

I think that in Debian, we would aim for a better separation between:

A/ QA tools development, focused on getting the good tools to analyze
packages (output: tools, as Debian packages)

B/ infrastructure that mass-process the archive using QA tools. (output:
current status of each package in Debian, analyzed with the latest
version of a given tool, as a parsable file)

C/ infrastructure that gathers the current status from all instances of
(B) and exposes it per-package, per-maintainer, per-team, etc.

(C) could even be split into:
  (C.1) infrastructure that gathers the status and stores it into a
  common DB;
  (C.2) infrastructure that uses (C.1) to provide useful
  per-package/per-maintainer frontends (views).

lintian.d.n is again an attempt at solving (B) and (C) at the same time.
While I don't want to prevent anyone from working on their projects of
choice, I wonder if someone else shouldn't work on a 'lintian archive
runner' service whose sole mission would be to provide the current
status of the archive against the current version of lintian as
something parsable, just to feed UDD/tracker/others.

Lucas

#960154#63

Date:: 2021-04-14 16:20:41 UTC

From:

To:

Hi Lucas,

TL;DR please find one idea to solve your issue below

Just for lintian.d.n (which is about to be transferred to
lintian.d.o), that is exactly what we provide. It just won't be one
file like it used to be. We plan instead to produce packaging hints
based on heuristics designed to provide the best service to
*maintainers*. I am sorry about the inconvenience, but as a service
facing the public—a distinction you likewise recognized in your
previous message—the change makes sense for us. We hope to prioritize
based on:

- packages for which no or no recent runs are available
- frequency of uploads (more uploads, better data)
- team requirements (for their statistics)

UDD can subscribe to the AQMP "results" queue and decide
independently, i.e. based on other input, when "a run across the
archive" is substantially complete. We previously used DAKweb for that
purpose, but our services are now available in real time.

But why wait? Why not just add a "lintian_version" column to your
table [1] and update the table at regular intervals, when you have
collected a sufficient number of runs? The Lintian version is in our
JSON results. Next, cut from your table those sources no longer known
to the archive.

For an example of how to do that, please see here for a solution via
DAKweb. [2] That is the script we use now to DROP, via ON CASCADE
DELETE, website data that is obsolete due changes in the archive.

HTH

Kind regards
Felix Lechner

[1] https://udd.debian.org/schema/udd.html#public.table.lintian
[2] https://salsa.debian.org/lintian/taxiv/-/blob/master/get-archive-state#L149-150

#960154#68

Date:: 2021-04-19 14:57:08 UTC

From:

To:

Hi,

For some data, such as Lintian packaging hints, there may be a
powerful combination of AMQP and PostgreSQL. UDD could even provide a
RabbitMQ instance as the primary interface for dynamic data
collection.

Very soon, UDD will collect Lintian's packaging hints (formerly known
as tags) in real time. Instead of grouping data as Lintian's run, our
runners could already produce rows suitable for the 'lintian' table.
(Alternatively, RabbitMQ could take apart the Lintian runs and
re-broadcast the data hint by hint on an adjacent channel.) In a super
simple design, UDD could collect those hints and true them up with its
more stable data sources like packages in the archive.

That design would weigh relevance over completeness. UDD data would
always be current even though occasionally a packaging hint might be
lost. No sweat—the missing hint will be captured next week.

The point behind this email is a hope that a conceptual insight might
emerge: UDD could become an event collector. The result would be an
up-to-date Lintian table that also ties to UDD's static data—which I
do not believe it does currently.

Kind regards
Felix Lechner

#960154#73

Date:: 2021-04-20 19:43:23 UTC

From:

To:

Hi,

Fully agreed on this. tracker.debian.org is clearly in the scope
of (C) but started to move into (B), but once I realized this I decided
that it would be better to have a separate project, that's how I ended
up designing "debusine".

See https://salsa.debian.org/freexian-team/debusine/-/blob/master/docs/devel/why.rst

As I announced a few days ago, I will invest Freexian's money
in this project so you're welcome to watch the project (in gitlab speak,
aka enable notifications) so that you can contribute to its design.

The first milestone will be oriented towards package building,
not lintian processing but I'm happy to include this in the roadmap
at some point.

Cheers,

#960154 Feed UDD with just-in-time packaging hints from Lintian #960154

Just Reply to ...

Reply to submitter ...

Send control command (Silently)

Set Architecture Tags (Silently)