Not sure whether to file this against UDD or lintian or detagtive; filing it against lintian since it changed more recently and I suspect that's related. For a while now, the UDD tables for lintian have been empty. They were previously populated with lintian data. (Independently of that, I'm curious if there are other avenues of getting this information that don't involve UDD)
clone 960154 -1 retitle 960154 lintian: please provide a stable, parsable output reassign -1 qa.debian.org user qa.debian.org@packages.debian.org usertags -1 udd block -1 by 960154 retitle -1 udd.debian.org: change the lintian importer to use the new export file thanks It was caused by a change in lintian, but it will have to be fixed in both lintian and udd. This was discussed recently in #-qa. The frontend rewrite lechner did, completely removed the logs.gz that was used by udd (and not only udd, I fear…) I believe the first step is going to have lintian decide on a nice machine-parsable (text!) format and publish that; then udd will adopt its importer. (PS, lechner: I went looking in lindsay to see if there was already a replacement for the logs.gz file, and I noticed that the files under /srv/lintian.debian.org/ are all with mixed ownership lintian:lintian and lechner:lintian.... that might not be a issue right now since you are the only one working on that part of lintian, but it's not quite nice to any other team member in the future, even if the umask seems to allow all team members to edit those files; I recommend you take on the habit to deploy that stuff entirely under the role account.)
Hi, As you know, both of these already happened several months ago. I have not commented here because I am still chewing on a related, but much harder problem: Lintian will soon cease to run blindly across the archive and instead produce packaging hints on demand, as uploads are received by the archive. There is no batch process anymore that will produce files for the entire archive the way you expect. Instead, Lintian's new website https://lintian.debian.*net* offers a JSON interface [1] to get up to date information similar to DAKweb. [2] The Node.js maintainers already use it. Retitling to better reflect the problem at hand. Ideas are welcome. Kind regards Felix Lechner [1] https://lintian.debian.net/query [2] https://ftp-team.pages.debian.net/dak/epydoc/dakweb-module.html
[ Adding lucas@ to CC since he is the main person behind UDD after all ]
Indeed, I consider that done by now.
I'd have probably used a different bug, but guess we'll cope.
So, if we really go down this route, I think we need to:
* Have the importer able to run a full import of everything, which means
looping through all sources (which means running some ~30k HTTP GETs)
and storing them.
* Figure out a way for UDD to know it needs to check the status of a
package. This likely means a job that compares the set of known
(package, version, suite) (is the tuple right?) with what is available
in the lintian table: if something is missing query the lintian
website for new data.
* perhaps have the lintian website *push* new data to udd.d.o. I'm
conflicted if this should be just a trigger ("hey I've just processed
this, check it out yourself") or if it should carry the actual data as
well. I'm sure you'd like a HTTP post or such, but I can tell you
that we'd likely prefer something through SSH.
Since after all you did look at udd several times, I believe you should
already be able to implement the first 2?
All this said, I still don't understand why you wouldn't be able to
provide a view of everything. Since you set up that API, couldn't you
have a endpoint with *all* packages and everything, like the current
dump? That sounds much more trivial than what you are proposing…
Hi, I thought you might get upset the other way around. This bug already blocks a UDD counterpart (#960156). The solutions I offered to date were stop gaps. I do not believe that is practicable. There are other ideas below. Such a polling technique seems likewise like a so-so solution. I love this idea (from Jelmer), if you can make it work. We will publish the files you consume now in real time. You can subscribe via RabbitMQ and collect them, if that is helpful to you. As a UDD user, I believe the data may be better off being curated in real time, if the effort can be justified. The tables don't always match up. It is a speed issue. We are in the process of moving to DSA-operated equipment. Maybe they have faster disks. Kind regards Felix Lechner
These two points (and noting that the second also takes care of the first) are still needed, for whenever UDD misses a notification or similar, or for bootstrapping the tables (else we'd need a complete re-run of all lintian, which I understand that with the new setup is going to be somewhat rarer than it used to be as well, so…). Mh, as myself I never used RabbitMQ, but I suppose it's a one way. probably more "contemporary" than you providing SSH triggers or so. However I'd have no clues how to incorporate a long-running process in the current UDD setup, I'll have to leave that to Lucas.
From the UDD point of view, I would very much prefer to get a full dump something I can import every few hours, than having to deal with a stream of updates or with querying a per-package API. Currently the full import (that runs twice a day) takes about 10 minutes (and I don't remember if it has been optimized, so there might be space for improvement). Lucas
Hi, Since few users ever need *all* data, would it make sense to re-conceive UDD as a "query broker" to help people get the data they actually need? The power of COPY. The Lintian website currently takes 12 hours to import a single run across the archive in 42 bulk UPSERTS via JSON (but will eventually cease to generate data that way). Kind regards Felix Lechner
Well if you adopt it, feel free to reimplement it the way you want :) No, it's just PREPARE/EXECUTE (inside a single big transaction -- I think) Lucas
(Adding debian-qa@ to Cc to broaden the discussion a bit) Hi, On the issue of lintian.d.n/lintian.d.o/UDD/tracker.d.o, I wonder if the separation of concerns is the right one. I think that in Debian, we would aim for a better separation between: A/ QA tools development, focused on getting the good tools to analyze packages (output: tools, as Debian packages) B/ infrastructure that mass-process the archive using QA tools. (output: current status of each package in Debian, analyzed with the latest version of a given tool, as a parsable file) C/ infrastructure that gathers the current status from all instances of (B) and exposes it per-package, per-maintainer, per-team, etc. (C) could even be split into: (C.1) infrastructure that gathers the status and stores it into a common DB; (C.2) infrastructure that uses (C.1) to provide useful per-package/per-maintainer frontends (views). lintian.d.n is again an attempt at solving (B) and (C) at the same time. While I don't want to prevent anyone from working on their projects of choice, I wonder if someone else shouldn't work on a 'lintian archive runner' service whose sole mission would be to provide the current status of the archive against the current version of lintian as something parsable, just to feed UDD/tracker/others. Lucas
Hi Lucas, TL;DR please find one idea to solve your issue below Just for lintian.d.n (which is about to be transferred to lintian.d.o), that is exactly what we provide. It just won't be one file like it used to be. We plan instead to produce packaging hints based on heuristics designed to provide the best service to *maintainers*. I am sorry about the inconvenience, but as a service facing the public—a distinction you likewise recognized in your previous message—the change makes sense for us. We hope to prioritize based on: - packages for which no or no recent runs are available - frequency of uploads (more uploads, better data) - team requirements (for their statistics) UDD can subscribe to the AQMP "results" queue and decide independently, i.e. based on other input, when "a run across the archive" is substantially complete. We previously used DAKweb for that purpose, but our services are now available in real time. But why wait? Why not just add a "lintian_version" column to your table [1] and update the table at regular intervals, when you have collected a sufficient number of runs? The Lintian version is in our JSON results. Next, cut from your table those sources no longer known to the archive. For an example of how to do that, please see here for a solution via DAKweb. [2] That is the script we use now to DROP, via ON CASCADE DELETE, website data that is obsolete due changes in the archive. HTH Kind regards Felix Lechner [1] https://udd.debian.org/schema/udd.html#public.table.lintian [2] https://salsa.debian.org/lintian/taxiv/-/blob/master/get-archive-state#L149-150
Hi, For some data, such as Lintian packaging hints, there may be a powerful combination of AMQP and PostgreSQL. UDD could even provide a RabbitMQ instance as the primary interface for dynamic data collection. Very soon, UDD will collect Lintian's packaging hints (formerly known as tags) in real time. Instead of grouping data as Lintian's run, our runners could already produce rows suitable for the 'lintian' table. (Alternatively, RabbitMQ could take apart the Lintian runs and re-broadcast the data hint by hint on an adjacent channel.) In a super simple design, UDD could collect those hints and true them up with its more stable data sources like packages in the archive. That design would weigh relevance over completeness. UDD data would always be current even though occasionally a packaging hint might be lost. No sweat—the missing hint will be captured next week. The point behind this email is a hope that a conceptual insight might emerge: UDD could become an event collector. The result would be an up-to-date Lintian table that also ties to UDD's static data—which I do not believe it does currently. Kind regards Felix Lechner
Hi, Fully agreed on this. tracker.debian.org is clearly in the scope of (C) but started to move into (B), but once I realized this I decided that it would be better to have a separate project, that's how I ended up designing "debusine". See https://salsa.debian.org/freexian-team/debusine/-/blob/master/docs/devel/why.rst As I announced a few days ago, I will invest Freexian's money in this project so you're welcome to watch the project (in gitlab speak, aka enable notifications) so that you can contribute to its design. The first milestone will be oriented towards package building, not lintian processing but I'm happy to include this in the roadmap at some point. Cheers,