* Package name : archivebox Version : 0.2.4 Upstream Author : Nick Sweeting * URL : https://archivebox.io/ * License : MIT/Expat? Programming Lang: Python Description : open source self-hosted web archive ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more). You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox works by rendering the pages in a headless browser, then saving all the requests and fully loaded pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the original content dissapears off the internet. It also automatically extracts assets like git repositories, audio, video, subtitles, images, and PDFs into separate files using youtube-dl, pywb, and wget. ArchiveBox doesn’t require a constantly running server or backend, instead you just run the ./archive command each time you want to import new links and update the static output. It can import and export JSON (among other formats), so it’s easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer). ---- I'm not using this just yet because the upstream packaging is somewhat weird right now. https://github.com/pirate/ArchiveBox/issues/120#issuecomment-471027516 It's eventually going to end up on pypi, at which point i'll look at packaging this myself. There are, as far as I know, no similar tool in Debian right now. There are web crawlers and grabbers, but nothing as comprehensive as this. I'd be happy to co-maintain this or delegate to whoever is interested.
Many people use xul-ext-scrapbook, but this means, they are stuck with firefox 52.9.0esr-1. I proposed packaging webext-scrapbookq (#898545), but TBH it does not seem to be an adequate replacement. At first sight, archivebox looks better. I'm willing to help under the PAPT (or other team) umbrella.
That would be great. I have very little time to work on this right now but would be available to test a package. Upstream did get their act together to fix Python packaging to be more standard, I believe. There are talks of uploading to PyPI and using setup.py properly, and I think it might be time to give it another try, for what that's worth. A.
Dear colleagues, I mourn loss of Scrapbook and ArchiveBox appears to be a viable alternative. I've noticed this ITP but since there are no signs of initial packaging I took the liberty of making a first draft: https://salsa.debian.org/debian/archivebox Let's work together on it, shall we? I hope that my initial packaging could be of help as a starting point. Thanks.--- It is a mistake to try to look too far ahead. The chain of destiny can only be grasped one link at a time. -- Winston Churchill
Feel free to take over this ITP, I'm swamped. :) a.
So do I. Thanks for that! As anarcat, I'm too busy with other things now, but I'm starting to test your draft package right now. Observations so far: - the link "https://nicksweeting.com/images/archive.png" in archivebox/templates/link_index.html should be replaced with a local copy - when starting archivebox, I'm getting a strange message: "fatal: not a git repository (or any of the parent directories): .git" - archivebox likes to write to /usr/share/output/, which probably should be ~/.cache/archivebox/ - I'm trying to archive www.debian.org using the command echo https://www.debian.org/ | archivbox but get the error: "! Failed to archive link: KeyError: 'domain'" and the resulting subdirectory /usr/share/output/archive/1583964646 remains empty I'm desperately in need of something replacing scrapbook. If you are working on archivebox, I'ld in turn test the package and send bug reports, maybe even with patches.
Hi Martin, Thanks a lot for the feedback. Upstream addressed that already so I've packaging a new upstream snapshot to pick up changes. That was a version detection implying that archivebox executable is running from git repository. I've patched away this logic and embedded package version. Indeed it is an inconvenient default even though it is customizable by setting "OUTPUT_DIR" environment variable or in "~/.ArchiveBox.conf" as per template in /usr/share/doc/archivebox/examples/ArchiveBox.conf.default I've patched ArchiveBox to use current working directory. "~/.cache/archivebox/" feels oddly specific... It has been fixed by recent updates to packaging. I could successfully archive some web sites but archived debian.org seems to CSS so the menu is not rendered properly... I need Scrapbook alternative as well but I'll be working slowly towards that goal due to pressure from other priorities. I'm yet to learn how to use ArchiveBox... Also upstream prepares some serious changes for next release -- I hope it won't be too difficult to package. I'd appreciate any help with ArchiveBox. Did you have a look at Archivematica by any chance?--- Facebook in particular is the most appalling spying machine that has ever been invented. -- Julian Assange
Thank you very much. I'll probably take over in few weeks once I find another opportunity window to spend more time on ArchiveBox...--- It is a fine thing to be honest, but it is also very important to be right. -- Winston Churchill
Hi all, I'm @pirate (the ArchiveBox creator). Thanks for all of your work here so far with trying to get ArchiveBox packaged. I've finally gotten around to packaging ArchiveBox for Debian myself, it's currently on a PPA, if you have any problems/concerns with it just let me know. https://github.com/ArchiveBox/debian-archivebox https://github.com/ArchiveBox/ArchiveBox/blob/dev/stdeb.cfg https://launchpad.net/~archivebox/+archive/ubuntu/archivebox v0.5.3 (the latest version) has a small bug with a missing python package (base32-crockford), but it will be fixed in v0.5.4 which I'll be releasing shortly. It's my first time packaging for debian so I might've gotten some things wrong, but so far it seems to be working well on my Ubuntu test machines. another right.
Hi Martin, Antoine, I've updated ArchiveBox to the latest release and it seems to work well. Could you have a look and provide your feedback please? https://salsa.debian.org/debian/archivebox Thanks.--- There are occasions when it pays better to fight and be beaten than not to fight at all. -- George Orwell--- A study on infectivity of asymptomatic SARS-CoV-2 carriers, concludes weak transmission. "The median contact time for patients was four days and that for family members was five days." -- https://pubmed.ncbi.nlm.nih.gov/32513410/
Hi! and thanks for your efforts to package archivebox. I have picked it up from there and tried to build the current version (0.6.2). For that I have refreshed the patches, here is a MR: https://salsa.debian.org/debian/archivebox/-/merge_requests/1 Feel free to pick up whatever you need from there. There is a test error: ============================= test session starts ============================== platform linux -- Python 3.9.2, pytest-6.0.2, py-1.10.0, pluggy-0.13.0 rootdir: /tmp/build-area/archivebox-0.6.2 plugins: django-3.5.1, sugar-0.9.4 collected 44 items / 1 error / 43 selected ==================================== ERRORS ==================================== ____ ERROR collecting .pybuild/cpython3_3.9/build/tests/test_extractors.py _____ ImportError while importing test module '/tmp/build-area/archivebox-0.6.2/.pybuild/cpython3_3.9/build/tests/test_extractors.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ... archivebox/system.py:14: in <module> from .vendor.atomicwrites import atomic_write as lib_atomic_write E ModuleNotFoundError: No module named 'archivebox.vendor.atomicwrites' =========================== short test summary info ============================ ERROR tests/test_extractors.py !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! =============================== 1 error in 0.60s =============================== but this does not prevent the package to build (?). Anyway I m not familiar with this style of tracking only the debian dir in salsa (I normally use gbp and push the upstream sources as well). How can you enable salsa CI BTW ? Paolo
I installed the package above and when I tried the archivebox command I got the same error about the missing atomicwrites module. This is easy to fix by adding lib/python3.9/site-packages/atomicwrites/__init__.py from https://pypi.org/project/atomicwrites/ 1.4.1 as debian/vendor/atomicwrites.py. A better way of vendoring dependencies would be to use gbp components, so that they are versioned, uscan looks for new versions etc. Another issue is that if I try to "archivebox add" an url I get these warnings: ! SINGLEFILE_BINARY: single-file (unable to detect version) ! READABILITY_BINARY: readability-extractor (unable to detect version) ! MERCURY_BINARY: mercury-parser (unable to detect version) Indeed the page is archived as (I have the recommended dep chromium): - Chrome > PDF - Chrome Screenshot - wget > HTML - Archive.org - Headers - Chrome > HTML - MEdia - Git But these fail: - Chrome > SingleFile - Readability - Mercury - Git (not sure why) Getting the first three to work would require installing the JS dependencies listed in package.json which is a mess. But after the atomicwrites fix the package seems to be usable as-is and worth uploading! Paolo