republisher/README.md
Abel Luck 507074b80e
All checks were successful
buildbot/nix-eval Build done.
buildbot/nix-build Build done.
buildbot/nix-effects Build done.
Add media retention cleanup command
2026-05-27 13:04:47 +02:00

140 lines
5.6 KiB
Markdown

# AnyNews Republisher
The AnyNews Republisher is a tool for mirroring news content to alternative distribution points to avoid censorship or make content available to communities suffering from high Internet cost, slow or limited access, or natural disaster.
The organization with the original news content is the "publisher".
The AnyNews Republisher is managed through a local web UI. Sources, schedules, and job executions are stored in SQLite. On an interval the Republisher crawls the configured sources and mirrors the content (text and media) offline into an RSS feed.
The [AnyNews app][app] can then be configured to use this mirror (or more than one such mirror).
The Republisher currently accepts the following source input types:
- RSS and Atom feeds
- Pangea sources via `pygea`
[app]: https://gitlab.com/guardianproject/anynews/anynews-web-client
## Usage
Sync dependencies and start the admin UI:
```sh
uv sync --all-groups
uv run repub
```
With no arguments, `uv run repub` starts the web UI in local dev mode and serves published feed files from `/feeds/...` out of `out/feeds/...`.
By default the UI listens on `127.0.0.1:8080`. You can override that with `REPUBLISHER_HOST` and `REPUBLISHER_PORT`, or with:
```sh
uv run repub serve --host 0.0.0.0 --port 8080
```
If you invoke the `serve` subcommand explicitly, use `--dev-mode` to expose published feeds directly from the Quart app:
```sh
uv run repub serve --dev-mode
```
In `--dev-mode`, requests under `/feeds/...` are served from `out/feeds/...`.
In production, do not rely on Quart to serve published feeds. Configure the reverse proxy to serve `out/feeds/...` directly at `/feeds/...`.
Important: the admin UI has no built-in authentication. Keep it bound to localhost or put it behind a trusted network layer such as Tailscale.
Once the UI is running:
1. Open `http://127.0.0.1:8080/`.
2. Create a source. Feed sources take a feed URL. Pangea sources take a domain plus category configuration.
3. Open `Settings` and set `Feed URL` to the public origin that serves mirrored feeds, for example `https://mirror.example`.
4. Configure the job schedule and any spider arguments.
5. Use `Run now` to trigger an immediate crawl, or leave the job enabled for scheduled runs.
6. Watch running jobs and logs live from the Runs pages.
Operational notes:
- The default database path is `republisher.db`. Set `REPUBLISHER_DB_PATH` to use a different SQLite file.
- Mirrored feeds are written under `out/feeds/<slug>/`.
In production, expose `out/feeds/` directly from the reverse proxy at `/feeds/`.
- `Feed URL` is used to generate absolute media URLs and `atom:link rel="self"` in exported feeds.
- Image output is profile-driven. `REPUBLISHER_IMAGE` defines full-size
variants; the first profile is the canonical image URL used when feed image
URLs are rewritten.
- Default image profiles keep source bytes under `images/source/`, write
full-size variants under `images/full/`, and write thumbnail profiles from
`REPUBLISHER_IMAGE_THUMBNAILS` under `images/thumbs/`.
- Explicit item image media is exported as Media RSS image groups with named
thumbnails. Inline HTML images are mirrored and rewritten in content, but are
not promoted to item-level Media RSS.
- Image profile names and transform settings are part of generated filenames.
Reordering `REPUBLISHER_IMAGE` changes canonical feed image URLs.
- Job logs and stats artifacts are written under `out/logs/`.
Media cleanup:
- Published media can outlive the current feed when articles fall out of the
feed window. Use `cleanup-media` to delete old media files that are no longer
referenced by the latest published `feed.rss`.
- The default retention window is 25 days. Run a dry run first:
```sh
uv run repub cleanup-media --feeds-dir out/feeds --days 25 --dry-run
```
- Remove `--dry-run` to delete matching files. The command protects media
referenced by the latest published feed and uses a lock to avoid racing with
active crawls.
- For config-driven deployments, pass the runtime config so cleanup uses the
configured `out_dir` and media directory names:
```sh
uv run repub cleanup-media --config repub.toml --dry-run
```
The legacy one-shot config-driven crawler is still available:
```sh
uv run repub crawl -c repub.toml
```
For config-driven crawls, set the public feed origin in `scrapy.settings.REPUBLISHER_FEED_URL`:
```toml
[scrapy.settings]
REPUBLISHER_FEED_URL = "https://mirror.example"
```
## Roadmap
- [x] Offlines RSS feed xml
- [x] Downloads media and enclosures
- [x] Rewrites media urls
- [x] Profile-driven image normalization, compression, and thumbnails
- [x] Audio transcoding
- [x] Video transcoding
- [x] Download and rewrite media embedded in content/CDATA fields
- [x] Config file to drive the program
- [x] Add sqlite database and simple admin UI to replace config
- [x] Integrate pygea as input source
- [ ] Operationalize with metrics and error reporting
## License
republisher, a tool to mirror RSS/ATOM feeds completely offline
Copyright (C) 2024-2026 Abel Luck
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.