diff --git a/README.md b/README.md index b7353ec..ec2a848 100644 --- a/README.md +++ b/README.md @@ -4,60 +4,55 @@ The AnyNews Republisher is a tool for mirroring news content to alternative dist The organization with the original news content is the "publisher". -The AnyNews Republisher can be configured with various publisher news sources. Then on an interval the Republisher crawls the sources, mirrors the content (text and media) offline into an RSS feed. +The AnyNews Republisher is managed through a local web UI. Sources, schedules, and job executions are stored in SQLite. On an interval the Republisher crawls the configured sources and mirrors the content (text and media) offline into an RSS feed. The [AnyNews app][app] can then be configured to use this mirror (or more than one such mirror). The Republisher currently accepts the following source input types: -- RSS Feeds +- RSS and Atom feeds +- Pangea sources via `pygea` [app]: https://gitlab.com/guardianproject/anynews/anynews-web-client +## Usage +Sync dependencies and start the admin UI: -``` shell -nix develop +```sh uv sync --all-groups -cat > repub.toml <<'EOF' -out_dir = "out" - -[[feeds]] -name = "Guardian Project Podcast" -slug = "gp-pod" -url = "https://guardianproject.info/podcast/podcast.xml" - -[[feeds]] -name = "NASA Breaking News" -slug = "nasa" -url = "https://www.nasa.gov/rss/dyn/breaking_news.rss" -EOF -uv run repub --config repub.toml +uv run repub ``` -`out_dir` may be relative or absolute. Relative paths are resolved against the -directory containing the config file. Each feed now needs a user-provided -`slug`, which is used for output paths and filenames. Optional Scrapy runtime -overrides can be set in the same file: +By default the UI listens on `127.0.0.1:8080`. You can override that with `REPUB_HOST` and `REPUB_PORT`, or with: -```toml -[scrapy.settings] -LOG_LEVEL = "DEBUG" -DOWNLOAD_TIMEOUT = 30 +```sh +uv run repub serve --host 0.0.0.0 --port 8080 ``` -Additional feed definitions can also be imported from one or more TOML files, -including a `pygea`-generated `manifest.toml`: +Important: the admin UI has no built-in authentication. Keep it bound to localhost or put it behind a trusted network layer such as Tailscale. -```toml -feed_config_files = ["/absolute/path/to/pygea/feed/manifest.toml"] +Once the UI is running: + +1. Open `http://127.0.0.1:8080/`. +2. Create a source. Feed sources take a feed URL. Pangea sources take a domain plus category configuration. +3. Configure the job schedule and any spider arguments. +4. Use `Run now` to trigger an immediate crawl, or leave the job enabled for scheduled runs. +5. Watch running jobs and logs live from the Runs pages. + +Operational notes: + +- The default database path is `republisher.db`. Set `REPUBLISHER_DB_PATH` to use a different SQLite file. +- Mirrored feeds are written under `out/feeds//`. +- Job logs and stats artifacts are written under `out/logs/`. + +The legacy one-shot config-driven crawler is still available: + +```sh +uv run repub crawl -c repub.toml ``` -Imported files only need `[[feeds]]` entries with `name`, `slug`, and `url`. - -See [`demo/README.md`](/home/abel/src/guardianproject/anynews/republisher-redux/demo/README.md) for a self-contained example config. - -## TODO +## Roadmap - [x] Offlines RSS feed xml - [x] Downloads media and enclosures @@ -68,9 +63,8 @@ See [`demo/README.md`](/home/abel/src/guardianproject/anynews/republisher-redux/ - [ ] Image compression - Do we want this? -> DEFERED for now - [x] Download and rewrite media embedded in content/CDATA fields - [x] Config file to drive the program -- [ ] Add sqlite database and simple admin UI to replace config -- [ ] Integrate pygea as input source -- [ ] Daemonize the program +- [x] Add sqlite database and simple admin UI to replace config +- [x] Integrate pygea as input source - [ ] Operationalize with metrics and error reporting ## License