Compare commits

...

23 commits

Author SHA1 Message Date
31e1da937f add dev-mode 2026-03-30 15:36:12 +02:00
0803617e62 add empty table placeholders 2026-03-30 15:28:56 +02:00
8716579508 humanize sizes 2026-03-30 15:25:28 +02:00
947ef8e833 remove most subtitles 2026-03-30 15:25:10 +02:00
d8f2e03d36 be consistent with env var names 2026-03-30 15:23:34 +02:00
6fd3b598ab output to out/feeds/* 2026-03-30 15:21:39 +02:00
beac981047 update readme 2026-03-30 15:20:27 +02:00
36cf98a91c fix output paths 2026-03-30 15:10:47 +02:00
8af28c2f68 implement scrapy + pygea job runner 2026-03-30 15:04:41 +02:00
916968c579 reconcile stale execs 2026-03-30 14:18:55 +02:00
90674e6515 tweak sidebar 2026-03-30 14:18:51 +02:00
51728a5401 shim renders app shell 2026-03-30 14:16:15 +02:00
c210168d65 tweak job runs 2026-03-30 14:14:59 +02:00
2b2a3f1cc0 implement job runner and scheduler 2026-03-30 14:02:39 +02:00
328a70ff9b edit sources 2026-03-30 13:49:00 +02:00
847aeae772 db backed source creation 2026-03-30 13:37:25 +02:00
b9e288a22d add sqlite database 2026-03-30 13:31:06 +02:00
06066c2394 create sources in memory 2026-03-30 13:23:36 +02:00
9e826fcee8 separeate pages 2026-03-30 13:11:37 +02:00
3fc999a69b add a datastar action 2026-03-30 12:48:32 +02:00
33dbb143fd add datastar SSE rendering 2026-03-30 12:34:38 +02:00
2accb26546 add datastar and render shim 2026-03-30 12:27:45 +02:00
9ce576e7e8 with htpy and css 2026-03-30 12:13:04 +02:00
35 changed files with 7536 additions and 85 deletions

3
.gitignore vendored
View file

@ -12,3 +12,6 @@ data
logs logs
archive archive
*egg-info *egg-info
*.db
*.db-shm
*.db-wal

View file

@ -1,5 +1,7 @@
# republisher-redux # republisher-redux
See @README.md
## Overview ## Overview
- `republisher-redux` is a Scrapy-based tool that mirrors RSS and Atom feeds for offline use. - `republisher-redux` is a Scrapy-based tool that mirrors RSS and Atom feeds for offline use.
@ -8,6 +10,71 @@
- Nix development and packaging use `flake.nix`. - Nix development and packaging use `flake.nix`.
- Formatting is managed through `treefmt-nix`, exposed via `nix fmt`. - Formatting is managed through `treefmt-nix`, exposed via `nix fmt`.
- Prefer immutable style functional programming style
- functions that operate on data over classes that encapsulate state
- No backwards-compatibility guarantees; prefer breaking changes over backwards compat and complexity.
- Think carefully and implement the most concise solution that changes as little code as possible.
## HTML/Datastar Rules
Very important rules for datastar usage.
The views are pure functions data in -> html out.
- we only use full page morph mode. no diffing
Why large/fat/main morphs (aka immediate mode)?
By only using data: mode morph and always targeting the main element of the document the API can be massively simplified. This avoids having the explosion of endpoints you get with HTMX and makes reasoning about your app much simpler.
- we only have a single render function per page
By having a single render function per page you can simplify the reasoning about your app to view = f(state). You can then reason about your pushed updates as a continuous signal rather than discrete event stream. The benefit of this is you don't have to handle missed events, disconnects and reconnects. When the state changes on the server you push down the latest view, not the delta between views. On the client idiomorph can translate that into fine grained dom updates.
- any database change -> re render all connected users with 200ms throttle
When your events are not homogeneous, you can't miss events, so you cannot throttle your events without losing data.
But, wait! Won't that mean every change will cause all users to re-render? Yes, but at a maximum rate determined by the throttle. This, might sound scary at first but in practice:
The more shared views the users have the more likely most of the connected users will have to re-render when a change happen.
The more events that are happening the more likely most users will have to re-render.
This means you actually end up doing more work with a non homogeneous event system under heavy load than with this simple homogeneous event system that's throttled (especially it there's any sort of common/shared view between users).
- Signals are only for ephemeral client side state
Signals should only be used for ephemeral client side state. Things like: the current value of a text input, whether a popover is visible, current csrf token, input validation errors. Signals can be controlled on the client via expressions, or from the backend via patch-signals.
- Signals in elements should be declared __ifmissing
Because signals are only being used to represent ephemeral client state that means they can only be initialised by elements and they can only be changed via expressions on the client or from the server via patch-signals in an action. Signals in elements should be declared __ifmissing unless they are "view only".
- View only signals, are signals that can only be changed by the server. These should not be declared __ifmissing instead they should be made "local" by starting their key with an _ this prevents the client from sending them up to the server.
- Actions should not update the view themselves directly
Actions should not update the view via patch elements. This is because the changes they make would get overwritten on the next render-fn that pushes a new view down the updates SSE connection. However, they can still be used to update signals as those won't be changed by elements patch. This allows you to do things like validation on the server.
- Stateless views
The only way for actions to affect the view returned by the render-fn running in a connection is via the database. The ensures CQRS. This means there is no connection state that needs to be persisted or maintained (so missed events and shutdowns/deploys will not lead to lost state). Even when you are running in a single process there is no way for an action (command) to communicate with/affect a view render (query) without going through the database.
- CQRS
Actions modify the database and return a 204 or a 200 if they patch-signals.
Render functions re-render when the database changes and send an update down the updates SSE connection.
- Work sharing (caching)
Work sharing is the term I'm using for sharing renders between connected users. This can be useful when a lot of connected users share the same view. For example a leader board, game board, presence indicator etc. It ensures the work (eg: query and html generation) for that view is only done once regardless of the number of connected users. The simplest way to do this is to recalculate and cache values after after a batch has been run.
- Use data-on:pointerdown/mousedown over data-on:click
This is a small one but can make even the slowest of networks feel much snappier.
- No CORS By hosting all assets on the same origin we avoid the need for CORS. This avoids additional server round trips and helps reduce latency.
- Rendering an initial shim -Rather than returning the whole page on initial render and having two render paths, one for initial render and one for subsequent rendering a shell is rendered and then populated when the page connects to the updates endpoint for that page. This has a few advantages:
The page will only render dynamic content if the user has javascript and first party cookies enabled.
The initial shell page can generated and compressed once.
The server only does more work for actual users and less work for link preview crawlers and other bots (that don't support javascript or cookies).
## Workflow ## Workflow
- Use Python 3.13. - Use Python 3.13.
@ -44,3 +111,7 @@ uv run repub crawl -c repub.toml
- The console entrypoint is `repub`. - The console entrypoint is `repub`.
- Runtime ffmpeg availability is provided by the flake package and devshell. - Runtime ffmpeg availability is provided by the flake package and devshell.
- Tests live under `tests/`. - Tests live under `tests/`.
- `prompts/` is git ignored intentionally
- Never search the web for this repo. If an external resource, document, or reference is needed, stop and ask the user to provide it.
- Treat the repo-root `republisher.db` as user-owned local state. Do not delete or reset it as part of routine testing or verification.
- For automated tests or isolated verification, use a separate database path via `REPUBLISHER_DB_PATH` instead of mutating or removing the repo-root database.

View file

@ -4,60 +4,65 @@ The AnyNews Republisher is a tool for mirroring news content to alternative dist
The organization with the original news content is the "publisher". The organization with the original news content is the "publisher".
The AnyNews Republisher can be configured with various publisher news sources. Then on an interval the Republisher crawls the sources, mirrors the content (text and media) offline into an RSS feed. The AnyNews Republisher is managed through a local web UI. Sources, schedules, and job executions are stored in SQLite. On an interval the Republisher crawls the configured sources and mirrors the content (text and media) offline into an RSS feed.
The [AnyNews app][app] can then be configured to use this mirror (or more than one such mirror). The [AnyNews app][app] can then be configured to use this mirror (or more than one such mirror).
The Republisher currently accepts the following source input types: The Republisher currently accepts the following source input types:
- RSS Feeds - RSS and Atom feeds
- Pangea sources via `pygea`
[app]: https://gitlab.com/guardianproject/anynews/anynews-web-client [app]: https://gitlab.com/guardianproject/anynews/anynews-web-client
## Usage
Sync dependencies and start the admin UI:
``` shell ```sh
nix develop
uv sync --all-groups uv sync --all-groups
cat > repub.toml <<'EOF' uv run repub
out_dir = "out"
[[feeds]]
name = "Guardian Project Podcast"
slug = "gp-pod"
url = "https://guardianproject.info/podcast/podcast.xml"
[[feeds]]
name = "NASA Breaking News"
slug = "nasa"
url = "https://www.nasa.gov/rss/dyn/breaking_news.rss"
EOF
uv run repub --config repub.toml
``` ```
`out_dir` may be relative or absolute. Relative paths are resolved against the With no arguments, `uv run repub` starts the web UI in local dev mode and serves published feed files from `/feeds/...` out of `out/feeds/...`.
directory containing the config file. Each feed now needs a user-provided
`slug`, which is used for output paths and filenames. Optional Scrapy runtime
overrides can be set in the same file:
```toml By default the UI listens on `127.0.0.1:8080`. You can override that with `REPUBLISHER_HOST` and `REPUBLISHER_PORT`, or with:
[scrapy.settings]
LOG_LEVEL = "DEBUG" ```sh
DOWNLOAD_TIMEOUT = 30 uv run repub serve --host 0.0.0.0 --port 8080
``` ```
Additional feed definitions can also be imported from one or more TOML files, If you invoke the `serve` subcommand explicitly, use `--dev-mode` to expose published feeds directly from the Quart app:
including a `pygea`-generated `manifest.toml`:
```toml ```sh
feed_config_files = ["/absolute/path/to/pygea/feed/manifest.toml"] uv run repub serve --dev-mode
``` ```
Imported files only need `[[feeds]]` entries with `name`, `slug`, and `url`. In `--dev-mode`, requests under `/feeds/...` are served from `out/feeds/...`.
See [`demo/README.md`](/home/abel/src/guardianproject/anynews/republisher-redux/demo/README.md) for a self-contained example config. Important: the admin UI has no built-in authentication. Keep it bound to localhost or put it behind a trusted network layer such as Tailscale.
## TODO Once the UI is running:
1. Open `http://127.0.0.1:8080/`.
2. Create a source. Feed sources take a feed URL. Pangea sources take a domain plus category configuration.
3. Configure the job schedule and any spider arguments.
4. Use `Run now` to trigger an immediate crawl, or leave the job enabled for scheduled runs.
5. Watch running jobs and logs live from the Runs pages.
Operational notes:
- The default database path is `republisher.db`. Set `REPUBLISHER_DB_PATH` to use a different SQLite file.
- Mirrored feeds are written under `out/feeds/<slug>/`.
- Job logs and stats artifacts are written under `out/logs/`.
The legacy one-shot config-driven crawler is still available:
```sh
uv run repub crawl -c repub.toml
```
## Roadmap
- [x] Offlines RSS feed xml - [x] Offlines RSS feed xml
- [x] Downloads media and enclosures - [x] Downloads media and enclosures
@ -68,9 +73,8 @@ See [`demo/README.md`](/home/abel/src/guardianproject/anynews/republisher-redux/
- [ ] Image compression - Do we want this? -> DEFERED for now - [ ] Image compression - Do we want this? -> DEFERED for now
- [x] Download and rewrite media embedded in content/CDATA fields - [x] Download and rewrite media embedded in content/CDATA fields
- [x] Config file to drive the program - [x] Config file to drive the program
- [ ] Add sqlite database and simple admin UI to replace config - [x] Add sqlite database and simple admin UI to replace config
- [ ] Integrate pygea as input source - [x] Integrate pygea as input source
- [ ] Daemonize the program
- [ ] Operationalize with metrics and error reporting - [ ] Operationalize with metrics and error reporting
## License ## License

View file

@ -239,7 +239,10 @@
inherit src; inherit src;
dontConfigure = true; dontConfigure = true;
dontBuild = true; dontBuild = true;
nativeBuildInputs = [ testVenv ]; nativeBuildInputs = [
pkgs.pyright
testVenv
];
checkPhase = '' checkPhase = ''
runHook preCheck runHook preCheck
pyright pyright

View file

@ -19,6 +19,7 @@ dependencies = [
"aiosqlite>=0.21.0,<0.22.0", "aiosqlite>=0.21.0,<0.22.0",
"datastar-py>=0.8.0,<0.9.0", "datastar-py>=0.8.0,<0.9.0",
"greenlet>=3.2.4,<4.0.0", "greenlet>=3.2.4,<4.0.0",
"htpy>=25.12.0,<26.0.0",
"peewee>=3.19.0,<4.0.0", "peewee>=3.19.0,<4.0.0",
"pygea @ git+https://guardianproject.dev/anynews/pygea.git", "pygea @ git+https://guardianproject.dev/anynews/pygea.git",
] ]
@ -49,6 +50,9 @@ include-package-data = true
where = ["."] where = ["."]
include = ["repub*"] include = ["repub*"]
[tool.setuptools.package-data]
repub = ["sql/*.sql"]
[tool.pytest.ini_options] [tool.pytest.ini_options]
testpaths = ["tests"] testpaths = ["tests"]
@ -65,6 +69,14 @@ max-line-length = "88"
[tool.pyright] [tool.pyright]
include = ["repub", "tests"] include = ["repub", "tests"]
exclude = [
"repub/crawl.py",
"repub/exporters.py",
"repub/media.py",
"repub/rss.py",
"repub/spiders",
"repub/srcset.py",
]
pythonVersion = "3.13" pythonVersion = "3.13"
typeCheckingMode = "basic" typeCheckingMode = "basic"
reportMissingImports = false reportMissingImports = false

412
repub/components.py Normal file
View file

@ -0,0 +1,412 @@
from __future__ import annotations
import htpy as h
from htpy import Node, Renderable
def base_layout(*, page_title: str, stylesheet_href: str, content: Node) -> Renderable:
return h.html(lang="en", class_="h-full bg-slate-100")[
h.head[
h.meta(charset="utf-8"),
h.meta(name="viewport", content="width=device-width, initial-scale=1"),
h.title[page_title],
h.link(rel="stylesheet", href=stylesheet_href),
],
h.body(
class_="h-full bg-linear-to-br from-stone-100 via-amber-50 to-orange-100 text-slate-900"
)[content],
]
def nav_link(
*, label: str, href: str, active: bool = False, badge: str | None = None
) -> Renderable:
link_class = (
"group flex items-center justify-between rounded-xl px-3 py-2 text-sm font-medium transition "
+ (
"bg-white text-slate-950 shadow-sm ring-1 ring-white/10"
if active
else "text-slate-300 hover:bg-white/5 hover:text-white"
)
)
badge_class = "rounded-full px-2 py-0.5 text-[11px] font-semibold " + (
"bg-amber-200 text-amber-950" if active else "bg-slate-800 text-slate-300"
)
return h.a(href=href, class_=link_class)[
h.span[label],
badge and h.span(class_=badge_class)[badge],
]
def admin_sidebar(*, current_path: str) -> Renderable:
return h.aside(
class_="relative overflow-hidden bg-slate-950 px-6 py-8 text-white lg:min-h-screen"
)[
h.div(
class_="absolute inset-x-0 top-0 h-40 bg-radial from-amber-400/25 via-amber-400/10 to-transparent"
),
h.div(class_="relative flex h-full flex-col")[
h.div(class_="flex items-center gap-3")[
h.div(
class_="flex size-11 items-center justify-center rounded-2xl bg-amber-400 text-base font-black text-slate-950"
)["AR"],
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.24em] text-amber-300"
)["Republisher"],
],
],
h.nav(class_="mt-10 space-y-2")[
nav_link(
label="Dashboard",
href="/",
active=current_path == "/",
badge="Live",
),
nav_link(
label="Sources",
href="/sources",
active=current_path.startswith("/sources"),
badge="12",
),
nav_link(
label="Runs",
href="/runs",
active=current_path.startswith("/runs")
or current_path.startswith("/job/"),
badge="3",
),
],
h.div(class_="mt-auto rounded-3xl bg-white/5 p-5 ring-1 ring-white/10")[
h.p(class_="text-sm font-semibold text-white")[
"AnyNews Republisher v2.0"
],
h.p(class_="mt-4 text-xs uppercase tracking-[0.22em] text-slate-400")[
"by Guardian Project"
],
],
],
]
def header_action_link(*, href: str, label: str) -> Renderable:
return h.a(
href=href,
class_="inline-flex items-center rounded-full bg-amber-400 px-4 py-2.5 text-sm font-semibold text-slate-950 shadow-sm transition hover:bg-amber-300",
)[label]
def header_secondary_link(*, href: str, label: str) -> Renderable:
return h.a(
href=href,
class_="inline-flex items-center rounded-full border border-white/15 bg-white/5 px-4 py-2.5 text-sm font-semibold text-white transition hover:bg-white/10",
)[label]
def muted_action_link(*, href: str, label: str) -> Renderable:
return h.a(
href=href,
class_="inline-flex items-center rounded-full border border-slate-200 bg-white px-3.5 py-2 text-sm font-semibold text-slate-700 shadow-sm transition hover:bg-slate-50",
)[label]
def inline_link(*, href: str, label: str, tone: str = "default") -> Renderable:
classes = {
"default": "text-slate-700 hover:text-slate-950",
"amber": "text-amber-700 hover:text-amber-800",
"rose": "text-rose-700 hover:text-rose-800",
}
return h.a(
href=href,
class_=f"inline-flex items-center whitespace-nowrap text-sm font-semibold {classes[tone]}",
)[label]
def inline_button(
*, label: str, tone: str = "default", disabled: bool = False
) -> Renderable:
classes = {
"default": "bg-stone-100 text-slate-700 hover:bg-stone-200",
"danger": "bg-rose-50 text-rose-700 hover:bg-rose-100",
"success": "bg-emerald-100 text-emerald-800 hover:bg-emerald-200",
}
class_name = (
"cursor-not-allowed bg-slate-100 text-slate-400" if disabled else classes[tone]
)
return h.button(
type="button",
disabled=disabled,
class_=f"inline-flex items-center whitespace-nowrap rounded-full px-3 py-1.5 text-sm font-semibold transition {class_name}",
)[label]
def page_shell(
*,
current_path: str,
eyebrow: str,
title: str,
description: str | None = None,
actions: Node | None = None,
content: Node,
) -> Renderable:
return h.main(
id="morph",
class_="min-h-screen lg:grid lg:grid-cols-[18rem_minmax(0,1fr)]",
)[
admin_sidebar(current_path=current_path),
h.div(class_="px-4 py-4 sm:px-5 lg:px-6 lg:py-5")[
h.div(class_="mx-auto max-w-7xl space-y-5")[
h.section[
h.div(
class_="flex flex-col gap-4 sm:flex-row sm:items-start sm:justify-between"
)[
h.div(class_="max-w-3xl")[
h.h1(
class_="text-3xl font-semibold tracking-tight text-slate-950"
)[title],
(
description
and h.p(class_="mt-1 text-sm text-slate-600")[
description
]
),
],
actions and h.div(class_="flex flex-wrap gap-2")[actions],
]
],
content,
]
],
]
def section_card(*, content: Node) -> Renderable:
return h.section(class_="space-y-4")[content]
def table_section(
*,
eyebrow: str | None = None,
title: str,
subtitle: str | None = None,
empty_message: str,
headers: tuple[str, ...],
rows: tuple[tuple[Node, ...], ...],
actions: Node | None = None,
) -> Renderable:
def render_row(row: tuple[Node, ...]) -> Renderable:
first_cell, *other_cells = row
return h.tr(class_="align-top")[
h.td(class_="py-4 pr-6 pl-4 text-sm font-medium text-slate-950 sm:pl-6")[
first_cell
],
(
h.td(
class_="px-3 py-4 align-top text-sm whitespace-nowrap text-slate-600"
)[cell]
for cell in other_cells
),
]
body_rows: Node
if rows:
body_rows = (render_row(row) for row in rows)
else:
body_rows = h.tr[
h.td(
colspan=str(len(headers)),
class_="px-4 py-8 text-center text-sm text-slate-500 sm:px-6",
)[empty_message]
]
return h.section[
h.div(class_="flex flex-col gap-3 sm:flex-row sm:items-end sm:justify-between")[
h.div[
eyebrow
and h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)[eyebrow],
h.h2(class_="mt-1 text-xl font-semibold text-slate-950")[title],
subtitle and h.p(class_="mt-1 text-sm text-slate-600")[subtitle],
],
actions,
],
h.div(
class_="mt-3 overflow-hidden rounded-2xl bg-white shadow-sm ring-1 ring-slate-200"
)[
h.div(class_="overflow-x-auto")[
h.table(
class_="relative w-full min-w-[72rem] divide-y divide-slate-200 table-auto"
)[
h.thead(class_="bg-stone-50")[
h.tr[
(
h.th(
scope="col",
class_="px-3 py-2.5 text-left text-xs font-semibold uppercase tracking-[0.18em] whitespace-nowrap text-slate-500 first:pl-4 sm:first:pl-6",
)[header]
for header in headers
)
]
],
h.tbody(class_="divide-y divide-slate-200 bg-white")[body_rows],
]
]
],
]
def stat_card(*, label: str, value: str, detail: str) -> Renderable:
return h.div(
class_="rounded-3xl bg-white/85 p-5 shadow-sm ring-1 ring-slate-200 backdrop-blur"
)[
h.dt(class_="text-sm font-medium text-slate-500")[label],
h.dd(class_="mt-3 text-3xl font-semibold tracking-tight text-slate-950")[value],
h.p(class_="mt-2 text-sm text-slate-600")[detail],
]
def input_field(
*,
label: str,
field_id: str,
value: str = "",
placeholder: str = "",
help_text: str | None = None,
signal_name: str | None = None,
disabled: bool = False,
) -> Renderable:
class_name = (
"mt-2 block w-full rounded-2xl border-0 px-3.5 py-2.5 text-sm shadow-sm ring-1 "
+ (
"cursor-not-allowed bg-slate-100 text-slate-500 ring-slate-200"
if disabled
else "bg-white text-slate-900 ring-slate-200 placeholder:text-slate-400 focus:outline-hidden focus:ring-2 focus:ring-amber-500"
)
)
return h.div[
h.label(for_=field_id, class_="block text-sm font-medium text-slate-900")[
label
],
h.input(
{"data-bind": signal_name} if signal_name is not None else {},
id=field_id,
name=field_id,
type="text",
value=value,
placeholder=placeholder,
disabled=disabled,
class_=class_name,
),
help_text and h.p(class_="mt-2 text-xs text-slate-500")[help_text],
]
def select_field(
*,
label: str,
field_id: str,
options: tuple[str, ...],
selected: str,
help_text: str | None = None,
signal_name: str | None = None,
) -> Renderable:
return h.div[
h.label(for_=field_id, class_="block text-sm font-medium text-slate-900")[
label
],
h.select(
{"data-bind": signal_name} if signal_name is not None else {},
id=field_id,
name=field_id,
class_="mt-2 block w-full rounded-2xl border-0 bg-white px-3.5 py-2.5 text-sm text-slate-900 shadow-sm ring-1 ring-slate-200 focus:outline-hidden focus:ring-2 focus:ring-amber-500",
)[
(
h.option(value=option, selected=option == selected)[option]
for option in options
)
],
help_text and h.p(class_="mt-2 text-xs text-slate-500")[help_text],
]
def textarea_field(
*,
label: str,
field_id: str,
value: str,
rows: str = "4",
signal_name: str | None = None,
) -> Renderable:
return h.div[
h.label(for_=field_id, class_="block text-sm font-medium text-slate-900")[
label
],
h.textarea(
{"data-bind": signal_name} if signal_name is not None else {},
id=field_id,
name=field_id,
rows=rows,
class_="mt-2 block w-full rounded-2xl border-0 bg-white px-3.5 py-2.5 text-sm text-slate-900 shadow-sm ring-1 ring-slate-200 placeholder:text-slate-400 focus:outline-hidden focus:ring-2 focus:ring-amber-500",
)[value],
]
def toggle_field(
*,
label: str,
description: str,
signal_name: str,
checked: bool = False,
) -> Renderable:
signal_value = str(checked).lower()
return h.div(
{"data-signals__ifmissing": f"{{{signal_name}: {signal_value}}}"},
class_="rounded-3xl bg-white p-4 shadow-sm",
)[
h.div(class_="flex items-start justify-between gap-4")[
h.div[
h.h3(class_="text-sm font-semibold text-slate-900")[label],
h.p(class_="mt-1 text-sm text-slate-600")[description],
],
h.label(class_="mt-0.5 cursor-pointer")[
h.div(
{
"data-class:bg-amber-500": f"${signal_name}",
"data-class:bg-slate-200": f"!${signal_name}",
},
class_="group relative inline-flex w-11 shrink-0 rounded-full bg-slate-200 p-0.5 outline-offset-2 outline-amber-500 transition",
)[
h.span(
{
"data-class:translate-x-5": f"${signal_name}",
"data-class:translate-x-0": f"!${signal_name}",
},
class_="size-5 translate-x-0 rounded-full bg-white shadow-xs ring-1 ring-slate-900/5 transition-transform",
),
h.input(
{"data-bind": signal_name},
type="checkbox",
name=signal_name,
checked=checked,
class_="sr-only",
),
],
],
]
]
def status_badge(*, label: str, tone: str) -> Renderable:
tones = {
"running": "bg-emerald-100 text-emerald-800",
"scheduled": "bg-sky-100 text-sky-800",
"idle": "bg-slate-200 text-slate-700",
"failed": "bg-rose-100 text-rose-800",
"done": "bg-emerald-100 text-emerald-800",
}
return h.span(
class_=f"inline-flex rounded-full px-2.5 py-1 text-xs font-semibold {tones[tone]}"
)[label]

View file

@ -30,6 +30,14 @@ class RepublisherConfig:
scrapy_settings: dict[str, Any] scrapy_settings: dict[str, Any]
def feed_output_dir(*, out_dir: Path, feed_slug: str) -> Path:
return out_dir / "feeds" / feed_slug
def feed_output_path(*, out_dir: Path, feed_slug: str) -> Path:
return feed_output_dir(out_dir=out_dir, feed_slug=feed_slug) / "feed.rss"
def _resolve_path(base_path: Path, value: str) -> Path: def _resolve_path(base_path: Path, value: str) -> Path:
path = Path(value).expanduser() path = Path(value).expanduser()
if not path.is_absolute(): if not path.is_absolute():
@ -173,7 +181,7 @@ def build_feed_settings(
out_dir: Path, out_dir: Path,
feed_slug: str, feed_slug: str,
) -> Settings: ) -> Settings:
feed_dir = out_dir / feed_slug feed_dir = feed_output_dir(out_dir=out_dir, feed_slug=feed_slug)
image_dir = base_settings.get("REPUBLISHER_IMAGE_DIR", IMAGE_DIR) image_dir = base_settings.get("REPUBLISHER_IMAGE_DIR", IMAGE_DIR)
video_dir = base_settings.get("REPUBLISHER_VIDEO_DIR", VIDEO_DIR) video_dir = base_settings.get("REPUBLISHER_VIDEO_DIR", VIDEO_DIR)
audio_dir = base_settings.get("REPUBLISHER_AUDIO_DIR", AUDIO_DIR) audio_dir = base_settings.get("REPUBLISHER_AUDIO_DIR", AUDIO_DIR)
@ -192,7 +200,7 @@ def build_feed_settings(
{ {
"REPUBLISHER_OUT_DIR": str(out_dir), "REPUBLISHER_OUT_DIR": str(out_dir),
"FEEDS": { "FEEDS": {
str(out_dir / f"{feed_slug}.rss"): { str(feed_output_path(out_dir=out_dir, feed_slug=feed_slug)): {
"format": "rss", "format": "rss",
"postprocessing": [], "postprocessing": [],
"feed_name": feed_slug, "feed_name": feed_slug,

View file

@ -11,6 +11,7 @@ from repub.config import (
FeedConfig, FeedConfig,
build_base_settings, build_base_settings,
build_feed_settings, build_feed_settings,
feed_output_dir,
load_config, load_config,
) )
from repub.media import check_runtime from repub.media import check_runtime
@ -30,7 +31,9 @@ class FeedNameFilter:
def prepare_output_dirs(out_dir: Path, feed_name: str) -> None: def prepare_output_dirs(out_dir: Path, feed_name: str) -> None:
(out_dir / "logs").mkdir(parents=True, exist_ok=True) (out_dir / "logs").mkdir(parents=True, exist_ok=True)
(out_dir / "httpcache").mkdir(parents=True, exist_ok=True) (out_dir / "httpcache").mkdir(parents=True, exist_ok=True)
(out_dir / feed_name).mkdir(parents=True, exist_ok=True) feed_output_dir(out_dir=out_dir, feed_slug=feed_name).mkdir(
parents=True, exist_ok=True
)
def create_feed_crawler( def create_feed_crawler(

89
repub/datastar.py Normal file
View file

@ -0,0 +1,89 @@
from __future__ import annotations
import asyncio
import hashlib
from collections.abc import AsyncGenerator, Awaitable, Callable
from typing import Protocol
from datastar_py import ServerSentEventGenerator as SSE
from datastar_py.sse import DatastarEvent
class HtmlRenderable(Protocol):
def __html__(self) -> str: ...
RenderResult = str | HtmlRenderable
RenderFunction = Callable[[], Awaitable[RenderResult]]
class RefreshBroker:
def __init__(self) -> None:
self._subscribers: dict[asyncio.Queue[object], asyncio.AbstractEventLoop] = {}
def subscribe(self) -> asyncio.Queue[object]:
queue: asyncio.Queue[object] = asyncio.Queue(maxsize=1)
self._subscribers[queue] = asyncio.get_running_loop()
return queue
def unsubscribe(self, queue: asyncio.Queue[object]) -> None:
self._subscribers.pop(queue, None)
def publish(self, event: object = "refresh-event") -> None:
for queue, loop in tuple(self._subscribers.items()):
loop.call_soon_threadsafe(_publish_event, queue, event)
def _publish_event(queue: asyncio.Queue[object], event: object) -> None:
if queue.full():
try:
queue.get_nowait()
except asyncio.QueueEmpty:
pass
try:
queue.put_nowait(event)
except asyncio.QueueFull:
return
async def render_sse_event(
render: RenderFunction, *, last_event_id: str | None = None
) -> tuple[str | None, DatastarEvent | None]:
html = _coerce_html(await render())
event_id = _render_hash(html)
if event_id == last_event_id:
return last_event_id, None
return event_id, SSE.patch_elements(html, event_id=event_id)
async def render_stream(
queue: asyncio.Queue[object],
render: RenderFunction,
*,
last_event_id: str | None = None,
render_on_connect: bool = True,
) -> AsyncGenerator[DatastarEvent, None]:
if render_on_connect:
last_event_id, event = await render_sse_event(
render, last_event_id=last_event_id
)
if event is not None:
yield event
while True:
await queue.get()
last_event_id, event = await render_sse_event(
render, last_event_id=last_event_id
)
if event is not None:
yield event
def _coerce_html(view: RenderResult) -> str:
if isinstance(view, str):
return view
return view.__html__()
def _render_hash(html: str) -> str:
return hashlib.blake2s(html.encode("utf-8"), digest_size=16).hexdigest()

View file

@ -34,14 +34,19 @@ def parse_args(argv: list[str] | None = None) -> tuple[str, argparse.Namespace]:
serve_parser = subparsers.add_parser("serve", help="Start the republisher web UI") serve_parser = subparsers.add_parser("serve", help="Start the republisher web UI")
serve_parser.add_argument( serve_parser.add_argument(
"--host", "--host",
default=os.environ.get("REPUB_HOST", "127.0.0.1"), default=os.environ.get("REPUBLISHER_HOST", "127.0.0.1"),
help="Host interface for the web UI", help="Host interface for the web UI",
) )
serve_parser.add_argument( serve_parser.add_argument(
"--port", "--port",
default=os.environ.get("REPUB_PORT", "8080"), default=os.environ.get("REPUBLISHER_PORT", "8080"),
help="Port for the web UI", help="Port for the web UI",
) )
serve_parser.add_argument(
"--dev-mode",
action="store_true",
help="Serve published feeds from /feeds for local development",
)
crawl_parser = subparsers.add_parser("crawl", help="Run the feed crawler once") crawl_parser = subparsers.add_parser("crawl", help="Run the feed crawler once")
crawl_parser.add_argument( crawl_parser.add_argument(
@ -51,11 +56,11 @@ def parse_args(argv: list[str] | None = None) -> tuple[str, argparse.Namespace]:
help="Path to runtime config TOML file", help="Path to runtime config TOML file",
) )
if not raw_args: if not raw_args:
raw_args = ["serve"] raw_args = ["serve", "--dev-mode"]
elif raw_args[0] in {"-c", "--config"}: elif raw_args[0] in {"-c", "--config"}:
raw_args = ["crawl", *raw_args] raw_args = ["crawl", *raw_args]
elif raw_args[0] not in {"serve", "crawl"}: elif raw_args[0] not in {"serve", "crawl"}:
raw_args = ["serve", *raw_args] raw_args = ["serve", "--dev-mode", *raw_args]
args = parser.parse_args(raw_args) args = parser.parse_args(raw_args)
command = args.command or "serve" command = args.command or "serve"
@ -72,10 +77,10 @@ def entrypoint(argv: list[str] | None = None) -> int:
try: try:
port = int(args.port) port = int(args.port)
except ValueError: except ValueError:
logger.error("Invalid REPUB_PORT/--port value: %s", args.port) logger.error("Invalid REPUBLISHER_PORT/--port value: %s", args.port)
return 2 return 2
app = create_app() app = create_app(dev_mode=bool(args.dev_mode))
app.run(host=args.host, port=port) app.run(host=args.host, port=port)
return 0 return 0

468
repub/job_runner.py Normal file
View file

@ -0,0 +1,468 @@
from __future__ import annotations
import argparse
import json
import signal
import sys
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from pygea.config import LoggingConfig, PygeaConfig, ResultsConfig, RuntimeConfig
from scrapy.crawler import CrawlerProcess
from scrapy.statscollectors import StatsCollector
from twisted.python.failure import Failure
from repub.config import (
FeedConfig,
RepublisherConfig,
build_base_settings,
build_feed_settings,
feed_output_dir,
)
from repub.crawl import prepare_output_dirs
from repub.model import (
Job,
Source,
SourceFeed,
SourcePangea,
database,
initialize_database,
)
from repub.spiders.rss_spider import RssFeedSpider
def _json_default(value: Any) -> Any:
if isinstance(value, datetime):
if value.tzinfo is None:
return value.replace(tzinfo=UTC).isoformat()
return value.astimezone(UTC).isoformat()
return str(value)
def _normalized_stats(stats: dict[str, Any]) -> dict[str, Any]:
cache_store = int(stats.get("httpcache/store", 0))
cache_hits = int(stats.get("httpcache/hit", 0))
cache_misses = int(stats.get("httpcache/miss", 0))
return {
**stats,
"requests_count": int(stats.get("downloader/request_count", 0)),
"items_count": int(stats.get("item_scraped_count", 0)),
"warnings_count": int(stats.get("log_count/WARNING", 0)),
"errors_count": int(stats.get("log_count/ERROR", 0)),
"bytes_count": int(stats.get("downloader/response_bytes", 0)),
"retries_count": int(stats.get("retry/count", 0)),
"exceptions_count": int(stats.get("spider_exceptions/count", 0)),
"cache_size_count": cache_store,
"cache_object_count": cache_store + cache_hits + cache_misses,
}
class ExecutionStatsCollector(StatsCollector):
def __init__(self, crawler: Any):
super().__init__(crawler)
self._stats_path = Path(crawler.settings["REPUB_JOB_STATS_PATH"])
self._stats_path.parent.mkdir(parents=True, exist_ok=True)
def set_value(self, key: str, value: Any, spider: Any | None = None) -> None:
super().set_value(key, value, spider)
self._write_snapshot()
def set_stats(self, stats: dict[str, Any], spider: Any | None = None) -> None:
super().set_stats(stats, spider)
self._write_snapshot()
def inc_value(
self,
key: str,
count: int = 1,
start: int = 0,
spider: Any | None = None,
) -> None:
super().inc_value(key, count, start, spider)
self._write_snapshot()
def max_value(self, key: str, value: Any, spider: Any | None = None) -> None:
super().max_value(key, value, spider)
self._write_snapshot()
def min_value(self, key: str, value: Any, spider: Any | None = None) -> None:
super().min_value(key, value, spider)
self._write_snapshot()
def clear_stats(self, spider: Any | None = None) -> None:
super().clear_stats(spider)
self._write_snapshot()
def open_spider(self, spider: Any | None = None) -> None:
super().open_spider(spider)
self._write_snapshot()
def _persist_stats(self, stats: dict[str, Any]) -> None:
self._write_snapshot(stats)
def _write_snapshot(self, stats: dict[str, Any] | None = None) -> None:
payload = {
"timestamp": datetime.now(UTC).isoformat(),
**_normalized_stats(self._stats if stats is None else stats),
}
with self._stats_path.open("a", encoding="utf-8") as handle:
handle.write(json.dumps(payload, sort_keys=True, default=_json_default))
handle.write("\n")
def pangea_feed_class():
from pygea.pangeafeed import PangeaFeed
return PangeaFeed
def generate_pangea_feed(
*,
name: str,
slug: str,
domain: str,
category_name: str,
content_type: str,
only_newest: bool,
max_articles: int,
oldest_article: int,
include_authors: bool,
exclude_media: bool,
include_content: bool,
content_format: str,
out_dir: str | Path,
log_path: str | Path,
) -> Path:
resolved_out_dir = Path(out_dir).resolve()
resolved_log_path = Path(log_path).resolve()
pangea_out_dir = feed_output_dir(out_dir=resolved_out_dir, feed_slug=slug)
config = PygeaConfig(
config_path=resolved_out_dir / "pygea-runtime.toml",
domain=domain,
default_content_type=content_type,
feeds=(
{
"name": category_name,
"slug": slug,
"only_newest": only_newest,
"content_type": content_type,
},
),
runtime=RuntimeConfig(
api_key=None,
max_articles=max_articles,
oldest_article=oldest_article,
authors_p=include_authors,
no_media_p=exclude_media,
content_inc_p=include_content,
content_format=content_format,
verbose_p=True,
),
results=ResultsConfig(
output_to_file_p=True,
output_file_name="pangea.rss",
output_directory=pangea_out_dir.parent,
),
logging=LoggingConfig(
log_file=resolved_log_path,
default_log_level="INFO",
),
)
feed_class = pangea_feed_class()
feed = feed_class(config, list(config.feeds))
feed.acquire_content()
feed.generate_feed()
output_path = feed.disgorge(slug)
if output_path is None:
raise RuntimeError(f"pygea did not write an output file for {name!r}")
return output_path.resolve()
@dataclass(frozen=True)
class JobSourceConfig:
source_name: str
source_slug: str
source_type: str
spider_arguments: dict[str, str]
feed_url: str | None = None
pangea_domain: str | None = None
pangea_category: str | None = None
content_type: str | None = None
only_newest: bool = True
max_articles: int = 10
oldest_article: int = 3
include_authors: bool = True
exclude_media: bool = False
include_content: bool = True
content_format: str = "MOBILE_3"
def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Run a republisher job worker")
parser.add_argument("--job-id", type=int, required=True)
parser.add_argument("--execution-id", type=int, required=True)
parser.add_argument("--db-path", required=True)
parser.add_argument("--out-dir", required=True)
parser.add_argument("--stats-path", required=True)
return parser.parse_args(argv)
def main(argv: list[str] | None = None) -> int:
args = parse_args(argv)
stop_requested = False
process: CrawlerProcess | None = None
def request_stop(signum: int, frame: object | None) -> None:
del signum, frame
nonlocal stop_requested
stop_requested = True
print(
f"worker[{args.job_id}:{args.execution_id}]: graceful stop requested",
flush=True,
)
if process is None:
return
try:
from twisted.internet import reactor
call_from_thread = getattr(reactor, "callFromThread", None)
if callable(call_from_thread):
call_from_thread(process.stop)
else:
process.stop()
except Exception as error:
print(
f"worker[{args.job_id}:{args.execution_id}]: failed to stop reactor gracefully: {error}",
flush=True,
)
signal.signal(signal.SIGTERM, request_stop)
signal.signal(signal.SIGINT, request_stop)
try:
source_config = _load_job_source_config(
db_path=args.db_path, job_id=args.job_id
)
except Exception as error:
print(
f"worker[{args.job_id}:{args.execution_id}]: failed to load job config: {error}",
flush=True,
)
return 1
out_dir = Path(args.out_dir).resolve()
stats_path = Path(args.stats_path).resolve()
log_path = stats_path.with_suffix(".log")
try:
feed = _resolve_feed(
source_config=source_config,
out_dir=out_dir,
log_path=log_path,
)
process = CrawlerProcess(
_build_crawl_settings(
out_dir=out_dir,
feed=feed,
stats_path=stats_path,
)
)
print(
f"worker[{args.job_id}:{args.execution_id}]: starting crawl for {source_config.source_slug}",
flush=True,
)
exit_code = _run_crawl(
process=process,
feed=feed,
spider_arguments=source_config.spider_arguments,
)
except Exception as error:
print(
f"worker[{args.job_id}:{args.execution_id}]: crawl failed: {error}",
flush=True,
)
return 1
if stop_requested:
print(
f"worker[{args.job_id}:{args.execution_id}]: stopping after graceful request",
flush=True,
)
return 130
if exit_code == 0:
print(
f"worker[{args.job_id}:{args.execution_id}]: completed successfully",
flush=True,
)
return exit_code
def _load_job_source_config(*, db_path: str, job_id: int) -> JobSourceConfig:
initialize_database(db_path)
primary_key = getattr(Job, "_meta").primary_key
with database.connection_context():
job = (
Job.select(Job, Source)
.join(Source)
.where(primary_key == job_id)
.get_or_none()
)
if job is None:
raise ValueError(f"job {job_id} does not exist")
source = job.source
spider_arguments = _parse_spider_arguments(job.spider_arguments)
if source.source_type == "feed":
feed = SourceFeed.get_or_none(SourceFeed.source == source)
if feed is None:
raise ValueError(
f"feed source {source.slug!r} is missing its feed config"
)
return JobSourceConfig(
source_name=source.name,
source_slug=source.slug,
source_type=source.source_type,
spider_arguments=spider_arguments,
feed_url=feed.feed_url,
)
pangea = SourcePangea.get_or_none(SourcePangea.source == source)
if pangea is None:
raise ValueError(
f"pangea source {source.slug!r} is missing its pangea config"
)
return JobSourceConfig(
source_name=source.name,
source_slug=source.slug,
source_type=source.source_type,
spider_arguments=spider_arguments,
pangea_domain=pangea.domain,
pangea_category=pangea.category_name,
content_type=pangea.content_type,
only_newest=bool(pangea.only_newest),
max_articles=int(pangea.max_articles),
oldest_article=int(pangea.oldest_article),
include_authors=bool(pangea.include_authors),
exclude_media=bool(pangea.exclude_media),
include_content=bool(pangea.include_content),
content_format=pangea.content_format,
)
def _parse_spider_arguments(raw_value: str) -> dict[str, str]:
arguments: dict[str, str] = {}
for raw_line in raw_value.splitlines():
line = raw_line.strip()
if line == "":
continue
key, separator, value = line.partition("=")
key = key.strip()
if separator == "" or key == "":
raise ValueError(
f"invalid spider argument {raw_line!r}; expected key=value"
)
arguments[key] = value
return arguments
def _resolve_feed(
*,
source_config: JobSourceConfig,
out_dir: Path,
log_path: Path,
) -> FeedConfig:
if source_config.source_type == "feed":
assert source_config.feed_url is not None
return FeedConfig(
name=source_config.source_name,
slug=source_config.source_slug,
url=source_config.feed_url,
)
generated_feed_path = generate_pangea_feed(
name=source_config.source_name,
slug=source_config.source_slug,
domain=_require_value(source_config.pangea_domain, "pangea_domain"),
category_name=_require_value(source_config.pangea_category, "pangea_category"),
content_type=_require_value(source_config.content_type, "content_type"),
only_newest=source_config.only_newest,
max_articles=source_config.max_articles,
oldest_article=source_config.oldest_article,
include_authors=source_config.include_authors,
exclude_media=source_config.exclude_media,
include_content=source_config.include_content,
content_format=source_config.content_format,
out_dir=out_dir,
log_path=log_path.with_suffix(".pygea.log"),
)
print(
f"pygea: generated intermediate feed at {generated_feed_path}",
flush=True,
)
return FeedConfig(
name=source_config.source_name,
slug=source_config.source_slug,
url=generated_feed_path.as_uri(),
)
def _build_crawl_settings(*, out_dir: Path, feed: FeedConfig, stats_path: Path):
base_settings = build_base_settings(
RepublisherConfig(
config_path=out_dir / "job-runner.toml",
out_dir=out_dir,
feeds=(feed,),
scrapy_settings={},
)
)
prepare_output_dirs(out_dir, feed.slug)
settings = build_feed_settings(base_settings, out_dir=out_dir, feed_slug=feed.slug)
settings.set("LOG_FILE", None, priority="cmdline")
settings.set(
"STATS_CLASS",
"repub.job_runner.ExecutionStatsCollector",
priority="cmdline",
)
settings.set("REPUB_JOB_STATS_PATH", str(stats_path), priority="cmdline")
return settings
def _run_crawl(
*,
process: CrawlerProcess,
feed: FeedConfig,
spider_arguments: dict[str, str],
) -> int:
results: list[Failure | None] = []
deferred = process.crawl(
RssFeedSpider,
feed_name=feed.slug,
url=feed.url,
**spider_arguments,
)
def handle_success(_: object) -> None:
results.append(None)
return None
def handle_error(failure: Failure) -> None:
print(failure.getTraceback(), flush=True)
results.append(failure)
return None
deferred.addCallbacks(handle_success, handle_error)
process.start()
return 1 if any(result is not None for result in results) else 0
def _require_value(value: str | None, field_name: str) -> str:
if value is None or value == "":
raise ValueError(f"missing {field_name}")
return value
if __name__ == "__main__":
sys.exit(main())

747
repub/jobs.py Normal file
View file

@ -0,0 +1,747 @@
from __future__ import annotations
import json
import subprocess
import sys
from dataclasses import dataclass
from datetime import UTC, datetime, timedelta
from pathlib import Path
from typing import Callable, TextIO, cast
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.triggers.cron import CronTrigger
from repub.config import feed_output_dir, feed_output_path
from repub.model import Job, JobExecution, JobExecutionStatus, Source, database, utc_now
SCHEDULER_JOB_PREFIX = "job-"
POLL_JOB_ID = "runtime-poll-workers"
SYNC_JOB_ID = "runtime-sync-jobs"
@dataclass(frozen=True)
class JobArtifacts:
log_path: Path
stats_path: Path
@classmethod
def for_execution(
cls, *, log_dir: Path, job_id: int, execution_id: int
) -> "JobArtifacts":
prefix = f"job-{job_id}-execution-{execution_id}"
return cls(
log_path=log_dir / f"{prefix}.log",
stats_path=log_dir / f"{prefix}.jsonl",
)
@dataclass
class RunningWorker:
execution_id: int
process: subprocess.Popen[str]
log_handle: TextIO
artifacts: JobArtifacts
stats_offset: int = 0
@dataclass(frozen=True)
class ExecutionLogView:
job_id: int
execution_id: int
title: str
description: str
status_label: str
status_tone: str
log_text: str
error_message: str | None = None
class JobRuntime:
def __init__(
self,
*,
log_dir: str | Path,
refresh_callback: Callable[[], None] | None = None,
graceful_stop_seconds: float = 15.0,
) -> None:
self.log_dir = Path(log_dir)
self.refresh_callback = refresh_callback
self.graceful_stop_seconds = graceful_stop_seconds
self.scheduler = BackgroundScheduler(timezone=UTC)
self._workers: dict[int, RunningWorker] = {}
self._started = False
def start(self) -> None:
if self._started:
return
self._reconcile_stale_executions()
self.scheduler.start()
self.scheduler.add_job(
self.poll_workers,
"interval",
id=POLL_JOB_ID,
seconds=0.25,
replace_existing=True,
max_instances=1,
coalesce=True,
)
self.scheduler.add_job(
self.sync_jobs,
"interval",
id=SYNC_JOB_ID,
seconds=1,
replace_existing=True,
max_instances=1,
coalesce=True,
)
self.sync_jobs()
self._started = True
def shutdown(self) -> None:
for execution_id in tuple(self._workers):
worker = self._workers.pop(execution_id)
if worker.process.poll() is None:
worker.process.kill()
worker.process.wait(timeout=2)
worker.log_handle.close()
if self._started:
self.scheduler.shutdown(wait=False)
self._started = False
def sync_jobs(self) -> None:
with database.connection_context():
jobs = tuple(Job.select().where(Job.enabled == True)) # noqa: E712
desired_ids = set()
for job in jobs:
scheduler_job_id = _scheduler_job_id(_job_id(job))
desired_ids.add(scheduler_job_id)
self.scheduler.add_job(
self.run_scheduled_job,
trigger=_job_trigger(job),
args=(_job_id(job),),
id=scheduler_job_id,
replace_existing=True,
max_instances=1,
coalesce=True,
misfire_grace_time=1,
)
for scheduled_job in tuple(self.scheduler.get_jobs()):
if (
scheduled_job.id.startswith(SCHEDULER_JOB_PREFIX)
and scheduled_job.id not in desired_ids
):
self.scheduler.remove_job(scheduled_job.id)
def run_scheduled_job(self, job_id: int) -> None:
self.run_job_now(job_id, reason="scheduled")
def run_job_now(self, job_id: int, *, reason: str) -> int | None:
del reason
self.start()
with database.connection_context():
job = Job.get_or_none(id=job_id)
if job is None:
return None
already_running = (
JobExecution.select()
.where(
(JobExecution.job == job)
& (JobExecution.running_status == JobExecutionStatus.RUNNING)
)
.exists()
)
if already_running:
return None
execution = JobExecution.create(
job=job,
started_at=utc_now(),
running_status=JobExecutionStatus.RUNNING,
)
execution_id = _execution_id(execution)
artifacts = JobArtifacts.for_execution(
log_dir=self.log_dir, job_id=job_id, execution_id=execution_id
)
artifacts.log_path.parent.mkdir(parents=True, exist_ok=True)
log_handle = artifacts.log_path.open("a", encoding="utf-8", buffering=1)
log_handle.write(
f"scheduler: starting execution {execution_id} for job {job_id}\n"
)
process = subprocess.Popen(
[
sys.executable,
"-u",
"-m",
"repub.job_runner",
"--job-id",
str(job_id),
"--execution-id",
str(execution_id),
"--db-path",
str(database.database),
"--out-dir",
str(self.log_dir.parent),
"--stats-path",
str(artifacts.stats_path),
],
stdout=log_handle,
stderr=subprocess.STDOUT,
text=True,
)
self._workers[execution_id] = RunningWorker(
execution_id=execution_id,
process=process,
log_handle=log_handle,
artifacts=artifacts,
)
self._trigger_refresh()
return execution_id
def request_execution_cancel(self, execution_id: int) -> bool:
with database.connection_context():
execution = JobExecution.get_or_none(id=execution_id)
if execution is None:
return False
if execution.running_status != JobExecutionStatus.RUNNING:
return False
if execution.stop_requested_at is None:
execution.stop_requested_at = utc_now()
execution.save()
worker = self._workers.get(execution_id)
if worker is not None and worker.process.poll() is None:
worker.log_handle.write(
f"scheduler: graceful stop requested for execution {execution_id}\n"
)
worker.process.terminate()
self._trigger_refresh()
return True
def set_job_enabled(self, job_id: int, *, enabled: bool) -> bool:
with database.connection_context():
job = Job.get_or_none(id=job_id)
if job is None:
return False
job.enabled = enabled
job.save()
self.sync_jobs()
self._trigger_refresh()
return True
def poll_workers(self) -> None:
for execution_id in tuple(self._workers):
worker = self._workers[execution_id]
self._apply_stats(worker)
self._enforce_graceful_stop(worker)
returncode = worker.process.poll()
if returncode is None:
continue
self._apply_stats(worker)
with database.connection_context():
execution = JobExecution.get_by_id(execution_id)
execution.ended_at = utc_now()
execution.running_status = _final_status(
execution=execution,
returncode=returncode,
)
execution.save()
worker.log_handle.close()
del self._workers[execution_id]
self._trigger_refresh()
def _apply_stats(self, worker: RunningWorker) -> None:
if not worker.artifacts.stats_path.exists():
return
with worker.artifacts.stats_path.open("r", encoding="utf-8") as handle:
handle.seek(worker.stats_offset)
payload = handle.read()
worker.stats_offset = handle.tell()
lines = [line for line in payload.splitlines() if line.strip()]
if not lines:
return
stats = json.loads(lines[-1])
with database.connection_context():
execution = JobExecution.get_by_id(worker.execution_id)
execution.requests_count = int(stats.get("requests_count", 0))
execution.items_count = int(stats.get("items_count", 0))
execution.warnings_count = int(stats.get("warnings_count", 0))
execution.errors_count = int(stats.get("errors_count", 0))
execution.bytes_count = int(stats.get("bytes_count", 0))
execution.retries_count = int(stats.get("retries_count", 0))
execution.exceptions_count = int(stats.get("exceptions_count", 0))
execution.cache_size_count = int(stats.get("cache_size_count", 0))
execution.cache_object_count = int(stats.get("cache_object_count", 0))
execution.raw_stats = json.dumps(stats, sort_keys=True)
execution.save()
self._trigger_refresh()
def _enforce_graceful_stop(self, worker: RunningWorker) -> None:
with database.connection_context():
execution = JobExecution.get_by_id(worker.execution_id)
if execution.stop_requested_at is None:
return
elapsed = utc_now() - _coerce_datetime(execution.stop_requested_at)
if (
elapsed >= timedelta(seconds=self.graceful_stop_seconds)
and worker.process.poll() is None
):
worker.process.kill()
def _trigger_refresh(self) -> None:
if self.refresh_callback is not None:
self.refresh_callback()
def _reconcile_stale_executions(self) -> None:
with database.connection_context():
stale_executions = tuple(
JobExecution.select(JobExecution, Job)
.join(Job)
.where(JobExecution.running_status == JobExecutionStatus.RUNNING)
)
for execution in stale_executions:
job = cast(Job, execution.job)
execution_id = _execution_id(execution)
artifacts = JobArtifacts.for_execution(
log_dir=self.log_dir,
job_id=_job_id(job),
execution_id=execution_id,
)
artifacts.log_path.parent.mkdir(parents=True, exist_ok=True)
with artifacts.log_path.open("a", encoding="utf-8") as log_handle:
log_handle.write(
"scheduler: execution marked failed after app restart\n"
)
execution.ended_at = utc_now()
execution.running_status = (
JobExecutionStatus.CANCELED
if execution.stop_requested_at is not None
else JobExecutionStatus.FAILED
)
execution.save()
if stale_executions:
self._trigger_refresh()
def load_runs_view(
*, log_dir: str | Path, now: datetime | None = None
) -> dict[str, tuple[dict[str, object], ...]]:
reference_time = now or datetime.now(UTC)
resolved_log_dir = Path(log_dir)
with database.connection_context():
jobs = tuple(Job.select(Job, Source).join(Source).order_by(Source.name.asc()))
running_executions = tuple(
JobExecution.select(JobExecution, Job, Source)
.join(Job)
.join(Source)
.where(JobExecution.running_status == JobExecutionStatus.RUNNING)
.order_by(JobExecution.started_at.desc())
)
completed_executions = tuple(
JobExecution.select(JobExecution, Job, Source)
.join(Job)
.join(Source)
.where(
JobExecution.running_status.in_(
(
JobExecutionStatus.SUCCEEDED,
JobExecutionStatus.FAILED,
JobExecutionStatus.CANCELED,
)
)
)
.order_by(JobExecution.ended_at.desc())
.limit(20)
)
running_by_job = {
_job_id(execution.job): execution for execution in running_executions
}
return {
"running": tuple(
_project_running_execution(execution, resolved_log_dir, reference_time)
for execution in running_executions
),
"upcoming": tuple(
_project_upcoming_job(job, running_by_job.get(job.id), reference_time)
for job in jobs
),
"completed": tuple(
_project_completed_execution(execution, resolved_log_dir, reference_time)
for execution in completed_executions
),
}
def load_dashboard_view(
*, log_dir: str | Path, now: datetime | None = None
) -> dict[str, object]:
reference_time = now or datetime.now(UTC)
runs_view = load_runs_view(log_dir=log_dir, now=reference_time)
output_dir = Path(log_dir).parent
with database.connection_context():
sources = tuple(Source.select().order_by(Source.name.asc()))
failed_last_day = (
JobExecution.select()
.where(
(JobExecution.running_status == JobExecutionStatus.FAILED)
& (JobExecution.ended_at.is_null(False))
)
.count()
)
upcoming_ready = sum(
1 for job in runs_view["upcoming"] if str(job["run_reason"]) == "Ready"
)
footprint_bytes = _directory_size(output_dir)
return {
"running": runs_view["running"],
"source_feeds": tuple(
_project_source_feed(source, output_dir, reference_time)
for source in sources
),
"snapshot": {
"running_now": str(len(runs_view["running"])),
"upcoming_today": str(upcoming_ready),
"failures_24h": str(failed_last_day),
"artifact_footprint": _format_bytes(footprint_bytes),
},
}
def load_execution_log_view(
*, log_dir: str | Path, job_id: int, execution_id: int
) -> ExecutionLogView:
with database.connection_context():
execution = JobExecution.get_or_none(id=execution_id)
route = f"/job/{job_id}/execution/{execution_id}/logs"
if execution is None or _job_id(cast(Job, execution.job)) != job_id:
return ExecutionLogView(
job_id=job_id,
execution_id=execution_id,
title=f"Job {job_id} / execution {execution_id}",
description="Plain text log view routed through the app.",
status_label="Unavailable",
status_tone="failed",
log_text="",
error_message="Execution does not exist.",
)
artifacts = JobArtifacts.for_execution(
log_dir=Path(log_dir),
job_id=job_id,
execution_id=execution_id,
)
if not artifacts.log_path.exists():
return ExecutionLogView(
job_id=job_id,
execution_id=execution_id,
title=f"Job {job_id} / execution {execution_id}",
description="Plain text log view routed through the app.",
status_label=_execution_status_label(execution),
status_tone=_execution_status_tone(execution),
log_text="",
error_message="Log file has not been created yet.",
)
return ExecutionLogView(
job_id=job_id,
execution_id=execution_id,
title=f"Job {job_id} / execution {execution_id}",
description=f"Route: {route}",
status_label=_execution_status_label(execution),
status_tone=_execution_status_tone(execution),
log_text=artifacts.log_path.read_text(encoding="utf-8"),
)
def _job_trigger(job: Job) -> CronTrigger:
expression = " ".join(
(
str(job.cron_minute),
str(job.cron_hour),
str(job.cron_day_of_month),
str(job.cron_month),
str(job.cron_day_of_week),
)
)
return CronTrigger.from_crontab(expression, timezone=UTC)
def _scheduler_job_id(job_id: int) -> str:
return f"{SCHEDULER_JOB_PREFIX}{job_id}"
def _project_running_execution(
execution: JobExecution, log_dir: Path, reference_time: datetime
) -> dict[str, object]:
job = cast(Job, execution.job)
job_id = _job_id(job)
execution_id = _execution_id(execution)
artifacts = JobArtifacts.for_execution(
log_dir=log_dir, job_id=job_id, execution_id=execution_id
)
started_at = _coerce_datetime(
cast(datetime | str, execution.started_at or execution.created_at)
)
runtime = reference_time - started_at
return {
"source": job.source.name,
"slug": job.source.slug,
"job_id": job_id,
"execution_id": execution_id,
"started_at": started_at.strftime("%Y-%m-%d %H:%M UTC"),
"runtime": f"running for {int(runtime.total_seconds())}s",
"status": "Stopping" if execution.stop_requested_at else "Running",
"stats": _stats_summary(execution),
"worker": (
"graceful stop requested"
if execution.stop_requested_at
else "streaming stats from worker jsonl"
),
"log_href": f"/job/{job_id}/execution/{execution_id}/logs",
"log_exists": artifacts.log_path.exists(),
"cancel_post_path": f"/actions/executions/{execution_id}/cancel",
}
def _project_upcoming_job(
job: Job, running_execution: JobExecution | None, reference_time: datetime
) -> dict[str, object]:
job_id = _job_id(job)
trigger = _job_trigger(job)
next_run = (
trigger.get_next_fire_time(None, reference_time)
if job.enabled and running_execution is None
else None
)
return {
"source": job.source.name,
"slug": job.source.slug,
"job_id": job_id,
"next_run": (
_humanize_relative_time(reference_time, next_run)
if next_run is not None
else ("Running now" if running_execution is not None else "Not scheduled")
),
"next_run_at": next_run.isoformat() if next_run is not None else None,
"schedule": " ".join(
(
str(job.cron_minute),
str(job.cron_hour),
str(job.cron_day_of_month),
str(job.cron_month),
str(job.cron_day_of_week),
)
),
"enabled_label": "Enabled" if job.enabled else "Disabled",
"enabled_tone": "scheduled" if job.enabled else "idle",
"run_disabled": running_execution is not None,
"run_reason": "Already running" if running_execution is not None else "Ready",
"toggle_label": "Disable" if job.enabled else "Enable",
"toggle_enabled": not job.enabled,
"run_post_path": f"/actions/jobs/{job_id}/run-now",
"toggle_post_path": f"/actions/jobs/{job_id}/toggle-enabled",
"delete_post_path": f"/actions/jobs/{job_id}/delete",
}
def _project_completed_execution(
execution: JobExecution, log_dir: Path, reference_time: datetime
) -> dict[str, object]:
job = cast(Job, execution.job)
job_id = _job_id(job)
execution_id = _execution_id(execution)
artifacts = JobArtifacts.for_execution(
log_dir=log_dir, job_id=job_id, execution_id=execution_id
)
ended_at = (
_coerce_datetime(cast(datetime | str, execution.ended_at))
if execution.ended_at is not None
else None
)
return {
"source": job.source.name,
"slug": job.source.slug,
"job_id": job_id,
"execution_id": execution_id,
"ended_at": (
_humanize_relative_time(reference_time, ended_at)
if ended_at is not None
else "Pending"
),
"ended_at_iso": ended_at.isoformat() if ended_at is not None else None,
"status": _execution_status_label(execution),
"status_tone": _execution_status_tone(execution),
"stats": _stats_summary(execution),
"summary": (
"Canceled by operator"
if execution.running_status == JobExecutionStatus.CANCELED
else (
"Worker exited successfully"
if execution.running_status == JobExecutionStatus.SUCCEEDED
else "Worker exited with failure"
)
),
"log_href": f"/job/{job_id}/execution/{execution_id}/logs",
"log_exists": artifacts.log_path.exists(),
}
def _project_source_feed(
source: Source, output_dir: Path, reference_time: datetime
) -> dict[str, object]:
source_slug = str(source.slug)
source_dir = feed_output_dir(out_dir=output_dir, feed_slug=source_slug)
feed_path = feed_output_path(out_dir=output_dir, feed_slug=source_slug)
feed_exists = feed_path.exists()
updated_at = (
datetime.fromtimestamp(feed_path.stat().st_mtime, tz=UTC)
if feed_exists
else None
)
return {
"source": source.name,
"slug": source_slug,
"feed_href": f"/feeds/{source_slug}/feed.rss",
"feed_status_label": "Available" if feed_exists else "Missing",
"feed_status_tone": "done" if feed_exists else "failed",
"feed_exists": feed_exists,
"last_updated": (
_humanize_relative_time(reference_time, updated_at)
if updated_at is not None
else "Never published"
),
"last_updated_iso": updated_at.isoformat() if updated_at is not None else None,
"artifact_footprint": _format_bytes(_directory_size(source_dir)),
}
def _execution_status_label(execution: JobExecution) -> str:
status = JobExecutionStatus(execution.running_status)
return {
JobExecutionStatus.PENDING: "Pending",
JobExecutionStatus.RUNNING: (
"Stopping" if execution.stop_requested_at else "Running"
),
JobExecutionStatus.SUCCEEDED: "Succeeded",
JobExecutionStatus.FAILED: "Failed",
JobExecutionStatus.CANCELED: "Canceled",
}[status]
def _execution_status_tone(execution: JobExecution) -> str:
status = JobExecutionStatus(execution.running_status)
return {
JobExecutionStatus.PENDING: "idle",
JobExecutionStatus.RUNNING: "running",
JobExecutionStatus.SUCCEEDED: "done",
JobExecutionStatus.FAILED: "failed",
JobExecutionStatus.CANCELED: "idle",
}[status]
def _stats_summary(execution: JobExecution) -> str:
bytes_count = cast(int, execution.bytes_count)
return (
f"{execution.requests_count} requests"
f"{execution.items_count} items"
f"{_format_summary_bytes(bytes_count)}"
)
def _final_status(*, execution: JobExecution, returncode: int) -> JobExecutionStatus:
if execution.stop_requested_at is not None:
return JobExecutionStatus.CANCELED
if returncode == 0:
return JobExecutionStatus.SUCCEEDED
return JobExecutionStatus.FAILED
def _coerce_datetime(value: datetime | str) -> datetime:
if isinstance(value, datetime):
if value.tzinfo is None:
return value.replace(tzinfo=UTC)
return value.astimezone(UTC)
parsed = datetime.fromisoformat(value)
if parsed.tzinfo is None:
return parsed.replace(tzinfo=UTC)
return parsed.astimezone(UTC)
def _job_id(job: Job) -> int:
return int(job.get_id())
def _execution_id(execution: JobExecution) -> int:
return int(execution.get_id())
def _directory_size(path: Path) -> int:
if not path.exists():
return 0
return sum(entry.stat().st_size for entry in path.rglob("*") if entry.is_file())
def _format_bytes(value: int) -> str:
if value < 1024:
return f"{value} B"
if value < 1024 * 1024:
return f"{value / 1024:.1f} KB"
if value < 1024 * 1024 * 1024:
return f"{value / (1024 * 1024):.1f} MB"
return f"{value / (1024 * 1024 * 1024):.1f} GB"
def _format_summary_bytes(value: int) -> str:
if value == 1:
return "1 byte"
if value < 1024:
return f"{value} bytes"
if value < 1024 * 1024:
return f"{value / 1024:.1f} KiB"
if value < 1024 * 1024 * 1024:
return f"{value / (1024 * 1024):.1f} MiB"
return f"{value / (1024 * 1024 * 1024):.1f} GiB"
def _humanize_relative_time(reference_time: datetime, target_time: datetime) -> str:
delta_seconds = int(round((target_time - reference_time).total_seconds()))
if delta_seconds == 0:
return "now"
absolute_delta_seconds = abs(delta_seconds)
units = (
("day", 24 * 60 * 60),
("hour", 60 * 60),
("minute", 60),
)
for label, size in units:
if absolute_delta_seconds >= size:
count = max(1, round(absolute_delta_seconds / size))
suffix = "" if count == 1 else "s"
if delta_seconds > 0:
return f"in {count} {label}{suffix}"
return f"{count} {label}{suffix} ago"
if delta_seconds > 0:
return f"in {absolute_delta_seconds} seconds"
return f"{absolute_delta_seconds} seconds ago"

View file

@ -54,12 +54,25 @@ class VideoMeta(TypedDict):
bit_rate: float bit_rate: float
def _decode_ffmpeg_output(output: Any) -> str:
if isinstance(output, bytes):
return output.decode("utf-8", errors="replace")
return str(output)
def _print_ffmpeg_error_output(error: ffmpeg.Error) -> None:
if error.stderr:
print(_decode_ffmpeg_output(error.stderr), file=sys.stderr)
if error.stdout:
print(_decode_ffmpeg_output(error.stdout))
def probe_media(file_path) -> Dict[str, Any]: def probe_media(file_path) -> Dict[str, Any]:
"""Probes `file_path` using ffmpeg's ffprobe and returns the data.""" """Probes `file_path` using ffmpeg's ffprobe and returns the data."""
try: try:
return ffmpeg.probe(file_path) return ffmpeg.probe(file_path)
except ffmpeg.Error as e: except ffmpeg.Error as e:
print(e.stderr, file=sys.stderr) _print_ffmpeg_error_output(e)
logger.error(f"Failed to probe io {file_path}") logger.error(f"Failed to probe io {file_path}")
logger.error(e) logger.error(e)
raise RuntimeError(f"Failed to probe io {file_path}") from e raise RuntimeError(f"Failed to probe io {file_path}") from e
@ -217,7 +230,7 @@ def transcode_audio(input_file: str, output_dir: str, params: Dict[str, str]) ->
**params, **params,
loglevel="quiet", loglevel="quiet",
) )
.run() .run(capture_stdout=True, capture_stderr=True)
) )
before = os.path.getsize(input_file) / 1024 before = os.path.getsize(input_file) / 1024
after = os.path.getsize(output_file) / 1024 after = os.path.getsize(output_file) / 1024
@ -229,8 +242,7 @@ def transcode_audio(input_file: str, output_dir: str, params: Dict[str, str]) ->
) )
return output_file return output_file
except ffmpeg.Error as e: except ffmpeg.Error as e:
print(e.stderr, file=sys.stderr) _print_ffmpeg_error_output(e)
print(e.stdout)
logger.error(e) logger.error(e)
raise RuntimeError(f"Failed to compress audio {input_file}") from e raise RuntimeError(f"Failed to compress audio {input_file}") from e
@ -310,7 +322,7 @@ def transcode_video(input_file: str, output_dir: str, params: Dict[str, Any]) ->
**params, **params,
# loglevel="quiet", # loglevel="quiet",
) )
.run() .run(capture_stdout=True, capture_stderr=True)
) )
else: else:
passes = params["passes"] passes = params["passes"]
@ -323,16 +335,18 @@ def transcode_video(input_file: str, output_dir: str, params: Dict[str, Any]) ->
"-stats" "-stats"
) )
logger.info("Running pass #1") logger.info("Running pass #1")
std_out, std_err = ffoutput.run(capture_stdout=True) ffoutput.run(capture_stdout=True, capture_stderr=True)
print(std_out)
print(std_err)
logger.info("Running pass #2") logger.info("Running pass #2")
ffoutput = ffinput.output(video, audio, output_file, **passes[1]) ffoutput = ffinput.output(video, audio, output_file, **passes[1])
ffoutput = ffoutput.global_args( ffoutput = ffoutput.global_args(
# "-loglevel", "quiet", # "-loglevel", "quiet",
"-stats" "-stats"
) )
ffoutput.run(overwrite_output=True) ffoutput.run(
capture_stdout=True,
capture_stderr=True,
overwrite_output=True,
)
before = os.path.getsize(input_file) / 1024 before = os.path.getsize(input_file) / 1024
after = os.path.getsize(output_file) / 1024 after = os.path.getsize(output_file) / 1024
@ -344,7 +358,7 @@ def transcode_video(input_file: str, output_dir: str, params: Dict[str, Any]) ->
) )
return output_file return output_file
except ffmpeg.Error as e: except ffmpeg.Error as e:
print(e.stderr, file=sys.stderr) _print_ffmpeg_error_output(e)
logger.error("Failed to transcode") logger.error("Failed to transcode")
logger.error(e) logger.error(e)
raise RuntimeError(f"Failed to transcode video: {e.stderr.decode()}") from e raise RuntimeError(f"Failed to transcode video: {e.stderr.decode()}") from e

446
repub/model.py Normal file
View file

@ -0,0 +1,446 @@
from __future__ import annotations
import os
from datetime import UTC, datetime
from enum import IntEnum
from importlib import resources
from importlib.resources.abc import Traversable
from pathlib import Path
from peewee import (
BooleanField,
Check,
DateTimeField,
ForeignKeyField,
IntegerField,
Model,
SqliteDatabase,
TextField,
)
DEFAULT_DB_PATH = Path("republisher.db")
DATABASE_PRAGMAS = {
"busy_timeout": 5000,
"cache_size": 15625,
"foreign_keys": 1,
"journal_mode": "wal",
"page_size": 4096,
"synchronous": "normal",
"temp_store": "memory",
}
SCHEMA_GLOB = "*.sql"
database = SqliteDatabase(None, pragmas=DATABASE_PRAGMAS)
class JobExecutionStatus(IntEnum):
PENDING = 0
RUNNING = 1
SUCCEEDED = 2
FAILED = 3
CANCELED = 4
def utc_now() -> datetime:
return datetime.now(UTC)
def resolve_database_path(db_path: str | Path | None = None) -> Path:
raw_value = (
os.environ.get("REPUBLISHER_DB_PATH", DEFAULT_DB_PATH)
if db_path is None
else db_path
)
raw_path = Path(raw_value)
return raw_path.expanduser().resolve()
def schema_paths() -> tuple[Traversable, ...]:
schema_dir = resources.files("repub").joinpath("sql")
return tuple(
sorted(
(path for path in schema_dir.iterdir() if path.name.endswith(".sql")),
key=lambda path: path.name,
)
)
def initialize_database(db_path: str | Path | None = None) -> Path:
resolved_path = resolve_database_path(db_path)
resolved_path.parent.mkdir(parents=True, exist_ok=True)
if not database.is_closed():
database.close()
database.init(str(resolved_path), pragmas=DATABASE_PRAGMAS)
database.connect(reuse_if_open=True)
try:
connection = database.connection()
for path in schema_paths():
connection.executescript(path.read_text(encoding="utf-8"))
finally:
database.close()
return resolved_path
def source_slug_exists(slug: str) -> bool:
with database.connection_context():
return Source.select().where(Source.slug == slug).exists()
def load_source_form(slug: str) -> dict[str, object] | None:
with database.connection_context():
source = Source.get_or_none(Source.slug == slug)
if source is None:
return None
job = Job.get(Job.source == source)
form_data: dict[str, object] = {
"name": source.name,
"slug": source.slug,
"source_type": source.source_type,
"notes": source.notes,
"spider_arguments": job.spider_arguments,
"enabled": job.enabled,
"cron_minute": job.cron_minute,
"cron_hour": job.cron_hour,
"cron_day_of_month": job.cron_day_of_month,
"cron_day_of_week": job.cron_day_of_week,
"cron_month": job.cron_month,
"feed_url": "",
"pangea_domain": "",
"pangea_category": "",
"content_format": "MOBILE_3",
"content_type": "articles",
"max_articles": "10",
"oldest_article": "3",
"only_newest": True,
"include_authors": True,
"exclude_media": False,
"include_content": True,
}
if source.source_type == "feed":
feed = SourceFeed.get(SourceFeed.source == source)
form_data["feed_url"] = feed.feed_url
else:
pangea = SourcePangea.get(SourcePangea.source == source)
form_data.update(
{
"pangea_domain": pangea.domain,
"pangea_category": pangea.category_name,
"content_format": pangea.content_format,
"content_type": pangea.content_type,
"max_articles": str(pangea.max_articles),
"oldest_article": str(pangea.oldest_article),
"only_newest": pangea.only_newest,
"include_authors": pangea.include_authors,
"exclude_media": pangea.exclude_media,
"include_content": pangea.include_content,
}
)
return form_data
def create_source(
*,
name: str,
slug: str,
source_type: str,
notes: str,
spider_arguments: str,
enabled: bool,
cron_minute: str,
cron_hour: str,
cron_day_of_month: str,
cron_day_of_week: str,
cron_month: str,
feed_url: str = "",
pangea_domain: str = "",
pangea_category: str = "",
content_type: str = "",
only_newest: bool = True,
max_articles: int | None = None,
oldest_article: int | None = None,
include_authors: bool = True,
exclude_media: bool = False,
include_content: bool = True,
content_format: str = "",
) -> Source:
with database.connection_context():
with database.atomic():
source = Source.create(
name=name,
slug=slug,
source_type=source_type,
notes=notes,
)
if source_type == "feed":
SourceFeed.create(
source=source,
feed_url=feed_url,
)
else:
SourcePangea.create(
source=source,
domain=pangea_domain,
category_name=pangea_category,
content_type=content_type,
only_newest=only_newest,
max_articles=max_articles,
oldest_article=oldest_article,
include_authors=include_authors,
exclude_media=exclude_media,
include_content=include_content,
content_format=content_format,
)
Job.create(
source=source,
enabled=enabled,
spider_arguments=spider_arguments,
cron_minute=cron_minute,
cron_hour=cron_hour,
cron_day_of_month=cron_day_of_month,
cron_day_of_week=cron_day_of_week,
cron_month=cron_month,
)
return source
def update_source(
source_slug: str,
*,
name: str,
slug: str,
source_type: str,
notes: str,
spider_arguments: str,
enabled: bool,
cron_minute: str,
cron_hour: str,
cron_day_of_month: str,
cron_day_of_week: str,
cron_month: str,
feed_url: str = "",
pangea_domain: str = "",
pangea_category: str = "",
content_type: str = "",
only_newest: bool = True,
max_articles: int | None = None,
oldest_article: int | None = None,
include_authors: bool = True,
exclude_media: bool = False,
include_content: bool = True,
content_format: str = "",
) -> Source | None:
with database.connection_context():
with database.atomic():
source = Source.get_or_none(Source.slug == source_slug)
if source is None:
return None
source.name = name
source.notes = notes
source.source_type = source_type
source.save()
job = Job.get(Job.source == source)
job.enabled = enabled
job.spider_arguments = spider_arguments
job.cron_minute = cron_minute
job.cron_hour = cron_hour
job.cron_day_of_month = cron_day_of_month
job.cron_day_of_week = cron_day_of_week
job.cron_month = cron_month
job.save()
if source_type == "feed":
SourcePangea.delete().where(SourcePangea.source == source).execute()
feed = SourceFeed.get_or_none(SourceFeed.source == source)
if feed is None:
SourceFeed.create(source=source, feed_url=feed_url)
else:
feed.feed_url = feed_url
feed.save()
else:
SourceFeed.delete().where(SourceFeed.source == source).execute()
pangea = SourcePangea.get_or_none(SourcePangea.source == source)
if pangea is None:
SourcePangea.create(
source=source,
domain=pangea_domain,
category_name=pangea_category,
content_type=content_type,
only_newest=only_newest,
max_articles=max_articles,
oldest_article=oldest_article,
include_authors=include_authors,
exclude_media=exclude_media,
include_content=include_content,
content_format=content_format,
)
else:
pangea.domain = pangea_domain
pangea.category_name = pangea_category
pangea.content_type = content_type
pangea.only_newest = only_newest
pangea.max_articles = max_articles
pangea.oldest_article = oldest_article
pangea.include_authors = include_authors
pangea.exclude_media = exclude_media
pangea.include_content = include_content
pangea.content_format = content_format
pangea.save()
return source
def delete_job_source(job_id: int) -> bool:
with database.connection_context():
with database.atomic():
job = Job.get_or_none(id=job_id)
if job is None:
return False
source = Source.get_by_id(job.source_id)
return source.delete_instance() > 0
def load_sources() -> tuple[dict[str, object], ...]:
with database.connection_context():
sources = tuple(Source.select().order_by(Source.created_at.desc()))
source_ids = tuple(int(source.get_id()) for source in sources)
if not source_ids:
return ()
jobs = {
job.source_id: job for job in Job.select().where(Job.source.in_(source_ids))
}
feed_configs = {
config.source_id: config
for config in SourceFeed.select().where(SourceFeed.source.in_(source_ids))
}
pangea_configs = {
config.source_id: config
for config in SourcePangea.select().where(
SourcePangea.source.in_(source_ids)
)
}
return tuple(
_project_source(source, jobs, feed_configs, pangea_configs)
for source in sources
)
def _project_source(
source: "Source",
jobs: dict[int, "Job"],
feed_configs: dict[int, "SourceFeed"],
pangea_configs: dict[int, "SourcePangea"],
) -> dict[str, object]:
source_id = int(source.get_id())
job = jobs[source_id]
if source.source_type == "feed":
upstream = feed_configs[source_id].feed_url
source_type = "Feed"
else:
pangea = pangea_configs[source_id]
upstream = f"{pangea.domain} / {pangea.category_name}"
source_type = "Pangea"
return {
"name": source.name,
"slug": source.slug,
"source_type": source_type,
"upstream": upstream,
"schedule": (
f"cron: {job.cron_minute} {job.cron_hour} {job.cron_day_of_month} "
f"{job.cron_month} {job.cron_day_of_week}"
),
"last_run": "Never run",
"state": "Enabled" if job.enabled else "Disabled",
"state_tone": "scheduled" if job.enabled else "idle",
}
class BaseModel(Model):
class Meta:
database = database
class Source(BaseModel):
created_at = DateTimeField(default=utc_now)
updated_at = DateTimeField(default=utc_now)
name = TextField()
slug = TextField(unique=True)
source_type = TextField(constraints=[Check("source_type IN ('feed', 'pangea')")])
notes = TextField(default="")
class Meta:
table_name = "source"
class SourceFeed(BaseModel):
source = ForeignKeyField(Source, primary_key=True, backref="feed_config")
feed_url = TextField()
etag = TextField(null=True)
last_modified = TextField(null=True)
class Meta:
table_name = "source_feed"
class SourcePangea(BaseModel):
source = ForeignKeyField(Source, primary_key=True, backref="pangea_config")
domain = TextField()
category_name = TextField()
content_type = TextField()
only_newest = BooleanField()
max_articles = IntegerField()
oldest_article = IntegerField()
include_authors = BooleanField()
exclude_media = BooleanField()
include_content = BooleanField()
content_format = TextField()
class Meta:
table_name = "source_pangea"
class Job(BaseModel):
source = ForeignKeyField(Source, unique=True, backref="job")
created_at = DateTimeField(default=utc_now)
updated_at = DateTimeField(default=utc_now)
enabled = BooleanField()
spider_arguments = TextField(default="")
cron_minute = TextField()
cron_hour = TextField()
cron_day_of_month = TextField()
cron_day_of_week = TextField()
cron_month = TextField()
class Meta:
table_name = "job"
class JobExecution(BaseModel):
job = ForeignKeyField(Job, backref="executions")
created_at = DateTimeField(default=utc_now)
started_at = DateTimeField(null=True)
ended_at = DateTimeField(null=True)
stop_requested_at = DateTimeField(null=True)
running_status = IntegerField(
default=JobExecutionStatus.PENDING,
constraints=[Check("running_status BETWEEN 0 AND 4")],
)
requests_count = IntegerField(default=0)
items_count = IntegerField(default=0)
warnings_count = IntegerField(default=0)
errors_count = IntegerField(default=0)
bytes_count = IntegerField(default=0)
retries_count = IntegerField(default=0)
exceptions_count = IntegerField(default=0)
cache_size_count = IntegerField(default=0)
cache_object_count = IntegerField(default=0)
raw_stats = TextField(default="{}")
class Meta:
table_name = "job_execution"

15
repub/pages/__init__.py Normal file
View file

@ -0,0 +1,15 @@
from repub.pages.dashboard import dashboard_page, dashboard_page_with_data
from repub.pages.runs import execution_logs_page, runs_page
from repub.pages.shim import shim_page
from repub.pages.sources import create_source_page, edit_source_page, sources_page
__all__ = [
"create_source_page",
"dashboard_page",
"dashboard_page_with_data",
"edit_source_page",
"execution_logs_page",
"runs_page",
"shim_page",
"sources_page",
]

267
repub/pages/dashboard.py Normal file
View file

@ -0,0 +1,267 @@
from __future__ import annotations
from collections.abc import Mapping
import htpy as h
from htpy import Node, Renderable
from repub.components import (
admin_sidebar,
header_action_link,
inline_button,
inline_link,
muted_action_link,
stat_card,
status_badge,
table_section,
)
def _text(values: Mapping[str, object], key: str) -> str:
return str(values[key])
def _running_execution_row(execution: Mapping[str, object]) -> tuple[Node, ...]:
status_tone = "running" if _text(execution, "status") != "Succeeded" else "done"
return (
h.div[
h.div(class_="font-semibold text-slate-950")[_text(execution, "source")],
h.p(class_="mt-0.5 font-mono text-[11px] text-slate-500")[
_text(execution, "slug")
],
],
h.div[
h.p(class_="font-medium text-slate-900")[
f"#{_text(execution, 'execution_id')}"
],
h.p(class_="mt-0.5 text-[11px] text-slate-500")[
f"job {_text(execution, 'job_id')}"
],
],
h.div[
h.p(class_="font-medium text-slate-900")[_text(execution, "started_at")],
h.p(class_="mt-0.5 text-[11px] text-slate-500")[
_text(execution, "runtime")
],
],
status_badge(label=_text(execution, "status"), tone=status_tone),
h.div(class_="min-w-56 whitespace-normal")[
h.p(class_="font-medium text-slate-900")[_text(execution, "stats")],
h.p(class_="mt-0.5 text-[11px] text-slate-500")[_text(execution, "worker")],
],
h.div(class_="flex flex-nowrap items-center gap-3")[
inline_link(
href=_text(execution, "log_href"),
label="View log",
tone="amber",
),
inline_button(label="Stop", tone="danger"),
],
)
def dashboard_header() -> Renderable:
return h.section[
h.div(
class_="flex flex-col gap-4 sm:flex-row sm:items-start sm:justify-between"
)[
h.div[
h.h1(class_="text-3xl font-semibold tracking-tight text-slate-950")[
"Republisher"
],
],
h.div(class_="flex flex-wrap gap-2")[
header_action_link(href="/sources/create", label="Create source"),
muted_action_link(href="/sources", label="View sources"),
],
]
]
def operational_snapshot(*, snapshot: Mapping[str, str] | None = None) -> Renderable:
values = snapshot or {
"running_now": "0",
"upcoming_today": "0",
"failures_24h": "0",
"artifact_footprint": "0 B",
}
return h.section[
h.div(class_="mb-3 flex items-end justify-between gap-4")[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-slate-500"
)["Overview"],
h.h2(class_="mt-1 text-xl font-semibold tracking-tight text-slate-950")[
"Operational snapshot"
],
],
],
h.dl(class_="grid gap-3 md:grid-cols-2 xl:grid-cols-4")[
stat_card(
label="Running now",
value=values["running_now"],
detail="Currently active job executions.",
),
stat_card(
label="Upcoming today",
value=values["upcoming_today"],
detail="Enabled jobs that are ready for their next run.",
),
stat_card(
label="Failures in 24h",
value=values["failures_24h"],
detail="Recent failed executions recorded by the scheduler.",
),
stat_card(
label="Artifact footprint",
value=values["artifact_footprint"],
detail="Current artifact size under the output path.",
),
],
]
def running_executions_table(
*, running_executions: tuple[Mapping[str, object], ...] | None = None
) -> Renderable:
rows = tuple(
_running_execution_row(execution) for execution in (running_executions or ())
)
headers = ("Source", "Execution", "Started", "Status", "Stats", "Actions")
def render_row(row: tuple[Node, ...]) -> Renderable:
first_cell, *other_cells = row
return h.tr(class_="align-top")[
h.td(class_="py-3 pr-6 pl-4 text-sm font-medium text-slate-950 sm:pl-4")[
first_cell
],
(
h.td(
class_="px-3 py-3 align-top text-sm whitespace-nowrap text-slate-600"
)[cell]
for cell in other_cells
),
]
body_rows: Node
if rows:
body_rows = (render_row(row) for row in rows)
else:
body_rows = h.tr[
h.td(
colspan=str(len(headers)),
class_="px-4 py-8 text-center text-sm text-slate-500",
)["No job executions are running."]
]
return h.section[
h.div(class_="mb-3 flex items-end justify-between gap-4")[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Live work"],
h.h2(class_="mt-1 text-xl font-semibold text-slate-950")[
"Running executions"
],
],
muted_action_link(href="/runs", label="Open runs"),
],
h.div(
class_="overflow-hidden rounded-2xl bg-white shadow-sm ring-1 ring-slate-200"
)[
h.div(class_="overflow-x-auto")[
h.table(
class_="w-full min-w-[70rem] divide-y divide-slate-200 table-auto"
)[
h.thead(class_="bg-stone-50")[
h.tr[
(
h.th(
scope="col",
class_="px-3 py-2.5 text-left text-[11px] font-semibold uppercase tracking-[0.18em] whitespace-nowrap text-slate-500 first:pl-4",
)[header]
for header in headers
)
]
],
h.tbody(class_="divide-y divide-slate-200 bg-white")[body_rows],
]
]
],
]
def _source_feed_row(source_feed: Mapping[str, object]) -> tuple[Node, ...]:
last_updated_iso = source_feed.get("last_updated_iso")
last_updated = (
h.time(
datetime=str(last_updated_iso),
title=str(last_updated_iso),
class_="font-medium text-slate-900",
)[str(source_feed["last_updated"])]
if last_updated_iso is not None
else h.p(class_="font-medium text-slate-900")[str(source_feed["last_updated"])]
)
return (
h.div[
h.div(class_="font-semibold text-slate-950")[str(source_feed["source"])],
h.p(class_="mt-0.5 font-mono text-[11px] text-slate-500")[
str(source_feed["slug"])
],
],
h.div(class_="min-w-64")[
inline_link(
href=str(source_feed["feed_href"]),
label=str(source_feed["feed_href"]),
tone="amber",
)
],
status_badge(
label=str(source_feed["feed_status_label"]),
tone=str(source_feed["feed_status_tone"]),
),
last_updated,
h.p(class_="font-medium text-slate-900")[
str(source_feed["artifact_footprint"])
],
)
def published_feeds_table(
*, source_feeds: tuple[Mapping[str, object], ...] | None = None
) -> Renderable:
rows = tuple(_source_feed_row(source_feed) for source_feed in (source_feeds or ()))
return table_section(
eyebrow="Published feeds",
title="Published feeds",
empty_message="No feeds have been published yet.",
headers=("Source", "Feed URL", "Status", "Last updated", "Disk usage"),
rows=rows,
actions=muted_action_link(href="/sources", label="Manage sources"),
)
def dashboard_page() -> Renderable:
return dashboard_page_with_data()
def dashboard_page_with_data(
*,
snapshot: Mapping[str, str] | None = None,
running_executions: tuple[Mapping[str, object], ...] | None = None,
source_feeds: tuple[Mapping[str, object], ...] | None = None,
) -> Renderable:
return h.main(
id="morph",
class_="min-h-screen lg:grid lg:grid-cols-[18rem_minmax(0,1fr)]",
)[
admin_sidebar(current_path="/"),
h.div(class_="px-4 py-4 sm:px-5 lg:px-6 lg:py-5")[
h.div(class_="mx-auto max-w-7xl space-y-5")[
dashboard_header(),
operational_snapshot(snapshot=snapshot),
running_executions_table(running_executions=running_executions),
published_feeds_table(source_feeds=source_feeds),
]
],
]

358
repub/pages/runs.py Normal file
View file

@ -0,0 +1,358 @@
from __future__ import annotations
from collections.abc import Mapping
import htpy as h
from htpy import Node, Renderable
from repub.components import (
inline_link,
muted_action_link,
page_shell,
section_card,
status_badge,
table_section,
)
def _action_button(
*,
label: str,
tone: str = "default",
disabled: bool = False,
post_path: str | None = None,
) -> Renderable:
classes = {
"default": "bg-stone-100 text-slate-700 hover:bg-stone-200",
"danger": "bg-rose-50 text-rose-700 hover:bg-rose-100",
}
class_name = (
"cursor-not-allowed bg-slate-100 text-slate-400" if disabled else classes[tone]
)
attributes: dict[str, str] = {}
if post_path is not None and not disabled:
attributes["data-on:pointerdown"] = f"@post('{post_path}')"
return h.button(
attributes,
type="button",
disabled=disabled,
class_=(
"inline-flex items-center whitespace-nowrap rounded-full px-3 py-1.5 "
f"text-sm font-semibold transition {class_name}"
),
)[label]
def _text(values: Mapping[str, object], key: str) -> str:
return str(values[key])
def _maybe_text(values: Mapping[str, object], key: str) -> str | None:
value = values.get(key)
if value in {None, ""}:
return None
return str(value)
def _flag(values: Mapping[str, object], key: str) -> bool:
return bool(values[key])
def _running_row(execution: Mapping[str, object]) -> tuple[Node, ...]:
return (
h.div[
h.div(class_="font-semibold text-slate-950")[_text(execution, "source")],
h.p(class_="mt-1 font-mono text-xs text-slate-500")[
_text(execution, "slug")
],
],
h.div[
h.p(class_="font-medium text-slate-900")[
f"#{_text(execution, 'execution_id')}"
],
],
h.div[
h.p(class_="font-medium text-slate-900")[_text(execution, "started_at")],
h.p(class_="mt-1 text-xs text-slate-500")[_text(execution, "runtime")],
],
status_badge(label=_text(execution, "status"), tone="running"),
h.div(class_="min-w-56 whitespace-normal")[
h.p(class_="font-medium text-slate-900")[_text(execution, "stats")],
h.p(class_="mt-1 text-xs text-slate-500")[_text(execution, "worker")],
],
h.div(class_="flex flex-nowrap items-center gap-3")[
inline_link(
href=_text(execution, "log_href"),
label="View log",
tone="amber",
),
_action_button(
label="Stop",
tone="danger",
post_path=_maybe_text(execution, "cancel_post_path"),
),
],
)
def _upcoming_row(job: Mapping[str, object]) -> tuple[Node, ...]:
next_run_at = _maybe_text(job, "next_run_at")
next_run_label: Node = h.p(class_="font-medium text-slate-900")[
_text(job, "next_run")
]
if next_run_at is not None:
next_run_label = h.time(
{
"data-next-run-at": next_run_at,
"title": next_run_at,
},
datetime=next_run_at,
class_="font-medium text-slate-900",
)[_text(job, "next_run")]
return (
h.div[
h.div(class_="font-semibold text-slate-950")[_text(job, "source")],
h.p(class_="mt-1 font-mono text-xs text-slate-500")[_text(job, "slug")],
],
h.div[next_run_label,],
h.p(class_="font-mono text-xs text-slate-600")[_text(job, "schedule")],
status_badge(
label=_text(job, "enabled_label"),
tone=_text(job, "enabled_tone"),
),
h.p(class_="max-w-40 whitespace-normal text-sm text-slate-500")[
_text(job, "run_reason")
],
h.div(class_="flex flex-nowrap items-center gap-2")[
_action_button(
label="Run now",
disabled=_flag(job, "run_disabled"),
post_path=_maybe_text(job, "run_post_path"),
),
_action_button(
label=_text(job, "toggle_label"),
post_path=_maybe_text(job, "toggle_post_path"),
),
_action_button(
label="Delete",
tone="danger",
post_path=_maybe_text(job, "delete_post_path"),
),
],
)
def _completed_row(execution: Mapping[str, object]) -> tuple[Node, ...]:
ended_at = _maybe_text(execution, "ended_at_iso")
ended_at_label: Node = h.p(class_="font-medium text-slate-900")[
_text(execution, "ended_at")
]
if ended_at is not None:
ended_at_label = h.time(
{
"data-ended-at": ended_at,
"title": ended_at,
},
datetime=ended_at,
class_="font-medium text-slate-900",
)[_text(execution, "ended_at")]
return (
h.div[
h.div(class_="font-semibold text-slate-950")[_text(execution, "source")],
h.p(class_="mt-1 font-mono text-xs text-slate-500")[
_text(execution, "slug")
],
],
h.div[
h.p(class_="font-medium text-slate-900")[
f"#{_text(execution, 'execution_id')}"
],
],
h.div[
ended_at_label,
h.p(class_="mt-1 text-xs text-slate-500")[_text(execution, "summary")],
],
status_badge(
label=_text(execution, "status"),
tone=_text(execution, "status_tone"),
),
h.div(class_="min-w-48 whitespace-normal")[
h.p(class_="font-medium text-slate-900")[_text(execution, "stats")]
],
inline_link(
href=_text(execution, "log_href"),
label="View log",
tone="amber",
),
)
def runs_page(
*,
running_executions: tuple[Mapping[str, object], ...] | None = None,
upcoming_jobs: tuple[Mapping[str, object], ...] | None = None,
completed_executions: tuple[Mapping[str, object], ...] | None = None,
) -> Renderable:
running_items = running_executions or ()
upcoming_items = upcoming_jobs or ()
completed_items = completed_executions or ()
running_rows = tuple(_running_row(execution) for execution in running_items)
upcoming_rows = tuple(_upcoming_row(job) for job in upcoming_items)
completed_rows = tuple(_completed_row(execution) for execution in completed_items)
return page_shell(
current_path="/runs",
eyebrow="Execution control",
title="Runs",
actions=muted_action_link(href="/sources", label="Back to sources"),
content=(
table_section(
eyebrow="Live work",
title="Running job executions",
empty_message="No job executions are running.",
headers=(
"Source",
"Execution",
"Started",
"Status",
"Stats",
"Actions",
),
rows=running_rows,
),
table_section(
eyebrow="Queue",
title="Upcoming jobs",
empty_message="No jobs are scheduled.",
headers=(
"Source",
"Next run",
"Cron",
"State",
"Run now",
"Actions",
),
rows=upcoming_rows,
),
table_section(
eyebrow="History",
title="Completed job executions",
empty_message="No job executions have completed yet.",
headers=(
"Source",
"Execution",
"Ended",
"Status",
"Summary",
"Log",
),
rows=completed_rows,
),
h.script[
"""
window.repubFormatNextRuns = window.repubFormatNextRuns || (() => {
const relativeFormatter = new Intl.RelativeTimeFormat(undefined, { numeric: 'auto' });
const absoluteFormatter = new Intl.DateTimeFormat(undefined, {
dateStyle: 'medium',
timeStyle: 'short',
timeZoneName: 'short',
});
const formatRelative = (targetDate) => {
const diffSeconds = Math.round((targetDate.getTime() - Date.now()) / 1000);
const units = [
['day', 86400],
['hour', 3600],
['minute', 60],
['second', 1],
];
for (const [unit, size] of units) {
if (Math.abs(diffSeconds) >= size || unit === 'second') {
return relativeFormatter.format(Math.round(diffSeconds / size), unit);
}
}
return relativeFormatter.format(0, 'second');
};
const format = () => {
document.querySelectorAll('time[data-next-run-at], time[data-ended-at]').forEach((element) => {
const relativeAt =
element.getAttribute('data-next-run-at') ??
element.getAttribute('data-ended-at');
if (!relativeAt) return;
const targetDate = new Date(relativeAt);
if (Number.isNaN(targetDate.getTime())) return;
element.textContent = formatRelative(targetDate);
element.title = absoluteFormatter.format(targetDate);
});
};
format();
if (!window.repubNextRunTimer) {
window.repubNextRunTimer = window.setInterval(format, 30000);
}
});
window.repubFormatNextRuns();
"""
],
),
)
def execution_logs_page(
*,
job_id: int,
execution_id: int,
log_view: Mapping[str, object] | None = None,
) -> Renderable:
if log_view is None:
log_view = {
"title": f"Job {job_id} / execution {execution_id}",
"description": "",
"status_label": "Unavailable",
"status_tone": "failed",
"log_text": "",
"error_message": "Execution log is only available from persisted job runs.",
}
error_message = _maybe_text(log_view, "error_message")
error_notice = (
h.div(
class_="mt-3 rounded-2xl bg-rose-50 px-4 py-3 text-sm font-medium text-rose-800"
)[
h.p["Execution log unavailable"],
h.p(class_="mt-1 font-normal")[error_message],
]
if error_message is not None
else None
)
return page_shell(
current_path=f"/job/{job_id}/execution/{execution_id}/logs",
eyebrow="Execution log",
title=_text(log_view, "title"),
actions=muted_action_link(href="/runs", label="Back to runs"),
content=(
section_card(
content=(
h.div(class_="flex items-end justify-between gap-4")[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Route"],
h.h2(class_="mt-2 text-xl font-semibold text-slate-950")[
f"/job/{job_id}/execution/{execution_id}/logs"
],
],
status_badge(
label=_text(log_view, "status_label"),
tone=_text(log_view, "status_tone"),
),
],
error_notice,
h.pre(
class_="mt-3 overflow-x-auto rounded-[1.5rem] bg-slate-950 p-5 text-xs leading-6 text-emerald-200"
)[_text(log_view, "log_text")],
)
),
),
)

71
repub/pages/shim.py Normal file
View file

@ -0,0 +1,71 @@
from __future__ import annotations
import htpy as h
from htpy import Node, Renderable
from repub.components import admin_sidebar
ON_LOAD_JS = (
"@post(window.location.pathname + "
"(window.location.search + '&u=').replace(/^&/,'?'), "
"{retryMaxCount: Infinity})"
)
TAB_ID_JS = "self.crypto.randomUUID().substring(0,8)"
def shim_page(
*, datastar_src: str, current_path: str, head: Node | None = None
) -> Renderable:
return h.html(lang="en")[
h.head[
h.meta(charset="UTF-8"),
head,
h.script(id="js", defer=True, type="module", src=datastar_src),
h.meta(name="viewport", content="width=device-width, initial-scale=1.0"),
],
h.body[
h.div({"data-signals:tabid": TAB_ID_JS}),
h.div(
{
"data-init": ON_LOAD_JS,
"data-on:online__window": ON_LOAD_JS,
}
),
h.noscript["Your browser does not support JavaScript!"],
h.main(
id="morph",
class_="min-h-screen lg:grid lg:grid-cols-[18rem_minmax(0,1fr)]",
)[
admin_sidebar(current_path=current_path),
h.div(class_="px-4 py-4 sm:px-5 lg:px-6 lg:py-5")[
h.div(class_="mx-auto max-w-7xl space-y-5")[
h.section[
h.div(
class_="flex flex-col gap-4 sm:flex-row sm:items-start sm:justify-between"
)[
h.div(class_="max-w-3xl")[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Connecting"],
h.h1(
class_="mt-1 text-3xl font-semibold tracking-tight text-slate-950"
)["Loading page"],
],
]
],
h.section(
class_="overflow-hidden rounded-2xl bg-white shadow-sm ring-1 ring-slate-200"
)[
h.div(class_="animate-pulse space-y-4 p-6")[
h.div(class_="h-5 w-40 rounded-full bg-stone-100"),
h.div(class_="h-12 rounded-2xl bg-stone-100"),
h.div(class_="h-12 rounded-2xl bg-stone-100"),
h.div(class_="h-12 rounded-2xl bg-stone-100"),
]
],
]
],
],
],
]

425
repub/pages/sources.py Normal file
View file

@ -0,0 +1,425 @@
from __future__ import annotations
from collections.abc import Mapping
import htpy as h
from htpy import Node, Renderable
from repub.components import (
header_action_link,
inline_link,
input_field,
muted_action_link,
page_shell,
section_card,
select_field,
status_badge,
table_section,
textarea_field,
toggle_field,
)
PANGEA_CONTENT_FORMATS = (
"WTF_0",
"TEXT_ONLY",
"WTF_1",
"MOBILE_1",
"MOBILE_2",
"MOBILE_3",
"WTF_2",
"XML_TX",
"JSON",
)
PANGEA_CONTENT_TYPES = (
"articles",
"audioclips",
"videoclips",
"breakingnews",
"mostpopular",
"topstories",
)
def _value(source: Mapping[str, object] | None, key: str, default: str = "") -> str:
if source is None:
return default
return str(source.get(key, default))
def _checked(source: Mapping[str, object] | None, key: str, default: bool) -> bool:
if source is None:
return default
value = source.get(key, default)
return bool(value)
def _source_row(source: Mapping[str, object]) -> tuple[Node, ...]:
return (
h.div[
h.div(class_="font-semibold text-slate-950")[str(source["name"])],
h.p(class_="mt-1 font-mono text-xs text-slate-500")[str(source["slug"])],
],
h.p(class_="font-medium whitespace-nowrap text-slate-900")[
str(source["source_type"])
],
h.p(class_="max-w-sm truncate font-mono text-xs text-slate-600")[
str(source["upstream"])
],
h.p(class_="font-medium whitespace-nowrap text-slate-900")[
str(source["schedule"])
],
h.div(class_="min-w-32 whitespace-normal")[
status_badge(
label=str(source["state"]),
tone=str(source["state_tone"]),
),
h.p(class_="mt-2 text-xs text-slate-500")[str(source["last_run"])],
],
h.div(class_="flex flex-nowrap items-center gap-3")[
inline_link(
href=f"/sources/{source['slug']}/edit", label="Edit", tone="amber"
),
inline_link(href="/runs", label="View runs"),
],
)
def sources_table(
*, sources: tuple[Mapping[str, object], ...] | None = None
) -> Renderable:
rows = tuple(_source_row(source) for source in (sources or ()))
return table_section(
eyebrow="Inventory",
title="Sources",
empty_message="No sources yet.",
headers=("Source", "Type", "Upstream", "Schedule", "Job state", "Actions"),
rows=rows,
actions=header_action_link(href="/sources/create", label="Create source"),
)
def sources_page(
*, sources: tuple[Mapping[str, object], ...] | None = None
) -> Renderable:
return page_shell(
current_path="/sources",
eyebrow="Source management",
title="Sources",
actions=header_action_link(href="/sources/create", label="Create source"),
content=sources_table(sources=sources),
)
def source_form(
*,
mode: str,
action_path: str,
source: Mapping[str, object] | None = None,
) -> Renderable:
source_type = _value(source, "source_type", "pangea")
slug = _value(source, "slug")
title = "Source and job setup" if mode == "create" else "Edit source"
eyebrow = "Create" if mode == "create" else "Edit"
status_label = "New source" if mode == "create" else "Existing source"
submit_label = "Create source" if mode == "create" else "Save changes"
initial_signals = "{sourceType: 'pangea'}"
if mode == "edit":
initial_signals = f"{{sourceType: '{source_type}', sourceSlug: '{slug}'}}"
return section_card(
content=(
h.div(
class_="flex flex-col gap-3 sm:flex-row sm:items-end sm:justify-between"
)[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)[eyebrow],
h.h2(class_="mt-2 text-xl font-semibold text-slate-950")[title],
],
status_badge(label=status_label, tone="scheduled"),
],
h.form(
{
"data-signals": "{_formError: '', _formSuccess: ''}",
"data-signals__ifmissing": initial_signals,
"data-on:submit": f"@post('{action_path}')",
},
class_="mt-5 space-y-6",
)[
h.div(
{
"data-show": "$_formError !== ''",
"data-text": "$_formError",
},
class_="rounded-2xl bg-rose-50 px-4 py-3 text-sm font-medium text-rose-800",
),
h.div(
{
"data-show": "$_formSuccess !== ''",
"data-text": "$_formSuccess",
},
class_="rounded-2xl bg-emerald-100 px-4 py-3 text-sm font-medium text-emerald-800",
),
h.div(class_="grid gap-4 md:grid-cols-2")[
input_field(
label="Source name",
field_id="source-name",
value=_value(source, "name"),
signal_name="sourceName",
),
input_field(
label="Slug",
field_id="source-slug",
value=slug,
help_text="Immutable after creation.",
signal_name="sourceSlug",
disabled=mode == "edit",
),
h.div[
h.label(
for_="source-type",
class_="block text-sm font-medium text-slate-900",
)["Source type"],
h.select(
{"data-bind": "sourceType"},
id="source-type",
name="source-type",
class_="mt-2 block w-full rounded-2xl border-0 bg-white px-3.5 py-2.5 text-sm text-slate-900 shadow-sm ring-1 ring-slate-200 focus:outline-hidden focus:ring-2 focus:ring-amber-500",
)[
h.option(value="feed", selected=source_type == "feed")[
"feed"
],
h.option(value="pangea", selected=source_type == "pangea")[
"pangea"
],
],
],
],
h.div(
{"data-show": "$sourceType === 'feed'"},
class_="space-y-4 rounded-[1.5rem] bg-stone-50 p-5",
)[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Feed source options"],
h.h3(class_="mt-2 text-lg font-semibold text-slate-950")[
"Feed settings"
],
],
h.div(class_="grid gap-4 md:grid-cols-2")[
input_field(
label="Feed URL",
field_id="feed-url",
value=_value(source, "feed_url"),
placeholder="https://example.com/feed.xml",
signal_name="feedUrl",
),
],
],
h.div(
{"data-show": "$sourceType === 'pangea'"},
class_="space-y-4 rounded-[1.5rem] bg-stone-50 p-5",
)[
h.div[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Pangea source options"],
h.h3(class_="mt-2 text-lg font-semibold text-slate-950")[
"Pangea settings"
],
],
h.div(class_="grid gap-4 lg:grid-cols-3")[
input_field(
label="Pangea domain",
field_id="pangea-domain",
value=_value(source, "pangea_domain"),
signal_name="pangeaDomain",
),
input_field(
label="Category name",
field_id="pangea-category",
value=_value(source, "pangea_category"),
signal_name="pangeaCategory",
),
select_field(
label="Content format",
field_id="content-format",
options=PANGEA_CONTENT_FORMATS,
selected=_value(source, "content_format", "MOBILE_3"),
signal_name="contentFormat",
),
select_field(
label="Content type",
field_id="content-type",
options=PANGEA_CONTENT_TYPES,
selected=_value(source, "content_type", "articles"),
signal_name="contentType",
),
input_field(
label="Max articles",
field_id="max-articles",
value=_value(source, "max_articles", "10"),
signal_name="maxArticles",
),
input_field(
label="Oldest article (days)",
field_id="oldest-article",
value=_value(source, "oldest_article", "3"),
signal_name="oldestArticle",
),
],
h.div(class_="grid gap-4 lg:grid-cols-3")[
toggle_field(
label="Only newest",
description="Limit Pangea syncs to the newest material available in the selected category.",
signal_name="onlyNewest",
checked=_checked(source, "only_newest", True),
),
toggle_field(
label="Include authors",
description="Carry author bylines into mirrored output where upstream data exists.",
signal_name="includeAuthors",
checked=_checked(source, "include_authors", True),
),
toggle_field(
label="Exclude media",
description="Skip image and media attachment mirroring for this source.",
signal_name="excludeMedia",
checked=_checked(source, "exclude_media", False),
),
toggle_field(
label="Include content",
description="Store article body content in mirrored output when the upstream provides it.",
signal_name="includeContent",
checked=_checked(source, "include_content", True),
),
],
],
h.div(class_="grid gap-4 lg:grid-cols-2")[
textarea_field(
label="Notes",
field_id="source-notes",
value=_value(source, "notes"),
signal_name="sourceNotes",
),
textarea_field(
label="Spider arguments",
field_id="spider-arguments",
value=_value(
source,
"spider_arguments",
"language=en\ndownload_media=true",
),
signal_name="spiderArguments",
),
],
h.div(
class_="grid gap-6 xl:grid-cols-[minmax(0,1.3fr)_minmax(20rem,0.9fr)]"
)[
h.div(class_="rounded-[1.5rem] bg-stone-50 p-5")[
h.div[
h.h3(class_="text-lg font-semibold text-slate-950")[
"Cron schedule"
],
h.p(class_="mt-1 text-sm text-slate-600")[
"Stored in UTC and displayed in the browser timezone."
],
],
h.div(class_="mt-5 grid gap-4 sm:grid-cols-2 xl:grid-cols-5")[
input_field(
label="Minute",
field_id="cron-minute",
value=_value(source, "cron_minute", "*/30"),
signal_name="cronMinute",
),
input_field(
label="Hour",
field_id="cron-hour",
value=_value(source, "cron_hour", "*"),
signal_name="cronHour",
),
input_field(
label="Day of month",
field_id="cron-day-of-month",
value=_value(source, "cron_day_of_month", "*"),
signal_name="cronDayOfMonth",
),
input_field(
label="Day of week",
field_id="cron-day-of-week",
value=_value(source, "cron_day_of_week", "*"),
signal_name="cronDayOfWeek",
),
input_field(
label="Month",
field_id="cron-month",
value=_value(source, "cron_month", "*"),
signal_name="cronMonth",
),
],
],
h.div(class_="rounded-[1.5rem] bg-stone-50 p-5")[
h.p(
class_="text-xs font-semibold uppercase tracking-[0.22em] text-amber-600"
)["Job defaults"],
h.h3(class_="mt-2 text-lg font-semibold text-slate-950")[
"Initial job state"
],
h.div(class_="mt-5 grid gap-4")[
toggle_field(
label="Job enabled",
description="Scheduler will consider the new job immediately after creation.",
signal_name="jobEnabled",
checked=_checked(source, "enabled", True),
),
],
],
],
h.div(
class_="flex flex-wrap justify-end gap-3 border-t border-slate-200 pt-6"
)[
muted_action_link(href="/sources", label="Cancel"),
h.button(
type="submit",
class_="rounded-full bg-slate-950 px-4 py-2.5 text-sm font-semibold text-white transition hover:bg-slate-800",
)[submit_label],
],
],
)
)
def create_source_page(*, action_path: str = "/actions/sources/create") -> Renderable:
actions = (
muted_action_link(href="/sources", label="Back to sources"),
header_action_link(href="/runs", label="View runs"),
)
return page_shell(
current_path="/sources/create",
eyebrow="Source creation",
title="Create source",
actions=actions,
content=source_form(mode="create", action_path=action_path),
)
def edit_source_page(
*,
slug: str,
source: Mapping[str, object],
action_path: str,
) -> Renderable:
actions = (
muted_action_link(href="/sources", label="Back to sources"),
header_action_link(href="/runs", label="View runs"),
)
return page_shell(
current_path=f"/sources/{slug}/edit",
eyebrow="Source editing",
title="Edit source",
actions=actions,
content=source_form(mode="edit", action_path=action_path, source=source),
)

View file

@ -8,7 +8,7 @@ from scrapy.utils.spider import iterate_spider_output
from repub.items import ChannelElementItem, ElementItem from repub.items import ChannelElementItem, ElementItem
from repub.rss import CDATA, CONTENT, ITUNES, MEDIA, E, munge_cdata_html, normalize_date from repub.rss import CDATA, CONTENT, ITUNES, MEDIA, E, munge_cdata_html, normalize_date
from repub.utils import FileType, determine_file_type, local_file_path from repub.utils import FileType, determine_file_type, local_file_path, local_image_path
class BaseRssFeedSpider(Spider): class BaseRssFeedSpider(Spider):
@ -34,13 +34,15 @@ class BaseRssFeedSpider(Spider):
def rewrite_file_url(self, file_type: FileType, url): def rewrite_file_url(self, file_type: FileType, url):
file_dir = self.settings["REPUBLISHER_FILE_DIR"] file_dir = self.settings["REPUBLISHER_FILE_DIR"]
local_path = local_file_path(url)
if file_type == FileType.IMAGE: if file_type == FileType.IMAGE:
file_dir = self.settings["REPUBLISHER_IMAGE_DIR"] file_dir = self.settings["REPUBLISHER_IMAGE_DIR"]
local_path = local_image_path(url)
elif file_type == FileType.VIDEO: elif file_type == FileType.VIDEO:
file_dir = self.settings["REPUBLISHER_VIDEO_DIR"] file_dir = self.settings["REPUBLISHER_VIDEO_DIR"]
elif file_type == FileType.AUDIO: elif file_type == FileType.AUDIO:
file_dir = self.settings["REPUBLISHER_AUDIO_DIR"] file_dir = self.settings["REPUBLISHER_AUDIO_DIR"]
return f"/{file_dir}/{local_file_path(url)}" return f"{file_dir}/{local_path}"
def rewrite_image_url(self, url): def rewrite_image_url(self, url):
return self.rewrite_file_url(FileType.IMAGE, url) return self.rewrite_file_url(FileType.IMAGE, url)

98
repub/sql/001_initial.sql Normal file
View file

@ -0,0 +1,98 @@
CREATE TABLE IF NOT EXISTS source (
id INTEGER PRIMARY KEY,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
name TEXT NOT NULL,
slug TEXT NOT NULL UNIQUE,
source_type TEXT NOT NULL CHECK (source_type IN ('feed', 'pangea')),
notes TEXT NOT NULL DEFAULT ''
);
CREATE TABLE IF NOT EXISTS source_feed (
source_id INTEGER PRIMARY KEY,
feed_url TEXT NOT NULL,
etag TEXT,
last_modified TEXT,
FOREIGN KEY (source_id) REFERENCES source(id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS source_pangea (
source_id INTEGER PRIMARY KEY,
domain TEXT NOT NULL,
category_name TEXT NOT NULL,
content_type TEXT NOT NULL,
only_newest INTEGER NOT NULL CHECK (only_newest IN (0, 1)),
max_articles INTEGER NOT NULL,
oldest_article INTEGER NOT NULL,
include_authors INTEGER NOT NULL CHECK (include_authors IN (0, 1)),
exclude_media INTEGER NOT NULL CHECK (exclude_media IN (0, 1)),
include_content INTEGER NOT NULL CHECK (include_content IN (0, 1)),
content_format TEXT NOT NULL,
FOREIGN KEY (source_id) REFERENCES source(id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS job (
id INTEGER PRIMARY KEY,
source_id INTEGER NOT NULL UNIQUE,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
enabled INTEGER NOT NULL CHECK (enabled IN (0, 1)),
spider_arguments TEXT NOT NULL DEFAULT '',
cron_minute TEXT NOT NULL,
cron_hour TEXT NOT NULL,
cron_day_of_month TEXT NOT NULL,
cron_day_of_week TEXT NOT NULL,
cron_month TEXT NOT NULL,
FOREIGN KEY (source_id) REFERENCES source(id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS job_execution (
id INTEGER PRIMARY KEY,
job_id INTEGER NOT NULL,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
started_at TEXT,
ended_at TEXT,
stop_requested_at TEXT,
running_status INTEGER NOT NULL DEFAULT 0 CHECK (running_status BETWEEN 0 AND 4),
requests_count INTEGER NOT NULL DEFAULT 0,
items_count INTEGER NOT NULL DEFAULT 0,
warnings_count INTEGER NOT NULL DEFAULT 0,
errors_count INTEGER NOT NULL DEFAULT 0,
bytes_count INTEGER NOT NULL DEFAULT 0,
retries_count INTEGER NOT NULL DEFAULT 0,
exceptions_count INTEGER NOT NULL DEFAULT 0,
cache_size_count INTEGER NOT NULL DEFAULT 0,
cache_object_count INTEGER NOT NULL DEFAULT 0,
raw_stats TEXT NOT NULL DEFAULT '{}',
FOREIGN KEY (job_id) REFERENCES job(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS job_enabled_idx
ON job (enabled);
CREATE INDEX IF NOT EXISTS job_execution_job_created_at_idx
ON job_execution (job_id, created_at DESC);
CREATE INDEX IF NOT EXISTS job_execution_status_started_at_idx
ON job_execution (running_status, started_at DESC);
CREATE INDEX IF NOT EXISTS job_execution_status_ended_at_idx
ON job_execution (running_status, ended_at DESC);
CREATE TRIGGER IF NOT EXISTS source_set_updated_at
AFTER UPDATE ON source
FOR EACH ROW
BEGIN
UPDATE source
SET updated_at = CURRENT_TIMESTAMP
WHERE id = NEW.id;
END;
CREATE TRIGGER IF NOT EXISTS job_set_updated_at
AFTER UPDATE ON job
FOR EACH ROW
BEGIN
UPDATE job
SET updated_at = CURRENT_TIMESTAMP
WHERE id = NEW.id;
END;

1417
repub/static/app.css Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1 @@
@import "tailwindcss" source("../");

File diff suppressed because one or more lines are too long

View file

@ -1,27 +1,503 @@
from __future__ import annotations from __future__ import annotations
from quart import Quart import asyncio
import hashlib
from collections.abc import AsyncGenerator, Awaitable, Callable
from pathlib import Path
from typing import TypedDict, cast
from urllib.parse import urlparse
import htpy as h
from datastar_py import ServerSentEventGenerator as SSE
from datastar_py.quart import DatastarResponse, read_signals
from datastar_py.sse import DatastarEvent
from htpy import Renderable
from peewee import IntegrityError
from quart import Quart, Response, request, send_from_directory, url_for
from repub.datastar import RefreshBroker, render_stream
from repub.jobs import (
JobRuntime,
load_dashboard_view,
load_execution_log_view,
load_runs_view,
)
from repub.model import (
Job,
create_source,
delete_job_source,
initialize_database,
load_source_form,
load_sources,
source_slug_exists,
update_source,
)
from repub.pages import (
create_source_page,
dashboard_page_with_data,
edit_source_page,
execution_logs_page,
runs_page,
shim_page,
sources_page,
)
from repub.pages.sources import PANGEA_CONTENT_FORMATS, PANGEA_CONTENT_TYPES
REFRESH_BROKER_KEY = "repub.refresh_broker"
JOB_RUNTIME_KEY = "repub.job_runtime"
DEFAULT_LOG_DIR = Path("out/logs")
DEFAULT_FEEDS_DIR = Path("out/feeds")
RenderFunction = Callable[[], Awaitable[Renderable]]
def create_app() -> Quart: class SourceFormData(TypedDict):
name: str
slug: str
source_type: str
notes: str
spider_arguments: str
enabled: bool
cron_minute: str
cron_hour: str
cron_day_of_month: str
cron_day_of_week: str
cron_month: str
feed_url: str
pangea_domain: str
pangea_category: str
content_format: str
content_type: str
max_articles: int | None
oldest_article: int | None
only_newest: bool
include_authors: bool
exclude_media: bool
include_content: bool
DEFAULT_PANGEA_CONTENT_FORMAT = "MOBILE_3"
DEFAULT_PANGEA_CONTENT_TYPE = "articles"
DEFAULT_PANGEA_MAX_ARTICLES = "10"
DEFAULT_PANGEA_OLDEST_ARTICLE = "3"
def _render_shim_page(
*, stylesheet_href: str, datastar_src: str, current_path: str
) -> tuple[str, str]:
head = (
h.title["Republisher Admin UI"],
h.link(rel="stylesheet", href=stylesheet_href),
)
body = str(
shim_page(datastar_src=datastar_src, current_path=current_path, head=head)
)
etag = hashlib.sha256(body.encode("utf-8")).hexdigest()
return body, etag
def create_app(*, dev_mode: bool = False) -> Quart:
app = Quart(__name__) app = Quart(__name__)
app.config["REPUB_DB_PATH"] = str(initialize_database())
app.config.setdefault("REPUB_LOG_DIR", DEFAULT_LOG_DIR)
app.config.setdefault("REPUB_FEEDS_DIR", DEFAULT_FEEDS_DIR)
app.config["REPUB_DEV_MODE"] = dev_mode
app.extensions[REFRESH_BROKER_KEY] = RefreshBroker()
app.extensions[JOB_RUNTIME_KEY] = None
@app.get("/feeds/<path:feed_path>")
async def published_feed(feed_path: str) -> Response:
if not bool(app.config["REPUB_DEV_MODE"]):
return Response(status=404)
response = await send_from_directory(
str(Path(app.config["REPUB_FEEDS_DIR"])),
feed_path,
)
if Path(feed_path).suffix == ".rss":
response.mimetype = "application/rss+xml"
return response
@app.get("/") @app.get("/")
async def index() -> str: @app.get("/sources")
return """<!doctype html> @app.get("/sources/create")
<html lang="en"> @app.get("/sources/<string:slug>/edit")
<head> @app.get("/runs")
<meta charset="utf-8"> @app.get("/job/<int:job_id>/execution/<int:execution_id>/logs")
<meta name="viewport" content="width=device-width, initial-scale=1"> async def page_shim(
<title>Republisher</title> slug: str | None = None,
</head> job_id: int | None = None,
<body> execution_id: int | None = None,
<main> ) -> Response:
<h1>Hello, world!</h1> del slug, job_id, execution_id
<p>Republisher web UI is starting here.</p> body, etag = _render_shim_page(
</main> stylesheet_href=url_for("static", filename="app.css"),
</body> datastar_src=url_for("static", filename="datastar@1.0.0-RC.8.js"),
</html> current_path=request.path,
""" )
if request.if_none_match.contains(etag):
response = Response(status=304)
response.set_etag(etag)
return response
response = Response(body, mimetype="text/html")
response.set_etag(etag)
return response
@app.post("/")
async def dashboard_patch() -> DatastarResponse:
return _page_patch_response(app, lambda: render_dashboard(app))
@app.post("/sources")
async def sources_patch() -> DatastarResponse:
return _page_patch_response(app, lambda: render_sources(app))
@app.post("/sources/create")
async def create_source_patch() -> DatastarResponse:
return _page_patch_response(app, lambda: render_create_source(app))
@app.post("/sources/<string:slug>/edit")
async def edit_source_patch(slug: str) -> DatastarResponse:
return _page_patch_response(app, lambda: render_edit_source(slug))
@app.post("/actions/sources/create")
async def create_source_action() -> DatastarResponse:
signals = cast(dict[str, object], await read_signals())
source, error = validate_source_form(
signals,
slug_exists=source_slug_exists,
)
if error is not None:
return DatastarResponse(
SSE.patch_signals({"_formError": error, "_formSuccess": ""})
)
assert source is not None
try:
create_source(**source)
except IntegrityError:
return DatastarResponse(
SSE.patch_signals(
{"_formError": "Slug must be unique.", "_formSuccess": ""}
)
)
get_job_runtime(app).sync_jobs()
trigger_refresh(app)
return DatastarResponse(SSE.redirect("/sources"))
@app.post("/actions/sources/<string:slug>/edit")
async def edit_source_action(slug: str) -> DatastarResponse:
signals = cast(dict[str, object], await read_signals())
source, error = validate_source_form(
signals,
slug_exists=lambda candidate: candidate != slug
and source_slug_exists(candidate),
immutable_slug=slug,
)
if error is not None:
return DatastarResponse(
SSE.patch_signals({"_formError": error, "_formSuccess": ""})
)
assert source is not None
if update_source(slug, **source) is None:
return DatastarResponse(
SSE.patch_signals(
{"_formError": "Source does not exist.", "_formSuccess": ""}
)
)
get_job_runtime(app).sync_jobs()
trigger_refresh(app)
return DatastarResponse(SSE.redirect("/sources"))
@app.post("/runs")
async def runs_patch() -> DatastarResponse:
return _page_patch_response(app, lambda: render_runs(app))
@app.post("/actions/jobs/<int:job_id>/run-now")
async def run_job_now_action(job_id: int) -> Response:
get_job_runtime(app).run_job_now(job_id, reason="manual")
trigger_refresh(app)
return Response(status=204)
@app.post("/actions/jobs/<int:job_id>/toggle-enabled")
async def toggle_job_enabled_action(job_id: int) -> Response:
job = Job.get_or_none(id=job_id)
if job is not None:
get_job_runtime(app).set_job_enabled(job_id, enabled=not job.enabled)
trigger_refresh(app)
return Response(status=204)
@app.post("/actions/jobs/<int:job_id>/delete")
async def delete_job_action(job_id: int) -> Response:
delete_job_source(job_id)
get_job_runtime(app).sync_jobs()
trigger_refresh(app)
return Response(status=204)
@app.post("/actions/executions/<int:execution_id>/cancel")
async def cancel_execution_action(execution_id: int) -> Response:
get_job_runtime(app).request_execution_cancel(execution_id)
trigger_refresh(app)
return Response(status=204)
@app.post("/job/<int:job_id>/execution/<int:execution_id>/logs")
async def logs_patch(job_id: int, execution_id: int) -> DatastarResponse:
async def render() -> Renderable:
return await render_execution_logs(
app, job_id=job_id, execution_id=execution_id
)
return _page_patch_response(app, render)
@app.before_serving
async def start_runtime() -> None:
get_job_runtime(app).start()
@app.after_serving
async def stop_runtime() -> None:
get_job_runtime(app).shutdown()
return app return app
def get_refresh_broker(app: Quart) -> RefreshBroker:
return cast(RefreshBroker, app.extensions[REFRESH_BROKER_KEY])
def get_job_runtime(app: Quart) -> JobRuntime:
runtime = cast(JobRuntime | None, app.extensions.get(JOB_RUNTIME_KEY))
if runtime is None:
runtime = JobRuntime(
log_dir=app.config["REPUB_LOG_DIR"],
refresh_callback=lambda: trigger_refresh(app),
)
app.extensions[JOB_RUNTIME_KEY] = runtime
return runtime
def trigger_refresh(app: Quart, event: object = "refresh-event") -> None:
get_refresh_broker(app).publish(event)
async def render_dashboard(app: Quart | None = None) -> Renderable:
if app is None:
return dashboard_page_with_data()
view = load_dashboard_view(log_dir=app.config["REPUB_LOG_DIR"])
return dashboard_page_with_data(
snapshot=cast(dict[str, str], view["snapshot"]),
running_executions=cast(tuple[dict[str, object], ...], view["running"]),
source_feeds=cast(tuple[dict[str, object], ...], view["source_feeds"]),
)
async def render_sources(app: Quart | None = None) -> Renderable:
sources = None if app is None else load_sources()
return sources_page(sources=sources)
async def render_create_source(app: Quart | None = None) -> Renderable:
del app
return create_source_page()
async def render_edit_source(slug: str) -> Renderable:
source = load_source_form(slug)
if source is None:
return sources_page(sources=())
return edit_source_page(
slug=slug,
source=source,
action_path=f"/actions/sources/{slug}/edit",
)
async def render_runs(app: Quart | None = None) -> Renderable:
if app is None:
return runs_page()
view = load_runs_view(log_dir=app.config["REPUB_LOG_DIR"])
return runs_page(
running_executions=cast(tuple[dict[str, object], ...], view["running"]),
upcoming_jobs=cast(tuple[dict[str, object], ...], view["upcoming"]),
completed_executions=cast(tuple[dict[str, object], ...], view["completed"]),
)
async def render_execution_logs(
app: Quart | None = None, *, job_id: int, execution_id: int
) -> Renderable:
if app is None:
return execution_logs_page(job_id=job_id, execution_id=execution_id)
log_view = load_execution_log_view(
log_dir=app.config["REPUB_LOG_DIR"],
job_id=job_id,
execution_id=execution_id,
)
return execution_logs_page(
job_id=job_id,
execution_id=execution_id,
log_view={
"title": log_view.title,
"description": log_view.description,
"status_label": log_view.status_label,
"status_tone": log_view.status_tone,
"log_text": log_view.log_text,
"error_message": log_view.error_message,
},
)
def _page_patch_response(app: Quart, render: RenderFunction) -> DatastarResponse:
queue = get_refresh_broker(app).subscribe()
stream = render_stream(
queue,
render=render,
last_event_id=request.headers.get("last-event-id"),
)
return DatastarResponse(_unsubscribe_on_close(queue, stream, app))
async def _unsubscribe_on_close(
queue: object, stream: AsyncGenerator[DatastarEvent, None], app: Quart
) -> AsyncGenerator[DatastarEvent, None]:
try:
async for event in stream:
yield event
finally:
get_refresh_broker(app).unsubscribe(cast(asyncio.Queue[object], queue))
def validate_source_form(
signals: dict[str, object] | None,
*,
slug_exists: Callable[[str], bool],
immutable_slug: str | None = None,
) -> tuple[SourceFormData | None, str | None]:
if signals is None:
return None, "Missing form data."
source_name = _read_string(signals, "sourceName")
source_slug = _read_string(signals, "sourceSlug")
source_type = _read_string(signals, "sourceType")
feed_url = _read_string(signals, "feedUrl")
pangea_domain = _read_string(signals, "pangeaDomain")
pangea_category = _read_string(signals, "pangeaCategory")
content_format = _read_string(signals, "contentFormat")
content_type = _read_string(signals, "contentType")
max_articles = _read_string(signals, "maxArticles")
oldest_article = _read_string(signals, "oldestArticle")
source_notes = _read_string(signals, "sourceNotes")
spider_arguments = _normalize_multiline(_read_string(signals, "spiderArguments"))
cron_minute = _read_string(signals, "cronMinute")
cron_hour = _read_string(signals, "cronHour")
cron_day_of_month = _read_string(signals, "cronDayOfMonth")
cron_day_of_week = _read_string(signals, "cronDayOfWeek")
cron_month = _read_string(signals, "cronMonth")
errors: list[str] = []
if source_name == "":
errors.append("Source name is required.")
if source_slug == "":
errors.append("Slug is required.")
elif immutable_slug is not None and source_slug != immutable_slug:
errors.append("Slug is immutable.")
elif slug_exists(source_slug):
errors.append("Slug must be unique.")
if source_type not in {"feed", "pangea"}:
errors.append("Source type must be feed or pangea.")
if source_type == "feed":
if feed_url == "":
errors.append("Feed URL is required for feed sources.")
elif not _is_valid_url(feed_url):
errors.append("Feed URL must be a valid URL.")
if source_type == "pangea":
content_format = content_format or DEFAULT_PANGEA_CONTENT_FORMAT
content_type = content_type or DEFAULT_PANGEA_CONTENT_TYPE
max_articles = max_articles or DEFAULT_PANGEA_MAX_ARTICLES
oldest_article = oldest_article or DEFAULT_PANGEA_OLDEST_ARTICLE
if pangea_domain == "":
errors.append("Pangea domain is required.")
if pangea_category == "":
errors.append("Category name is required.")
if content_format not in PANGEA_CONTENT_FORMATS:
errors.append("Content format is invalid.")
if content_type not in PANGEA_CONTENT_TYPES:
errors.append("Content type is invalid.")
if _parse_int(max_articles) is None:
errors.append("Max articles must be an integer.")
if _parse_int(oldest_article) is None:
errors.append("Oldest article must be an integer.")
cron_values = (
cron_minute,
cron_hour,
cron_day_of_month,
cron_day_of_week,
cron_month,
)
if any(value == "" for value in cron_values):
errors.append("All cron fields are required.")
if errors:
return None, " ".join(errors)
enabled = _read_bool(signals, "jobEnabled")
source: SourceFormData = {
"name": source_name,
"slug": source_slug,
"source_type": source_type,
"notes": source_notes,
"spider_arguments": spider_arguments,
"feed_url": feed_url,
"pangea_domain": pangea_domain,
"pangea_category": pangea_category,
"content_format": content_format,
"content_type": content_type,
"max_articles": _parse_int(max_articles),
"oldest_article": _parse_int(oldest_article),
"enabled": enabled,
"only_newest": _read_bool(signals, "onlyNewest", default=True),
"include_authors": _read_bool(signals, "includeAuthors", default=True),
"exclude_media": _read_bool(signals, "excludeMedia", default=False),
"include_content": _read_bool(signals, "includeContent", default=True),
"cron_minute": cron_minute,
"cron_hour": cron_hour,
"cron_day_of_month": cron_day_of_month,
"cron_day_of_week": cron_day_of_week,
"cron_month": cron_month,
}
return source, None
def _read_string(signals: dict[str, object], key: str) -> str:
return str(signals.get(key, "")).strip()
def _read_bool(signals: dict[str, object], key: str, *, default: bool = False) -> bool:
value = signals.get(key, default)
if isinstance(value, bool):
return value
if isinstance(value, str):
return value.lower() in {"true", "1", "on", "yes"}
return bool(value)
def _normalize_multiline(value: str) -> str:
return value.replace("\r\n", "\n").replace("\r", "\n")
def _parse_int(value: str) -> int | None:
try:
return int(value)
except ValueError:
return None
def _is_valid_url(value: str) -> bool:
parsed = urlparse(value)
return parsed.scheme in {"http", "https"} and parsed.netloc != ""

View file

@ -141,12 +141,20 @@ def test_build_feed_settings_derives_output_paths_from_feed_slug(
assert feed_settings["REPUBLISHER_OUT_DIR"] == str(out_dir) assert feed_settings["REPUBLISHER_OUT_DIR"] == str(out_dir)
assert feed_settings["LOG_FILE"] == str(out_dir / "logs" / "info-marti.log") assert feed_settings["LOG_FILE"] == str(out_dir / "logs" / "info-marti.log")
assert feed_settings["HTTPCACHE_DIR"] == str(out_dir / "httpcache") assert feed_settings["HTTPCACHE_DIR"] == str(out_dir / "httpcache")
assert feed_settings["IMAGES_STORE"] == str(out_dir / "info-marti" / "images") assert feed_settings["IMAGES_STORE"] == str(
assert feed_settings["AUDIO_STORE"] == str(out_dir / "info-marti" / "audio") out_dir / "feeds" / "info-marti" / "images"
assert feed_settings["VIDEO_STORE"] == str(out_dir / "info-marti" / "video") )
assert feed_settings["FILES_STORE"] == str(out_dir / "info-marti" / "files") assert feed_settings["AUDIO_STORE"] == str(
out_dir / "feeds" / "info-marti" / "audio"
)
assert feed_settings["VIDEO_STORE"] == str(
out_dir / "feeds" / "info-marti" / "video"
)
assert feed_settings["FILES_STORE"] == str(
out_dir / "feeds" / "info-marti" / "files"
)
assert feed_settings["FEEDS"] == { assert feed_settings["FEEDS"] == {
str(out_dir / "info-marti.rss"): { str(out_dir / "feeds" / "info-marti" / "feed.rss"): {
"format": "rss", "format": "rss",
"postprocessing": [], "postprocessing": [],
"feed_name": "info-marti", "feed_name": "info-marti",
@ -181,5 +189,9 @@ def test_build_feed_settings_uses_runtime_media_dir_overrides(tmp_path: Path) ->
assert feed_settings["REPUBLISHER_VIDEO_DIR"] == "videos-custom" assert feed_settings["REPUBLISHER_VIDEO_DIR"] == "videos-custom"
assert feed_settings["REPUBLISHER_AUDIO_DIR"] == "audio-custom" assert feed_settings["REPUBLISHER_AUDIO_DIR"] == "audio-custom"
assert feed_settings["VIDEO_STORE"] == str(out_dir / "gp-pod" / "videos-custom") assert feed_settings["VIDEO_STORE"] == str(
assert feed_settings["AUDIO_STORE"] == str(out_dir / "gp-pod" / "audio-custom") out_dir / "feeds" / "gp-pod" / "videos-custom"
)
assert feed_settings["AUDIO_STORE"] == str(
out_dir / "feeds" / "gp-pod" / "audio-custom"
)

71
tests/test_dev_mode.py Normal file
View file

@ -0,0 +1,71 @@
from __future__ import annotations
import asyncio
from pathlib import Path
from repub.web import create_app
def test_dev_mode_serves_published_feeds(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "dev-mode.db"
feeds_dir = tmp_path / "out" / "feeds"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app(dev_mode=True)
app.config["REPUB_FEEDS_DIR"] = feeds_dir
feed_path = feeds_dir / "demo-source" / "feed.rss"
feed_path.parent.mkdir(parents=True)
feed_path.write_text("<rss/>\n", encoding="utf-8")
client = app.test_client()
response = await client.get("/feeds/demo-source/feed.rss")
assert response.status_code == 200
assert response.mimetype == "application/rss+xml"
assert await response.get_data(as_text=True) == "<rss/>\n"
asyncio.run(run())
def test_dev_mode_serves_feed_enclosure_assets(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "dev-mode-assets.db"
feeds_dir = tmp_path / "out" / "feeds"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app(dev_mode=True)
app.config["REPUB_FEEDS_DIR"] = feeds_dir
enclosure_path = feeds_dir / "demo-source" / "audio" / "episode.mp3"
enclosure_path.parent.mkdir(parents=True)
enclosure_path.write_bytes(b"mp3-data")
client = app.test_client()
response = await client.get("/feeds/demo-source/audio/episode.mp3")
assert response.status_code == 200
assert await response.get_data() == b"mp3-data"
asyncio.run(run())
def test_default_mode_does_not_serve_published_feeds(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "default-mode.db"
feeds_dir = tmp_path / "out" / "feeds"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
app.config["REPUB_FEEDS_DIR"] = feeds_dir
feed_path = feeds_dir / "demo-source" / "feed.rss"
feed_path.parent.mkdir(parents=True)
feed_path.write_text("<rss/>\n", encoding="utf-8")
client = app.test_client()
response = await client.get("/feeds/demo-source/feed.rss")
assert response.status_code == 404
asyncio.run(run())

View file

@ -1,6 +1,9 @@
import io
import logging
from types import SimpleNamespace from types import SimpleNamespace
from typing import cast
from repub.entrypoint import FeedNameFilter from repub.entrypoint import FeedNameFilter, entrypoint, logger, parse_args
def test_feed_name_filter_accepts_matching_item() -> None: def test_feed_name_filter_accepts_matching_item() -> None:
@ -15,3 +18,70 @@ def test_feed_name_filter_rejects_non_matching_item() -> None:
feed_filter = FeedNameFilter({"feed_name": "nasa"}) feed_filter = FeedNameFilter({"feed_name": "nasa"})
assert feed_filter.accepts(item) is False assert feed_filter.accepts(item) is False
def test_parse_args_uses_republisher_host_and_port_env_vars(monkeypatch) -> None:
monkeypatch.setenv("REPUBLISHER_HOST", "0.0.0.0")
monkeypatch.setenv("REPUBLISHER_PORT", "9090")
command, args = parse_args(["serve"])
assert command == "serve"
assert args.host == "0.0.0.0"
assert args.port == "9090"
def test_parse_args_supports_dev_mode_flag() -> None:
command, args = parse_args(["serve", "--dev-mode"])
assert command == "serve"
assert args.dev_mode is True
def test_parse_args_defaults_to_dev_mode_when_no_args() -> None:
command, args = parse_args([])
assert command == "serve"
assert args.dev_mode is True
def test_entrypoint_rejects_invalid_republisher_port(monkeypatch) -> None:
monkeypatch.setenv("REPUBLISHER_PORT", "not-a-number")
stream = io.StringIO()
handlers = [
cast(logging.StreamHandler[io.StringIO], handler) for handler in logger.handlers
]
original_streams = [handler.stream for handler in handlers]
for handler in handlers:
handler.stream = stream
try:
exit_code = entrypoint(["serve"])
finally:
for handler, original_stream in zip(handlers, original_streams):
handler.stream = original_stream
assert exit_code == 2
assert "Invalid REPUBLISHER_PORT/--port value" in stream.getvalue()
def test_entrypoint_passes_dev_mode_to_create_app(monkeypatch) -> None:
recorded: dict[str, object] = {}
class StubApp:
def run(self, *, host: str, port: int) -> None:
recorded["host"] = host
recorded["port"] = port
def fake_create_app(*, dev_mode: bool) -> StubApp:
recorded["dev_mode"] = dev_mode
return StubApp()
monkeypatch.setattr("repub.entrypoint.create_app", fake_create_app)
exit_code = entrypoint(
["serve", "--dev-mode", "--host", "0.0.0.0", "--port", "9090"]
)
assert exit_code == 0
assert recorded == {"dev_mode": True, "host": "0.0.0.0", "port": 9090}

View file

@ -1,6 +1,10 @@
from pathlib import Path from pathlib import Path
from scrapy.settings import Settings
from repub import entrypoint as entrypoint_module from repub import entrypoint as entrypoint_module
from repub.spiders.rss_spider import RssFeedSpider
from repub.utils import FileType, local_audio_path, local_image_path
def test_entrypoint_supports_file_feed_urls(tmp_path: Path, monkeypatch) -> None: def test_entrypoint_supports_file_feed_urls(tmp_path: Path, monkeypatch) -> None:
@ -29,9 +33,33 @@ DOWNLOAD_TIMEOUT = 5
exit_code = entrypoint_module.entrypoint(["--config", str(config_path)]) exit_code = entrypoint_module.entrypoint(["--config", str(config_path)])
output_path = tmp_path / "out" / "local-file.rss" output_path = tmp_path / "out" / "feeds" / "local-file" / "feed.rss"
assert exit_code == 0 assert exit_code == 0
assert output_path.exists() assert output_path.exists()
output = output_path.read_text(encoding="utf-8") output = output_path.read_text(encoding="utf-8")
assert "<title>Local Demo Feed</title>" in output assert "<title>Local Demo Feed</title>" in output
assert "<title>Local Demo Entry</title>" in output assert "<title>Local Demo Entry</title>" in output
def test_rss_spider_rewrites_public_asset_urls_as_relative_paths() -> None:
spider = RssFeedSpider(feed_name="demo", url="https://example.com/feed.rss")
spider.settings = Settings(
values={
"REPUBLISHER_IMAGE_DIR": "images",
"REPUBLISHER_FILE_DIR": "files",
"REPUBLISHER_AUDIO_DIR": "audio",
"REPUBLISHER_VIDEO_DIR": "video",
}
)
assert (
spider.rewrite_image_url("https://example.com/media/photo.jpg")
== f"images/{local_image_path('https://example.com/media/photo.jpg')}"
)
assert (
spider.rewrite_file_url(
FileType.AUDIO,
"https://example.com/media/podcast.mp3",
)
== f"audio/{local_audio_path('https://example.com/media/podcast.mp3')}"
)

85
tests/test_jobs.py Normal file
View file

@ -0,0 +1,85 @@
from __future__ import annotations
from datetime import UTC, datetime
from pathlib import Path
from repub.jobs import load_runs_view
from repub.model import (
Job,
JobExecution,
JobExecutionStatus,
create_source,
initialize_database,
)
def test_load_runs_view_humanizes_completed_execution_summary_bytes(
tmp_path: Path,
) -> None:
initialize_database(tmp_path / "jobs-completed.db")
source = create_source(
name="Completed source",
slug="completed-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/completed.xml",
)
job = Job.get(Job.source == source)
JobExecution.create(
job=job,
running_status=JobExecutionStatus.SUCCEEDED,
ended_at=datetime(2026, 3, 30, 12, 0, tzinfo=UTC),
requests_count=14,
items_count=11,
bytes_count=16_410_269,
)
view = load_runs_view(
log_dir=tmp_path / "out" / "logs",
now=datetime(2026, 3, 30, 12, 30, tzinfo=UTC),
)
assert view["completed"][0]["stats"] == "14 requests • 11 items • 15.7 MiB"
def test_load_runs_view_humanizes_running_execution_summary_bytes(
tmp_path: Path,
) -> None:
initialize_database(tmp_path / "jobs-running.db")
source = create_source(
name="Running source",
slug="running-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/running.xml",
)
job = Job.get(Job.source == source)
JobExecution.create(
job=job,
running_status=JobExecutionStatus.RUNNING,
started_at=datetime(2026, 3, 30, 12, 0, tzinfo=UTC),
requests_count=14,
items_count=11,
bytes_count=1_536,
)
view = load_runs_view(
log_dir=tmp_path / "out" / "logs",
now=datetime(2026, 3, 30, 12, 30, tzinfo=UTC),
)
assert view["running"][0]["stats"] == "14 requests • 11 items • 1.5 KiB"

170
tests/test_model.py Normal file
View file

@ -0,0 +1,170 @@
from __future__ import annotations
import sqlite3
from pathlib import Path
import pytest
from peewee import IntegrityError
from repub.model import (
Job,
Source,
database,
initialize_database,
resolve_database_path,
)
def test_resolve_database_path_defaults_to_republisher_db(
monkeypatch: pytest.MonkeyPatch, tmp_path: Path
) -> None:
monkeypatch.chdir(tmp_path)
monkeypatch.delenv("REPUBLISHER_DB_PATH", raising=False)
assert resolve_database_path() == tmp_path / "republisher.db"
def test_resolve_database_path_prefers_environment_variable(
monkeypatch: pytest.MonkeyPatch, tmp_path: Path
) -> None:
db_path = tmp_path / "env-configured.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
assert resolve_database_path() == db_path
def test_initialize_database_bootstraps_schema_from_sql_files(tmp_path: Path) -> None:
db_path = tmp_path / "bootstrap.db"
initialize_database(db_path)
connection = sqlite3.connect(db_path)
try:
table_names = {
row[0]
for row in connection.execute(
"""
SELECT name
FROM sqlite_master
WHERE type = 'table' AND name NOT LIKE 'sqlite_%'
"""
)
}
assert table_names == {
"job",
"job_execution",
"source",
"source_feed",
"source_pangea",
}
defaults = {
row[1]: row[4]
for row in connection.execute("PRAGMA table_info('source_pangea')")
}
assert defaults["content_type"] is None
assert defaults["only_newest"] is None
assert defaults["max_articles"] is None
assert defaults["oldest_article"] is None
assert defaults["include_authors"] is None
assert defaults["exclude_media"] is None
assert defaults["include_content"] is None
assert defaults["content_format"] is None
finally:
connection.close()
def test_initialize_database_configures_sqlite_pragmas(tmp_path: Path) -> None:
db_path = tmp_path / "pragmas.db"
initialize_database(db_path)
database.connect(reuse_if_open=True)
try:
pragma_values = {
"cache_size": database.execute_sql("PRAGMA cache_size").fetchone()[0],
"page_size": database.execute_sql("PRAGMA page_size").fetchone()[0],
"journal_mode": database.execute_sql("PRAGMA journal_mode").fetchone()[0],
"synchronous": database.execute_sql("PRAGMA synchronous").fetchone()[0],
"temp_store": database.execute_sql("PRAGMA temp_store").fetchone()[0],
"foreign_keys": database.execute_sql("PRAGMA foreign_keys").fetchone()[0],
"busy_timeout": database.execute_sql("PRAGMA busy_timeout").fetchone()[0],
}
assert pragma_values == {
"cache_size": 15625,
"page_size": 4096,
"journal_mode": "wal",
"synchronous": 1,
"temp_store": 2,
"foreign_keys": 1,
"busy_timeout": 5000,
}
finally:
database.close()
def test_initialize_database_creates_scheduler_and_execution_indexes(
tmp_path: Path,
) -> None:
db_path = tmp_path / "indexes.db"
initialize_database(db_path)
connection = sqlite3.connect(db_path)
try:
index_names = {
row[0]
for row in connection.execute(
"""
SELECT name
FROM sqlite_master
WHERE type = 'index'
AND name IN (
'job_enabled_idx',
'job_execution_job_created_at_idx',
'job_execution_status_started_at_idx',
'job_execution_status_ended_at_idx'
)
"""
)
}
assert index_names == {
"job_enabled_idx",
"job_execution_job_created_at_idx",
"job_execution_status_started_at_idx",
"job_execution_status_ended_at_idx",
}
finally:
connection.close()
def test_job_table_allows_exactly_one_job_per_source(tmp_path: Path) -> None:
initialize_database(tmp_path / "jobs.db")
source = Source.create(
name="Guardian feed mirror",
slug="guardian-feed",
source_type="feed",
)
Job.create(
source=source,
enabled=True,
spider_arguments="",
cron_minute="15",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
)
with pytest.raises(IntegrityError):
Job.create(
source=source,
enabled=True,
spider_arguments="language=en",
cron_minute="30",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
)

View file

@ -1,8 +1,10 @@
import sys
from pathlib import Path from pathlib import Path
from types import SimpleNamespace from types import SimpleNamespace
import pytest import pytest
from repub import media
from repub.config import ( from repub.config import (
FeedConfig, FeedConfig,
RepublisherConfig, RepublisherConfig,
@ -48,3 +50,141 @@ def test_pipeline_from_crawler_uses_configured_store(
assert pipeline.settings is crawler.settings assert pipeline.settings is crawler.settings
assert pipeline.store.basedir == crawler.settings[store_setting] assert pipeline.store.basedir == crawler.settings[store_setting]
def test_transcode_audio_captures_ffmpeg_output(monkeypatch, tmp_path: Path) -> None:
input_file = tmp_path / "input.mp3"
input_file.write_bytes(b"12345")
output_dir = tmp_path / "audio-out"
output_dir.mkdir()
run_calls: list[dict[str, object]] = []
class FakeOutput:
def __init__(self, output_path: Path):
self.output_path = output_path
def run(self, **kwargs):
run_calls.append(kwargs)
self.output_path.write_bytes(b"12")
return b"", b""
class FakeInput:
def output(self, output_file: str, **params):
del params
return FakeOutput(Path(output_file))
monkeypatch.setattr(media.ffmpeg, "input", lambda _: FakeInput())
result = media.transcode_audio(
str(input_file),
str(output_dir),
{"extension": "mp3", "acodec": "libmp3lame"},
)
assert result == str(output_dir / "converted.mp3")
assert run_calls == [{"capture_stdout": True, "capture_stderr": True}]
def test_transcode_video_two_pass_does_not_print_ffmpeg_output(
monkeypatch, tmp_path: Path
) -> None:
input_file = tmp_path / "input.mp4"
input_file.write_bytes(b"12345")
output_dir = tmp_path / "video-out"
output_dir.mkdir()
run_calls: list[dict[str, object]] = []
printed: list[tuple[tuple[object, ...], dict[str, object]]] = []
class FakeOutput:
def __init__(self, output_path: Path | None):
self.output_path = output_path
def global_args(self, *args):
del args
return self
def run(self, **kwargs):
run_calls.append(kwargs)
if self.output_path is not None:
self.output_path.write_bytes(b"12")
return b"pass-out", b"pass-err"
class FakeInput:
video = object()
audio = object()
def output(self, *args, **params):
del params
output_path = next(
(
Path(arg)
for arg in args
if isinstance(arg, str) and arg.endswith(".mp4")
),
None,
)
return FakeOutput(output_path)
monkeypatch.setattr(media.ffmpeg, "input", lambda _: FakeInput())
monkeypatch.setattr(
"builtins.print", lambda *args, **kwargs: printed.append((args, kwargs))
)
result = media.transcode_video(
str(input_file),
str(output_dir),
{
"extension": "mp4",
"passes": [
{"f": "null"},
{"c:v": "libx264"},
],
},
)
assert result == str(output_dir / "converted.mp4")
assert run_calls == [
{"capture_stdout": True, "capture_stderr": True},
{
"capture_stdout": True,
"capture_stderr": True,
"overwrite_output": True,
},
]
assert printed == []
def test_transcode_video_prints_ffmpeg_output_on_error(
monkeypatch, tmp_path: Path
) -> None:
input_file = tmp_path / "input.mp4"
input_file.write_bytes(b"12345")
output_dir = tmp_path / "video-out"
output_dir.mkdir()
printed: list[tuple[str, bool]] = []
class FakeOutput:
def run(self, **kwargs):
del kwargs
raise media.ffmpeg.Error("ffmpeg", b"video-stdout", b"video-stderr")
class FakeInput:
def output(self, *args, **params):
del args, params
return FakeOutput()
def fake_print(*args, **kwargs):
printed.append((str(args[0]), kwargs.get("file") is sys.stderr))
monkeypatch.setattr(media.ffmpeg, "input", lambda _: FakeInput())
monkeypatch.setattr("builtins.print", fake_print)
with pytest.raises(RuntimeError):
media.transcode_video(
str(input_file),
str(output_dir),
{"extension": "mp4", "c:v": "libx264"},
)
assert ("video-stderr", True) in printed
assert ("video-stdout", False) in printed

View file

@ -0,0 +1,513 @@
from __future__ import annotations
import asyncio
import json
import socketserver
import threading
import time
from datetime import UTC, datetime, timedelta
from http.server import BaseHTTPRequestHandler
from pathlib import Path
from repub.job_runner import generate_pangea_feed
from repub.jobs import JobArtifacts, JobRuntime, load_runs_view
from repub.model import (
Job,
JobExecution,
JobExecutionStatus,
Source,
create_source,
initialize_database,
)
from repub.web import create_app, get_job_runtime, render_execution_logs, render_runs
FIXTURE_FEED_PATH = (
Path(__file__).resolve().parents[1] / "demo" / "fixtures" / "local-feed.rss"
).resolve()
def test_job_runtime_syncs_enabled_jobs_into_apscheduler(tmp_path: Path) -> None:
initialize_database(tmp_path / "scheduler.db")
enabled_source = create_source(
name="Enabled source",
slug="enabled-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=True,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/enabled.xml",
)
disabled_source = create_source(
name="Disabled source",
slug="disabled-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="15",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/disabled.xml",
)
enabled_job = Job.get(Job.source == enabled_source)
disabled_job = Job.get(Job.source == disabled_source)
runtime = JobRuntime(log_dir=tmp_path / "out" / "logs")
try:
runtime.start()
runtime.sync_jobs()
scheduled_ids = {job.id for job in runtime.scheduler.get_jobs()}
assert f"job-{enabled_job.id}" in scheduled_ids
assert f"job-{disabled_job.id}" not in scheduled_ids
enabled_job.enabled = False
enabled_job.save()
runtime.sync_jobs()
scheduled_ids = {job.id for job in runtime.scheduler.get_jobs()}
assert f"job-{enabled_job.id}" not in scheduled_ids
finally:
runtime.shutdown()
def test_job_runtime_run_now_writes_log_and_stats_and_marks_success(
tmp_path: Path,
) -> None:
initialize_database(tmp_path / "run-now.db")
source = create_source(
name="Manual source",
slug="manual-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url=FIXTURE_FEED_PATH.as_uri(),
)
job = Job.get(Job.source == source)
runtime = JobRuntime(log_dir=tmp_path / "out" / "logs")
try:
runtime.start()
execution_id = runtime.run_job_now(job.id, reason="manual")
assert execution_id is not None
execution = _wait_for_terminal_execution(execution_id)
artifacts = JobArtifacts.for_execution(
log_dir=tmp_path / "out" / "logs",
job_id=job.id,
execution_id=execution_id,
)
assert execution.running_status == JobExecutionStatus.SUCCEEDED
assert execution.started_at is not None
assert execution.ended_at is not None
assert execution.requests_count > 0
assert execution.items_count > 0
assert execution.bytes_count > 0
assert artifacts.log_path.exists()
assert artifacts.stats_path.exists()
output_path = tmp_path / "out" / "feeds" / "manual-source" / "feed.rss"
assert output_path.exists()
output_text = output_path.read_text(encoding="utf-8")
assert "<title>Local Demo Feed</title>" in output_text
assert "<title>Local Demo Entry</title>" in output_text
stats_lines = [
json.loads(line)
for line in artifacts.stats_path.read_text(encoding="utf-8").splitlines()
]
assert len(stats_lines) >= 2
assert stats_lines[-1]["requests_count"] == execution.requests_count
finally:
runtime.shutdown()
def test_job_runtime_cancel_marks_execution_canceled(tmp_path: Path) -> None:
initialize_database(tmp_path / "cancel.db")
with _slow_feed_server() as feed_url:
source = create_source(
name="Cancelable source",
slug="cancelable-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url=feed_url,
)
job = Job.get(Job.source == source)
runtime = JobRuntime(log_dir=tmp_path / "out" / "logs")
try:
runtime.start()
execution_id = runtime.run_job_now(job.id, reason="manual")
assert execution_id is not None
_wait_for_running_execution(execution_id)
runtime.request_execution_cancel(execution_id)
execution = _wait_for_terminal_execution(execution_id)
artifacts = JobArtifacts.for_execution(
log_dir=tmp_path / "out" / "logs",
job_id=job.id,
execution_id=execution_id,
)
assert execution.running_status == JobExecutionStatus.CANCELED
assert execution.ended_at is not None
assert execution.stop_requested_at is not None
assert "graceful stop requested" in artifacts.log_path.read_text(
encoding="utf-8"
)
finally:
runtime.shutdown()
def test_job_runtime_start_reconciles_stale_running_execution(tmp_path: Path) -> None:
initialize_database(tmp_path / "stale-running.db")
source = create_source(
name="Stale source",
slug="stale-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/stale.xml",
)
job = Job.get(Job.source == source)
execution = JobExecution.create(
job=job,
started_at="2026-03-30 12:30:00+00:00",
running_status=JobExecutionStatus.RUNNING,
)
artifacts = JobArtifacts.for_execution(
log_dir=tmp_path / "out" / "logs",
job_id=job.id,
execution_id=int(execution.get_id()),
)
artifacts.log_path.parent.mkdir(parents=True, exist_ok=True)
artifacts.log_path.write_text(
"worker: process lost during app restart\n",
encoding="utf-8",
)
runtime = JobRuntime(log_dir=tmp_path / "out" / "logs")
try:
runtime.start()
reconciled_execution = JobExecution.get_by_id(execution.get_id())
assert reconciled_execution.running_status == JobExecutionStatus.FAILED
assert reconciled_execution.ended_at is not None
assert "marked failed after app restart" in artifacts.log_path.read_text(
encoding="utf-8"
)
finally:
runtime.shutdown()
def test_generate_pangea_feed_writes_pangea_rss_file(
monkeypatch, tmp_path: Path
) -> None:
class StubPangeaFeed:
def __init__(self, config, feeds):
self.config = config
self.feed = feeds[0]
def acquire_content(self) -> None:
return None
def generate_feed(self) -> None:
return None
def disgorge(self, slug: str):
output_path = self.config.results.output_directory / slug / "pangea.rss"
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(
"<rss><channel><title>Pangea Fixture</title></channel></rss>\n",
encoding="utf-8",
)
return output_path
monkeypatch.setattr(
"repub.job_runner.pangea_feed_class",
lambda: StubPangeaFeed,
)
output_path = generate_pangea_feed(
name="Pangea source",
slug="pangea-source",
domain="example.org",
category_name="News",
content_type="articles",
only_newest=True,
max_articles=10,
oldest_article=3,
include_authors=True,
exclude_media=False,
include_content=True,
content_format="MOBILE_3",
out_dir=tmp_path / "out",
log_path=tmp_path / "out" / "logs" / "pangea.log",
)
assert output_path == (tmp_path / "out" / "feeds" / "pangea-source" / "pangea.rss")
assert output_path.exists()
assert "Pangea Fixture" in output_path.read_text(encoding="utf-8")
def test_load_runs_view_humanizes_completed_execution_end_time(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "runs-view.db"
log_dir = tmp_path / "out" / "logs"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
app = create_app()
app.config["REPUB_LOG_DIR"] = log_dir
source = create_source(
name="Completed source",
slug="completed-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/completed.xml",
)
job = Job.get(Job.source == source)
reference_time = datetime(2026, 1, 15, 12, 0, tzinfo=UTC)
ended_at = reference_time - timedelta(hours=2)
JobExecution.create(
job=job,
running_status=JobExecutionStatus.SUCCEEDED,
ended_at=ended_at,
)
view = load_runs_view(log_dir=app.config["REPUB_LOG_DIR"], now=reference_time)
completed = view["completed"][0]
assert completed["ended_at"] == "2 hours ago"
assert completed["ended_at_iso"] == ended_at.isoformat()
def test_render_runs_uses_database_backed_jobs_and_executions(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "runs-page.db"
log_dir = tmp_path / "out" / "logs"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
app = create_app()
app.config["REPUB_LOG_DIR"] = log_dir
source = create_source(
name="Runs page source",
slug="runs-page-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=True,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url=FIXTURE_FEED_PATH.as_uri(),
)
job = Job.get(Job.source == source)
runtime = get_job_runtime(app)
runtime.start()
try:
execution_id = runtime.run_job_now(job.id, reason="manual")
assert execution_id is not None
execution = _wait_for_terminal_execution(execution_id)
async def run() -> None:
body = str(await render_runs(app))
assert "runs-page-source" in body
assert "Running job executions" in body
assert "Upcoming jobs" in body
assert "Completed job executions" in body
assert f"/job/{job.id}/execution/{execution.get_id()}/logs" in body
assert "Succeeded" in body
assert "Run now" in body
asyncio.run(run())
finally:
runtime.shutdown()
def test_render_execution_logs_handles_missing_execution_and_missing_log_file(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "log-errors.db"
log_dir = tmp_path / "out" / "logs"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
app = create_app()
app.config["REPUB_LOG_DIR"] = log_dir
source = create_source(
name="Log source",
slug="log-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/log-source.xml",
)
job = Job.get(Job.source == source)
execution = JobExecution.create(
job=job,
running_status=JobExecutionStatus.FAILED,
)
async def run() -> None:
missing_execution = str(
await render_execution_logs(app, job_id=job.id, execution_id=9999)
)
missing_log = str(
await render_execution_logs(app, job_id=job.id, execution_id=execution.id)
)
assert "Execution log unavailable" in missing_execution
assert "Execution does not exist." in missing_execution
assert "Execution log unavailable" in missing_log
assert "Log file has not been created yet." in missing_log
asyncio.run(run())
def test_delete_job_action_removes_source_job_and_execution_history(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "delete-job.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
client = app.test_client()
source = create_source(
name="Delete source",
slug="delete-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=True,
cron_minute="*/30",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/delete.xml",
)
job = Job.get(Job.source == source)
execution = JobExecution.create(
job=job,
running_status=JobExecutionStatus.SUCCEEDED,
)
response = await client.post(f"/actions/jobs/{job.id}/delete")
assert response.status_code == 204
assert Source.get_or_none(Source.slug == "delete-source") is None
assert Job.get_or_none(id=job.id) is None
assert JobExecution.get_or_none(id=int(execution.get_id())) is None
asyncio.run(run())
def _wait_for_running_execution(
execution_id: int, *, timeout_seconds: float = 2.0
) -> JobExecution:
deadline = time.monotonic() + timeout_seconds
while time.monotonic() < deadline:
execution = JobExecution.get_by_id(execution_id)
if execution.running_status == JobExecutionStatus.RUNNING:
return execution
time.sleep(0.02)
raise AssertionError(f"execution {execution_id} never entered RUNNING state")
def _wait_for_terminal_execution(
execution_id: int, *, timeout_seconds: float = 4.0
) -> JobExecution:
deadline = time.monotonic() + timeout_seconds
while time.monotonic() < deadline:
execution = JobExecution.get_by_id(execution_id)
if execution.running_status in {
JobExecutionStatus.SUCCEEDED,
JobExecutionStatus.FAILED,
JobExecutionStatus.CANCELED,
}:
return execution
time.sleep(0.02)
raise AssertionError(f"execution {execution_id} did not finish in time")
class _SlowFeedRequestHandler(BaseHTTPRequestHandler):
def do_GET(self) -> None: # noqa: N802
time.sleep(2.0)
payload = FIXTURE_FEED_PATH.read_bytes()
self.send_response(200)
self.send_header("Content-Type", "application/rss+xml; charset=utf-8")
self.send_header("Content-Length", str(len(payload)))
self.end_headers()
self.wfile.write(payload)
def log_message(self, format: str, *args: object) -> None:
del format, args
class _ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
allow_reuse_address = True
class _slow_feed_server:
def __enter__(self) -> str:
self._server = _ThreadedTCPServer(("127.0.0.1", 0), _SlowFeedRequestHandler)
self._thread = threading.Thread(
target=self._server.serve_forever,
kwargs={"poll_interval": 0.01},
daemon=True,
)
self._thread.start()
host = str(self._server.server_address[0])
port = int(self._server.server_address[1])
return f"http://{host}:{port}/slow-feed.rss"
def __exit__(self, exc_type, exc, tb) -> None:
del exc_type, exc, tb
self._server.shutdown()
self._server.server_close()
self._thread.join(timeout=1)

924
tests/test_web.py Normal file
View file

@ -0,0 +1,924 @@
from __future__ import annotations
import asyncio
import os
from datetime import UTC, datetime, timedelta
from pathlib import Path
from typing import Any, cast
from repub.components import status_badge
from repub.datastar import RefreshBroker, render_sse_event, render_stream
from repub.jobs import load_dashboard_view
from repub.model import (
Job,
JobExecution,
JobExecutionStatus,
Source,
SourceFeed,
SourcePangea,
create_source,
)
from repub.pages.runs import runs_page
from repub.web import (
create_app,
get_refresh_broker,
render_create_source,
render_dashboard,
render_edit_source,
render_execution_logs,
render_runs,
render_sources,
)
def test_status_badge_uses_green_done_tone() -> None:
badge = str(status_badge(label="Succeeded", tone="done"))
assert "bg-emerald-100 text-emerald-800" in badge
assert "Succeeded" in badge
def test_runs_page_renders_completed_execution_end_time_as_relative_hoverable_time() -> (
None
):
ended_at = "2026-01-15T10:00:00+00:00"
body = str(
runs_page(
completed_executions=(
{
"source": "Completed source",
"slug": "completed-source",
"job_id": 7,
"execution_id": 42,
"ended_at": "2 hours ago",
"ended_at_iso": ended_at,
"status": "Succeeded",
"status_tone": "done",
"stats": "1 requests • 1 items • 1 bytes",
"summary": "Worker exited successfully",
"log_href": "/job/7/execution/42/logs",
},
)
)
)
assert "data-ended-at" in body
assert f'data-ended-at="{ended_at}"' in body
assert f'datetime="{ended_at}"' in body
assert f'title="{ended_at}"' in body
assert ">2 hours ago<" in body
def test_root_get_serves_datastar_shim() -> None:
async def run() -> None:
client = create_app().test_client()
response = await client.get("/")
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert response.headers["ETag"]
assert body.startswith("<!doctype html>")
assert (
'<script id="js" defer type="module" src="/static/datastar@1.0.0-RC.8.js"></script>'
in body
)
assert 'data-signals:tabid="self.crypto.randomUUID().substring(0,8)"' in body
assert 'data-init="@post(window.location.pathname +' in body
assert "retryMaxCount: Infinity" in body
assert "data-on:online__window=" in body
assert '<main id="morph"' in body
assert 'href="/sources"' in body
assert 'href="/runs"' in body
assert "Connecting" in body
asyncio.run(run())
def test_create_app_bootstraps_default_database_path(
monkeypatch, tmp_path: Path
) -> None:
monkeypatch.chdir(tmp_path)
app = create_app()
assert Path(app.config["REPUB_DB_PATH"]) == tmp_path / "republisher.db"
assert (tmp_path / "republisher.db").exists()
def test_root_get_honors_if_none_match() -> None:
async def run() -> None:
client = create_app().test_client()
initial = await client.get("/")
etag = initial.headers["ETag"]
response = await client.get("/", headers={"If-None-Match": etag})
assert response.status_code == 304
assert response.headers["ETag"] == etag
asyncio.run(run())
def test_dashboard_post_serves_morph_component() -> None:
async def run() -> None:
client = create_app().test_client()
async with client.request("/?u=shim", method="POST") as connection:
await connection.send_complete()
chunk = await asyncio.wait_for(connection.receive(), timeout=1)
raw_connection = cast(Any, connection)
assert raw_connection.status_code == 200
assert raw_connection.headers["Content-Type"] == "text/event-stream"
assert b"event: datastar-patch-elements" in chunk
assert b"id: " in chunk
assert b'<main id="morph"' in chunk
assert b"Operational snapshot" in chunk
assert b"Running executions" in chunk
await connection.disconnect()
asyncio.run(run())
def test_render_sse_event_skips_unchanged_view() -> None:
async def run() -> None:
async def render() -> str:
return '<main id="morph">same</main>'
event_id, event = await render_sse_event(render)
repeated_id, repeated_event = await render_sse_event(
render, last_event_id=event_id
)
assert repeated_id == event_id
assert event is not None
assert repeated_event is None
asyncio.run(run())
def test_app_refresh_broker_publishes_events() -> None:
async def run() -> None:
app = create_app()
broker = get_refresh_broker(app)
queue = broker.subscribe()
broker.publish()
event = await asyncio.wait_for(queue.get(), timeout=1)
assert event == "refresh-event"
broker.unsubscribe(queue)
asyncio.run(run())
def test_render_stream_yields_on_connect_and_refresh() -> None:
async def run() -> None:
queue = RefreshBroker().subscribe()
renders = 0
async def render() -> str:
nonlocal renders
renders += 1
return f'<main id="morph">{renders}</main>'
stream = render_stream(queue, render)
first = await anext(stream)
await queue.put("refresh-event")
second = await anext(stream)
await stream.aclose()
assert "1</main>" in first
assert "2</main>" in second
asyncio.run(run())
def test_render_dashboard_shows_dashboard_information_architecture(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "dashboard-render.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
body = str(await render_dashboard(app))
assert "Operational snapshot" in body
assert "Running executions" in body
assert "Published feeds" in body
assert 'href="/sources"' in body
assert 'href="/runs"' in body
assert "Create source" in body
asyncio.run(run())
def test_render_dashboard_shows_empty_state_rows(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "dashboard-empty.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
body = str(await render_dashboard(app))
assert "No job executions are running." in body
assert "No feeds have been published yet." in body
asyncio.run(run())
def test_load_dashboard_view_measures_log_artifact_path(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "dashboard-footprint.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
create_app()
out_dir = tmp_path / "out"
log_dir = out_dir / "logs"
cache_dir = out_dir / "httpcache"
log_dir.mkdir(parents=True)
cache_dir.mkdir(parents=True)
(log_dir / "run.log").write_bytes(b"x" * 1024)
(cache_dir / "cache.bin").write_bytes(b"y" * 2048)
snapshot = load_dashboard_view(log_dir=log_dir)["snapshot"]
assert cast(dict[str, str], snapshot)["artifact_footprint"] == "3.0 KB"
def test_render_dashboard_describes_log_artifact_footprint(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "dashboard-footprint-copy.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
body = str(await render_dashboard(app))
assert "Current artifact size under the output path." in body
asyncio.run(run())
def test_load_dashboard_view_lists_source_feed_artifacts(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "dashboard-feeds.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
app = create_app()
out_dir = tmp_path / "out"
log_dir = out_dir / "logs"
app.config["REPUB_LOG_DIR"] = log_dir
log_dir.mkdir(parents=True)
create_source(
name="Available source",
slug="available-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/available.xml",
)
create_source(
name="Missing source",
slug="missing-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/missing.xml",
)
feed_dir = out_dir / "feeds" / "available-source"
feed_dir.mkdir(parents=True)
feed_path = feed_dir / "feed.rss"
feed_path.write_bytes(b"x" * 1024)
(feed_dir / "audio.mp3").write_bytes(b"y" * 2048)
reference_time = datetime(2026, 3, 30, 12, 30, tzinfo=UTC)
updated_at = reference_time - timedelta(minutes=32)
updated_at_epoch = updated_at.timestamp()
os.utime(feed_path, (updated_at_epoch, updated_at_epoch))
source_feeds = cast(
tuple[dict[str, object], ...],
load_dashboard_view(log_dir=log_dir, now=reference_time)["source_feeds"],
)
assert source_feeds == (
{
"source": "Available source",
"slug": "available-source",
"feed_href": "/feeds/available-source/feed.rss",
"feed_status_label": "Available",
"feed_status_tone": "done",
"feed_exists": True,
"last_updated": "32 minutes ago",
"last_updated_iso": updated_at.isoformat(),
"artifact_footprint": "3.0 KB",
},
{
"source": "Missing source",
"slug": "missing-source",
"feed_href": "/feeds/missing-source/feed.rss",
"feed_status_label": "Missing",
"feed_status_tone": "failed",
"feed_exists": False,
"last_updated": "Never published",
"last_updated_iso": None,
"artifact_footprint": "0 B",
},
)
def test_render_dashboard_shows_source_feed_links_and_statuses(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "dashboard-feed-links.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
app = create_app()
app.config["REPUB_LOG_DIR"] = tmp_path / "out" / "logs"
create_source(
name="Published source",
slug="published-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/published.xml",
)
create_source(
name="Missing source",
slug="missing-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/5",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/missing.xml",
)
async def run() -> None:
published_feed = tmp_path / "out" / "feeds" / "published-source" / "feed.rss"
published_feed.parent.mkdir(parents=True)
published_feed.write_text("<rss/>\n", encoding="utf-8")
body = str(await render_dashboard(app))
assert "Published feeds" in body
assert 'href="/feeds/published-source/feed.rss"' in body
assert 'href="/feeds/missing-source/feed.rss"' in body
assert "Available" in body
assert "Missing" in body
assert "Never published" in body
asyncio.run(run())
def test_render_sources_shows_table_and_create_link() -> None:
async def run() -> None:
body = str(await render_sources())
assert ">Sources<" in body
assert 'href="/sources/create"' in body
assert "No sources yet." in body
assert "guardian-feed" not in body
assert "podcast-audio" not in body
asyncio.run(run())
def test_render_create_source_shows_dedicated_form_page() -> None:
async def run() -> None:
body = str(await render_create_source())
assert ">Create source<" in body
assert "Source and job setup" in body
assert "data-signals__ifmissing" in body
assert "/actions/sources/create" in body
assert 'data-show="$sourceType === &#39;feed&#39;"' in body
assert 'data-show="$sourceType === &#39;pangea&#39;"' in body
assert "jobEnabled" in body
assert "onlyNewest" in body
assert "includeAuthors" in body
assert "excludeMedia" in body
assert "includeContent" in body
assert "TEXT_ONLY" in body
assert "breakingnews" in body
assert "Pangea domain" in body
assert "Feed URL" in body
assert "Cron schedule" in body
assert "Initial job state" in body
assert "Pangea mobile articles" not in body
assert "pangea-mobile" not in body
assert "guardianproject.info" not in body
assert (
"Primary Pangea mobile article mirror for the operator landing page."
not in body
)
assert "language=en,download_media=true" not in body
assert "language=en\ndownload_media=true" in body
assert 'value="articles"' in body
assert 'value="10"' in body
assert 'value="3"' in body
assert 'value="*/30"' in body
assert 'value="*"' in body
asyncio.run(run())
def test_render_edit_source_shows_existing_values(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "edit-page.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
create_app()
create_source(
name="Kenya health desk",
slug="kenya-health",
source_type="pangea",
notes="Regional health alerts.",
spider_arguments="language=en\ndownload_media=true",
enabled=True,
cron_minute="0",
cron_hour="*/6",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
pangea_domain="example.org",
pangea_category="Health",
content_type="breakingnews",
only_newest=True,
max_articles=12,
oldest_article=5,
include_authors=True,
exclude_media=False,
include_content=True,
content_format="MOBILE_3",
)
async def run() -> None:
body = str(await render_edit_source("kenya-health"))
assert "Edit source" in body
assert "/actions/sources/kenya-health/edit" in body
assert "Kenya health desk" in body
assert "kenya-health" in body
assert 'id="source-slug"' in body
assert (
'id="source-slug" name="source-slug" type="text" value="kenya-health"'
in body
)
assert " disabled " in body
assert "cursor-not-allowed bg-slate-100 text-slate-500" in body
assert "example.org" in body
assert "Health" in body
assert "language=en\ndownload_media=true" in body
asyncio.run(run())
def test_create_source_action_creates_pangea_source_and_job_in_database(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "sources.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
client = app.test_client()
response = await client.post(
"/actions/sources/create",
headers={"Datastar-Request": "true"},
json={
"sourceName": "Kenya health desk",
"sourceSlug": "kenya-health",
"sourceType": "pangea",
"pangeaDomain": "example.org",
"pangeaCategory": "Health",
"contentFormat": "MOBILE_3",
"contentType": "breakingnews",
"maxArticles": "12",
"oldestArticle": "5",
"sourceNotes": "Regional health alerts.",
"spiderArguments": "language=en\ndownload_media=true",
"cronMinute": "0",
"cronHour": "*/6",
"cronDayOfMonth": "*",
"cronDayOfWeek": "*",
"cronMonth": "*",
"jobEnabled": True,
"onlyNewest": True,
"includeAuthors": True,
"excludeMedia": False,
},
)
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert "window.location = '/sources'" in body
source = Source.get(Source.slug == "kenya-health")
pangea = SourcePangea.get(SourcePangea.source == source)
job = Job.get(Job.source == source)
rendered_sources = str(await render_sources(app))
assert source.name == "Kenya health desk"
assert source.source_type == "pangea"
assert pangea.content_type == "breakingnews"
assert pangea.include_content is True
assert job.enabled is True
assert job.spider_arguments == "language=en\ndownload_media=true"
assert job.cron_hour == "*/6"
assert "kenya-health" in rendered_sources
assert "example.org / Health" in rendered_sources
assert "Enabled" in rendered_sources
asyncio.run(run())
def test_create_source_action_creates_feed_source_and_job_in_database(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "feed-sources.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
client = app.test_client()
response = await client.post(
"/actions/sources/create",
headers={"Datastar-Request": "true"},
json={
"sourceName": "NASA feed",
"sourceSlug": "nasa-feed",
"sourceType": "feed",
"feedUrl": "https://www.nasa.gov/rss/dyn/breaking_news.rss",
"sourceNotes": "Primary NASA mirror.",
"spiderArguments": "",
"cronMinute": "30",
"cronHour": "*",
"cronDayOfMonth": "*",
"cronDayOfWeek": "*",
"cronMonth": "*",
"jobEnabled": False,
},
)
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert "window.location = '/sources'" in body
source = Source.get(Source.slug == "nasa-feed")
feed = SourceFeed.get(SourceFeed.source == source)
job = Job.get(Job.source == source)
rendered_sources = str(await render_sources(app))
assert source.source_type == "feed"
assert feed.feed_url == "https://www.nasa.gov/rss/dyn/breaking_news.rss"
assert job.enabled is False
assert "nasa-feed" in rendered_sources
assert "https://www.nasa.gov/rss/dyn/breaking_news.rss" in rendered_sources
assert "Disabled" in rendered_sources
asyncio.run(run())
def test_edit_source_action_updates_existing_source_and_job_in_database(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "edit-source.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
create_app()
create_source(
name="Kenya health desk",
slug="kenya-health",
source_type="pangea",
notes="Regional health alerts.",
spider_arguments="language=en\ndownload_media=true",
enabled=True,
cron_minute="0",
cron_hour="*/6",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
pangea_domain="example.org",
pangea_category="Health",
content_type="breakingnews",
only_newest=True,
max_articles=12,
oldest_article=5,
include_authors=True,
exclude_media=False,
include_content=True,
content_format="MOBILE_3",
)
async def run() -> None:
app = create_app()
client = app.test_client()
response = await client.post(
"/actions/sources/kenya-health/edit",
headers={"Datastar-Request": "true"},
json={
"sourceName": "Kenya health desk nightly",
"sourceSlug": "kenya-health",
"sourceType": "pangea",
"pangeaDomain": "example.org",
"pangeaCategory": "Nightly",
"contentFormat": "TEXT_ONLY",
"contentType": "articles",
"maxArticles": "25",
"oldestArticle": "7",
"sourceNotes": "Updated nightly run.",
"spiderArguments": "language=sw\ninclude_audio=false",
"cronMinute": "15",
"cronHour": "2",
"cronDayOfMonth": "*",
"cronDayOfWeek": "*",
"cronMonth": "*",
"jobEnabled": False,
"onlyNewest": False,
"includeAuthors": False,
"excludeMedia": True,
"includeContent": True,
},
)
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert "window.location = '/sources'" in body
source = Source.get(Source.slug == "kenya-health")
pangea = SourcePangea.get(SourcePangea.source == source)
job = Job.get(Job.source == source)
rendered_sources = str(await render_sources(app))
assert source.name == "Kenya health desk nightly"
assert source.notes == "Updated nightly run."
assert pangea.category_name == "Nightly"
assert pangea.content_format == "TEXT_ONLY"
assert pangea.max_articles == 25
assert pangea.include_authors is False
assert pangea.exclude_media is True
assert job.enabled is False
assert job.spider_arguments == "language=sw\ninclude_audio=false"
assert job.cron_hour == "2"
assert "Kenya health desk nightly" in rendered_sources
assert "example.org / Nightly" in rendered_sources
assert "Disabled" in rendered_sources
asyncio.run(run())
def test_edit_source_action_rejects_slug_changes(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "edit-invalid.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
create_app()
create_source(
name="Kenya health desk",
slug="kenya-health",
source_type="pangea",
notes="Regional health alerts.",
spider_arguments="language=en\ndownload_media=true",
enabled=True,
cron_minute="0",
cron_hour="*/6",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
pangea_domain="example.org",
pangea_category="Health",
content_type="breakingnews",
only_newest=True,
max_articles=12,
oldest_article=5,
include_authors=True,
exclude_media=False,
include_content=True,
content_format="MOBILE_3",
)
async def run() -> None:
app = create_app()
client = app.test_client()
response = await client.post(
"/actions/sources/kenya-health/edit",
headers={"Datastar-Request": "true"},
json={
"sourceName": "Kenya health desk",
"sourceSlug": "kenya-health-renamed",
"sourceType": "pangea",
"pangeaDomain": "example.org",
"pangeaCategory": "Health",
"contentFormat": "MOBILE_3",
"contentType": "breakingnews",
"maxArticles": "12",
"oldestArticle": "5",
"sourceNotes": "Regional health alerts.",
"spiderArguments": "language=en\ndownload_media=true",
"cronMinute": "0",
"cronHour": "*/6",
"cronDayOfMonth": "*",
"cronDayOfWeek": "*",
"cronMonth": "*",
"jobEnabled": True,
"onlyNewest": True,
"includeAuthors": True,
"excludeMedia": False,
"includeContent": True,
},
)
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert "Slug is immutable." in body
assert Source.get(Source.slug == "kenya-health").name == "Kenya health desk"
assert Source.select().where(Source.slug == "kenya-health-renamed").count() == 0
asyncio.run(run())
def test_create_source_action_validates_duplicate_slug_and_pangea_type(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "duplicate.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
Source.create(
name="Guardian feed mirror",
slug="guardian-feed",
source_type="feed",
)
client = app.test_client()
response = await client.post(
"/actions/sources/create",
headers={"Datastar-Request": "true"},
json={
"sourceName": "Duplicate guardian",
"sourceSlug": "guardian-feed",
"sourceType": "pangea",
"pangeaDomain": "example.org",
"pangeaCategory": "News",
"contentFormat": "WEB",
"contentType": "not-a-real-type",
"maxArticles": "ten",
"oldestArticle": "3",
"cronMinute": "0",
"cronHour": "*",
"cronDayOfMonth": "*",
"cronDayOfWeek": "*",
"cronMonth": "*",
"jobEnabled": True,
},
)
body = await response.get_data(as_text=True)
assert response.status_code == 200
assert "Slug must be unique." in body
assert "Content format is invalid." in body
assert "Content type is invalid." in body
assert "Max articles must be an integer." in body
assert Source.select().where(Source.name == "Duplicate guardian").count() == 0
asyncio.run(run())
def test_render_runs_shows_running_upcoming_and_completed_tables(
monkeypatch, tmp_path: Path
) -> None:
db_path = tmp_path / "runs-render.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
source = create_source(
name="Runs render source",
slug="runs-render-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=True,
cron_minute="*/30",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/runs.xml",
)
job = Job.get(Job.source == source)
execution = JobExecution.create(
job=job,
running_status=JobExecutionStatus.SUCCEEDED,
)
body = str(await render_runs(app))
assert "Running job executions" in body
assert "Upcoming jobs" in body
assert "Completed job executions" in body
assert "runs-render-source" in body
assert f"/job/{job.id}/execution/{execution.get_id()}/logs" in body
assert "data-next-run-at" in body
assert "in " in body
assert "Already running" not in body
asyncio.run(run())
def test_render_runs_shows_empty_state_rows(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "runs-empty.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
app = create_app()
body = str(await render_runs(app))
assert body.count("No job executions are running.") == 1
assert "No jobs are scheduled." in body
assert "No job executions have completed yet." in body
asyncio.run(run())
def test_render_execution_logs_uses_app_route(monkeypatch, tmp_path: Path) -> None:
db_path = tmp_path / "logs-render.db"
monkeypatch.setenv("REPUBLISHER_DB_PATH", str(db_path))
async def run() -> None:
log_dir = tmp_path / "out" / "logs"
app = create_app()
app.config["REPUB_LOG_DIR"] = log_dir
source = create_source(
name="Log render source",
slug="log-render-source",
source_type="feed",
notes="",
spider_arguments="",
enabled=False,
cron_minute="*/30",
cron_hour="*",
cron_day_of_month="*",
cron_day_of_week="*",
cron_month="*",
feed_url="https://example.com/logs.xml",
)
job = Job.get(Job.source == source)
execution = JobExecution.create(
job=job,
running_status=JobExecutionStatus.RUNNING,
)
log_path = log_dir / f"job-{job.id}-execution-{execution.get_id()}.log"
log_path.parent.mkdir(parents=True, exist_ok=True)
log_path.write_text(
"\n".join(
(
"scheduler: run_now requested",
"worker: starting simulated crawl",
"worker: waiting for more log lines ...",
)
),
encoding="utf-8",
)
body = str(
await render_execution_logs(
app, job_id=job.id, execution_id=int(execution.get_id())
)
)
assert f"Job {job.id} / execution {execution.get_id()}" in body
assert f"/job/{job.id}/execution/{execution.get_id()}/logs" in body
assert "waiting for more log lines" in body
asyncio.run(run())

14
uv.lock generated
View file

@ -504,6 +504,18 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/07/c6/80c95b1b2b94682a72cbdbfb85b81ae2daffa4291fbfa1b1464502ede10d/hpack-4.1.0-py3-none-any.whl", hash = "sha256:157ac792668d995c657d93111f46b4535ed114f0c9c8d672271bbec7eae1b496", size = 34357, upload-time = "2025-01-22T21:44:56.92Z" }, { url = "https://files.pythonhosted.org/packages/07/c6/80c95b1b2b94682a72cbdbfb85b81ae2daffa4291fbfa1b1464502ede10d/hpack-4.1.0-py3-none-any.whl", hash = "sha256:157ac792668d995c657d93111f46b4535ed114f0c9c8d672271bbec7eae1b496", size = 34357, upload-time = "2025-01-22T21:44:56.92Z" },
] ]
[[package]]
name = "htpy"
version = "25.12.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "markupsafe" },
]
sdist = { url = "https://files.pythonhosted.org/packages/b6/23/e00bbc355e70444d16c90a0f1fdce108c67379fe65e9312cd026c13db976/htpy-25.12.0.tar.gz", hash = "sha256:7d3f4aaa10b35c5e46dfa804df1f3f18772caf8efee6e6a035b5dee89a5d6af8", size = 291259, upload-time = "2025-12-01T20:35:01.666Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/61/f1/a2f2caf14b03e7fab4801ac6018a4ac996de3e82a573e7aa21f3cb11a7cc/htpy-25.12.0-py3-none-any.whl", hash = "sha256:642e69278d6f8f4643acc2d2d13c21682ceb5fb4860ecbbce042f171577fff54", size = 21141, upload-time = "2025-12-01T20:35:00.13Z" },
]
[[package]] [[package]]
name = "hypercorn" name = "hypercorn"
version = "0.18.0" version = "0.18.0"
@ -1077,6 +1089,7 @@ dependencies = [
{ name = "feedparser" }, { name = "feedparser" },
{ name = "ffmpeg-python" }, { name = "ffmpeg-python" },
{ name = "greenlet" }, { name = "greenlet" },
{ name = "htpy" },
{ name = "lxml" }, { name = "lxml" },
{ name = "peewee" }, { name = "peewee" },
{ name = "pillow" }, { name = "pillow" },
@ -1108,6 +1121,7 @@ requires-dist = [
{ name = "feedparser", specifier = ">=6.0.11,<7.0.0" }, { name = "feedparser", specifier = ">=6.0.11,<7.0.0" },
{ name = "ffmpeg-python", specifier = ">=0.2.0,<0.3.0" }, { name = "ffmpeg-python", specifier = ">=0.2.0,<0.3.0" },
{ name = "greenlet", specifier = ">=3.2.4,<4.0.0" }, { name = "greenlet", specifier = ">=3.2.4,<4.0.0" },
{ name = "htpy", specifier = ">=25.12.0,<26.0.0" },
{ name = "lxml", specifier = ">=5.2.1,<6.0.0" }, { name = "lxml", specifier = ">=5.2.1,<6.0.0" },
{ name = "peewee", specifier = ">=3.19.0,<4.0.0" }, { name = "peewee", specifier = ">=3.19.0,<4.0.0" },
{ name = "pillow", specifier = ">=10.3.0,<11.0.0" }, { name = "pillow", specifier = ">=10.3.0,<11.0.0" },