archivebox.crawls.models
Module Contents
Classes
A fountain that produces URLs (+metadata) each time it’s queried e.g. - file:///data/sources/2024-01-02_11-57-51__cli_add.txt - file:///data/sources/2024-01-02_11-57-51__web_ui_add.txt - file:///Users/squash/Library/Application Support/Google/Chrome/Default/Bookmarks - https://getpocket.com/user/nikisweeting/feed - https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml - … Each query of a Seed can produce the same list of URLs, or a different list each time. The list of URLs it returns is used to create a new Crawl and seed it with new pending Snapshots. |
|
A record for a job that should run repeatedly on a given schedule. |
|
Enhanced QuerySet for Crawl that adds some useful methods. |
|
A single session of URLs to archive starting from a given Seed and expanding outwards. An “archiving session” so to speak. |
|
A record of a link found on a page, pointing to another page. |
API
- class archivebox.crawls.models.Seed(*args: Any, **kwargs: Any)[source]
Bases:
base_models.models.ModelWithReadOnlyFields
,base_models.models.ModelWithSerializers
,base_models.models.ModelWithUUID
,base_models.models.ModelWithKVTags
,base_models.models.ABIDModel
,base_models.models.ModelWithOutputDir
,base_models.models.ModelWithConfig
,base_models.models.ModelWithNotes
,base_models.models.ModelWithHealthStats
A fountain that produces URLs (+metadata) each time it’s queried e.g. - file:///data/sources/2024-01-02_11-57-51__cli_add.txt - file:///data/sources/2024-01-02_11-57-51__web_ui_add.txt - file:///Users/squash/Library/Application Support/Google/Chrome/Default/Bookmarks - https://getpocket.com/user/nikisweeting/feed - https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml - … Each query of a Seed can produce the same list of URLs, or a different list each time. The list of URLs it returns is used to create a new Crawl and seed it with new pending Snapshots.
When a crawl is created, a root_snapshot is initially created with a URI set to the Seed URI. The seed’s preferred extractor is executed on that URI, which produces an ArchiveResult containing outlinks. The outlinks then get turned into new pending Snapshots under the same crawl, and the cycle repeats until Crawl.max_depth.
Each consumption of a Seed by an Extractor can produce new urls, as Seeds can point to stateful remote services, files with contents that change, directories that have new files within, etc.
Initialization
Overriden init method ensures we have a stable creation timestamp that fields can use within initialization code pre-saving to DB.
- output_dir_symlinks[source]
[(‘index.json’, ‘self.as_json()’), (‘config.toml’, ‘benedict(self.config).as_toml()’), (‘seed/’, ‘se…
- classmethod from_file(source_file: pathlib.Path, label: str = '', parser: str = 'auto', tag: str = '', created_by: int | None = None, config: dict | None = None)[source]
- property scheduled_crawl_set: django.db.models.QuerySet[crawls.models.CrawlSchedule][source]
- property snapshot_set: django.db.models.QuerySet[core.models.Snapshot][source]
- class archivebox.crawls.models.CrawlSchedule(*args: Any, **kwargs: Any)[source]
Bases:
base_models.models.ModelWithReadOnlyFields
,base_models.models.ModelWithSerializers
,base_models.models.ModelWithUUID
,base_models.models.ModelWithKVTags
,base_models.models.ABIDModel
,base_models.models.ModelWithNotes
,base_models.models.ModelWithHealthStats
A record for a job that should run repeatedly on a given schedule.
It pulls from a given Seed and creates a new Crawl for each scheduled run. The new Crawl will inherit all the properties of the crawl_template Crawl.
Initialization
Overriden init method ensures we have a stable creation timestamp that fields can use within initialization code pre-saving to DB.
- property snapshot_set: django.db.models.QuerySet[core.models.Snapshot][source]
- class archivebox.crawls.models.CrawlQuerySet[source]
Bases:
django.db.models.QuerySet
Enhanced QuerySet for Crawl that adds some useful methods.
To get all the snapshots for a given set of Crawls: Crawl.objects.filter(seed__uri=’https://example.com/some/rss.xml’).snapshots() -> QuerySet[Snapshot]
To get all the archiveresults for a given set of Crawls: Crawl.objects.filter(seed__uri=’https://example.com/some/rss.xml’).archiveresults() -> QuerySet[ArchiveResult]
To export the list of Crawls as a CSV or JSON: Crawl.objects.filter(seed__uri=’https://example.com/some/rss.xml’).export_as_csv() -> str Crawl.objects.filter(seed__uri=’https://example.com/some/rss.xml’).export_as_json() -> str
- snapshots(**filter_kwargs) django.db.models.QuerySet[core.models.Snapshot] [source]
- archiveresults() django.db.models.QuerySet[core.models.ArchiveResult] [source]
- class archivebox.crawls.models.Crawl(*args: Any, **kwargs: Any)[source]
Bases:
base_models.models.ModelWithReadOnlyFields
,base_models.models.ModelWithSerializers
,base_models.models.ModelWithUUID
,base_models.models.ModelWithKVTags
,base_models.models.ABIDModel
,base_models.models.ModelWithOutputDir
,base_models.models.ModelWithConfig
,base_models.models.ModelWithHealthStats
,workers.models.ModelWithStateMachine
A single session of URLs to archive starting from a given Seed and expanding outwards. An “archiving session” so to speak.
A new Crawl should be created for each loading from a Seed (because it can produce a different set of URLs every time its loaded). E.g. every scheduled import from an RSS feed should create a new Crawl, and more loadings from the same seed each create a new Crawl
Every “Add” task triggered from the Web UI, CLI, or Scheduled Crawl should create a new Crawl with the seed set to a file URI e.g. file:///sources/
_{ui,cli}_add.txt containing the user’s input. Initialization
Overriden init method ensures we have a stable creation timestamp that fields can use within initialization code pre-saving to DB.
- output_dir_template[source]
‘archive/crawls/{getattr(crawl, crawl.abid_ts_src).strftime(“%Y%m%d”)}/{crawl.abid}’
- output_dir_symlinks[source]
[(‘index.json’, ‘self.as_json’), (‘seed/’, ‘self.seed.output_dir’), (‘persona/’, ‘self.persona.outpu…
- snapshot_set: django.db.models.Manager[core.models.Snapshot][source]
None
- classmethod from_seed(seed: archivebox.crawls.models.Seed, max_depth: int = 0, persona: str = 'Default', tags_str: str = '', config: dict | None = None, created_by: int | None = None)[source]
- property template[source]
If this crawl was created under a ScheduledCrawl, returns the original template Crawl it was based off
- pending_snapshots() django.db.models.QuerySet[core.models.Snapshot] [source]
- pending_archiveresults() django.db.models.QuerySet[core.models.ArchiveResult] [source]
- create_root_snapshot() core.models.Snapshot [source]