This document is a starting point and reference to familiarize yourself with this codebase.
In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing.
a look at the tangles of communication between different parts of the system:
graph LR;
subgraph shtrove;
subgraph web[api/web server];
ingest;
search;
browse;
rss;
atom;
oaipmh;
end;
worker["background worker (celery)"];
indexer["indexer daemon"];
rabbitmq["task queue (rabbitmq)"];
postgres["database (postgres)"];
elasticsearch;
web---rabbitmq;
web---postgres;
web---elasticsearch;
worker---rabbitmq;
worker---postgres;
worker---elasticsearch;
indexer---rabbitmq;
indexer---postgres;
indexer---elasticsearch;
end;
source["metadata source (e.g. osf.io backend)"];
user["web user, either by browsing directly or via web app (like osf.io)"];
subscribers["feed subscription tools"];
source-->ingest;
user-->search;
user-->browse;
subscribers-->rss;
subscribers-->atom;
subscribers-->oaipmh;
A brief look at important areas of code as they happen to exist now.
trove: django app for rdf-based apistrove.digestive_tract: most of what happens after ingestion- stores records and identifiers in the database
- initiates indexing
trove.extract: parsing ingested metadata records into resource descriptionstrove.derive: from a given resource description, create special non-rdf serializationstrove.render: from an api response modeled as rdf graph, render the requested mediatypetrove.models: database models for identifiers and resource descriptionstrove.trovesearch: builds rdf-graph responses for trove search apis (usingIndexStrategyimplementations fromshare.search)trove.vocab: identifies and describes concepts used elsewheretrove.vocab.trove: describes types, properties, and api paths in the trove apitrove.vocab.osfmap: describes metadata from osf.io (currently the only metadata ingested)
trove.openapi: generate openapi json for the trove api from thesaurus introve.vocab.trove
share: django app with search indexes and remnants of sharev2share.models: database models for external sources, users, and other system book-keepingshare.oaipmh: provide data via OAI-PMHshare.search: all interaction with elasticsearchshare.search.index_strategy: abstract base classIndexStrategywith multiple implementations, for different approaches to indexing the same datashare.search.daemon: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategiesshare.search.index_messenger: for sending messages to the indexer daemon
api: django app with remnants of the legacy sharev2 apiapi.views.feeds: allows custom RSS and Atom feeds- otherwise, subject to possible deprecation
osf_oauth2_adapter: django app for login via osf.ioproject: the actual django project- default settings at
project.settings - pulls together code from other directories implemented as django apps (
share,trove,api, andosf_oauth2_adapter)
- default settings at
Uses the resource description framework:
- the content of each ingested metadata record is an rdf graph focused on a specific resource
- all api responses from
troveviews are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways
Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly.
Prefer using open, standard, well-defined namespaces wherever possible (DCAT is a good place to start; see trove.vocab.namespaces for others already in use). When app-specific concepts must be defined, use the TROVE namespace (https://share.osf.io/vocab/2023/trove/).
A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple (source, identifier) that uniquely and persistently identifies a metadata record in a source repository. This identifier may be any string value, provided by the external source.
(an incomplete list)
- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in)
- prefer full type annotations in python code, wherever reasonably feasible
inspired by this writeup and this example architecture document
