OAI-PMH Repo Handler
OAI-PMH Repo Handler implements an OAI-PMH Aggregator service for CESSDA Metadata Aggregator.
CESSDA Metadata Aggregator - OAI-PMH Repo Handler
HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records. This program is part of CESSDA Metadata Aggregator.
Source code is hosted at Github https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.
Features
The OAI-PMH Repo Handler implements an OAI-PMH Aggregator service. The aggregator provides an OAI-PMH endpoint which enables tracing of record origin using OAI-PMH provenance containers. The OAI-PMH specification is publicly available at http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm and provenance containers are described at http://www.openarchives.org/OAI/2.0/guidelines-provenance.htm. The aggregator adheres to the implementation Guidelines for Aggregators, Caches and Proxies, which is available at http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm.
The aggregator implements the following OAI-PMH features:
All six OAI-PMH verbs of OAI-PMH protocol 2.0.
ResumptionTokens to partition large list responses.
Selective harvesting via OAI-sets and datestamps.
Configurable support for deleted records.
Configurable support for OAI-identifiers.
Configurable support for arbitrary OAI-sets.
Built-in OAI set for grouping by study language.
Built-in OAI set for grouping by OpenAIRE data.
The following metadata formats are supported:
OAI-DC using metadataprefix
oai_dc
.DDI 2.5 using metadataprefix
oai_ddi25
.OpenAIRE Datacite using metadataprefix
oai_datacite
.
The application exposes a /metrics endpoint, which provides certain statistics about the running instance of the application. This endpoint is provided by prometheus-client. The following metrics are exposed:
Metric |
Type |
Explanation |
---|---|---|
|
Counter |
Total number of requests received |
|
Counter |
Number of requests received per user-agent |
|
Counter |
Number of successful requests |
|
Counter |
Number of failed requests |
|
Summary |
Response time in milliseconds |
|
Gauge |
Total number of OAI-PMH records (includes records marked as deleted) |
|
Gauge |
Total number of OAI-PMH records (excludes records marked as deleted) |
|
Gauge |
Total number of distinct publishers (defined by the repository’s declared OAI-PMH base URL) |
|
Gauge |
Number of OAI-PMH records per publisher (includes records marked as deleted) |
|
Gauge |
Number of OAI-PMH records per publisher (excludes records marked as deleted) |
Requirements
Python 3.8 or newer.
Running CESSDA Metadata Aggregator DocStore instance.
Installation
On Ubuntu 20.04
Get Package
Clone the repository using Git.
git clone https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git
Or fetch a specific release using a tag. For example to get 0.2.0 release.
git clone --branch 0.2.0 https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git
Install OAI-PMH Repo Handler
It is recommended to install packages inside Python virtual environment to isolate the install. This package also provides a Dockerfile to help setup a containerized environment.
Create the Python virtual environment and activate it.
python3 -m venv cdcagg-env
source cdcagg-env/bin/activate
Install Python packages.
cd cessda.cdc.aggregator.oai-pmh-repo-handler
pip install -r requirements.txt
pip install .
To upgrade existing install, use --upgrade
flag in pip commands. Pip
uses only-if-needed
upgrade strategy by default since version
10.0.0, but for backwards compatibility the option is also included in
the example.
pip install --upgrade -r requirements.txt --upgrade-strategy=only-if-needed
pip install . --upgrade --upgrade-strategy=only-if-needed
Run
Replace <docstore-url>
with an URL pointing to a DocStore
server. Replace <base-url>
with your endpoint OAI-PMH Base
URL. Replace <admin-email>
with administrator email address.
python -m cdcagg_oai --document-store-url <docstore-url> --oai-pmh-base-url <base-url> --oai-pmh-admin-email <admin-email>
Configuration reference
To list all available configuration options, use --help
.
python -m cdcagg_oai --help
Note that most configuration options can be specified via command line arguments, configuration file options and environment variables.
Prometheus client provides additional configuration options that can be set using environment variables:
PROMETHEUS_DISABLE_CREATED_SERIES
for disabling series suffixed by_created
.PROMETHEUS_MULTIPROC_DIR
for storing metrics when running in multiprocess mode.
Refer to Prometheus client documentation for more information.
Build OAI sets based on source endpoint
The aggregator provides a way to define OAI sets which group records by the source OAI-PMH endpoint. This functionality relies on a mapping file which maps the source OAI-PMH endpoint base-url value to a OAI-PMH setspec value. In order to use the mapping file, its filepath must given to the program via configuration and the program must be able to read the file.
See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.
The value that corresponds with the url
in the mapping file is
used to query the Document Store. Results are grouped using a setspec
value source:<source-key-value>
, where source
in the mapping file.
For example, if the mapping file has the following definition
-
url: 'archive.org'
source: 'archive'
setname: 'Some archive'
description: 'Describe some archive'
then all records that are harvested from archive.org are grouped in setspec
source:archive
.
Values for setname
and description
are used in
ListSets-response to describe the set contents.
When the mapping file is defined, the OAI-PMH Repo Handler must be
configured using configuration option
--oai-set-source-path <mapping-file-path>
.
Build arbitrary OAI sets
Arbitrary sets can be built using configurable sets -functionality.
Records can be grouped into arbitrary sets by creating a mapping file
which defines OAI set properties and record identifiers belonging to
the defined set. The record identifiers correspond to Study records
_aggregator_identifier
values, which are the same values that are
used as default OAI-identifiers.
The set builder supports a single top-level spec
value with
multiple second-level spec
values. Second-level spec
values are always
prepended with the top-level spec
value
<setSpec>top-level:second-level</setSpec>
. The top-level setspec
contains records matching all identifiers
defined in second-level set
definitions.
See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.
A single spec
must be found from top-level. The spec
value is used
as a top-level OAI setspec value and identifies that this
setspec-value gets intepreted as a configurable OAI-set. The nodes
contain a list of second-level set definitions. The second-level
spec
values must be unique and the list item must contain list of
identifiers
that belong to that particular OAI set.
For example, if the mapping file has the following definition
spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
- spec: 'social_sciences'
name: 'Social sciences'
description: 'Studies in social sciences'
identifiers:
- id_1
- id_2
- spec: 'humanities'
name: 'Humanities'
description: 'Studies in humanities'
identifiers:
- id_2
- id_3
- id_4
then thematic
is the top-level setspec node. It contains two child
nodes: social_sciences
and humanities
. ListRecords-request
with spec=thematic
will return records from all its second-level
nodes. ListRecords-request with spec=thematic:social_sciences
will
return records with _aggregator_identifiers
values id_1
and
id_2
. The record with identifier id_2
belongs to both
second-level setspec nodes.
Only a single top-level node is supported. It must contain at least one second-level child node.
Instead of specifying set definitions directly, the second level node
may alternatively specify a path
which points to an absolute path
of an external mapping file that contains second-level set
definitions.
The external configuration must specify spec
, name
and
identifiers
keys and may have an optional description
key. The
external configuration file can specify a single node or multiple
nodes in a list.
Main configuration file with path
spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
- path: '/absolute/path/to/ext/conf.yaml'
External configuration file with a single node
spec: 'history'
name: 'History'
description: 'Studies in history'
identifiers:
- id_5
- id_6
External configuration file with a list of nodes
- spec: 'history'
name: 'History'
description: 'Studies in history'
identifiers:
- id_5
- id_6
- spec: 'literature'
name: 'Literature'
description: 'Literature Studies'
identifiers:
- id_7
- id_8
The external configuration cannot further refer to an external configuration file.
The mapping file syntax is validated on server startup. The file is not loaded in-memory, but always read on-demand. Exceptions may occur after server startup, if the file is changed after initial syntax check.
When the mapping file is defined, the OAI-PMH Repo Handler must be
configured using configuration option
--oai-set-configurable-path <mapping-file-path>
License
See the LICENSE file.
Changelog
All notable changes to the CDC Aggregator OAI-PMH Repo Handler will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
0.8.0 - 2024-04-30
Added
Support Study.principal_investigator attributes
external_link
,external_link_uri
,external_link_role
andexternal_link_title
.Render /CodeBook/stdyDscr/citation/rspStmt/AuthEnty/ExtLink to DDI-C (metadataprefixes
ddi_c
andoai_ddi25
).
Changed
0.7.0 - 2024-01-24
Fixed
Empty resumptionToken is now intepreted correctly by XSLT processing. It means the list response is complete. (Fixes #67)
Use new study schema introduced by Shared Library 0.6.0 to count metrics exposed via /metrics endpoint. This improves performance by decreasing response times of the endpoint.
Changed
Require Kuha OAI-PMH Repo Handler 1.4.1 in requirements.txt.
Require CESSDA CDC Aggegator Shared Library 0.6.0 in requirements.txt and setup.py.
0.6.0 - 2023-08-29
Added
Add /metrics endpoint to serve prometheus metrics (Implements #43)
Fixed
Read configuration option
--server-process-count
. (Fixes #45)
0.5.0 - 2023-03-17
Added
Include
study.study_uris
to element dc:identifier inoai_dc
serialization. (Implements #40)
0.4.0 - 2022-12-21
Added
Add hard-coded
resourceType
tooai_datacite
serialization which always has the valueDataset
. (Implements #33)Add hard-coded
dc:type
element tooai_dc
serialization which always has the valueDataset
. (Implements #36)Add XML Stylesheet to make OAI responses more human-friendly. (Implements #22)
Configuration option to control XML Stylesheets (
--oai-pmh-stylesheet-url
):Set to empty string to disable stylesheets completely.
Set to a full URL to serve the stylesheet from some external file server.
Start with a slash (‘/’) to serve via Kuha OAI-PMH Repo Handler server.
Defaults to ‘/v0/oai/static/oai2.xsl’, which works with other default configuration values and uses Kuha OAI-PMH Repo Handler server to actually serve the file.
Changed
Require Kuha OAI-PMH Repo Handler 1.2.0 in requirements.txt and setup.py.
Fixed
Make sure
oai_datacite
serialization yields valid Datacite v3. (Fixes #35)Remove invalid xml:lang attributes.
Wrap geoLocationPlace inside geoLocation element.
0.3.0 - 2022-11-22
Added
Changed
Add primary lookup to
oai_datacite
Publisher from Study.distributors. The current lookup from Study.publishers will remain as a secondary source. (Fixes #31)Update dependencies:
Require Aggregator Shared Library 0.5.0 in requirements.txt and setup.py.
Require Kuha OAI-PMH Repo Handler 1.1.0 in requirements.txt and setup.py.
Require Kuha Common 2.0.1 in requirements.txt and 2.0.0 or newer in setup.py.
Require Tornado 6.2.0 in requirements.txt.
Require Genshi 0.7.7 in requirements.txt.
Fixed
0.2.1 - 2022-06-29
Changed
Require Aggregator Shared Library 0.3.0 in requirements.txt.
Require Kuha Common 1.0.0 in requirements.txt.
Require Kuha OAI-PMH Repo Handler 1.0.1 in requirements.txt.
Fixed
0.2.0 - 2021-12-17
Added
Mapping file syntax for source-sets (SourceAggMDSet-class) now supports setname and description key-value pairs. The setname is mandatory while description is optional.
Mapping file for configurable-sets now supports external mapping files via
path
key. (Implements #18)Mapping file for configurable-sets is validated upon server startup.
Changed
Mapping file configuration option for SourceAggMDSet
--oai-set-sources-path
defaults to None, which implies that the set is discarded (not loaded) on server startup. The operator is in charge of creating and configuring the mapping file.Update dependencies in requirements.txt
PyYAML 6.0.0
ConfigArgParse 1.5.3
Kuha Common to Git commit 8e7de1f16530decc356fee660255b60fcacaea23
Kuha OAI-PMH Repo Handler to Git commit cbe6d16bbe00369ccddc8a0ae5bcd64f8476755e
CDC Aggregator Shared Library 0.2.0
Fixed
Value for the altered attribute in Provenance containers is now either ‘true’ or ‘false. (Fixes #14)
Empty setName elements for language-sets are populated with generated values. Key-value pairs for setname are expected to be defined for source-sets in configured mapping file. (Fixes #15)
Source set no longer falls back to automatically generating sets based on source archive’s baseUrl. (Fixes #15)
deletedRecord declaration is now configurable. (Fixes #16)
Provenance container’s baseUrl element name should be baseURL. (Fixes #19)
0.1.0 - 2021-09-21
Added
New codebase for CDC Aggregator OAI-PMH Repo Handler.
HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records.