OAI-PMH Repo Handler

OAI-PMH Repo Handler implements an OAI-PMH Aggregator service for CESSDA Metadata Aggregator.

CESSDA Metadata Aggregator - OAI-PMH Repo Handler

Build Status Bugs Code Smells Coverage Duplicated Lines (%) Lines of Code Maintainability Rating Quality Gate Status Reliability Rating Security Rating Technical Debt Vulnerabilities

HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records. This program is part of CESSDA Metadata Aggregator.

Source code is hosted at Github https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.

Features

The OAI-PMH Repo Handler implements an OAI-PMH Aggregator service. The aggregator provides an OAI-PMH endpoint which enables tracing of record origin using OAI-PMH provenance containers. The OAI-PMH specification is publicly available at http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm and provenance containers are described at http://www.openarchives.org/OAI/2.0/guidelines-provenance.htm. The aggregator adheres to the implementation Guidelines for Aggregators, Caches and Proxies, which is available at http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm.

The aggregator implements the following OAI-PMH features:

  • All six OAI-PMH verbs of OAI-PMH protocol 2.0.

  • ResumptionTokens to partition large list responses.

  • Selective harvesting via OAI-sets and datestamps.

  • Configurable support for deleted records.

  • Configurable support for OAI-identifiers.

  • Configurable support for arbitrary OAI-sets.

  • Built-in OAI set for grouping by study language.

  • Built-in OAI set for grouping by OpenAIRE data.

The following metadata formats are supported:

  • OAI-DC using metadataprefix oai_dc.

  • DDI 2.5 using metadataprefix oai_ddi25.

  • OpenAIRE Datacite using metadataprefix oai_datacite.

The application exposes a /metrics endpoint, which provides certain statistics about the running instance of the application. This endpoint is provided by prometheus-client. The following metrics are exposed:

Metric

Type

Explanation

requests_total

Counter

Total number of requests received

requests_per_user_agent_total

Counter

Number of requests received per user-agent

requests_succeeded_total

Counter

Number of successful requests

requests_failed_total

Counter

Number of failed requests

requests_duration

Summary

Response time in milliseconds

records_total

Gauge

Total number of OAI-PMH records (includes records marked as deleted)

records_total_without_deleted

Gauge

Total number of OAI-PMH records (excludes records marked as deleted)

publishers_total

Gauge

Total number of distinct publishers (defined by the repository’s declared OAI-PMH base URL)

publishers_counts

Gauge

Number of OAI-PMH records per publisher (includes records marked as deleted)

publishers_counts_without_deleted

Gauge

Number of OAI-PMH records per publisher (excludes records marked as deleted)

Requirements

  • Python 3.8 or newer.

  • Running CESSDA Metadata Aggregator DocStore instance.

Installation

On Ubuntu 20.04

Get Package

Clone the repository using Git.

git clone https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git

Or fetch a specific release using a tag. For example to get 0.2.0 release.

git clone --branch 0.2.0 https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git

Install OAI-PMH Repo Handler

It is recommended to install packages inside Python virtual environment to isolate the install. This package also provides a Dockerfile to help setup a containerized environment.

Create the Python virtual environment and activate it.

python3 -m venv cdcagg-env
source cdcagg-env/bin/activate

Install Python packages.

cd cessda.cdc.aggregator.oai-pmh-repo-handler
pip install -r requirements.txt
pip install .

To upgrade existing install, use --upgrade flag in pip commands. Pip uses only-if-needed upgrade strategy by default since version 10.0.0, but for backwards compatibility the option is also included in the example.

pip install --upgrade -r requirements.txt --upgrade-strategy=only-if-needed
pip install . --upgrade --upgrade-strategy=only-if-needed

Run

Replace <docstore-url> with an URL pointing to a DocStore server. Replace <base-url> with your endpoint OAI-PMH Base URL. Replace <admin-email> with administrator email address.

python -m cdcagg_oai --document-store-url <docstore-url> --oai-pmh-base-url <base-url> --oai-pmh-admin-email <admin-email>

Configuration reference

To list all available configuration options, use --help.

python -m cdcagg_oai --help

Note that most configuration options can be specified via command line arguments, configuration file options and environment variables.

Prometheus client provides additional configuration options that can be set using environment variables:

  • PROMETHEUS_DISABLE_CREATED_SERIES for disabling series suffixed by _created.

  • PROMETHEUS_MULTIPROC_DIR for storing metrics when running in multiprocess mode.

Refer to Prometheus client documentation for more information.

Build OAI sets based on source endpoint

The aggregator provides a way to define OAI sets which group records by the source OAI-PMH endpoint. This functionality relies on a mapping file which maps the source OAI-PMH endpoint base-url value to a OAI-PMH setspec value. In order to use the mapping file, its filepath must given to the program via configuration and the program must be able to read the file.

See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.

The value that corresponds with the url in the mapping file is used to query the Document Store. Results are grouped using a setspec value source:<source-key-value>, where corresponds to the value of source in the mapping file.

For example, if the mapping file has the following definition

-
  url: 'archive.org'
  source: 'archive'
  setname: 'Some archive'
  description: 'Describe some archive'

then all records that are harvested from archive.org are grouped in setspec source:archive.

Values for setname and description are used in ListSets-response to describe the set contents.

When the mapping file is defined, the OAI-PMH Repo Handler must be configured using configuration option --oai-set-source-path <mapping-file-path>.

Build arbitrary OAI sets

Arbitrary sets can be built using configurable sets -functionality.

Records can be grouped into arbitrary sets by creating a mapping file which defines OAI set properties and record identifiers belonging to the defined set. The record identifiers correspond to Study records _aggregator_identifier values, which are the same values that are used as default OAI-identifiers.

The set builder supports a single top-level spec value with multiple second-level spec values. Second-level spec values are always prepended with the top-level spec value <setSpec>top-level:second-level</setSpec>. The top-level setspec contains records matching all identifiers defined in second-level set definitions.

See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.

A single spec must be found from top-level. The spec value is used as a top-level OAI setspec value and identifies that this setspec-value gets intepreted as a configurable OAI-set. The nodes contain a list of second-level set definitions. The second-level spec values must be unique and the list item must contain list of identifiers that belong to that particular OAI set.

For example, if the mapping file has the following definition

spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
  - spec: 'social_sciences'
    name: 'Social sciences'
    description: 'Studies in social sciences'
    identifiers:
    - id_1
    - id_2
  - spec: 'humanities'
    name: 'Humanities'
    description: 'Studies in humanities'
    identifiers:
    - id_2
    - id_3
    - id_4

then thematic is the top-level setspec node. It contains two child nodes: social_sciences and humanities. ListRecords-request with spec=thematic will return records from all its second-level nodes. ListRecords-request with spec=thematic:social_sciences will return records with _aggregator_identifiers values id_1 and id_2. The record with identifier id_2 belongs to both second-level setspec nodes.

Only a single top-level node is supported. It must contain at least one second-level child node.

Instead of specifying set definitions directly, the second level node may alternatively specify a path which points to an absolute path of an external mapping file that contains second-level set definitions.

The external configuration must specify spec, name and identifiers keys and may have an optional description key. The external configuration file can specify a single node or multiple nodes in a list.

Main configuration file with path

spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
  - path: '/absolute/path/to/ext/conf.yaml'

External configuration file with a single node

spec: 'history'
name: 'History'
description: 'Studies in history'
identifiers:
- id_5
- id_6

External configuration file with a list of nodes

- spec: 'history'
  name: 'History'
  description: 'Studies in history'
  identifiers:
  - id_5
  - id_6
- spec: 'literature'
  name: 'Literature'
  description: 'Literature Studies'
  identifiers:
  - id_7
  - id_8

The external configuration cannot further refer to an external configuration file.

The mapping file syntax is validated on server startup. The file is not loaded in-memory, but always read on-demand. Exceptions may occur after server startup, if the file is changed after initial syntax check.

When the mapping file is defined, the OAI-PMH Repo Handler must be configured using configuration option --oai-set-configurable-path <mapping-file-path>

License

See the LICENSE file.

Changelog

All notable changes to the CDC Aggregator OAI-PMH Repo Handler will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

0.8.0 - 2024-04-30

Added

  • Support Study.principal_investigator attributes external_link, external_link_uri, external_link_role and external_link_title.

  • Render /CodeBook/stdyDscr/citation/rspStmt/AuthEnty/ExtLink to DDI-C (metadataprefixes ddi_c and oai_ddi25).

Changed

  • Require CESSDA CDC Aggregator Shared Library 0.7.0 in requirements.txt and setup.py. (Implements #72)

  • Require Kuha OAI-PMH Repo Handler 1.5.0 in requirements.txt and setup.py. (Implements #72)

  • Require Kuha Common 2.4.0 in requirements.txt and setup.py. (Implements #72)

0.7.0 - 2024-01-24

Fixed

  • Empty resumptionToken is now intepreted correctly by XSLT processing. It means the list response is complete. (Fixes #67)

  • Use new study schema introduced by Shared Library 0.6.0 to count metrics exposed via /metrics endpoint. This improves performance by decreasing response times of the endpoint.

Changed

  • Require Kuha OAI-PMH Repo Handler 1.4.1 in requirements.txt.

  • Require CESSDA CDC Aggegator Shared Library 0.6.0 in requirements.txt and setup.py.

0.6.0 - 2023-08-29

DOI

Added

  • Add /metrics endpoint to serve prometheus metrics (Implements #43)

Fixed

  • Read configuration option --server-process-count. (Fixes #45)

0.5.0 - 2023-03-17

Added

  • Include study.study_uris to element dc:identifier in oai_dc serialization. (Implements #40)

0.4.0 - 2022-12-21

Added

  • Add hard-coded resourceType to oai_datacite serialization which always has the value Dataset. (Implements #33)

  • Add hard-coded dc:type element to oai_dc serialization which always has the value Dataset. (Implements #36)

  • Add XML Stylesheet to make OAI responses more human-friendly. (Implements #22)

  • Configuration option to control XML Stylesheets (--oai-pmh-stylesheet-url):

    • Set to empty string to disable stylesheets completely.

    • Set to a full URL to serve the stylesheet from some external file server.

    • Start with a slash (‘/’) to serve via Kuha OAI-PMH Repo Handler server.

    • Defaults to ‘/v0/oai/static/oai2.xsl’, which works with other default configuration values and uses Kuha OAI-PMH Repo Handler server to actually serve the file.

Changed

  • Require Kuha OAI-PMH Repo Handler 1.2.0 in requirements.txt and setup.py.

Fixed

  • Make sure oai_datacite serialization yields valid Datacite v3. (Fixes #35)

    • Remove invalid xml:lang attributes.

    • Wrap geoLocationPlace inside geoLocation element.

0.3.0 - 2022-11-22

Added

  • Grant & funding information to oai_datacite and oai_ddi25 metadata. (Implements #34)

  • Related publication identifiers and agencies to oai_datacite and oai_ddi25 metadata. (Implements #34)

Changed

  • Add primary lookup to oai_datacite Publisher from Study.distributors. The current lookup from Study.publishers will remain as a secondary source. (Fixes #31)

  • Update dependencies:

    • Require Aggregator Shared Library 0.5.0 in requirements.txt and setup.py.

    • Require Kuha OAI-PMH Repo Handler 1.1.0 in requirements.txt and setup.py.

    • Require Kuha Common 2.0.1 in requirements.txt and 2.0.0 or newer in setup.py.

    • Require Tornado 6.2.0 in requirements.txt.

    • Require Genshi 0.7.7 in requirements.txt.

Fixed

  • Change lookup order of preferred PublicationYear value for oai_datacite. (Fixes #30)

  • Format PublicationYear value for oai_datecite so that it is a year, instead of full datestamp. (Fixes #30)

  • Include mandatory Date property to oai_datacite. (Fixes #29)

0.2.1 - 2022-06-29

Changed

  • Require Aggregator Shared Library 0.3.0 in requirements.txt.

  • Require Kuha Common 1.0.0 in requirements.txt.

  • Require Kuha OAI-PMH Repo Handler 1.0.1 in requirements.txt.

Fixed

  • Include missing fields to oai_ddi25 metadata: (Fixes #26)

    • /codeBook/docDscr/citation/titlStmt/titl

    • /codeBook/stdyDscr/citation/holdings/@URI

    • /codeBook/stdyDscr/stdyInfo/sumDscr/dataKind

  • Render subject in oai_dc metadata regardless if the study.keyword does not have a value. (Fixes #27)

0.2.0 - 2021-12-17

DOI

Added

  • Mapping file syntax for source-sets (SourceAggMDSet-class) now supports setname and description key-value pairs. The setname is mandatory while description is optional.

  • Mapping file for configurable-sets now supports external mapping files via path key. (Implements #18)

  • Mapping file for configurable-sets is validated upon server startup.

Changed

  • Mapping file configuration option for SourceAggMDSet --oai-set-sources-path defaults to None, which implies that the set is discarded (not loaded) on server startup. The operator is in charge of creating and configuring the mapping file.

  • Update dependencies in requirements.txt

    • PyYAML 6.0.0

    • ConfigArgParse 1.5.3

    • Kuha Common to Git commit 8e7de1f16530decc356fee660255b60fcacaea23

    • Kuha OAI-PMH Repo Handler to Git commit cbe6d16bbe00369ccddc8a0ae5bcd64f8476755e

    • CDC Aggregator Shared Library 0.2.0

Fixed

  • Value for the altered attribute in Provenance containers is now either ‘true’ or ‘false. (Fixes #14)

  • Empty setName elements for language-sets are populated with generated values. Key-value pairs for setname are expected to be defined for source-sets in configured mapping file. (Fixes #15)

  • Source set no longer falls back to automatically generating sets based on source archive’s baseUrl. (Fixes #15)

  • deletedRecord declaration is now configurable. (Fixes #16)

  • Provenance container’s baseUrl element name should be baseURL. (Fixes #19)

0.1.0 - 2021-09-21

Added

  • New codebase for CDC Aggregator OAI-PMH Repo Handler.

  • HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records.