Skip to content

External Data Services(EDS)

Definitions and Terminology

The following terms are used throughout this document:

Administrator: a privileged user OSDU platform that is entitled to use the External Data Services (EDS) module using Console utility or Postman collection.

Connected Source Registry Entry: an OSDU master-data type that defines high-level business and technical information about an External Data Services. As of M23, Administrators use the EDS console utility or Postman collection to register external data sources, creating a Connected Source Registry Entry in the Administrator’s OSDU Platform.

Connected Source Data Job: an OSDU master-data type that defines the technical configuration for scheduled data jobs that will be executed by OSDU EDS workflow services to retrieve and ingest master data, work product component data, and/or reference data from an external source specified in a parent Connected Source Registry Entry. There is a one-to-many relationship between a Connected Source Registry Entry and a Connected Source Data Job; given one Connected Source Registry Entry record, there can be one or many associated Connected Source Data Job records. Administrators use the EDS Console utility or Postman collection to create scheduled data jobs, which in turn creates one or more Connected Source Data Job records in the Administrator’s OSDU Platform.

Core External Data Workflow Services: Software on the Administrator’s OSDU Platform that automatically consumes Connected Source Registry Entry and Connected Source Data Job records to schedule automated data jobs, retrieve metadata from an external data source, and ingest it into an Administrator’s OSDU platform.

External Data Service: Any data source that is external to the Administrator’s OSDU Platform. The information contained in the Connected Source Registry Entry and associated Connected Source Data Job records for this source are used by automated EDS Core Workflow services to retrieve and ingest data.

JSON: JavaScript Object Notation, which is a structure for data exchange and is the current method of data exchange for OSDU APIs.

JSON payload: the request or response body of an OSDU API call, in JSON format.

JSON schema: a JSON document that describes the accepted representation of an object in JSON format.

Record: a JSON document of a certain data type persisted in the OSDU platform.

Source Registry: a term used to describe an entire collection of records of type ConnectedSourceRegistryEntry and ConnectedSourceDataJob persisted in the Consuming OSDU Platform.

Reference Data Mapping: It standardizes data consumed from various Data Providers to match the naming convention per Operator/recipient, promoting seamless integration and effective data utilization within the Operator OSDU environment.

Master Data Value Mapping (Parent Data Mapping): It standardizes data consumed from various Data Providers to match the Natural key and Synthetic ID per Operator/recipient, promoting seamless integration and effective data utilization within the Operator OSDU environment.

Naturalization: This process appropriately adds the data file to the OSDU Platform and converts the WPC's child dataset from "external" to "internal.”

High-level Overview

high_level_overview_diagram

The objectives of the External Data Service (EDS) functionality are to retrieve metadata from a registered External Data Source and ingest it into the Administrator’s OSDU Platform so that the data can be discoverable and accessible to consuming OSDU applications through the OSDU Search API (just like any other data ingested into the system) while leaving the associated bulk data, or files, stored at the external source for delivery on demand.

From a consumer’s perspective (the perspective of the Administrator’s organization), one of the most valuable use cases this functionality satisfies is the ability to search for data that the Administrator’s organization has licensed from an External Data Source from its own OSDU platform without enduring the overhead of storing the bulk data files the ingested metadata refers to.

From an external supplier’s perspective, the most valuable use case this functionality satisfies is the ability to participate in the OSDU ecosystem to supply and advertise offerings to their customers even if they do not have an OSDU implementation. For those without a full OSDU implementation, all that is required is an OSDU-compliant “wrapper” that presents an OSDU-compliant interface for Search and Dataset services. Connected Source Registry and Connected Source Data Job records, collectively known as the Source Registry, are created by an Administrator using the EDS Console utility or Postman collection. This is the only part of the solution that requires human intervention.

From there the Source Registry information is automatically consumed by the Core External Workflow Service engine to schedule data jobs and begin automatically fetching the desired master data, work product component, and/or reference data from the specified external source and ingesting it into the Administrator’s OSDU Platform. Upon ingestion, the data is search-indexed just like any other piece of data and is accessible to end-users using the Search API (entitlements permitting). If a user decides to retrieve the bulk data file or files associated with a work product component the Platform will detect whether the file is stored externally or not. If it is, the Proxy Dataset Service is used to retrieve the file from the registered external source and deliver it to the user (Not in M19 scope). This process is transparent to the end user.

Dependent Service

EDS depends on the following services and their dependencies to be present in pre-shipping:

  • Data Ingestion with scheduling
  • Dataset
  • Search
  • Storage
  • Workflow

DAGS available in EDS

The DAGS below need to be registered in Airflow. Registration of the DAGS is not covered in this document.

  • eds_ingest
  • eds_scheduler
  • eds_naturalization

DAGS used by EDS

  • osdu_ingest
  • osdu_ingest_by_reference