Ocean Data Discovery

Online communities and research institutes provide ocean data, for this data to be accessible it must be easily reached. Our team is searching for ways to harness AI for the task of discovering on-line ocean data. In the data discovery phase, candidate datasets are collected to fit a set of study parameters. The process of data discovery can be divided into three distinct steps: search, link, and identify.

Search

In the search step, a list of candidate research is collected. Search is performed on repositories or through portals that provide access to multiple repositories, here after referred to as data sources. Data sources may contain either textual descriptions of studies or the datasets themselves.

Identify

The same data may appear in several datasets by being used for several studies. Thus, researchers are required to meticulously read the data collection procedures of every study used to make sure that their data do not contain duplicate measurements and identify each data-set or even data point in a unique manner.

Link

The linking process entails connecting between studies and their datasets (and vice versa) and between datasets, which are derived from one or more other datasets.

Ocean Data Discovery Application

This application will allow researchers to use a unified ontology-based data discovery form to search for oceanographic datasets. The datasets can then be collected and used in the data integration application. This application is under construction.

Sometimes trying to solve a problem leads you to interesting questions. We were trying to assign different datasets whose location was identified by a geocode (LAT/LONG) and wanted to know in which named sea they were collected (Weddel Sea? The Mediterranean?). It turns out there is no public service to reverse geocodes into location names. So we made one. Thanks to the efforts of Antonio Zaitoun, The see-sea service is now on-line: https://see-sea.odini.net/docs and allows you to locate the closest salt-water body of water/reef/gulf. The names and locations were retrieved from wikidata.org.

Ocean Data Sources

The data to be integrated in the ocean data integration process is available in various web-based sources, which can be divided roughly into three categories: Raw sources, Repositories and Portals. The evolving list of sources currently used in ODINI is found in the tables at right.

Data is collected by devices and expeditions that are part of numerous scientific projects and ongoing efforts. Many of these efforts sustain their own websites where they publish the data periodically or as a live stream. We call these Raw sources. For example, the SESAME project website (sesame-ip.eu), now defunct, contained announcements of the data it periodically collected.

Repositories store data from multiple sources that use these repositories to archive their data and make it more publicly available. The SESAME project data discussed above was archived at the Pangaea data repository (pangaea.de) and is available there to date, even though the SESAME project webpage is long gone.

Portals provide an interface to search in multiple repositories but do not host the data themselves. For example, dataone.org is a portal through which one can search multiple repositories and find the SESAME project data now hosted on PANGAEA.

Here at ODINI we are in the process of mapping out these multiple types of data sources by crawling the web and creating a database of such sources. In future work, we intend to establish the data lineage of the different sources to be able to understand the coverage of the portals and repositories with respect to the raw sources and with respect to each other.