Implementation Progress / Issues

From Earth Science Information Partners (ESIP)
Revision as of 05:06, July 19, 2012 by Nhoebelheinrich (talk | contribs)

Testing Analysis Tools

CrossRef Implementation Issues

NjH mtg notes from tcw with Patricia Feeney at crossref.org

Best way to accomplish what we want is to assign DOIs as components of the dataset / collection DOIs. However, the following facts are important to note:

  • DOIs for overall datasets already assigned.
  • limited metadata available to retain (PF will send me a sample of a component list, and the fields possible to include as MD along with what's req'd and not)
  • component DOIs will be persistent in the crossref system; however, at this point, the DOIs assigned to the components are not queryable by the crossref system since they are not indexing the component MD at this time; may be at some point, but not in the near future. Have no funds allocated to do this at this time.
  • can submit component list MD & DOIs in batches of 5 MB at a time.
  • need to make sure the XML file(s?) validate against the crossref.xsd before submitting else the entire batch could be rejected
  • time of response once XML submitted is several minutes - a day (rarely), so immediate gratification
  • have sample component XML file
  • type of component MD needed include description in narrative form, format type and the URL created per DOI specs, e.g., 10:etc.
  • cost per component DOI is $.06 billed at time of creation; no cost for updating MD or DOI to registered DOI (assuming that membership fee for organization has been paid). Need to calculate cost of Glacier Photo project.
  • will be one component DOI for either jpeg or tiff of same photo, so is MD based rather than format based.
  • after submission, response will be in the form of an XML bsed submission log describing what happens to each submission, but also a summary, so can find out how many failed.
  • if component DOI is included in a citation, will be for reference only. No ability to retrieve at this point except as part of the search results list returned from a query that finds the overall dataset DOI.

Component list is better than citation list since no DOIs are assigned when a citation list is created that is part of an existing DOI. Publishers can find out if a an entity that is on a citation list has been cited by other publishers' systems, but the entity itself will have only the metadata that is available as part of the citation form. (Don't know what these are exactly, but since there are no DOIs assigned, does not seem relevant.)


EZID and DataCite Implementation Issues

NjH Mtg Notes from tcw Joan Starr, EZID Service Mgr at CDL:

Questions to Joan: 1. It appears from the DOI documentation (DOI Handbook v4.4 pdf) that it should be possible to create citable DOIs for an overall (collection in library terms) dataset, and for digital components or sub-resources within that overall dataset. Is that true?

Answer: Yes, it's possible to create a DOI for digital components or sub-resources within a digital collection per the DOI schema. The EZID service is intended to allow just that for data sets.

2. As it does not appear to be possible at this time to create citable DOIs for sub-resources to a collection level item with CrossRef implementation of DOI, what is the plan for the EZID service?

Two part Answer: DataCite organization and EZID service

The EZID service is designed to allow users to obtain and manage long-term, citable identifiers either for individual or batched digital resources using the DOI or the ARK identifier schemes. The service can create and resolve identifiers on behalf of the user and also allow the user to enter and maintain information about the identifier ("metadata"). Eventually, the service will also allow the deposit of the object to which the identifier refers. The service is available via both a programming interface (an API that software can use) and a web user interface. The service is relatively new and explained more fully on the EZID website at: http://www.cdlib.org/services/uc3/ezid/index.html. Info about the EZID API can be found at: http://www.cdlib.org/uc3/docs/ezidapi.html.

The EZID service must be understood within the context of the DataCite, an international consortium of data creators and collectors to which CDL belongs. (See http://datacite.org/) This consortium has a number of international members who are working with scientific and technical data including national data centers and institutes such as the Australian National Data Service (ANDS), Canada Institute for Scientific and Technical Information, and the British Library. See the current list of members at http://datacite.org/members.html. This group is in the process of finalizing a set of descriptive metadata that they wish to recommend for use for datasets. I have received an early, public version of recommended metadata set (or "kernel" that was put out for public comment in August of this year ("DataCite Metadata Kernel for the Publication and Citation of Research Data"). The list includes both required and optional elements. The first five elements are intended to be enough to create a citation for any resource that has a registered DOI, i.e., [Creator] ([PublicationDate]): [Title]. [Publisher]. [doi:DOI]. [http:dx.doi.org/DOI]. The Metadata group is now in the process of re-drafting the metadata recommendations based on the fairly extensive feedback that they received with the expectation that the finalized list will be released by the end of this calendar year.

Some important facts to consider in using the EZID service:

1. EZID is available via a web based user interface designed for an individual scientist or data creator perhaps, and an API that can be integrated into other services, and probably facilitate batch use. The latter approach seems the one we should take for this project, especially as it could be used for both DOIs and ARKs. Practicability of the approach will need to be investigated by Yuechen, however.

2. At the moment, only University of California or DataOne partners can use the EZID service without prior negotiation with CDL. I suspect the ESIP Federation would not have too much difficulty negotiating use of the service for the testbed at least.

3. To negotiate use of EZID, we would have to establish an EZID User Group with the following responsibilities per the EZID Service Guidelines. See: http://www.cdlib.org/services/uc3/docs/EZIDServiceGuidelines.pdf

3.4 About EZID User Groups: The EZID notion of a group (or "owner group") is an aggregation of users that collectively inherits the identifiers owned by individuals in the group. EZID uses groups in three ways.

  • First, if an individual member of a group is no longer active, for whatever reason, we will work with the group administrator to assign a new owner to the member's identifiers.
  • Second, the EZID group is also the mechanism by which the EZID system controls the categories of identifiers (the “prefix”) an individual member can make using EZID.
  • Lastly, we will use the group record information for billing purposes when we implement our cost recovery program. See section 5.5, Financial Responsibilities.

4.5 Rights/Intellectual Property: The UC Curation Center and CDL make no claims of ownership about identifiers or metadata entered into EZID. Ownership of the identifiers is determined by EZID user and owner group. See section 3.4 above for more information about EZID groups.

Presumably either the ESIP Federation or one of its members could assume the role of the user group? This needs group discussion.

4. The EZID service relies upon 3rd parties to manage the relationship to DataCite, the registration agency for the DOIs, i.e., The German National Library of Science and Technology (TIB) in Hannover, Germany. Currently the primary Handle Servers at TIB and Swiss Federal Institute of Technology (ETH), Zurich, store the core registration records. There is a mirror run by the Corporation for National Research Initiatives (CNRI), in Reston, VA. TIB technical staff members guarantee a minimum of 24/5 service reliability of the resolution and registration infrastructure.

5. What EZID stores / doesn't store & ESIP user group ongoing responsibilities:

  • EZID stores the identifier string and its metadata,and internally generated, administrative metadata.
  • EZID does not store passwords except in encrypted form via a one-way hash for account security purposes. All stored data for the identifiers owned by "our" group would accessible using the API. The method for doing so is included in the API documentation.
  • The ESIP user group would have to take responsibility for maintaining the location of the digital resources, and for maintaining the metadata for the location of the resources per its permanent ID. The group can set this value, per the EZID API, Version 2. This would be the target URL for the resources. It appears that there can be separate targets for different formats of the resource, i.e., the jpeg and the tiff versions for this collection / set.

6. At present, there is no cost for the EZID service. In the future, however, the UC Curation Center running the service intends to charge a fee for cost recovery purposes. The business model for those costs is not yet public, but should be within several months (depending upon the CA state university bureaucracy).

7. In the paper announcing the draft MD Kernel, the Metadata group provided a comparison of their terms to those specified by DOI among others. It appears from quick review with the metadata that we have for the Glacier data set, we will have all the mandatory metadata, plus other optional metadata that will allow us to create opaque identifiers, and also make use of the descriptive information that is available for each / most of the components of the Glacier Photo data set or "collection". The feasibility of adding the other descriptive info will need to be determined as we test use of the service / API.

EZID Implementation Questions for Organizations Using Service

1. Which ID schemes do we want to use? DOI and/or ARK are both available at this time.

Answer: For MD testbed, both. [Verify.]


2. What type of organization is requesting this service? A single entity? Groups of researchers?

Answer: TBD If ESIP per se, would ESIP be described as a single entity or a group of entities? Could individual organizations create EZIDs under ESIP aegis?


3. IDs must have owners who promise maintenance of the ID over time. Must match to a single email address of an individual or an organization. Who should this be?

  • Maintenance of the ID over time implies that the individual or organization will take responsibility for changing the descrptive metadata for the resource should this change.
  • Very little metadata is required, but much more could / should(?) be present. See the MD comparion spreadsheet attached to this page at http://wiki.esipfed.org/index.php/File:MDComps4DOI_v2.pdf.
  • Question of versions: what / when is a resource "versioned"? Is there a means to express this in the EZID relationship scheme?


4. Occasionally, there are system announcments coming from CDL regarding the EZID service. Who should receive the system announcements? What is their email address?


5. Although the business model for the EZID service is not yet finalized by the UC system, it appears that there needs to be an organization which can pay an annual membership fee.

  • This membership fee is not designed to be very expensive as the service is geared toward academic and non-profit organizations.
  • It is also designed to be charged in lieu of a cost per ID assigned up to a limit (TBD, but probably around 500K).
  • The purpose of the ID limit at no charge is to prevent the "runaway" creation of IDs and to encourage the intentional assignment of them.


6. Need to have a rough idea of the number of IDs to be assigned in order to determine whether they will require their own name authority.

  • The name authority will be assigned or created by user group, e.g., ESIP, but maintained within the EZID system presumably.