RDTF metadata that is exposed using the community formats approach must be made available under an open licence, using a non-proprietary file format (such as one of those listed in the examples section below).
Where HTTP is used, one or more sitemaps (conforming to the Sitemap protocol [12]) should also be made available, listing the available files. The sitemaps should be listed in a robots.txt file. Sitemaps should use the following RDTF extension to differentiate RDTF files from other content:
lt;?xml version='1.0' encoding='UTF-8'?gt;
lt;urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rdtf="http://purl.org/rdtf"gt; lt;!-- namespace extension --gt;
lt;urlgt;
lt;rdtf:locgt;http://example.org/rdtf/catalogrecords.marclt;/rdtf:locgt;
...
lt;/urlgt;
lt;/urlsetgt;
Where possible, all significant resources associated with the collection of interest should be described using separate records. For example, there should be separate records describing a physical museum artifact and any digital surrogates of that artifact (e.g. images). For the purposes of these guidelines, a significant resource is one that is likely to be of interest to end-users, differs from other resources in terms of format or other attributes, and may have different ownership and/or usage restrictions to other resources. Note that ‘resources’ (as used here) may include conceptual entities (e.g. a FRBR ‘work’), people and organisations as well as both physical and digital objects. Where metadata is encoded using CSV [14] (or similar), a record corresponds to a row in the table. Where metadata is encoded using the OAI-PMH, a record corresponds to an OAI-PMH record.
All metadata records should contain an attribute/field/property that can be used as a label or title for the resource. Where metadata is encoded using CSV (or similar), this label should appear in a column called ‘label’ or ‘title’. Where metadata is encoded using CSV (or similar), any identifier for the resource (e.g. an ISBN) should appear in a column called ‘identifier’. In addition, for any metadata encoded using CSV, the first row should contain the column headings, there should be no use of footnotes and all rows should be of the same length.
Metadata about the resources associated with multiple collections may be made available. In general, where HTTP is used, there should be one file per collection; where the OAI-PMH is used, collections should be partitioned into separate repositories or separate sets within a single repository. Note that ‘collection’ (as used here) simply means any grouping of resources for curatorial, discovery or some other purpose.
For libraries, typical file format examples include library catalogue records encoded using MARC21 [15] or MODS [16], BibTeX [17], RIS [18], the CrossRef output schema [19], Dublin Core records encoded using XML[20], the Europeana Semantic Elements (ESE) format and formats based on JSON [21] Atom [22] or RSS [23].
For museums, typical file format examples include museum catalogue records encoded using SPECTRUM [24], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.
For archives, typical file format examples include archival descriptions encoded using EAD [25], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.
- there are likely to be many existing tools (both proprietary and open source) available that will display (and possibly modify) the metadata.
- the context of where metadata originated is more likely to be lost, leading to issues around data provenance, trust, etc.;
- the semantic relationships between the attributes used to describe resources (particularly across multiple collections) may be unclear;
- citing resources of interest is likely to be problematic.
- software tools for processing the metadata will probably need to be both format- and provider-specific, with many ad hoc heuristic techniques being adopted (including the need to cope with different interpretations of the same format by different providers);
- the metadata is unlikely to carry explicit links to other metadata/content, making web-effects unlikely.
As a non-technical person – the need for the RDTF extension strikes an odd note – suggesting a lack of openness!
Yes I think this should say something like ‘ISAD(G) compatible archival descriptions encoded in EAD’ – this may help constrain some of the wilder excesses possible with EAD as well!
While I think this would be the minimum here we would want to encourage ‘related’ EAC-CPF files (compatible with ISAAR(CPF) as well!
I agree with the assertions in these paragraphs – but I think it could be pointed out that much of the programmatic join up on the Web uses this approach and works to some degree. It’s not necessarily a dead end. The development of tools which are “both format- and provider-specific” could, in the end, be the most pragmatic approach, as such tools become easier to create and adapt.
I agree with both previous comments
In this section (and benefits) there seems to be a mixture of end users (e.g. academic wanting resource and cataloguer taking record for local use) perhaps being more specific would be helpful. As it stands I’m unclear why citing a resource would be problematic – if I want to cite the resource I’m going to go to the resource and cite it (rather than the aggregated record) [or are we using cite differently?].
Re. JSON and Atom: If this document is aspirational, then yes – I think these need to be mentioned. JSON is becoming a default already for certain important domains of functionality – for example for developers building mobile interfaces.
Many systems will export CSV with an internal ID – typically in the first column, often an integer (e.g. ‘rownum’, ‘recID’). Is this something we might want to actively discourage from being included in ‘RDTF CSVs’? As in actually say, “please mint and include a URL *instead* of any internal, system-level ID”.
I still feel slightly unsure about this. The idea that a sitemap might list available ‘files’ doesn’t seem to quite fit with the demands of the next but one paragraph (7) which calls for a fine granularity of record.
We’re going to quite quickly get to the point of unanticipated, just-in-time, serialisation of records based on dynamic criteria for some collections aren’t we? Not sure how a sitemap works in this scenario
Given all the discussion going about mechanisms for sharing metadata/ finding resources in the scholarly works community and other communities (such as Open Educational Resources) which is using RSS/ Atom quite a lot for this purpose I’m a bit surprised not to see it in this list.
Yes there are limits to the use of RSS/Atom in this way but it’s not stopping several discovery services using it.
“formats based on Atom or RSS”
Add? ‘such as GData [ http://code.google.com/apis/gdata/ ] or OData [ http://www.odata.org/ ]‘
I think the newly published Dataset Publishing Language (DSPL http://code.google.com/apis/publicdata/docs/developer_guide.html ) adds a lot to the discussion of this para. The DSPL allows for the publication of one or CSV files supported by an XML file containing metadata that describes the data in the CSV files.
The DSPL can be regarded in a couple of ways: 1) as a candidate representation; 2) at a more abstract level as an implementation of an idea about how best to represent and model data, with an associated vocabulary for describing that idea. So for example, terms are described for identifying data as “dimensions” (categorical) or as “metrics” (non-categorical, time-varying, numeric values), with a top-tip/handy rule of thumb recommendation that “your dataset will be more flexible if you keep metrics to a minimum, and instead create meaningful dimensions”.
Another example would be one record per item in an Atom/GData feed.
I’m not sure the fact that something is easy to do is really one of the benefits of doing it.
It sort of feels like even community formats isn’t all that easy. It could be just the mention of lots of different formats starts to seem off-putting and overwhelming. I’d almost want to keep GZip, OAI-PMH etc. out of the ‘first impression’, maybe by re-structuring the document somehow, so people know that getting CSVs out there is good enough.
I agree that there is a lot of detail in this section that starts to overwhelm – see my comment elsewhere.
The only thing I’d say is I’m not convinced that in GLAM csv is going to be the ‘go to’ format it might be in other areas. If you are a library, you’ll have a MARC export. If you are an archive, EAD etc.
Couldn’t this also be recommended, though not required, as a nudge that might also lead to increased use of compression across a site? (Assuming that we think compression is a Good Thing?!;-)
So HTTP POST would not be okay?
There are lots of things that could be weakly classed as an “open license” that conflict with other “open licenses”. SHould the statement be phrased in terms of ” licensed with an open license that is compatible with X, such as Y, Z”.
Must this be the case? It’s bound to put people off
Re. HTTP GET, might it be possible to say something like “… HTTP GET requests, such as when included in standard web pages”, to give a sense of what this might mean. If the guidelines only aimed at technical implementors then maybe this isn’t a problem, but otherwise it could be.
As the others, I think the ‘separate records’ principle here is good per se, but that won’t make things easier with the data. Perhaps that should be acknowledged in the costs…
Btw it may be worth mentioning that ESE re-uses Dublin Core alot.
There’s nothing at http://purl.org/rdtf/. Isn’t that sending a conflicting message with the “should use the following RDTF extension”?
I don’t understand why provenance is thought to be such an issue for this option. The integrity of the metadata seems least at risk in its native format
Should there be a recommendation to explain the scope of the metadata. Although MARC is an open format there is provision for locally defined metadata. Suggest a recommendation that if locally defined metadata is made available, definitions are provided to save aggregator effort – see para 23.
Does this mean that duplication of metadata descriptions is expected/acceptable?
CSV could also be of interest to libraries. Not all systems have the capability to output MARC. A csv output may also be more accessible to potential users than MARC.
A tabulated display might be clearer and would highlight those standards which are used across GLAM.
While no longer supported by BL, a significant number of libaries in UK and Ireland still use UKMARC.
“Where possible all significant resources…should be described using separate records”. This is aspirational from library perspective. In FRBR terms, most MARC records are composites of Work, Expression and Manifestation. Some records may aggregate description of the original work with details of a surrogate manifestation, Legacy bibliographic records may describe multiple resources, for example: manifestations issued with different bindings. It is unlikely that the modeling necessary to achieve a 1:1 relationship between records and resources could be justified in relation to the community format approach. It is an issue that we will give much more thought to in future as we develop our open data into open linked data.
After some twitter discussion around this with Jane Stevenson I think there is a danger that the very basic nature of what is required here is obscured by other detail that may not apply.
My reading is:
The ‘Community Format’ approach would allow a (MARC/EAD/other sector specific format) dump that could be posted to a website. For those taking this approach the only other requirement would be for a sitemap file.
I think this might come as a relief to many (especially, but not only, small organisations).
It feels like some of the other issues (e.g. raised in 7 & 8) could be ignored by those following this approach.
If I’ve read this correctly it may be worth spelling out – make this entry point sound as simple as possible.
What do you see as the purpose of the sitemap? Just wondering if we should be recommending use of any of the ‘optional’ aspects of the sitemap spec? Thinking specifically of ‘lastmod’ (and possibly ‘changefreq’)
I can see what Leonard means, and thought of this too. But ISAD(G) is not a file format – EAD complies with ISAD(G). You could add EAC-CPF though, as that’s the format that complies with ISAAR(CPF).
‘Collection’ for archives generally means one grouping of items created/accumulated by the same person/organisation, which may easily consist of several thousand items. I think the guidelines may have to be more explicit about what constitutes a ‘significant resource’ for archives. Don’t want to disappear into semantics too much, but I think its a fundamental difference with museums and libraries.
Some libraries in the UK use UNIMARC; some are still using the obsolete UKMARC format. And MADS goes with MODS. To list or not to list … is a perennial problem, but standards anxiety may lead to to “typical” being overlooked.
Despite the “Where possible”, this might be off-putting to libraries with legacy records that embed a print original and digital surrogates in a single record, and confuse libraries adopting FRBR, where a “record” typically consists of linked records for Work, Expression, Manifestation and Item. Perhaps a specific note about legacy metadata records would clarify.
In response to Ralph:
That would have to be Atom with several Atom extensions to achieve a harvesting framework with a functionality that is similar to that of OAI-PMH. As a matter of fact, several Atom extensions that could make this possible have been proposed. And, with some colleagues I have briefly investigated the possibility of defining a profile of Atom that provides PMH-like functionality. Given there is increased interest in that direction, an effort to this account may be launched some time soon.
It’s true that complete OAI-PMH duplication would require significant enhancement of Atom feeds. But a simple feed of records can be accomplished with Atom as it is now.
You’re assuming we need the extra functionality of PMH over web feeds. Largely, nobody does any more, which is why it isn’t used outside its original community, and why use is shrinking rather than growing.
I wouldn’t call OAI-PMH an “encoding”: it’s a delivery mechanism.
Agreed, how about ‘made available’ ?
… oh, and might be worth mentioning VRA, widely used for art collections.
not sure it’s entirely useful to divide into three paragraphs, some formats crossover and there is some duplication here, maybe better to have a list of examples?
In my view the provenance context is easiest to preserve in this approach: one statement per record is sufficient. This gets really tricky with aggregations of RDF triples.
Owen, Re: identifier for the record – this is a simple typo – should be identifier for the resource – will correct in the text to prevent further confusion. Thanks.
Perhaps the main guidance that needs to be given for those providing CSV or other generic formats is to have accompanying documentation that clearly documents the structure and nature of the data. Interpreting formats like JSON and CSV without this will be impossible
JSON sufficiently important to section of the developer community I think…
Identifiers such as ISBN are useful – wouldn’t want to see these excluded. However, I do think the opportunity to ask for a URI if one is available is too good to miss.
My own preference is that ‘community formats with URIs’ is an additional option alongside ‘community format’ (in the same way you have RDF vs Linked Data)
You say ‘any identifier for the record’ and then ‘e.g. and ISBN’. Risking getting into semantics… but of course an ISBN isn’t an identifier for the record, it’s an identifier for something else (probably something close to the manifestation in FRBR terms)
So – is the identifier expected here to identify the resource described, the record describing the resource?
It doesn’t seem unreasonable to ask for label/title – but what is it for? Just as an example, if you had a MARC record for a letter it might not have a ‘title’ (in a 2XX field), although it would have something in the record somewhere that could be used as a substitute for a title in an index or human readable display.
To think about CSV specifically if we get a CSV file of records describing letters then it would be pointless if the ‘label’ or ‘title’ column contained just the word ‘Correspondence’ – although within these guidelines.
What I’m getting at is the perceived purpose of label/title field. Why not just say ‘should include at least one field which clearly describes the item in human readable format’ or some such?
It feels wrong to be doing something to deliberately hide stuff from Google – isn’t this up to Google?
Is this mixing two types of mechanism?
Should you add ISAD(G), ISAAR(CPF) and other standards from the International Council on Archives? These do not give formal file formats, but they do list the elements and structure that should be included.
I’d prefer an RSS or Atom feed to OAI-PMH. They certainly have broader acceptance outside the library community.
You’d likely want to include LIDO in this list. http://bit.ly/gAkco0 – it’s the backbone of Athena http://www.athenaeurope.org/.
Question from the authors: The use of an RDTF extension here is to prevent accidental harvesting of, say, large files of MARC records by Google’s crawlers. Does this make sense?
Question from the authors: This is currently the only reference to JSON, Atom and RSS. Are JSON and Atom sufficiently important that they should get a mention? If so, how?
Wonder if its worth mooting as a default format – you can expose MARC with OAI-PMH for the six people in the world who give a toss, but you should also provide RSS with basic details for wider impact
Question from the authors: Is the guidance to include a column called ‘identifier’ sufficient? Should we also (or instead) ask for a column called ‘uri’?
Identifiers such as ISBN are useful – wouldn’t want to see these excluded. However, I do think the opportunity to ask for a URI if one is available is too good to miss.
My own preference is that ‘community formats with URIs’ is an additional option alongside ‘community format’ (in the same way you have RDF vs Linked Data)
I hope this comment is appearing in the right place – I have deliberately used the ‘reply’ function despite being warned not to…
I agree with Owen – I would like to see a ‘community formats with unique URLs as identifiers option’. In fact this would be my preferred path frankly. And yes, I think I do mean ‘URL’…. ;-)
Question from the authors: Is the “provide a ‘label’ or ‘title’ and ‘identifier’ column” stuff useful? What about the other CSV guidance? Should we say anything else?
It doesn’t seem unreasonable – but what is it for? Just as an example, if you had a MARC record for a letter it might not have a ‘title’ (in a 2XX field), although it would have something in the record somewhere that could be used as a substitute for a title in an index or human readable display.
To think about CSV specifically if we get a CSV file of records describing letters then it would be pointless if the ‘label’ or ‘title’ column contained just the word ‘Correspondence’ – although within these guidelines.
What I’m getting at is the perceived purpose of label/title field. Why not just say ‘should include at least one field which clearly describes the item in human readable format’ or some such?
You say ‘any identifier for the record’ and then ‘e.g. and ISBN’. Risking getting into semantics… but of course an ISBN isn’t an identifier for the record, it’s an identifier for something else (probably something close to the manifestation in FRBR terms)
So – is the identifier expected here to identify the resource described, the record describing the resource?
JSON sufficiently important to section of the developer community I think
I agree with Owen that a clear description is what is needed. I’m not sure how this will work if you have an archive that includes a series of letters – they may have the title ‘letters’ because the collection has the full descriptive title. You’d need both.
Question from the authors: Have we got the list of example community formats right? Are things missing?