Metadata guidelines for the UK RDTF

  • Table of Contents
  • Comments by Section
  • Comments by Users
  • General Comments
  • Login

Cancel
Submit Comment

show all (66)
There are no comments. Click the text to your left to make a new comment.
Bill Stockting
2/18/2011
GO TO TEXT

As a non-technical person – the need for the RDTF extension strikes an odd note – suggesting a lack of openness!

reply
Bill Stockting
2/18/2011
GO TO TEXT

Yes I think this should say something like ‘ISAD(G) compatible archival descriptions encoded in EAD’ – this may help constrain some of the wilder excesses possible with EAD as well!

While I think this would be the minimum here we would want to encourage ‘related’ EAC-CPF files (compatible with ISAAR(CPF) as well!

reply
Paul Walk
2/18/2011
GO TO TEXT

I agree with the assertions in these paragraphs – but I think it could be pointed out that much of the programmatic join up on the Web uses this approach and works to some degree. It’s not necessarily a dead end. The development of tools which are “both format- and provider-specific” could, in the end, be the most pragmatic approach, as such tools become easier to create and adapt.

reply
Paul Walk
2/18/2011
GO TO TEXT

I agree with both previous comments

reply
John Robertson
2/18/2011
GO TO TEXT

In this section (and benefits) there seems to be a mixture of end users (e.g. academic wanting resource and cataloguer taking record for local use) perhaps being more specific would be helpful. As it stands I’m unclear why citing a resource would be problematic – if I want to cite the resource I’m going to go to the resource and cite it (rather than the aggregated record) [or are we using cite differently?].

reply
Paul Walk
2/18/2011
GO TO TEXT

Re. JSON and Atom: If this document is aspirational, then yes – I think these need to be mentioned. JSON is becoming a default already for certain important domains of functionality – for example for developers building mobile interfaces.

reply
Paul Walk
2/18/2011
GO TO TEXT

Many systems will export CSV with an internal ID – typically in the first column, often an integer (e.g. ‘rownum’, ‘recID’). Is this something we might want to actively discourage from being included in ‘RDTF CSVs’? As in actually say, “please mint and include a URL *instead* of any internal, system-level ID”.

reply
Paul Walk
2/18/2011
GO TO TEXT

I still feel slightly unsure about this. The idea that a sitemap might list available ‘files’ doesn’t seem to quite fit with the demands of the next but one paragraph (7) which calls for a fine granularity of record.
We’re going to quite quickly get to the point of unanticipated, just-in-time, serialisation of records based on dynamic criteria for some collections aren’t we? Not sure how a sitemap works in this scenario

reply
John Robertson
2/18/2011
GO TO TEXT

Given all the discussion going about mechanisms for sharing metadata/ finding resources in the scholarly works community and other communities (such as Open Educational Resources) which is using RSS/ Atom quite a lot for this purpose I’m a bit surprised not to see it in this list.
Yes there are limits to the use of RSS/Atom in this way but it’s not stopping several discovery services using it.

reply
Tony Hirst
2/18/2011
GO TO TEXT

“formats based on Atom or RSS”
Add? ‘such as GData [ http://code.google.com/apis/gdata/ ] or OData [ http://www.odata.org/ ]‘

reply
Tony Hirst
2/18/2011
GO TO TEXT

I think the newly published Dataset Publishing Language (DSPL http://code.google.com/apis/publicdata/docs/developer_guide.html ) adds a lot to the discussion of this para. The DSPL allows for the publication of one or CSV files supported by an XML file containing metadata that describes the data in the CSV files.

The DSPL can be regarded in a couple of ways: 1) as a candidate representation; 2) at a more abstract level as an implementation of an idea about how best to represent and model data, with an associated vocabulary for describing that idea. So for example, terms are described for identifying data as “dimensions” (categorical) or as “metrics” (non-categorical, time-varying, numeric values), with a top-tip/handy rule of thumb recommendation that “your dataset will be more flexible if you keep metrics to a minimum, and instead create meaningful dimensions”.

reply
Tony Hirst
2/18/2011
GO TO TEXT

Another example would be one record per item in an Atom/GData feed.

reply
Adrian Stevenson
2/18/2011
GO TO TEXT

I’m not sure the fact that something is easy to do is really one of the benefits of doing it.

It sort of feels like even community formats isn’t all that easy. It could be just the mention of lots of different formats starts to seem off-putting and overwhelming. I’d almost want to keep GZip, OAI-PMH etc. out of the ‘first impression’, maybe by re-structuring the document somehow, so people know that getting CSVs out there is good enough.

reply
    Owen Stephens
    2/18/2011
    GO TO TEXT

    I agree that there is a lot of detail in this section that starts to overwhelm – see my comment elsewhere.

    The only thing I’d say is I’m not convinced that in GLAM csv is going to be the ‘go to’ format it might be in other areas. If you are a library, you’ll have a MARC export. If you are an archive, EAD etc.

Tony Hirst
2/18/2011
GO TO TEXT

Couldn’t this also be recommended, though not required, as a nudge that might also lead to increased use of compression across a site? (Assuming that we think compression is a Good Thing?!;-)

reply
Tony Hirst
2/18/2011
GO TO TEXT

So HTTP POST would not be okay?

reply
Tony Hirst
2/18/2011
GO TO TEXT

There are lots of things that could be weakly classed as an “open license” that conflict with other “open licenses”. SHould the statement be phrased in terms of ” licensed with an open license that is compatible with X, such as Y, Z”.

reply
Adrian Stevenson
2/18/2011
GO TO TEXT

Must this be the case? It’s bound to put people off

reply
Adrian Stevenson
2/18/2011
GO TO TEXT

Re. HTTP GET, might it be possible to say something like “… HTTP GET requests, such as when included in standard web pages”, to give a sense of what this might mean. If the guidelines only aimed at technical implementors then maybe this isn’t a problem, but otherwise it could be.

reply
Antoine Isaac
2/17/2011
GO TO TEXT

As the others, I think the ‘separate records’ principle here is good per se, but that won’t make things easier with the data. Perhaps that should be acknowledged in the costs…

reply
Antoine Isaac
2/17/2011
GO TO TEXT

Btw it may be worth mentioning that ESE re-uses Dublin Core alot.

reply
Antoine Isaac
2/17/2011
GO TO TEXT

There’s nothing at http://purl.org/rdtf/. Isn’t that sending a conflicting message with the “should use the following RDTF extension”?

reply
Alan Danskin
2/16/2011
GO TO TEXT

I don’t understand why provenance is thought to be such an issue for this option. The integrity of the metadata seems least at risk in its native format

reply
Alan Danskin
2/16/2011
GO TO TEXT

Should there be a recommendation to explain the scope of the metadata. Although MARC is an open format there is provision for locally defined metadata. Suggest a recommendation that if locally defined metadata is made available, definitions are provided to save aggregator effort – see para 23.

reply
Alan Danskin
2/16/2011
GO TO TEXT

Does this mean that duplication of metadata descriptions is expected/acceptable?

reply
Alan Danskin
2/16/2011
GO TO TEXT

CSV could also be of interest to libraries. Not all systems have the capability to output MARC. A csv output may also be more accessible to potential users than MARC.

reply
Alan Danskin
2/16/2011
GO TO TEXT

A tabulated display might be clearer and would highlight those standards which are used across GLAM.

reply
Alan Danskin
2/16/2011
GO TO TEXT

While no longer supported by BL, a significant number of libaries in UK and Ireland still use UKMARC.

reply
Alan Danskin
2/16/2011
GO TO TEXT

“Where possible all significant resources…should be described using separate records”. This is aspirational from library perspective. In FRBR terms, most MARC records are composites of Work, Expression and Manifestation. Some records may aggregate description of the original work with details of a surrogate manifestation, Legacy bibliographic records may describe multiple resources, for example: manifestations issued with different bindings. It is unlikely that the modeling necessary to achieve a 1:1 relationship between records and resources could be justified in relation to the community format approach. It is an issue that we will give much more thought to in future as we develop our open data into open linked data.

reply
Owen Stephens
2/15/2011
GO TO TEXT

After some twitter discussion around this with Jane Stevenson I think there is a danger that the very basic nature of what is required here is obscured by other detail that may not apply.

My reading is:

The ‘Community Format’ approach would allow a (MARC/EAD/other sector specific format) dump that could be posted to a website. For those taking this approach the only other requirement would be for a sitemap file.

I think this might come as a relief to many (especially, but not only, small organisations).

It feels like some of the other issues (e.g. raised in 7 & 8) could be ignored by those following this approach.

If I’ve read this correctly it may be worth spelling out – make this entry point sound as simple as possible.

reply
Owen Stephens
2/15/2011
GO TO TEXT

What do you see as the purpose of the sitemap? Just wondering if we should be recommending use of any of the ‘optional’ aspects of the sitemap spec? Thinking specifically of ‘lastmod’ (and possibly ‘changefreq’)

reply
Jane Stevenson
2/15/2011
GO TO TEXT

I can see what Leonard means, and thought of this too. But ISAD(G) is not a file format – EAD complies with ISAD(G). You could add EAC-CPF though, as that’s the format that complies with ISAAR(CPF).

reply
Jane Stevenson
2/15/2011
GO TO TEXT

‘Collection’ for archives generally means one grouping of items created/accumulated by the same person/organisation, which may easily consist of several thousand items. I think the guidelines may have to be more explicit about what constitutes a ‘significant resource’ for archives. Don’t want to disappear into semantics too much, but I think its a fundamental difference with museums and libraries.

reply
Gordon Dunsire
2/8/2011
GO TO TEXT

Some libraries in the UK use UNIMARC; some are still using the obsolete UKMARC format. And MADS goes with MODS. To list or not to list … is a perennial problem, but standards anxiety may lead to to “typical” being overlooked.

reply
Gordon Dunsire
2/8/2011
GO TO TEXT

Despite the “Where possible”, this might be off-putting to libraries with legacy records that embed a print original and digital surrogates in a single record, and confuse libraries adopting FRBR, where a “record” typically consists of linked records for Work, Expression, Manifestation and Item. Perhaps a specific note about legacy metadata records would clarify.

reply
Herbert Van de Sompel
2/7/2011
GO TO TEXT

In response to Ralph:

That would have to be Atom with several Atom extensions to achieve a harvesting framework with a functionality that is similar to that of OAI-PMH. As a matter of fact, several Atom extensions that could make this possible have been proposed. And, with some colleagues I have briefly investigated the possibility of defining a profile of Atom that provides PMH-like functionality. Given there is increased interest in that direction, an effort to this account may be launched some time soon.

reply
    Ralph
    2/16/2011
    GO TO TEXT

    It’s true that complete OAI-PMH duplication would require significant enhancement of Atom feeds. But a simple feed of records can be accomplished with Atom as it is now.

    Scott Wilson
    2/17/2011
    GO TO TEXT

    You’re assuming we need the extra functionality of PMH over web feeds. Largely, nobody does any more, which is why it isn’t used outside its original community, and why use is shrinking rather than growing.

Richard Light
2/7/2011
GO TO TEXT

I wouldn’t call OAI-PMH an “encoding”: it’s a delivery mechanism.

reply
    John Robertson
    2/18/2011
    GO TO TEXT

    Agreed, how about ‘made available’ ?

Julie Allinson
2/6/2011
GO TO TEXT

… oh, and might be worth mentioning VRA, widely used for art collections.

reply
Julie Allinson
2/6/2011
GO TO TEXT

not sure it’s entirely useful to divide into three paragraphs, some formats crossover and there is some duplication here, maybe better to have a list of examples?

reply
Stefan Gradmann
2/6/2011
GO TO TEXT

In my view the provenance context is easiest to preserve in this approach: one statement per record is sufficient. This gets really tricky with aggregations of RDF triples.

reply
Andy Powell
2/4/2011
GO TO TEXT

Owen, Re: identifier for the record – this is a simple typo – should be identifier for the resource – will correct in the text to prevent further confusion. Thanks.

reply
Owen Stephens
2/4/2011
GO TO TEXT

Perhaps the main guidance that needs to be given for those providing CSV or other generic formats is to have accompanying documentation that clearly documents the structure and nature of the data. Interpreting formats like JSON and CSV without this will be impossible

reply
Owen Stephens
2/4/2011
GO TO TEXT

JSON sufficiently important to section of the developer community I think…

reply
Owen Stephens
2/4/2011
GO TO TEXT

Identifiers such as ISBN are useful – wouldn’t want to see these excluded. However, I do think the opportunity to ask for a URI if one is available is too good to miss.

My own preference is that ‘community formats with URIs’ is an additional option alongside ‘community format’ (in the same way you have RDF vs Linked Data)

reply
Owen Stephens
2/4/2011
GO TO TEXT

You say ‘any identifier for the record’ and then ‘e.g. and ISBN’. Risking getting into semantics… but of course an ISBN isn’t an identifier for the record, it’s an identifier for something else (probably something close to the manifestation in FRBR terms)
So – is the identifier expected here to identify the resource described, the record describing the resource?

reply
Owen Stephens
2/4/2011
GO TO TEXT

It doesn’t seem unreasonable to ask for label/title – but what is it for? Just as an example, if you had a MARC record for a letter it might not have a ‘title’ (in a 2XX field), although it would have something in the record somewhere that could be used as a substitute for a title in an index or human readable display.
To think about CSV specifically if we get a CSV file of records describing letters then it would be pointless if the ‘label’ or ‘title’ column contained just the word ‘Correspondence’ – although within these guidelines.
What I’m getting at is the perceived purpose of label/title field. Why not just say ‘should include at least one field which clearly describes the item in human readable format’ or some such?

reply
Owen Stephens
2/4/2011
GO TO TEXT

It feels wrong to be doing something to deliberately hide stuff from Google – isn’t this up to Google?

reply
Owen Stephens
2/4/2011
GO TO TEXT

Is this mixing two types of mechanism?

reply
Leonard Will
2/4/2011
GO TO TEXT

Should you add ISAD(G), ISAAR(CPF) and other standards from the International Council on Archives? These do not give formal file formats, but they do list the elements and structure that should be included.

reply
Ralph LeVan
2/3/2011
GO TO TEXT

I’d prefer an RSS or Atom feed to OAI-PMH. They certainly have broader acceptance outside the library community.

reply
Günter Waibel
2/3/2011
GO TO TEXT

You’d likely want to include LIDO in this list. http://bit.ly/gAkco0 – it’s the backbone of Athena http://www.athenaeurope.org/.

reply
Andy Powell
2/3/2011
GO TO TEXT

Question from the authors: The use of an RDTF extension here is to prevent accidental harvesting of, say, large files of MARC records by Google’s crawlers. Does this make sense?

reply
Andy Powell
2/3/2011
GO TO TEXT

Question from the authors: This is currently the only reference to JSON, Atom and RSS. Are JSON and Atom sufficiently important that they should get a mention? If so, how?

reply
    Scott Wilson
    2/17/2011
    GO TO TEXT

    Wonder if its worth mooting as a default format – you can expose MARC with OAI-PMH for the six people in the world who give a toss, but you should also provide RSS with basic details for wider impact

Andy Powell
2/3/2011
GO TO TEXT

Question from the authors: Is the guidance to include a column called ‘identifier’ sufficient? Should we also (or instead) ask for a column called ‘uri’?

reply
    Owen Stephens
    2/4/2011
    GO TO TEXT

    Identifiers such as ISBN are useful – wouldn’t want to see these excluded. However, I do think the opportunity to ask for a URI if one is available is too good to miss.

    My own preference is that ‘community formats with URIs’ is an additional option alongside ‘community format’ (in the same way you have RDF vs Linked Data)

    Paul Walk
    2/18/2011
    GO TO TEXT

    I hope this comment is appearing in the right place – I have deliberately used the ‘reply’ function despite being warned not to…

    I agree with Owen – I would like to see a ‘community formats with unique URLs as identifiers option’. In fact this would be my preferred path frankly. And yes, I think I do mean ‘URL’…. ;-)

Andy Powell
2/3/2011
GO TO TEXT

Question from the authors: Is the “provide a ‘label’ or ‘title’ and ‘identifier’ column” stuff useful? What about the other CSV guidance? Should we say anything else?

reply
    Owen Stephens
    2/4/2011
    GO TO TEXT

    It doesn’t seem unreasonable – but what is it for? Just as an example, if you had a MARC record for a letter it might not have a ‘title’ (in a 2XX field), although it would have something in the record somewhere that could be used as a substitute for a title in an index or human readable display.

    To think about CSV specifically if we get a CSV file of records describing letters then it would be pointless if the ‘label’ or ‘title’ column contained just the word ‘Correspondence’ – although within these guidelines.

    What I’m getting at is the perceived purpose of label/title field. Why not just say ‘should include at least one field which clearly describes the item in human readable format’ or some such?

    Owen Stephens
    2/4/2011
    GO TO TEXT

    You say ‘any identifier for the record’ and then ‘e.g. and ISBN’. Risking getting into semantics… but of course an ISBN isn’t an identifier for the record, it’s an identifier for something else (probably something close to the manifestation in FRBR terms)

    So – is the identifier expected here to identify the resource described, the record describing the resource?

    Owen Stephens
    2/4/2011
    GO TO TEXT

    JSON sufficiently important to section of the developer community I think

    Jane Stevenson
    2/15/2011
    GO TO TEXT

    I agree with Owen that a clear description is what is needed. I’m not sure how this will work if you have an archive that includes a series of letters – they may have the title ‘letters’ because the collection has the full descriptive title. You’d need both.

Andy Powell
2/3/2011
GO TO TEXT

Question from the authors: Have we got the list of example community formats right? Are things missing?

reply

The community formats approach

« Previous
Next »
1 “ x Cite
Embed
0

Guidance

2 “ x Cite
Embed
3

RDTF metadata that is exposed using the community formats approach must be made available under an open licence, using a non-proprietary file format (such as one of those listed in the examples section below).

3 “ x Cite
Embed
8

The metadata must be made available using simple HTTP GET requests or the OAI-PMH [11].

4 “ x Cite
Embed
6

Where HTTP is used, one or more sitemaps (conforming to the Sitemap protocol [12]) should also be made available, listing the available files. The sitemaps should be listed in a robots.txt file. Sitemaps should use the following RDTF extension to differentiate RDTF files from other content:

5 “ x Cite
Embed
0

lt;?xml version='1.0' encoding='UTF-8'?gt;
lt;urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rdtf="http://purl.org/rdtf"gt; lt;!-- namespace extension --gt;
lt;urlgt;
lt;rdtf:locgt;http://example.org/rdtf/catalogrecords.marclt;/rdtf:locgt;
...
lt;/urlgt;
lt;/urlsetgt;

6 “ x Cite
Embed
1

Where HTTP is used, GZip compression [13] may be used to reduce file sizes.

7 “ x Cite
Embed
7

Where possible, all significant resources associated with the collection of interest should be described using separate records. For example, there should be separate records describing a physical museum artifact and any digital surrogates of that artifact (e.g. images). For the purposes of these guidelines, a significant resource is one that is likely to be of interest to end-users, differs from other resources in terms of format or other attributes, and may have different ownership and/or usage restrictions to other resources. Note that ‘resources’ (as used here) may include conceptual entities (e.g. a FRBR ‘work’), people and organisations as well as both physical and digital objects. Where metadata is encoded using CSV [14] (or similar), a record corresponds to a row in the table. Where metadata is encoded using the OAI-PMH, a record corresponds to an OAI-PMH record.

8 “ x Cite
Embed
14

All metadata records should contain an attribute/field/property that can be used as a label or title for the resource. Where metadata is encoded using CSV (or similar), this label should appear in a column called ‘label’ or ‘title’. Where metadata is encoded using CSV (or similar), any identifier for the resource (e.g. an ISBN) should appear in a column called ‘identifier’. In addition, for any metadata encoded using CSV, the first row should contain the column headings, there should be no use of footnotes and all rows should be of the same length.

9 “ x Cite
Embed
1

Metadata about the resources associated with multiple collections may be made available. In general, where HTTP is used, there should be one file per collection; where the OAI-PMH is used, collections should be partitioned into separate repositories or separate sets within a single repository. Note that ‘collection’ (as used here) simply means any grouping of resources for curatorial, discovery or some other purpose.

10 “ x Cite
Embed
5

Examples

11 “ x Cite
Embed
8

For libraries, typical file format examples include library catalogue records encoded using MARC21 [15] or MODS [16], BibTeX [17], RIS [18], the CrossRef output schema [19], Dublin Core records encoded using XML[20], the Europeana Semantic Elements (ESE) format and formats based on JSON [21] Atom [22] or RSS [23].

12 “ x Cite
Embed
2

For museums, typical file format examples include museum catalogue records encoded using SPECTRUM [24], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.

13 “ x Cite
Embed
3

For archives, typical file format examples include archival descriptions encoded using EAD [25], database tables dumped as CSV files, Dublin Core records encoded using XML, the ESE format and formats based on Atom or RSS.

14 “ x Cite
Embed
0

Benefits

15 “ x Cite
Embed
0

As an end-user:

16 “ x Cite
Embed
0
  • there are likely to be many existing tools (both proprietary and open source) available that will display (and possibly modify) the metadata.
17 “ x Cite
Embed
0

As a provider:

18 “ x Cite
Embed
3
  • it is simple to make the metadata available.
19 “ x Cite
Embed
0

Costs/issues

20 “ x Cite
Embed
0

As an end-user:

21 “ x Cite
Embed
4
  • the context of where metadata originated is more likely to be lost, leading to issues around data provenance, trust, etc.;
  • the semantic relationships between the attributes used to describe resources (particularly across multiple collections) may be unclear;
  • citing resources of interest is likely to be problematic.
22 “ x Cite
Embed
0

As an aggregator:

23 “ x Cite
Embed
1
  • software tools for processing the metadata will probably need to be both format- and provider-specific, with many ad hoc heuristic techniques being adopted (including the need to cope with different interpretations of the same format by different providers);
  • the metadata is unlikely to carry explicit links to other metadata/content, making web-effects unlikely.
« Previous
Next »
Powered by Digress.it

Spam prevention powered by Akismet

Login

Register account

Lost Password?

Login

Register

Previous Register Next
Register
Previous Register Next

Your account has been created. Check your email for further instructions on how to log in.