About smart data & microrepositories
From IntereditionWiki
Contents |
What are
Todo: what are micro repositories (and: are they smart data?) So basically the architectural over view, I guess. Pointer to http://www.interedition.eu/wiki/index.php/Interaction_model#Interactive_Work_Flows to too, I guess.
Rationale
Why do we need / want / do this? Basically: sustainability and usability
- Institutions and data archives will tumble, the internet will not
- Top down data format standardisation does not work (not facilitating, not flexible)
- We need a distributed, simple way of sharing, versioning, sustaining data
- We need our services to be able to tap in to data
- We need the solution to proliferate something like 'linked data'
- We need the sharing solution so simple as mail, so that it may be a killer app
- We nee a solution that has zero install, possibly zero client footprint, is light weight, implementation agnostic
Trust
Why will/should scholars want to get data off their hard drives into this system?
How will scholars know that the ‘right’ thing will happen when they share the data - that Peter can edit it, but Paul can only comment and Philip should just be able to read it?
- Rights management, and it needs to be damn fine usable and intuitive (column style I guess)
- Basically it sounds like we’d need a revision control frontend (some scholars are control freaks and will want to review/approve *any* changes) but a nice usable one. NOT however if something is an annotation/comment; you don't own annotations. Border case: the annotation marker/span.
Annotations
... (www.apture.com) Apture like, because it needs to be outside the data owner's data/data model)
- should not be under the direct control of the data creator
- but should be layered, not mixed into the data itself directly.
- Some scholars won’t want to see all these pesky annotations when they look at the data. Other scholars will want annotations peer reviewed or some such. What to do about that?
...so others’ annotations are not shown by default:
- in the view of the original author/researcher
- in the view of anonymous
- but are shown in the annotator's view
So: how does the Open Annotation Collaboration project cope with this / what is their take on this?
Data Formats or not?
Data formats of this smart data - you were claiming in essence that we need no data format, I thought. Assuming everything would be expressed in JSON. That doesn't make sense. Another nice thing that doesn't make sense: Smart data is data having an API, which sets it apart from 'just' data. Smart data is data that's processable, that has operations on it, is a service in itself. To what use though?
...which makes a little more sense if we take your model of dedicated microservices for serving up this data. (i.e. http://www.interedition.eu/wiki/index.php/Interaction_model#Interactive_Work_Flows)
- getVersion( version )
- portTo( format )
- submitRevision( my_changes )
Two views of data: data that is being pushed through services as a stream; the other: data as being serviced from a micro-repository with a small API like above.
Microrepositories GetVersion( version ) function can be used to obtain a certain stream, combined with streams from other microrepositories they form input for a certain micro services workflow. That results in *new* data, that can be stored/published as a new micro repository.
Cloud based
We don't want the hosting of micro repositories to be institutionalized not centralized. As with micro services: distribution and redundancy through publishing them in the cloud applies. This is important because: at the moment, most scholars don’t have much of a way to publish things online. They can copy ‘flat data’ things to their university web space, if they know how, and that is pretty much it. Or else they can use a free cloud service like Google or Heroku, but that comes with some severe resource limitations. And takes some serious technical chops.
What we do need is an academic cloud that is agnostic to who's hosting what and where, but has a shared business model for billing: (total cloud load / the number of institutions) * FTE in a particular institution.
Security
There's security and Persistent ID/DOI schemes thinkable for this too.
- Authentication: is this person who s/he says s/he is? it's not *solved*, but it's being solved around us by (bigger) projects, Eduroam/Shibboleth etc.
- Authorization: does this person have the right to do what s/he is trying to do? ? Probably part of the data on authentication token
- Defense against other attacks (what other attacks?)
- stream sanitization (finding in blocks of code, too large a stream, etc.)
- encryption (SSL) if necessary
- making the ‘services’ bit of the micro-repository as small as possible to protect against coding vulnerabilities.