Data models - OKP OSZK

Data models and data sharing
(instead of data exchange)

The data model must be flexible, and completely the choice of any participating system. There should be no limitation whatsoever, how the data are described and in what ways the entities are connected. Only this approach can ensure, that the exchange and the maintenance of the data is not limited to the library domain, and can be expanded to any existing domain with its free and individual data formats. The link to cultural institutions like museums and archives is obvious, but there must be a possibility of cooperation opened with geographic name space institutions, with the domains of military, education, transport, production, design etc. The data sources should not be limited to institutions, as many experts in certain domains, many scientists are not embedded into institutional hierarchies. They should be capable of enriching the platform data, the doors of ingesting digital data and catalogue information must be open for the widest possible range of participants. The legacy of writers, with a lots of electronic manuscripts and not published materials, have to be able to find their ways into safeguarded and archived library platforms. Citizen science and crowdsourcing must be enabled. The data model should allow for any type of data format, must be definable in a flexible way, without programmers’ intervention.

We have to aim to describe the entities by those experts, who have the deepest and most reliable knowledge about them. Such entities could be any, the mostly used in a library context are persons, institutions (libraries, publishing houses), geographic name spaces and chronological dates, the most important ones being works, instances, agents. Those entities should be described only once and should be linked by connections to other entities. The enrichment should happen strictly by adding information with a very precise information about the source, above all reflecting the trustworthiness of the source. This approach will give to each entity and the relevant connections a quality level qualification, and this is a clear indication of trustworthiness. It allows to every participant to attach information freely and with an immediate effect, but for consumers and users there will be a choice of selecting the range of trustworthiness, as they work and utilize the selected data. In our project we will additionally introduce a module called Loca Credibilia (Trusted Source / Place), which identifies the source and originator of a digital object, allowing to trace back the route of a given digital object. All entries can have many versions, can have competing data entries, but even at individual field level the information of the source, and with this the trustworthiness, is provided and stored.

With the growing amount of data produced and waiting to be processed, it would be a naïve approach to assume, that the problem could be solved by the increase of the number of staff, hiring new cataloguers. The involvement of a broader expert community is a must, but we should not ignore the option of the machine supported processing. The machine can do processes on its own, without manual human intervention, and there are tools and means, by which participants can teach the machine workflow and increase significantly the accuracy of the computer’s process. Many processes in improving the quality of digital materials are already making the life of colleagues easier, and this feature can be easily introduced also in cataloguing the web harvested materials, or in recognition of faces, objects etc. at the identification and cataloguing of still and moving images. The machine can give a huge support in improving data quality.

The sharing of work is a big chance and potential in overcoming the staff shortage (and often the shortage in domain specific expertise). This enhances the quality of data, and can result in a never seen productivity and richness of data. The model of cooperation must be open and transparent for every possible participant. The willingness of forming the word and the information systems is given, as young users are used to it, and this is time for the libraries to make use of this potential. An additional benefit is that the users who are contributing to the richness of the library, will very likely use the materials provided by the system and will come closer to the library domain. The separation of the open world and a closeness/secretiveness of the libraries can disappear, by integrating the library treasures into a broad collaborative system. The bridges to WikiData, WikiMedia, OpenStreetMap and alike can be built and used more extensively, and this will connect further users to the library domain.

The most significant transition regarding the data concept is, that with the introduction of the linked data, we rely on the data being there, where it should be. We do not duplicate the data – probably for the speedy access we cache some data temporarily, but we do not store them for long, we do not store unnecessary duplications. This makes data exchange unnecessary, so the significance of exchange format will disappear with time. This is the direction we are going, but we are all aware, that there is an ecosystem with the legacy data from the past (and present), which we have to feed for around a decade from now on. This makes a complex interfacing necessary, between the linked universe and the existing systems based on data duplication.

Data models and data sharing (instead of data exchange)

Data models and data sharing
(instead of data exchange)