Why can’t we just use Google?

Chief Executive Kevin Gosling addresses a key question at the heart of the Collections Trust’s recent feasibility study for DCMS on mapping digitised collections.

In my last blog, about our report for DCMS on Mapping digitised collections in England, I set out some of the strategic reasons why we might want to search across all museum collections. Here I want to look at what an aggregator is and why UK museums need one.

The brief for the DCMS study was to develop and evaluate a practical framework for collecting relevant data in order to map cultural collections and, given the existing state of technology, to consider what functionalities a tool based on this framework might possess.

The brief also talked about ‘a searchable database of cultural content’, which implied something centralised, even if its data came from lots of different cultural heritage institutions.

The framework we and our collaborators, Knowledge Integration, proposed was based upon an aggregator – a tool that, in short, would do three things:

  • Bring together data from a wide range of institutions, in whatever form it was supplied;
  • Use a flexible selection of plug-in tools and services to process, clean and enhance that data, making clear what it had done and keeping any changes separate from the original data; and
  • Make the data available in various ways for uses that were limited only by any licensing restrictions that contributing institutions might specify.

We decided on the aggregator model, but is it the right one? Why can’t we just search all the online databases of individual institutions simultaneously in real time? In fact, why can’t we just use Google? Surely that would be a lot easier?

Simultaneous searching

Online tools such as flight comparison websites do indeed search many different databases simultaneously in real time, a process called federated or broadcast searching.

In the cultural sector, from the 1990s onwards, libraries successfully shared bibliographic data through a number of ‘virtual union catalogues’ that used the federated searching model. These simultaneously searched the online public access catalogues of many different library services in real time and delivered the results to the user as a single ‘hit list’.

The libraries’ federated approach ensured that the search results were as up to date as possible and reduced the need for centralised data storage. However, the user experience could be poor, as the search speed was only as fast as the slowest response and potentially relevant results could be missed if an individual catalogue was offline for any reason.

Moreover, the federated approach demands a high level of consistency between the data from different institutions. In a simultaneous search there is, with current technology, no time to analyse and tweak messy data. This is less of a problem with simple bibliographic records that follow rigorous standards, but it would be a challenge with more complex and variable data from a wider range of cultural heritage collections.

Even assuming all 1,700 Accredited museums managed to get their collections online under their own steam – and kept the information up to date – the variability of the data would simply be too great for the federated approach to be viable.

The aggregation model

The technical term for collecting relevant data into a searchable database is aggregation and the system that does it is an aggregator. By themselves, these are fairly neutral terms and do not imply any specific solution beyond some kind of centralised database that is pre-loaded with ‘cached’ information, gathered one way or another from other data sources.

Note that not all the original source data actually needs be cached. What’s usually required is enough information for indexing purposes plus a link back to the original data or digital assets such as image files, which would take up too much storage space if copied into the aggregator’s own database.

Aggregators such as Google and other search engines don’t whizz round the entire World Wide Web in the milliseconds after you hit the search button. Rather, they refer to the massive databases they made earlier, which are updated regularly by the automated process of ‘crawling’ the Web. Having information to hand in this way speeds things up for the user and means that potentially relevant content is less likely to be missed due to a website being temporarily offline.

Aggregators currently gather their cached data in one or more of the following ways:

  • By crawling web pages using bots – this is a crude, free-text approach, although it can be refined if the web pages have machine-readable annotations, such as ‘embedded microdata’, that help the bot interpret the content.
  • By harvesting data exported from the source and imported into the aggregator using a defined standard template known as a protocol – this can be automated or done manually using spreadsheets.
  • By using Applications Programming Interfaces (APIs), which are tools that either proactively ‘push’ data from the original source to the aggregator or allow the aggregator to ‘pull’ data from the original source – this is not quite as straightforward as harvesting, because a certain amount of configuration is needed to connect the aggregator to the specific API of a data source.

There are also other data-sharing and data-gathering methods used by cultural heritage institutions and aggregators. These include publishing information about collections as linked data (or, when published with an ‘open’ licence for re-use, linked open data).

In linked data, complex information, for example a catalogue record about a Turner watercolour, is broken down into a series of semantic statements, but instead of text – JMW Turner – to denote the painter, an identifier, such as http://vocab.getty.edu/ulan/500026846, is used to make a link to authoritative information about him published somewhere else (in this case, the Union List of Artist Names).

If this sounds complicated, that’s because it is, and there are further complexities, too, which put this approach beyond the reach of all but the largest and most technically sophisticated institutions.

Google’s an aggregator – let’s just use that

As our DCMS study demonstrated, there are practical limitations to using Google to find all – and only – the cultural heritage items that might be relevant to a search.

Imagine you’re a curator looking for potential loans for a forthcoming exhibition about Charles Darwin’s life and work. As I’ve explained, Google (and other search engines like it) is a general-purpose tool that treats most web content as a stream of free text. It therefore misses out on the potential benefits of structured metadata (data about data) that could distinguish between, say, records of things created by Charles Darwin, things collected by him and things about him.

Emerging developments such as embedded microdata might eventually go some way towards improving this situation, but somebody, or some automated tool, will still need to create and add meaningful annotations to each relevant web page.

Google’s custom search engine does allow developers to provide a search interface that is limited to a specified website or group of sites. However, the main disadvantage of this approach, particularly for a framework intended to be an impartial resource on the nation’s digitised cultural heritage, is that the ‘relevance ranking’ of a web page is determined by Google’s secret algorithms which, among other things, seek to boost Google’s advertising revenue.

What’s more, the ‘just use Google’ approach has the same major drawback as the federated searching model. For example, in order for their collections to show up in search results, every single one of the country’s 1,700 Accredited museums would have to have a ‘crawl-able’ online collection as part of its own website. This is usually the complicated and expensive part of developing a new site and, judging from the research carried out for this study, it’s currently beyond the means of many cultural heritage institutions, even larger local authority services.

For all the reasons set out above, in our report to DCMS we argued that the only viable model for the framework is an aggregator that can gather and deal with the data it needs from cultural heritage institutions of all sizes and levels of technical capacity, through all the aggregation methods currently used and likely to emerge in coming years.

The Collections Trust continues to advocate for a sustainable national aggregator for the museums and heritage sector…