José Eduardo

AI at Europeana. Strategies for mitigating bias.

Discover the fusion of culture and cutting-edge AI with José Eduardo, a machine learning engineer from Europeana, at this ‘AI to amplify’ talk! Dive into the transformative world where AI not only preserves but enhances our rich cultural heritage. Learn how Europeana is tackling the challenges of digital curation with innovative AI applications—from breathing new life into low-resolution images to smart watermark detection and duplicate identification. Get to know the strategies that combat bias and ensure diversity and accessibility in the digital realm.

Summary

In his talk, José Eduardo, a machine learning engineer at Europeana, discusses AI's role in enhancing the accessibility and diversity of cultural heritage materials in their vast online gallery. He covers some of the recent Machine Learning work at Europeana with pecial focus on strategies to improve data quality and mitigate bias, such as documentation, dataset curation, evaluation and model interpretability. He also explains how a trained classifier can distinguish watermarked images, aiding in data curation. Additionally, José touches on strategies to mitigate AI bias, emphasizing the importance of documentation for transparency and reproducibility, and the use of model explainability to ensure AI models focus on relevant data features.

Transcript

Hi, my name is José and I'm a machine learning engineer at Europeana and this is my talk for the event “AI to amplify”. Good, so this is the index for the talk. I will briefly introduce Europeana and what do we do and what does it have to do with AI. Then a brief introduction of the AI work that we do at Europeana, spanning some projects such as super resolution, watermark detection and duplicate image detection. And also something about the strategies that we use for mitigating bias.

So first of all, what is Europeana? Europeana is a foundation based in the Netherlands and well, basically in a nutshell is like a gigantic online gallery of cultural heritage material with more than 50 million objects including content and metadata. One of the goals is to make our data as accessible and diverse as possible and therefore we use the FAIR principles. We try to make our data findable, accessible, interoperable and reusable.

As you can imagine, there is a lot of technology going on in this endeavour and this is where AI plays a significant role that I will describe right now. So there are different goals for artificial intelligence or machine learning technologies at Europeana. Some of them are enrichment where we contextualize or we enrich or enhance the content that we receive from our cultural heritage objects such as the images or the text.

We enrich this content with metadata and we can also augment the existing metadata. So if we are given for instance some metadata containing an entity of a place or a person for instance, by using AI we can identify certain entities in this metadata and perhaps link it with their Wikipedia entries.
Another goal of our efforts is data curation and quality. So as we said our content is quite diverse and huge, therefore it is unavoidable to find some quality issues such as duplicate images or low resolution images. So what we try to do is to curate our collections so that we can at least be aware of those objects that compromise our quality and also enhance and curate our collections.

As I mentioned I will briefly describe three projects, namely enhancing the resolution of low resolution images, detection of watermarks and also the visualization of datasets and curation. So in this first project as you can see some of our images, because we have providers from all over the world, have low resolution images. We already detect the presence of those but so far we haven't tried to correct them. Now with the advances of generative AI, in particular image to image models, we already have out of the shelf models that can effectively duplicate the resolution of images. These models have been trained on regular imagery, not specifically cultural heritage imagery, but nonetheless they work quite well. As a matter of fact we conducted a quantitative evaluation and then indeed we saw the advantages of using this type of model. So they can significantly increase the resolution and therefore the visual and the user experience in our platform. So that was the first problem.

The second problem that I'm going to mention here is the detection of images with watermarks. As you can see here in the lower right corner of the images we have some kind of property mark like a stamp, right, that probably the aggregator or the owner of those objects added. This effectively can hamper the user experience. So at least we would like to identify those images so that we can take an action regarding them, either remove them or flag it to the user, etc. The way we built this model or this solution was by training a model.

So first we had to gather many examples, in particular close to 2000 images of watermark images and normal images, and we used this dataset to train a classifier where the classifier will output either this image is a watermark or not a watermark, right. So this was quite powerful and effective and we are working on it.

And then the last use case that I will mention here is the visualization of datasets and also the detection of duplicate images and both of those problems rely on the, they are based on technologies relying on image similarity. So we can train a model for obtaining embeddings which are low dimensional representations, they are basically vectors, right, an array of numbers that can represent semantically the images, meaning that images with similar content will have very similar vectors, right.

So then we can convert images into vectors, which is quite powerful because if those vectors encode semantic information, then we can calculate the similarity between images and therefore we can quantify how different two images are. In the extreme, we can then detect duplicate images as those images that have vectors which are very close to each other. This is quite powerful because this allows us to identify in massive datasets a lot of duplicate images because the same data sometimes comes from different sources or the same source uploads the same data twice by mistake.

So then we would like to identify where this happens. And as you can see here in those visualizations, these also, the image similarity of having embeddings or the vector representation of these objects allows us to also visualize those objects. And if we have some nice visualization software, we could even be able to drag and basically travel this kind of embedding space. And as we can see here, well, this automatically separates the data into color, for instance, darker and lighter images. And these are just representations. So if we want to explore other properties of the dataset, we can still use this technology. So quite useful for dataset curation.

And finally, more in line with the topic of this workshop, what do we do? What strategies do we follow in order to mitigate bias? Well, this is a special type of bias or this is our two cents on how we try to go. The concept of bias in machine learning goes a little bit beyond the scope, our scope. But nonetheless, we still try to make some efforts in order to combat obvious sources of bias that happen when working on machine learning.

So the first initiative or effort that we use to mitigate bias is documentation. We try to document the provenance and the uses of the datasets. Why? Well, basically so that we can know exactly where the data is coming from, which providers, which sources. This fosters transparency and accountability and also the reproducibility. So if we can, if we know exactly what model or what data was a model trained with, then we will probably be able to replicate this model.

I included here a screenshot of a paper that is a reference in the field of documenting data sets and models, and I encourage you to have a look. This has a double function. So again, for creators, it's important to document because it gives you clarity and order about the process. Therefore it encourages reflection in the process of creating models or datasets. And obviously it has some advantages for consumers because then they ensure that they have the information they need to reproduce or to use the data or model in the proper way.

As I mentioned, in one of the use cases, we also focus on our efforts on dataset curation because we think that it's important to know exactly what we're working with in data. For this, you can see here a visualization of a very massive dataset, tens of thousands of images that we did using image similarity techniques. And therefore we can explore, visually navigate the dataset and identify those data points that are not really useful or those that are especially useful and we want to keep for a special purpose.

A very important aspect of machine learning is how the evaluation is done. Sometimes this also depends on the problem at hand. It's not a trivial task and there are plenty of sources of bias on this. Here the key point is to being able to evaluate the generalizability capability of the model. So how well the model generalize, right? How well does it perform on unseen data, on data that it has not been trained on? And then some tips and recommendations on some things that we follow in our own experiments.

We try to make the training and test splits well in the way that, in the sense that we avoid data leakage. So we avoid the same data being present in the training and the test data. Also experimentally, we found that it's important to do stratified sampling so that when we divide the training and the test data, we try to keep more or less the same distribution amongst classes as in the original data, right? So then we don't create artificial imbalances in the training and test data. Also we usually, well, we want to make sure we properly want to quantify seriously the generalization capability of our model.

We usually do cross-validation as well, which is to several train test splits and repeat the experiment for those, and then average the results that we can. We have different experiments and then we can have a better estimation, right, of the performance of the model on unseen data, which is what we are mainly interested in.
And then another method that we use in order to ensure that there is nothing weird going on in our models, that there is no shortcut learning, is model explainability. So one of the main criticisms of AI, or deep learning rather, is that the models are usually black boxes because, well, it's difficult for us humans to straightaway interpret the model, I mean the output of the model, or the decisions of the model, right? Or the process that took the models to take a certain decision.

Usually the images are presented mathematically in pixels, the model takes it, makes a decision after it has been trained by making a lot of matrix multiplications, and that's it. It produces an outcome with a certain confidence score. But sometimes, well, that's usually not enough for humans, right? We want to make sure that the model is actually solving the task that it's supposed to solve as it should solve it.

So for example, here in the case of watermark detection, those are the same type of images that we're working on, or that I mentioned in the project before. We want to detect a watermark. We know, as humans, right, that the watermark is located in a certain area, in particular the lower right corner in those examples. So we know that if the model takes a decision that this image contains a watermark, it must be because it has seen something around where the watermark is, right?

And indeed, when we apply some explainability techniques, which allows us to visualize a heat map of those areas in an image most relevant for the outcome of the model, then we indeed see that the most relevant areas are over the area that contains the watermark, which is good. This means that the model is paying attention to the area that we want it to pay attention to, and that nothing is happening.

If the explainability map gave us that the model is looking completely in the opposite direction, then we say, okay, the model is classifying this correctly as a watermark, but it's actually paying attention to another area. So there must be something, or there might be something going on in terms of data quality. Perhaps there has been some misclassification or some mistakes in the training data, something like that. So that's like an extra tool of diagnostics of models.

And this is the end of my talk. Thank you very much.

Best-of

Here you will find short excerpts from the talk. They can be used as micro-content, for example as an introduction to discussions or as individual food for thought.

Ideas for further use

The talk can be used as a learning resource for students and researchers interested in the intersection of AI and cultural heritage. It can serve as a case study in courses on machine learning, data science, digital humanities, or museum studies.
AI professionals and cultural institution staff could use the talk for training purposes, to understand the practical applications of AI in their field and to stay updated on the latest technologies and methodologies for managing digital collections.
The talk is suitable for presentation at AI, technology, and cultural heritage conferences, as well as workshops focused on digital transformation in the arts and humanities sectors.
For policymakers and funders in the technology and cultural sectors, the talk can inform decisions on supporting projects that leverage AI for cultural preservation and accessibility.

Licence

José Eduardo: AI at Europeana. Strategies for mitigating bias.
byJosé Eduardo Cejudo Grano de Oro (Europeana) for: Goethe Institut | AI2Amplify is licensed under Attribution-ShareAlike 4.0 International

Documentation of the AI to amplify project@eBildungslabor

Linus Zoll & Google DeepMind Bild Linus Zoll & Google DeepMind / Better Images of AI / Generative Image models / Licenced by CC-BY 4.0