Where does the data come from? On the responsible development and use of AI

Miðeind’s tenth anniversary year is drawing to a close. Over the past decade, the company has transformed from a tiny language technology startup into one of Iceland's largest firms training AI models from scratch. We see enormous potential in AI, have been positive advocates for it, and have participated in making it better and more robust. Its utility is clearly evident in our product, Málstaður, which has enjoyed an excellent reception. Language technology for Icelandic has made great strides in recent years, largely thanks to AI.

However, AI is controversial, and not without cause. Large AI models are energy-intensive to train and run; they require vast amounts of data, and that data is not always obtained freely. The heaviest questions, however, concern human existence: What makes us human? Is there a risk of diluting what makes human creativity unique if we outsource ever more tasks to machines?

This debate is prominent among those in creative industries, and their concerns are not unfounded. There was a recent uproar when the long-established family business Kjörís launched a new Christmas ice cream featuring packaging illustrated with a rather poor and uninspired AI drawing that bore little resemblance to Icelandic reality. Around the same time, controversy erupted over the Ministry of Education and Children's Affairs' contracts with the American AI giant Anthropic, as the company has been shown to have extensively and illicitly scraped literary works, including those of Icelandic authors.

It is not unreasonable for artists to ask, under these circumstances, whether the goal of widespread AI adoption is simply to eliminate their jobs. The Writers' Union of Iceland raised this very question at a recent lunch meeting titled “Will computers be writing the Christmas books?”. Diverse viewpoints were exchanged, but we can likely all agree that the thought of our work being used to train mechanical successors—so they can eventually oust us—is rather depressing.

This discussion hits close to home for Miðeind because we train AI ourselves and have assisted international partners in training AI, including by providing them with open data such as the Árni Magnússon Institute's Gigaword Corpus and a filtered version of the Common Crawl dataset. There is, therefore, every reason to reiterate our stance on data collection for such training.

Miðeind's purpose is, and always has been, to ensure Icelanders have access to the best possible language technology in our own tongue. AI cannot learn a language without data, and the larger the models, the greater the need. This is a significant hurdle for a very small language like Icelandic, which does not have abundant open data sources at its disposal. If technology is to speak Icelandic, every letter and every byte counts. It is also crucial that datasets are diverse, reflecting everything from literary fiction to news articles, and academic papers to social media chatter.

That said, it has never been an option for Miðeind to train language models on unethically sourced data or to encourage others to do so. In fact, in conversations with the government and other stakeholders, we have advocated for finding a way to compensate authors for the use of their works in AI training—provided, of course, that they consent to such use in the first place.

Regardless of the data source, many find the trend of outsourcing tasks requiring creativity and reflection to AI to be tiresome, or even worrying. The text generated by Large Language Models (LLMs) is sometimes clichéd, monotonous, and characterized by a certain flatness. Yet, we use these models regardless; the reason is simply that they are incredibly powerful tools that can save us a vast amount of time and effort. As with most tools, the quality of the output correlates with the ambition put into the work. It can take time and several iterations to craft a good prompt for an LLM. As we use the technology more, we learn to recognize its limitations and better assess which tasks suit it best.

At Miðeind, we use generative AI for various tasks, and some of our products feature functionality powered by Large Language Models. When we decide to utilize generative AI in product development, our guiding principle is always to use it responsibly and for good. If a task can be solved effectively with a smaller, more agile model, we choose that path. It is worth noting that while many have become accustomed to speaking of "AI" as a singular entity—usually referring to generative models like GPT-5 or Gemini—artificial intelligence is by no means just one thing.

All of Miðeind’s main language technology products rely on AI to some extent, but in most cases, these involve small models that we have trained from scratch or fine-tuned. Our most popular product, Málfríður, is an example of the former; it is trained on various open text corpora along with data we have generated ourselves. Ultimately, Miðeind’s goal is to support our language and ensure it continues to be used in all corners of society. To prevent English from taking over in certain domains, Icelandic language technology simply must keep pace with the capabilities available in English. In that regard, generative AI is truly a godsend when used correctly.

Finally, it is worth considering Moravec's paradox: tasks that are difficult for humans are easy to teach AI, yet it struggles with what we humans find obvious and natural. Creative writing—Christmas books and the like—requires not only literacy but also human experience, physical existence, emotional depth, and intuition. These are things that are difficult to construct from data. It is our conviction at Miðeind that the future lies not in choosing between human creativity and AI, but in an interplay where technology supports creativity without encroaching upon it. We have worked tirelessly toward this vision for ten years and look forward to seeing what the next decade brings.

Sign up for our mailing list!

Post Tags:
Share this post: