How AI Is Bringing the Vatican Secret Archives Into the Light of Day

How AI Is Bringing the Vatican Secret Archives Into the Light of Day

The Vatican Secret Archives are one of the most important troves of historical documents that exists in the world today. While the term “Secret Archives” is a bit of a misnomer — a better translation might be “private Vatican Apostolic archives” — the archive contains the records and documents of all of the actions taken by the Catholic Church over a period of over 800 years. One does not need to be religious in the slightest to see the historical interest in preserving documents that, in a few cases, date back to the end of the 8th century and concern an institution that played a pivotal role in politics, religious practice, statecraft, and culture across Europe and the world. Items previously exhibited from the Vatican Archives include the 1521 bull of excommunication against Martin Luther and a letter written by Mary, Queen of Scots while awaiting her execution (presumably written in late 1586 or early 1587, given the relatively short period between her trial and execution).

Wikipedia notes that the archive is considered to be complete from 1198 to the present day, though the period from 1939 forward is still prohibited. But despite the fact that the Vatican Archive is technically open to researchers, there’s been a larger problem with making use of much of the documentation. The problem, put simply, is the language itself. Below is a sample of what’s known as Caroline miniscule script. The document, Liber septimus regestorum domini Honorii pope III, apparently refers to Pope Honorius III, who was head of the Catholic church from 1216-1227. I’m inferring the relationship, but the dates and popes seem to check out.

In Codice Ratio
In Codice Ratio

I haven’t found a specific statement that confirms this, but Wikipedia notes that Caroline minuscule (also known as Carolingian minuscule) was in use from approximately 800-1200 AD, which would fit the pope’s own time period. But as you may have noticed, the text is extremely difficult to read. If you don’t have a degree in Medieval Latin and a detailed knowledge of the script, it’s going to be more-or-less impossible to decipher. This has dramatically limited the ability of scholars to make much use of the documents (The Atlantic notes that the Vatican Secret Archives is both one of the largest and the “most useless” archives in existence for this exact reason). Scholars have previously attempted to adapt optical character recognition, or OCR, for use in the archives with limited success. OCR only works on typeset characters, which these aren’t. Key to the function of OCR is the ability to recognize the spaces between letters, in order to distinguish the letters themselves. Attempting to teach OCR systems to read words instead of letters works, but the requirements of building a database of words is enormous enough that The Atlantic notes scholars have turned to other methods. Enter a new solution: In Codice Ratio.

In Codice Ratio breaks characters down into pen strokes by measuring differences in line thickness, creates letter identifications by measuring where these thinner joins are, then turns to an AI trained by high schoolers to measure whether the identified letters are accurate. Here’s how the system works:

How AI Is Bringing the Vatican Secret Archives Into the Light of Day

High schoolers are shown valid examples of a medieval “G” first (in green) and examples of what does not constitute a “g” in the red boxes. They are then asked to determine which of the letters in the white boxes constitute genuine letters, and which are letter-groups that the OCR software thinks might be letters that actually aren’t. Here’s The Atlantic:

The setup did require some expert input: Scholars had to pick out the perfect examples in green, as well as the false friends in red. But once they did this, there was no more need for them. The students didn’t even need to be able to read Latin. All they had to do is match visual patterns. At first, ‘the idea of involving high-school students was considered foolish,’ says [Paulo] Merialdo, who dreamed up In Codice Ratio. ‘But now the machine is learning thanks to their efforts. I like that a small and simple contribution by many people can indeed contribute to the solution of a complex problem.’

That wasn’t the end of the training; the scholars working on the project also had to bake in some common sense, since the OCR software wasn’t always able to cleanly distinguish certain letter groups. But it turns out that there are letter combinations far more likely in Latin than others — a double-“n” as in the word “Anno” may look superficially similar to a set of four i’s, but “nn” is vastly more common than “iiii.” Right now, the software is still learning. While its 96 percent accuracy is impressive, that’s still enough to leave at least one typo in about one-third of the words. As one can imagine, this would be rather disconcerting to read.

But there are two takeaways here: First, even 96 percent accuracy is often enough to read texts and would therefore be an improvement over needing a medieval Latin scholar every time one wanted to investigate a particular Pope or document. Second, the current performance of In Codice Ratio represents the early baseline of the software, not the final product. If this approach proves to work, it could be instrumental in recovering text from documents that are now too degraded to be handled or too difficult to read. The high school students that helped train the AI didn’t need to understand Latin in either form — all they needed was a grasp of pattern recognition.