AI and Archives. Breakdowns and Breakthroughs.

By Eric Hoyt, Kahl Family Professor and WCFTR and MHDL Director

The past year has been a time of AI-powered breakdowns and breakthroughs for the Wisconsin Center for Film and Theater Research (WCFTR) and Media History Digital Library (MHDL).

The biggest and most disruptive series of AI-fueled breakdowns have been the server outages that have disrupted service to the MHDL search engine Lantern. These crashes are the result of LLM bots crawling the website, scraping the scanned trade papers and fan magazines to train their next gen models.

If you have been negatively affected by the Lantern outages, we are sorry. We hate that robot traffic has denied our loyal base of human users the ability to reliably utilize Lantern and the collections that we have spent the last 15 years curating. It was crushing to learn of instructors changing entire student research assignments because of the inconsistent service.

We are determined not to let these intermittent outages resume. After months of blocking offending IP addresses in an endless game of whack-a-mole, we have now implemented stronger protections to keep our servers online for our human users.

That is not, however, the end to our AI story.

During the same stretch of time that we have been slammed by AI breakdowns, we have achieved several breakthroughs in deploying AI. The breakthroughs are allowing us to accurately recognize text, extract structured data, and segment different sections of page scans (e.g., headline, news story, advertisement).

Ben Pettis wrote a great blog post last summer about our efforts to teach computers how to read pressbooks. It’s a project that required three talented graduate students (Areyana Proctor, Lore FitzWhittemore, and Olivia Riley) to annotate 4,000 pressbook page images to train a YOLO (You Only Look Once) object detection model. Utilizing a workflow designed by Ben Pettis and Sam Hansen, we then ran the fine-tuned YOLO model on full corpus of pressbooks and output detected segments. Below is an example of the output of the fine-tuned model, after running it with the support of our friends at UW-Madison’s Center for High Throughput Computing.

Fig 1. Output from fine-tuned YOLOv11 segmentation model.

Why use the adjective “fine-tuned”? In this case, fine-tuned is an AI development term used to acknowledge that our pressbook detection model is built atop much larger convolutional neural networks—which themselves were built atop massive labor, environmental, and financial resources. We have an ethical duty when approaching AI for archives to ask ourselves: can we deploy the technology in ways that are responsible in light of the costs? The modest, lightweight, fine-tuned approach helps us meet this responsibility.

We are also finding that results from modest and targeted deployments of AI can yield tremendous gains for archives like ours. On a very basic level, current LLMs far outperform older OCR software packages for text recognition. The context-sensitive and predictive language mechanisms built into LLMs output the text from old pressbooks with tremendous accuracy. You can see the examples below, which compare the LLM Chandra model against the older effOCR software package on a volume available in the MHDL’s Pressbooks Collection. For those of us who developed the Lantern search engine—and I suspect most researchers who use it—the difference and improvement is remarkable.

How else can AI help with the research process into film history? For Project Ballyhoo, Ben, Sam, Kallan Benjamin, and I have been using those fine-tuned pressbook segments to look for instances in which American newspapers reprinting the pre-written publicity stories served up to them by Hollywood publicists (via local movie theater managers). We have found many, many such instances. As we will be discussing more in a forthcoming publication, we have found that the pressbook text generally appeared in newspapers alongside advertisements for the same movie, from the same pressbook. Below are examples from Gold Diggers of 1933. Whether one wants to label it a quid pro quo, or an extension of the advertising itself, many exhibitors expected, and received, “news” coverage when they purchased an advertisement. Despite codes of journalistic ethics that mandated the separation of news and advertising, many publishers sold the two together to exhibitors, part and parcel.

As my team and I continue to explore the reprinting of pressbooks in newspapers, we have had to do so with drastically fewer resources due, in part, to a poorly conceived and highly destructive deployment of AI. Project Ballyhoo holds the peculiar distinctions of being one of the last projects to ever receive a NEH Digital Humanities Advancement Grant—conferred in January 2025, terminated three months later by DOGE and the acting NEH chairman. We joined the lawsuit led by the ACLS, AHA, and MLA to restore the funding for two grants that had been revoked from the WCFTR and hundreds of other organizations across the country. We are proud that our work received specific praise in a federal judicial decision last July that deemed the grant terminations “unlawful”—a conclusion that was reaffirmed and elaborated upon a few weeks ago in a second ruling.

As the New York Times has reported in depth, one of the things that the litigation’s deposition and discovery process revealed was that the DOGE staffers used a ChatGPT prompt to assess, target, and de-fund projects. DOGE pointed ChatGPT toward a folder of project abstracts and offered the prompt: “Does the following relate at all to D.E.I.? Respond factually in less than 120 characters. Begin with ‘Yes’ or ‘No.’” In a memorable moment from the depositions that has now become something of an Internet meme, one of the DOGE staffers struggled to define D.E.I. when asked explain to what it was.

This use of AI—and the larger demolition of a once great NEH—is infuriating and wrong for so many reasons. To try to keep this blog post on topic, however, I will simply point out one reason: chatbots should not be making conclusions, let alone consequential decisions, for questions that center on Humanistic inquiry. AI can improve the searchability of collections. AI can recognize patterns in those collections, helping us find needles in haystacks. But it requires humans to determine the meaning—to assess what all of those needles and haystacks tell us that is interesting and new, and why any of them should matter to us at all.

With all of this in mind, we at the WCFTR will continue to explore the possibilities of AI for archives. Thoughtfully and responsibly. Prepared for more breakdowns and breakthroughs. And holding onto those most human qualities of all: creativity, community, and resiliency.