Data Science For Social Good: Exploring Unstructured Genocide Survivor Testimony Metadata Using Data Driven Techniques

EE376A (Winter 2019)

By Anchit Narain, Nicolo Zulaybar and Jack Andraka

Project Overview

In 1994, Steven Spielberg founded the Shoah Foundation Institute in response to the many letters he received from survivors of the Holocaust who wished to share their stories in the months and years following the premiere of his acclaimed film, Schindler’s List.

From 1994 to 1999, the Shoah Foundation collected and digitized roughly 52, 000 testimonies of survivors and other witnesses of the Holocaust. This information is now
preserved in the USC Shoah Foundation Institute’s online Visual History Archive (VHA).
However, the Shoah Foundation believes that simply storing the testimonies is not enough; the full social and educational potential of these testimonies cannot be realized without the development of tools that allow researchers to search, study and exhibit the archives’ more than 100, 000 hours of testimony.

So far, the majority of users who have interacted with the VHA have primarily been historians, though the Shoah Foundation would like to open the VHA to a much broader, general audience. In an attempt to do so, the Shoah Foundation has partnered with renowned Data Sculptor Refik Anadol (see his work here: to unveil an emotionally moving creative piece created from trends and data found in the VHA at the 75th annual UN General Assembly later this year.

Our team of students from EE376A were reached out to by the Shoah Foundation for doing the initial data exploration of the VHA’s vast survivor testimony metadata to try and find interesting trends or conclusions that Refik and his team could then use in their data sculpting.

What Does the VHA Metadata Look Like?

Holocaust historians viewed each individual testimony and tagged each one-minute segment with appropriate keywords, drawn from an aggregated 50,000-term thesaurus. This keyword thesaurus also forms the basis of the VHA’s indexing system, which has permitted researchers to perform detailed searches for relevant testimonies or segments of testimonies.

Each interview/testimony has an individual XML file containing the metadata for that specific interview. Each testimony is uniquely indexed by its filename and interview code, which allows each interview to be found easily in the database itself.

Each XML file is not organized in a rectangular format, however. The XML structure is best characterized as a “list of lists”, where each interview itself is a giant list, and each element in the list is a set of keyword tags for each minute of the interview (each in their own list if they correspond to different interview questions in the same minute). The name of the interviewee and each question they were asked is available in the XML files, but the questions are not accessible fields as they aren’t keyword indexed like the individual responses.

This is an example of part of a single testimony metadata file. Notice that only the white colored responses/fields are keyword indexed in the metadata at one minute intervals. The questions themselves are not.

Additionally, there is a keyword hierarchy that follows the Z39.19 standard, however we did not work with this data as we didn’t receive it in time. The following section discusses the approaches we took to clean, format, and explore the interview format metadata as described above.

What Work Did We Do on/with the Metadata?

As we didn’t receive the interview metadata until later in the quarter (end of week 7), we primarily focused on cleaning, reformatting, and then exploring certain aspects of the interview data. The following tables and charts reflect information gathered from exploring the demographic information extracted from a sample of 500 interviews.

One of the first steps we took was to construct rectangular/tabular data from the list of lists metadata to then more easily conduct frequency analysis and other forms of data exploration. We focused on creating 2 main tables — one which concatenated all the biographical information from each XML file into a single table, and another which concatenated each individual question asked to each interviewed survivor into one table. Attached below are partial screenshots of these tables:

Snapshot of the biographical information table. The majority of interviewees are survivors of the European Holocaust, though there were testimonies from survivors of other genocides as well.
Snapshot of all the unique questions asked to each interviewee.
A more detailed analysis of each interview showed that not all interviewees were asked the same set of questions, but in general, there existed a subset of questions that were constant across all interviews, and those formed the basis for the majority of the frequency analysis performed on demographic data. The table above and chart below both illustrate this point.

An example of frequency analysis we did using the reformatted and tabulated data was finding the relative distribution of where the interviewees were initially from and where they fled to to seek refuge during or after the Holocaust.

From the above figures, we can see that the majority of interviewees from the randomly sampled subset totaling 500 testimonies were from Poland, and the migrated to the US (mostly) to seek asylum.

We are currently working on extracting intermediate location data from the interviews (names of camps, towns, etc. where the interviewees resided before receiving asylum) to then plot using Google Maps API to track the journeys of individual interviewees, further visualizing their stories of survival.

Future Work

One interesting area of future exploration is centered around using Martin Niemoller’s famous poem, “First they came for… ” as a template for tracking the progression of violated rights among genocide victims. We would divide each interview into 100 separate bins, and in each, calculate the frequency of keywords documented in the metadata. In general, the interviewees recount their stories in chronological order, so we could assume that the frequency of keywords mentioned in each bin correlates to which issues survivors faced at different times during the genocide. We would then compare the temporal distribution of issues among different survivors of the same genocide to create an overall chronological trend map demarcating the order in which survivors’ rights were infringed upon. We would then compare this across different genocides to analyze whether similar tactics were used universally across different genocides or even across different marginalized groups in the same genocide.

Outreach Event

We implemented a simple “list of lists” simulation activity at the Outreach event at Nixon Elementary. We filled smaller Tupperware boxes with various colored M&M’s and then closed these containers and placed them in a grid pattern in a larger Tupperware box. There were 5 such large Tupperware boxes, with 4-5 smaller boxes in each large box. We then asked individual students to compete against each other to count the number of orange M&M’s in all the boxes. The goal would be to count the correct number of orange M&M’s in the fastest time. We showed the students 2 possible algorithms for counting the M&M’s — the first being to open the larger Tupperware boxes and then count the orange M&M’s in the smaller boxes by inspection only (not opening each smaller box but instead counting by looking through the transparent bottoms), and the second being to open each smaller Tupperware box individually and counting the orange M&M’s by hand. The competitors could pick whichever algorithm they preferred, or create their own algorithms to attempt to count the 25 orange M&M’s as quickly as possible. The inspection method worked best, and those who were able to use this method in the shortest time at the end of the end of the night all received grand prizes — a giant bag of M&M’s, potatoes, or onions (depending on if they were first, second, or third fastest overall). We also handed out a packet explaining why this method worked fastest, and hope they tackle list of list problems using a similar method if they come across it in the future.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.