The next time you’re scrolling through Netflix, take a second look at the thumbnails. Better yet, compare the images you receive with the images a friend gets on their account. Chances are, the thumbnails are different. You just might like the image you got better than your friends. This isn’t a glitch: Netflix has begun to personalize thumbnail art for individual viewers. Through in-house studies, the company found that artwork captures 82% of a user’s focus while browsing by measuring how long users looked at the thumbnail before moving on or clicking for more information. While the company used to receive artwork for titles from the studios, materials from billboards or movie posters, the importance of retaining a user’s attention has pushed Netflix to develop new techniques. An enticing image, tailor-made for a user’s specific movie history and preferences, can increase the chances a user clicks on a title, and maximize the time they spend using Netflix.
A single season of average TV show (about 10 episodes) contains nearly 9 million total still image frames. So how does Netflix come up with just one image to display? In an effort to reduce human labor and maximize its ability to efficiently personalize content, Netflix designed the AVA (Aesthetic Visual Analysis) system. AVA has two main steps, frame annotation and image ranking, which both use digital analysis tools to evaluate images.
In the frame annotation stage, a program analyzes every static frame in a given show. This is one of the first major advantages of using an automated approach to creating thumbnails-even the most dedicated artist would find it difficult to sort through every single frame. In this stage, frames are judged using three different characteristics to create metadata about each frame.
This characteristic considers factors such as brightness, color, and blurriness. This information is assigned to each frame, as seen in the example below from the film Bright.
This considers what is going on in a given frame by examining motion occurring in a frame and highlighting faces and objects that might be important to a show. In this frame from Stranger Things, the faces of the characters are identified, with main character Eleven singled out in the bottom right.
This considers visual principles important to aesthetics, like symmetry, position of major faces or objects, and depth of field. With important objects (such as the woman in the frame below) selected for and overlayed upon a 3×3 grid, Netflix can assess an image
with principles such as the Rule of Thirds. The “rule” notes that images that are aligned upon the intersections of the gridlines tend to be more interesting to a viewer. We can see that the image does indeed follow this principle, and might be considered a good candidate by AVA.
Image ranking takes the analysis further than the metadata characteristics above. In this stage, criteria developed by Netflix’s team of artists are used to rank images. Notable criteria include how depicted characters are posed, blurriness, and the importance of a given character. Through training a deep-learning model to identify main characters, a higher score can be given to thumbnails that contain main characters and a lower one to secondary characters or extras. Another important measurement is visual diversity, which includes factors such as colors used, camera angle, and image composition.
For example, the above image contains thumbnails that were given a low score through the ranking system. The left image has a poor pose, the middle image is blurry, and the rightmost frame features a secondary character in Stranger Things. A last criterion is maturity, which ranks frames that contain sensitive or branded content much lower in order to remove them from consideration.
After frame annotation and image ranking, the resulting thumbnails are taken to (human!) artists and stylized. This leads to an extensive testing process to determine what the best thumbnails from this set of approved images are best for different types of users.
After generating thumbnails, Netflix uses A/B testing to find what the best thumbnails are for each user. Specifically, this type of testing seeks to understand when artwork influenced a member to play a title (or not) and when a member would have played the title regardless of image. This requires dividing users into two groups, A and B. Netflix notes that these groups “should be as homogeneous as possible in order to draw statistically meaningful conclusions from the test”. In order to ensure each group contains a similar proportion of members, Netflix determines group homogeneity “with respect to a set of key dimensions, of which country and device type (i.e. smart TV, game console) are the most prominent” .
Both groups choose from thumbnails created by the AVA system, but Group A sees an image suggested by one algorithm, while Group B gets an image suggested by a slightly different one. If Group B has higher engagement, then this new algorithm is used for all members. This is a straightforward way to select the best thumbnail, but there’s one huge drawback: during the testing process, a large number of users are exposed to thumbnails they weren’t interested in the first place. In the field of statistics, this kind of drawback is called “regret.” Ideally, it would be possible to engineer a test process that figures out the best thumbnail as fast as possible, while also minimizing the regret.
Luckily, there already exists a large array of algorithms to minimize regret in situations like this one, which are known generally as contextual bandit problems. A “bandit” refers to a gambler playing a series slot machine who needs to decide which machines to try in order to find the one that will provide the best results.Put simply, a contextual bandit problem is one in which there is access to information about a situation (context) and only one action can be chosen per situation. After the outcome of only one possible action is observed, the outcome is corresponded to a reward. The goal is to maximize the reward for the algorithm averaged over all users. For Netflix’s challenge of serving title art to users, reward is defined as the number of plays divided by the number of member impressions with an artwork. Netflix’s context is a vector of features provided as input to the model, and can include country, genre, language preferences, previous titles played by the member, time of the day, and even what device is being used. Training data for the algorithm comes from randomizing the model’s predictions, creating both “tuples” which record member, title, and image and also a dataset that records whether a tuple resulted in an impression. For different groups, Netflix’s contextual bandit algorithm tests variations of algorithms that pair contexts (such as a user preferences) to outcomes (a thumbnail). This information is all used to create a ranked list of images for each member, and the highest ranked image is presented to a member. For example, Netflix released the following infographic on which thumbnail for Unbreakable Kimmy Schmidt was ranked most highly.
With such a large membership, Netflix can use the algorithm to reduce regret by having each member provide feedback on just a small portion of the catalog. This leads to members spending less time potentially wasting their time with thumbnails that aren’t ultimately the best. Through this, the company can increase efficiency by finding ideal images (such as the thumbnail in the bottom right) without cycling through extensive number of users.
Out in the World
In practice, what can we learn from the results of Netflix’s personalization program? One surprisingly insightful takeaway are regional preferences for art. For example, the most popular thumbnail images in Germany are abstract while US audiences prefer seeing a main character. Netflix show sense8, which has a fairly diverse and international audience, is a good example of these differences. In Germany, A/B testing revealed the most successful thumbnail to be a simple title card with very minimal art. Contrasting this, the most popular thumbnail in the United States depicted a main character, following many of the previously mentioned composition rules: a strong pose, display of emotion, rule of thirds, and depth of field with the background buildings.
This observation is supported by the literature on the psychology of Americans and Germans. Kurt Lewin, one of the pioneers of applied psychology, writes that “the average social distance between different individuals seems to be smaller in the United States…that means the American is more willing to be open to other individuals…than the German” (Lewin). The desire to connect a tv show with a character’s face seems to reflect the American value on understanding others, something that is less important to German culture. The artwork generation process, which may seem complicated, robotic, and cold, actually provides fascinating insights and confirmations about international differences in aesthetic value.
We see even more specific psychological observations within the American context. Netflix observed over time that an image’s tendency to be ranked the highest dropped greatly if more than three people were depicted. This consistently led to less interest from potential viewers, and led to changes in how television shows such as Orange is the New Black were presented over time.
Netflix’s attempts to maximize viewership has not gone unnoticed. Neither have the flaws of relying upon an automated process. Sometimes the “best images” that are produced are hilariously unfitting, such as here:
Additionally, many users have felt that Netflix is targeting viewers using their race. For example, several black Netflix users received a thumbnail photo of a black woman and her father for Like Father starring Kristen Bell (who is white).
These pictured characters are not the father-daughter duo central to the movie, and in fact are only on screen for about 10 minutes. After this tweet garnered internet attention, other twitter users began comparing their thumbnails.
In this side by side, we see that a black user receives mostly black actors in their thumbnails while a white user receives only white actors in theirs for the same movie. Crucially, this is perceived by many viewers as an offensive attempt to win over minority audiences with misleading images.
Part of this problem may be a lack of understanding regarding the selection process. Conspiracy theories aside, it is likely that a combination of viewing history and an algorithm’s best guess at demographics is driving such targeted thumbnails. We can likely rule out malicious human behavior. However, this does raise questions about the biases in the algorithms used by companies such as Netflix. If black viewers are being targeted and put into a racial box by the AVA system and A/B testing, the onus of changing these processes does fall upon the engineers and executives at Netflix. While Netflix has claimed that there is no intentional targeting, “blaming it on the algorithm” seems irresponsible. Despite the impressive technical workings of personalization, it may go too far when systems begin to make offensive decisions about what audiences want. This also reveals that there is still progress to be made in truly understanding what motivates people to choose entertainment. Your thumbnails are different than mine, but maybe we shouldn’t trust that these differences reflect who we really are yet.
For the outreach event, I sought to make the selection and sorting processes clear to students. I also wanted to capture the very visual nature of the problem at hand.
While I did include information about how culture shapes preferences and problematic “targeting” of demographics, the crux of the outreach was selection. I printing out several frames from Incredibles 2, and challenged students to enact each stage of an algorithm. I used a poster board and velcro to allow students to move images around. For example:
- Student A conducts the visual step. For most kids, this mostly involved removing blurry photos (although there are other visual characteristics evaluated.
- Student B enacts the contextual step, selecting images that contain main characters and may feature action or crucial plot points.
- Student C ranks the images based upon composition. I would either explain a concept like depth of field or the rule of thirds, and asks students to rank images.
- Student D “A/B tests” the different images by asking people around them which they would prefer. I had them try to do this as fast as they could. While this isn’t quite the same as the contextual bandits algorithm, it highlights simply the importance of testing and a value on speed.
- End up with a final product
Sottek, T.C. The Thumbnails Are Always Changing on Netflix Because You’re Being Tested. The Verge, 3 May 2016, www.theverge.com/2016/5/3/11582382/netflix-thumbnail-test.
Barton, Gina. Why Your Netflix Thumbnails Change Regularly. Vox, 21 Nov. 2018, www.vox.com/2018/11/21/18106394/why-your-netflix-thumbnail-coverart-changes.
“Artwork Personalization at Netflix.” Medium, Netflix TechBlog, 7 Dec. 2017, medium.com/netflix-techblog/artwork-personalization-c589f074ad76.
Sharf, Zack. “Netflix Accused of Promoting Content by Targeting Viewers’ Race, but the Company Says That’s Impossible.” IndieWire, 16 Jan. 2019, www.indiewire.com/2018/10/netflix-accused-targeting-viewers-race-posters-thumbnails-1202014458/.
“Selecting the Best Artwork for Videos through A/B Testing.” Medium, Netflix TechBlog, 3 May 2016, medium.com/netflix-techblog/selecting-the-best-artwork-for-videos-through- a-b-testing-f6155c4595f6.
Segran, Elizabeth. Netflix Knows Which Pictures You’ll Click On–And Why. Fast Company, 18 Apr. 2017, www.fastcompany.com/3059450/netflix-knows-which-pictures-youll-click-on-and-why.
Lewin, K. (1936), SOME SOCIAL‐PSYCHOLOGICAL DIFFERENCES BETWEEN THEUNITED STATES AND GERMANY. Journal of Personality, 4: 265-293. doi:10.1111/j.1467- 6494.1936.tb02034.x
“It’s All A/Bout Testing.” Medium, Netflix TechBlog, 29 Apr. 2016, medium.com/netflix- techblog/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15.