American Music Awards Tweets

CMSC320 Final Tutorial
Made By: Duy Nguyen, Bissaka Kenah, Henry Cheung

Introduction

This tutorial is an analysis on American Music Award (AMA) tweets from the year 2018. The goal is to determine indicators that could predict the winners of the award show. The data for this tutorial comes from a Kaggle dataset containing 3,283 tweets that include the official hashtag #AMAs and 2,300 user ID (people who retweeted the Twitter voting tweet) from 4 days before the award show. This award show also allowed for Twitter voting, so the data set also contains tweet that announced this and the retweeters' IDs. This tutorial will show how we tidied this data and provide insight into which common characteristics of tweets that alluded to the winners of the show, in order to determine winning indicators and furthermore “predict” the winners. This will be done though understanding which users are most successful in predicting the AMA winners and, essentially, why they are successful. One example indicator is location. Depending on the genre, the number of tweets about a song or artist in a certain location would be a strong predictor of who, or what song will win a category. For example, several tweets about Dua Lipa in Los Angeles, with the AMA hashtag, would be a good indicator of her winning the pop category, and therefore, can be used to predict the winner of that category. This analysis will hopefully provide some insight into the role of viewer input in Music Award that it shows and how much, or how little, they take this input into account.

Required Tools

To put together the tutorial, we used Jupyter Notebook since it includes Python and presents the dataset clearly. We used the libraries below in order to complete the analysis. The funcitons are listed alongside the libraries.

Data Preparation

The data is first downloaded from https://www.kaggle.com/eliasdabbas/american_music_awards_tweets. It contains CSV (comma-seprated value) files that has data of tweets related to the American Music Awards in 2018. We load the CSV files into the notebook using Pandas; a software library written for the Python programming language for data manipulation analysis.

Data Tidying

The data table contains many unneeded columns and take alot of space in the dataframe. We need to drop most of the columns to keep the dataframe clean and simple. As a result, we only kept the full text of the tweet, the tweet's id, the user description, and the amount of retweets the tweet got.

We want to classify each tweet of a nominee of the Music Awards. To do this, we make a regular expression (regex) for each nominee then apply a classifier function to each tweet_full_text in the dataframe. We then have a new column with classifier.

We can start using the regex variables above to make a classifier function to separate the tweets based on the artists that were mentioned in the tweets.

With the function above, we can apply the function to each tweet text in order to classify them into their own category. This represents which nominee the tweet was referring to.

Some of the text will be unclassified due to language or no mention of any nominee.

We made a new table that counts how many times a nominee is mentioned. We can use this to make a bargraph to present this visually.

When taking a look at the number of mentions, the most popular by far during the 2018 AMAs looked to be Camila Cabello with over 9000 votes with no other artists getting close. Both Taylor Swift, and BTS were fairly close to each other in that both were around 4800 to 5100 mentions or so. In comparisson, both Shawn Mendes, and Cardi B only recieved a little over 2000 mentions which in comparisson is less than half of the votes both Taylor, and BTS got each. It only gets smaller as aside from Dua Lipa, no other artists that were nominated cracked above four digit mentions.

Now, we will analyze the trending topics on Twitter as of October 9th, 2018.

Some of the columns are not needed such as the query, url, woeid, and time. The important ones kept in the table are the name of the trend, volume, and location.

Though, as you can see, some of the trending topics on Twitter are not at all related to the American Music Awards. As a result, we need to filter them out.

Now that the dataframe is filtered, the dataframe needs to be grouped by name, taking the mean of the tweet_volume per name. There are some nominees that are missing. That's because they either trended too late or they didn't have enough tweets to trend.

This dataframe will tell us how popular the above nominees are in trends. We can visualize this using a bargraph similar to before.

The trending artists graph looks somewhat similar to the amount of tweets that mentioned the artists with quite a few exceptions. Some artists such as Camila Cabello, Taylor Swift,and Cardi B still are trending substantially, but some of the other artists that were mentioned such as BTS do not appear on the trending list.This is due to the popularity of the nominees and how soon the nominees started trending as stated earlier. If a nominee is not popular enough, or trend too late then they will not trend and be considered missing.

Graphing the tweet volume per location

Next, we will take a look at the locations of the trends. This will be first visualized as a table, and later be converted to a bar graph.

From the graph, we can see that Portland, Phoenix, Long Beach, and Chicago are the top cities with the highest tweet volume talking about the AMAs. Though, Portland leads with over 2.8 million tweets or so with Phoenix not far behind with around 2.6 million tweets with Long Beach, and Chicago having over 2 million tweets about the AMAs.

We can also see that the majority of the major metropolitan cities have very similar tweet volumes of around 1.25 million tweets after Boston which is the last major city whose tweet volume is over 1.5 million.

Graphing the Top 4 Cities

In this section, we will graph the top four tweet volumes which are Portland, Phoenix, Long Beach, and Chicago. This analysis is to find how influential certain nominees are in certain cities, and a way to measure this is the tweet volume below.

We will first be measuring the tweet volume of artists in Portland in a table first, and then converting it to a bar graph later on.

For the most part, this sort of adherence to the national consensus is rather similar across all four of the cities with each of them having their exceptions. For all of them, Camila Cabello dominates at #1 while Taylor Swift takes the #2 spot consistently. After the top two though, many cities start to have some differences in them.

Let's start off with Portland whose graphs is rather similar to the national graph, but with a fair few differences after Camila Cabello, and Taylor Swift taking the top two spots. One part that is rather striking is the tweet volume of Cardi B, and Drake are absent on the graph as they do not appear whatsoever on the graph for Portland, but appear on the nationwide tweet volume graph. Instead, Shawn Mendes, Khalid, and Post Malone take the rest of the visible tweet volume graph for Portland which does fall in line with much of the rest of the country.

As we can see with the other city graphs, there are two cities that have their own variance to the plot in that they have different artists at the #3 spot. With cities like Chicago, we can see that Cardi B is the one holding the spot while the other popular artists such as Shawn Mendes, Khalid, and Post Malone wrap up the other positions. Meanwhile, Drake has the #3 spot in Phoenix with the rest of the other popular artists taking the next few positions downwards. In comparisson, Long Beach is basically on point with the national tweet volume graph with no notiable exception when basing it off the original graph.

Statistical Analysis

Chances of Winning vs. Mentions

In this upcoming section, we are taking a look at the winners of each category in the 2018 AMAs, and comparing it to the sheer amount of mentions that an artist had. We are first detailing the winners of each category in a table below.

This cell appends a column to the nom_df table determining if the nominee won or not in the 2018 AMAs. The representation of whether or not the artist won a nomination has a 1 marked under their row, and those who did not win have a 0 marked instead.

We are going to create a logistic regression, and plot showing the impact that the amount of mention of an artist has on the chances on them winning.

From the graph, we can see that while having a ton of mentions makes it substantially more likely that the artist would win a category at the show, there are still a ton of artists who only had a fraction of the mentions of many of the more popular artists who also won a category at the show.

From the logistic regression, the data shows a positive correlation between the number of mentions, and the likelihood of winning a nomination at the show. But, it is not by much, and it is not a strong correlation based off the p-value. As a result, we would need to reject the hypothesis that was mentioned quite a bit ago about how the artist's popularity would basically guarantee a winning spot at the 2018 AMAs.

This is due to the amount of winners who had far less mentions, and popularity compared to their peers, and yet still won a nomination. Usually with awards shows, they are not always popularity contests in that getting the winners for each category have other factors other than say how many mentions or interactions they get on Twitter. Categories are also pre-determinded from the start. This includes genres such as country, electronic dance music (EDM) and alternative rock who had nomiations who basically had very small sections to no section on the trending graph whatsoever.

When looking at the logistic regression, one might ask as to why this model work out. When it comes to using a binary logistic regression, it requires the dependent variable to be binary being whether or not an artist won or not, and the observations to be independent of each other. The data set also requires there to be little or no multicollinearity among the independent variables which in this case are the mentions of each artists at the 2018 AMAs. It also assumes the linearity of independent variables, and log odds. Finally, these sorts of logistic regressions typically need a fairly large sample size with around a minimum of 10 cases with the least frequent outcome for each independent variable or in this case, Twitter mentions in the model. For the most part, there are hundreds to thousands of mentions of each artists in the graph, and with the amount of artists nominated at the AMAs, this requirement is met.

Some more information as to why this model was used can be glanced at here.

Trends vs Win The two following code blocks essentially duplicate the experiment above (and the corresponding code), but instead of showing the relationship between mentions and chances of winning, they show the relationship between whther or not an artist trended and their chances of winning. This was done by including dataframe column denoting whether or not the artist trended using a binary variable, then creating a logistic regression based on that data, and data on whether or not the artist won.

This regression shows a much lower p value, but still too high to reject the null hypothesis. A positive correlation between trending and likelihood of winning relationship can also be seen in bar plot. What seperates trending and mentions, as explained by data scientist Gilad Lotan, is that "Twitter's algorithm determines what is trending by favoring sharp spikes rather than gradual sustained growth." This means that Trends are determined by a combination of volume and how much time it takes to create volume, so this helps explain the stronger correlation in trends vs wins, as compared to mentions vs wins. Trends are more focused on volume of tweets, which means they are more likely to correlate with AMA winners is the winner trended arount the time of the AMAs. Source: https://rethinkmedia.org/blog/how-does-twitter-decide-what-trending#:~:text=Twitter's%20algorithm%20determines%20what%20is,it%20takes%20to%20create%20volume.&text=Rather%2C%20the%20Twitter%20conversation%20simply%20built%20over%20time.

Location Popularity vs Win The following code block examines the top 4 cities with the highest tweet volumes (Portland, Chicago, Pheonix, Long Beach) the artists with the highest tweet volumes, and compares the artists' popularity in that city (by determining the proportion of tweets in that city that went to each artist). A logistic regression is then performed on this data to understand which city has the highest impact on the winner, if any.

The five cities analyzed abouve had the highest tweet volumes, so the hypothesis here was that regressing wins based on a trending artist's popularity would be a stronger indicator in helping predict the winner. The popularity formila was af follows, with the example of Portland:

Portland popularity formula (only calculated for trending artists in Portland): Tweet Volume/Total Portland Ama Tweet Volume

Based on the high P-values, nothing can necessarily be concluded from this data, but the logistic results indicate that Portland is the strongest indicator of the winners, out of the 4 cities. This makes sense as it also has the highest tweet volumes, so it has a higher impact on the trending status of the artist

Conclusion

Wrapping Up When analyzing the data that we have collected, we can infer that the number of tweets an artists gets relating to the AMAs is not an ideal way of predicting the winners of the AMAs in that usually with award shows, it is more than a pure popularity contest. This is espeically apparent when there are several artists/groups that won an award that did not really trend, and have notable tweet volumes. In fact, there were notably quite a few more artists who won that did not trend compared to those that did trend, and won. Many of these categories are already pre-determined from the start which led to some artists that trended far below most others winning said category. Such examples include Marshmello, and Kane Brown winning categories such as Favorite EDM Artist, and Favorite Male Country Artist despite not much traction with tweets, and not trending.

Sources of Error and Missing Data
From the visualization of tweet volumes and tweets, there were some errors and missing data that needs to be addressed. One error is with the dataset itself, specifically amas_combined.csv. This dataset contains tweets with #AMA but does not categorize the nominee associated with the tweet. Due to this, we had to categorize each tweet with regular expression which is not perfect. This method does not represent the dataset on a true scale, and an improvement can be made with more accurate categories. Next, the missing data with tweet volumes. As mentioned earlier, many of the nominees did not trend at all. According to Elias Dabbas, the author of the dataset on Kaggle, many of the trending topics trended too late or did not have enough tweets to generate a trend. Due to the missing data, the nominees that are less popular are not represent well in the tutorial.

Future Ideas
Throughout this data collection and analysis process, though some telling results were found, there is always room for improvement. Specifically, there are some enhanced assessments that could be performed to futher explore this data in order to obtain even more information. One example would be assessing previous years. This way, there would be more data points and those could help examine consistent trends accross years, in order to predict the following years. In addition, because this data is based on Twitter, another area that could be explored is the number of followers an artist has, as that would probably be included in the error term of the statistical analysis because an artist may be talked about more because they have a lot of followers, not necessarily because their music is better. This may lead them to have a lot of mentions, but not win. Finally, it is important to note that some categories are more popular than others, so another future experiment could deal with Twitter discussions regarding categoriesin conjunction with artists, as artists may be discussed more if their category is popular, thus creating bias in the prediction.

Links for more info

Information on Twitter trends:

https://rethinkmedia.org/blog/how-does-twitter-decide-what-trending#:~:text=Twitter's%20algorithm%20determines%20what%20is,it%20takes%20to%20create%20volume.&text=Rather%2C%20the%20Twitter%20conversation%20simply%20built%20over%20time.

https://thesocialsavior.com/blog/how-does-twitter-go-about-deciding-whats-trending/

https://www.newsweek.com/how-do-twitter-trends-work-1526430

Information on AMA voting:

https://www.theamas.com/2020-voting-rules/#:~:text=VOTING%20RESULTS%3A%20The%20total%20number,the%20winner%20of%20the%20Award.