Predicting the popularity of social curation =dự đoán nội dung mạng xã hội nổi bật
Predicting the Popularity of Social Curation
Kieu Thanh Binh Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Assoc. Prof. Pham Bao Son
A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science December 2015
ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis. Any
contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’
Hanoi, December 30th , 2015 Signed ........................................................................
ABSTRACT The amount and variety of social media content such as status, images, movies, and music are increasing rapidly. Accordingly, the social curation service is emerging as a new way to connect, select, and organize information on a massive scale. One noticeable feature of social curation services is that they are loosely supervised: the content that users create in the service is manually collected, selected, and maintained. A large proportion of these contents are arbitrarily created by inexperienced users. In this thesis, we look into social curation, particularly, the Storify website. This is the most popular social curation for creating stories included in various domains such as Twitter, Flicker, and YouTube... We implemented a machine learning method with feature extraction to filter these contents and to predict the popularity of social curation data. Publication: Binh Thanh Kieu, Son Bao Pham and Ryutaro Ichise. Predicting the Popularity of Social Curation . In Proceedings of the 6th International Conference on Knowledge and Systems Engineering, pp.413-424, Springer (KSE 2014)
ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my supervisor, Assoc. Prof. Pham Bao Son, for his patient guidance and continuous support throughout the years. He always appears when I need help, and responds to queries so helpfully and promptly. I would like to specially thank Prof. Ryutaro Ichise and his colleagues for their help through my time at Ichise Laboratory, NII. I sincerely acknowledge the Vietnam National University, Hanoi, Toshiba Foundation Scholarship, and especially Assoc. Prof. Pham Bao Son for supporting finance to my master study. Finally, this thesis would not have been possible without the support and love of my mother and my father. Thank you!
Bag-of-Words Machine Learning Natural Language Processing Social Networking Service Support Vector Machine
Chapter 1 Introduction Along with the rapid growth of the Internet, social networks are increasingly attracting users, young people in particular. Therefore, the study of social networks is getting more and more attention. Social network services such as Facebook, Myspace, and Twitter have become viable sources of information for many online users. These websites are increasingly used for communicating breaking news, sharing eyewitness accounts, and organizing groups of people. At the most basic level, a curation service offers the ability to manually collect, select, and maintain this social media information. This is very different from other social information sources, and we can utilize this characteristic for efficient content mining.
The emergence of Web 2.0 and online social networking services, such as Digg, YouTube, Facebook, and Twitter, has changed how users generate and consume online contents. For example, YouTube, well-known for its fast-growing user-generated contents, reports 100 hours worth of video upload every minute according YouTube Statistics1 reported. Online social networking services, augmented with multimedia content support, sharing, and commenting on other users’ contents, constitute a significant part of the web experience by Internet users. The question is how do users find interesting contents? Or, how do certain contents rise in popularity? If we can answer these questions, we can predict the most likely contents to become popular and filter out others. Moreover, when we can filter out unpopular contents that get 1
Chapter 1. Introduction
little attention, good contents can be used to build an automatic system for curating social content.
Prediction the poplularity
However, predicting the popularity of content is a difficult task for many reasons. Among these, the effects of external phenomena (e.g., media, natural, and geopolitical) are difficult to incorporate into models (Lee et al., 2010), and cascades of information are difficult to forecast (Cha et al., 2009). Finally, the underlying contexts, such as locality, relevance to users, resonance, and impact, are not easy to decipher (Bandari et al., 2012). Design is changing into an experience-oriented discipline; consequently designers need appropriate tools and methods to incorporate experiential aspects into their designs. A story is a crafted experience and storytelling is the craft. Therefore, understanding the structural strategies behind storytelling and learning how to incorporate them into a design process is relevant for designers when they want to envision, discuss and influence user experiences. In this thesis, I introduce storify website and a method for analyzing it. Storify is a multi-modal tool to provide design teams with an experiential approach towards designing interactive products by incorporating dramaturgical techniques from film and sequential art.
The rest of the paper is organized as follows. In the second section, we explain the social curation service, our target data source, and details of the dataset specifications. In the third section, we review related work. The fourth section is devoted to the formulation of predicting view counts of a curation list. The fifth section describes experiments and the evaluation of our results. The last section concludes this paper with a discussion about future work.
Chapter 2 Literature review 2.1 2.1.1
Social Curation Definition
The word “curate” is defined as selecting, organizing, and looking after the items in a collection or exhibition1 . The word is derived from the Latin root “curare” or “to cure”, which means “to care”. Curation involves assembling, managing and presenting some types of collections. For example, curators of art galleries and museums research, select, and acquire pieces for their institutions’ collections and oversee interpretation, displays, and exhibits. Social curation is the collaborative oversight of collections of web content organized around types of content such as Pinterest (a site for sharing and organizing images) and Storify (a site for collecting and publishing stories). Together with social media raises two new figures or at least two new ways of naming one of the most common behaviors in this environment. This is the Content Curators and Content Creators. Content is to be expressed through some medium, as speech, writing or any of various arts for self-expression, distribution, marketing or publication. First, we will see a brief explanation of Content Creation and Content Curation. On the one hand, content creation is the contribution of information to any media and most especially to digital media for an end-user or audience in specific contexts. Typical forms of content creation include maintaining and updating web 1
Chapter 2. Literature review
sites, blogging, photography, videography, online commentary, the maintenance of social media accounts, and editing and distribution of digital media. A Pew survey described content creation as the creation of "the material people contribute to the online world" (Horrigan, 2004). On the other hand, content curation is not a new phenomenon. Museums and galleries have curators to select items for collection and display. Content curation is the process of collecting, organizing and displaying information relevant to a particular topic or area of interest.
Statistics say that of all the users, a vast majority just passively consume content, it means they never create, or share. A minor portion of content curator acts filtering out others the best. And a minority are content creator and create new original content that are shared by content curators and consumed by users. In Internet and thanks to social media these proportions change and more people create and almost all or a very large share. But still, we can still see these behaviors clearly. A defining our behavior in social media, we can choose one or the other profile, but at the professional level, it is interesting to understand the differences and benefits.
Creating original and quality content on a frequent topic is one of the most arduous tasks, but often has great rewards, because we get to to attract people interested creating a community around us, and obtain recommendations from other users who are like vote of confidence in the social economy. On the other hand, the curators of content, without needing much creative work or composition, but much surveillance and information processing, can achieve much relevance as in many examples we can see, if you are able to generate a community around thanks to the information shared daily interest on a topic. Content Curator that the user profile social network usually reads a lot and share a lot, either with Share option of Facebook, republishing on his wall, the publications of others he finds most interesting, or on Twitter doing Retweet of what better than read, just as in Pinterest, Linkedin, Instagram, Tumblr... These profiles are read first lot, then selected, with an initial filter and finally share the best selection. Hence the title of curator, because they care finally share content, helping to combat the current infoxication. Thus, if we choose our top content properly curators to follow the topics that we prefer to be informed, we eat better information, more interesting and less time.
2.1. Social Curation
Social Curation Service
Social networks are spaces for dialog and conversation that have grown into ubiquitous information exchanges. Youth today refer to social networks, aggregators, and mobile apps for most of their information instead of singling out specific media for news, politics, personal communication, and leisure. In turn, social networks have provided new functions that help users curate information in meaningful and productive ways. Social curation involves aggregating, organizing, and sharing the content created by others to add context, narrative, and meaning. Artists, changemakers, and organizations use social curation to showcase the full range of conversations around a topic, add more nuance to their own original content, and crowdsource content from their community members. The rise of social curation can be attributed to three broad trends. • Firstly, people are creating a constant stream of social media content, including updates, location check-ins, blog posts, photos, and videos. • Secondly, people are using their social networks to filter relevant content by following others who share similar interests. • Thirdly, social media platforms are also curating content by giving curation tools to users (YouTube playlists, Flickr galleries, Amazon lists, Foodspotting guides), using editors and volunteers (YouTube Politics, Tumblr Tags), or using algorithms (YouTube Trends, Autogenerated YouTube channels, LinkedIn Today). One notable trend in Social Networking Service (SNS)-related research is agglomerating multiple information sources or services to obtain a deeper understanding of social media content. For example, Mejova employ a domain adaptation technique for sentiment analysis in three different social media streams: weblogs, review articles, and tweets on Twitter (Mejova and Srinivasan, 2012). The authors of (Hu et al., 2012) extend a topic model (Blei et al., 2003) to associate tweets and real events to discover topical segmentation in an event. Kulshrestha studied the impact of offline geolocations on online social network activities and participants (Kulshrestha et al., 2012). However, the first two studies focus on the same modality: namely, text-based datasets. In this paper, we employ the social curation service as a complimentary information source for the automatic understanding and mining of content in social
Chapter 2. Literature review
Figure 2.1: Content Creators network. This is closer to (Kulshrestha et al., 2012) in the sense that the information source is crossmodal: a social network structure with offline geographical information, as in our case social curation lists are associated with stories. To the best of our knowledge, there are only some studies dealing with social curation service, like a work by (Duh et al., 2012). This paper analyzed curation lists consisting of Twitter messages (tweets). They also studied the objectives and topics of curation lists, and reported that there are many styles and usages among social curation services. The difference from our work is again in the modality. The focus of the authors was unimodal: the authors of (Duh et al., 2012) mainly focus on text messages (i.e. tweets). In our work, we extract various kinds of information (features) from a curation list to understand and evaluate the quality of this data by predicting the popularity of them. To be more specific, users involved in social curation service are classified into three types in Figure 2.2 (Duh et al., 2012). First, content creators generate social media content (or simply, content) that is posted to social networking services. Formats and domains of the content are diverse: text messages like tweets, photos taken by mobile phones, weblogs, movies, and so on. Second, curators collect and evaluate this posted content, and re-organize it to form compound content (called a curation, a summary or a curation list) based on the opinions, perspectives and interests of the curators. Usually, a curation list is created by one user. However, some curation lists are generated through the interaction of multiple curators. Third, content consumers enjoy, share and consume social media content created by content
Figure 2.2: Content Curators creators, as well content expressed by the curation lists. Note that a user can be a content creator, curator, and content consumer at the same time. As a result, a number of niche social curation platforms have emerged to enable people to curate different types of content, including links, photos, sounds, and videos. We should emphasize that each curation list is a kind of loosely supervised but organized social dataset. This means that social media items in the same curation list are expected to share the same context to a certain degree: a curation list is manually generated to fully convey one idea to the consumer. This is a very distinct characteristic compared to other social media that are unorganized in many cases.
The website Storify is the most well-known site for people telling stories by curating social media. Storify was launched in September 2010 and accounts were invitation-only until April 2011. The site is now open to everyone and users only need a Twitter account. Storify provides a function to filter out poor content and unreliable sources. If social media changes or misinterprets context, Storify can help curators put it back together again (Fincham, 2011). Storify allows curators to embed dynamic images, text, tweets, and even Facebook status updates, and then knit these all together with background and context provided by the storyteller. It is an engaging way for us to learn how to work out what is true and what is specula-
Chapter 2. Literature review
Figure 2.3: Example of a Storify list tion. We have also found that using Twitter has taught us how to look for sources and news and Storify has helped teach us how to think and write context and narrative. Each story is a curation list which shares some characteristics: manually collected (bundling a collection of content from diverse sources), manually selected (re-organizing them to give one’s own perspective), manually maintained (publishing the resulting story for consumers).
The Storify data is in the form of lists of Twitter messages. An example of a list is shown in Figure 2.3. A list of tweets corresponds to what we called a story, which represents a manually filtered and organized bundle. Lists in Storify draw on Twitter as its source. The lists may be created individually in private or collaboratively in public as determined by the initial curator. In the Storify curation interface, the curator begins the list curation process by looking through his Twitter timeline (tweets from users that he or she follows), or directly searching tweets via relevant words/hashtags. The curator can drag-and-drop these tweets into a list, reorder them freely, and also add annotations such as a list header and in-place comments.
Table 2.1: Statistics of curated domains Domain Number of Elements Proportion Twitter 8,514,006 75.5% Storify 1,206,794 10.7% YouTube 190,611 1.7% Facebook 169,361 1.5% Instagram 155,762 1.4% Flickr 127192 1.2% Others 920,089 8%
Table 2.2: Element types Types Number Proportion Quote 7,715,616 68.4% Text 1,195,625 10.6% Image 1,436,673 12.7% Video 206,265 1.8% Link 732,096 6.5%
We first provide some data statistics to get a feel for the curation data. We collected all the data from 2010 to April 2013, which amounted to 63,419 users and 352,540 stories. This corresponds to a total of 11,283,815 elements from various domains. Table 2.1 describes the various domains used in the stories. Twitter is the largest domain source with more than 75% elements, and Flickr is the smallest specific source with 1.2% elements. The statistics of the element types is shown in Table 2.2. The five types of elements in stories are quote, text, image, video, and link. Because Storify users use a huge number of tweets, the number of quote contents accounts for a large percentage of nearly 70%. Media content as images and movies make up approximately 15%. Text contents are written by the curator to add more information, explain, or link elements. The Storify API provides the four main actions shown in Table 2.3. The Storify website allows users to comment on each element or on all parts of a story. However, the average numbers of comments, element comments and similar actions are quite small. Therefore, approaches utilizing user comments and actions are not suitable for this dataset (Ahmed et al., 2013).
Chapter 2. Literature review
Table 2.3: Storify action statistics Action Number Average Views 642,666,347 1823 per story Comments 21,306 0.06 per story Element comments 21,133 0.002 per element Likes 206,265 0.12 per story
Several studies have investigated social curation as a new source of data mining. Pinterest2 is the most popular website for sharing images and video, and the third most popular social network in the US behind Facebook and Twitter. The website is built around the activity of collecting digital images and videos and pinning them to a pin board. Each pin is essentially a visual bookmark and the pin boards are thematic collections of the bookmarks, where context is added to the collected information. Hall and Zarro described some of the user actions on Pinterest and created a dataset to find the pin content of Pinterest users across a wide variety of subject areas (Hall and Zarro, 2012). Besides only curating images or video, other sites curate status, comments, news sources to write blogs, stories. Storyful3 is a social media news agency established in 2010 with the aim of filtering news, or newsworthy content, from the vast quantities of noisy data on social networks such as Twitter and YouTube. Storyful invests considerable time into the manual curation of content on these networks. It sounds more or less like the same goal as Storify’s but there is one important difference. Storyful aims to deliver content for news organizations, whereas Storify is more of a tool for journalists. It allows journalists to use its template to write stories that include relevant tweets and Facebook posts without losing the original formatting or links. Journalists can create interactive stories with clear links to original pictures or tweets. Greene et al. proposed a variety of criteria for generating user list recommendations based on content analysis, network analysis, and the “crowdsourcing” of existing user lists (Greene et al., 2012). In addition, the Togetter website4 is a rapidly growing social curation website in Japan. Togetter averaged more than 4 million user-views per month in 2011. The Togetter curation 2
data mainly exist in the form of lists of Twitter messages. Ishiguro et al. used Togetter data for the automatic understanding and mining of images (Ishiguro et al., 2012) and created a system (Duh et al., 2012) that suggests new tweets to increase the curator’s productivity and breadth of perspective. Our research discovered another social curation website, Storify. The structure of a Storify list is quite similar to that of a Togetter list. The only difference is the language: the common language of Togetter is Japanese and Storify is English. However, we interested in another aspect which show the quality of curation list made by users. The problem of predicting online content highlights how much attention it will ultimately receive. Research shows that user attention is allocated in a rather asymmetric way, with most content getting only some views and downloads, whereas a few receive a significant amount of user attention; thus, filtering these contents will help to save much time for viewers. There are different ways to formulate how much attention of contents. Many researchers interested in the number of views as the popularity of online content such as YouTube (the number of views (Szabo and Huberman, 2010)), Vimeo (the number of views (Ahmed et al., 2013)), Flickr (the number of views (van Zwol et al., 2010)). Otherwise, the popularity is presented by users’ actions like Dig (the number of user votes (Jamali and Rangwala, 2009)), Twitter (the number of retweets (Hong et al., 2011)). Moreover, others formulate the problem to a change of the number of views that contents receive over time. Predicting the popularity of news articles is a complex and difficult task and different prediction methods and strategies have been proposed in several recent studies (Szabo and Huberman, 2010) (Tsagkias et al., 2010). The common point of all these methods is that they focus on predicting the exact attention that an article will generate in the near future. First, some researchers have studied features that describe the underlying social network of the users and contents that can be leveraged to predict popularity (Hogg and Lerman, 2012) (Jamali and Rangwala, 2009) (Lerman and Hogg, 2010) (Tsagkias et al., 2010). The authors in (Kim et al., 2011) (Lee et al., 2010) (Lee et al., 2012) (Tsagkias et al., 2010) studied features that take into account the comments found in blogs to predict popularity. However, few other works forecast a value for the actual popularity of individual content. Lee et al. used survival analysis to evaluate the probability that a given content receives more than some x number of hits (Lee et al., 2010) (Lee et al., 2012). Hong et al.
Chapter 2. Literature review
developed a coarse multi-class classifier-based approach to determine whether given Twitter hashtags are retweeted x ≤ (0; 100; 10000; ∞) times (Hong et al., 2011). Similarly, Lakkaraju and Ajmera used support vector machines (SVMs) to predict whether a given content falls into a group that attracts x ≤ (10%; 25%; 50%; 75%; 100%) of the attention in a system (Lakkaraju and Ajmera, 2011), while Jamali and Rangwala predicted the popularity of content by using an entropy measure (Jamali and Rangwala, 2009). Finally, Szabo and Huberman presented a linear regression model based on the number of views (Szabo and Huberman, 2010); this method was applied to build predictive popularity by applying regression to different feature spaces (Bandari et al., 2012) (Hogg and Lerman, 2012) (Lerman and Hogg, 2010) (Tsagkias et al., 2010). In this work, the popularity of Social Curation is shown by the number of views that the content will receive in the near future. We propose three groups for categorizing the popularity level of Social Curation. We build a predictor based on a machine learning method, SVM, with feature selection to classify into these groups.
Chapter 3 Predicting the Popularity of Social Curation 3.1 3.1.1
Problem Formulation Regression
Formally, we predict view count yi of content i from information of the content xi . This is a typical regression problem: i.e. we try to minimize the error between the predicted view count yˆi and the true view count yi by modifying an unknown parameter w that governs the regression function yˆi = f (xi; w). Given content and social curation lists, we extract several features xi and predicted a view count for each content. Social curation lists contain many kinds of information that are useful for predicting view counts.
Similar to normal content, the popularity of social curation is defined by the number of users’ view. We predict how much view which stories will receive in the near future. However, it is difficult to predict exact amount of attention and people are almost interested in the popularity of content; thus, instead of predicting exactly the number, we cast the task as a multi-class classification problem that predicts the popularity that a curation list will receive after three months based on the number of views. Although our system cannot predict exactly the number of attention, but this 13
Chapter 3. Predicting the Popularity of Social Curation
system partly helps users to be able to identify popular contents and not popular contents. We divide the number of views into three different classes: class 1 – not popular, with the number of views less than 10, class 2 – less popular, with the number of views between 10 and 1000, class 3 – very popular, with the number of views more than 1000. We used an SVM to classify these classes. LibSVM (Chang and Lin, 2011) with a radial basic function (RBF) kernel and default parameters, and the feature selection tool (wei Chen, 2005) were used to optimize the result. We extracted three types of features, namely curation features, curator features and text features. Curator features are features of users who collect and organize elements from some domains and create curation lists. Curation features are features related to the content of the curation lists. Text features cover all content text of curation lists.
Social curation lists contain many kinds of information that are useful for classifying. For example, if the curation list includes many Twitter contents, the view count of the contents is expected to increase; or, if elements match the context of the curation list, the content will attract much more attention. In this study, as the social curation list included a large number of Twitter messages, we used applicable features for predicting the number of retweets and microblogging popularity. We divided the features into the two distinct sets mentioned above: curator features (which are related to the author of the story), curation features (which encompass various statistics of the content in the story) and text features.
The following are the five curator features: (i) The number of users who follow the curator of the content
3.2. Feature Extraction
(ii) The number of users who the curator of the content follows (iii) The number of stories written by the curator (iv) The user’s language (English or not) (v) When the curator of the content started using Storify These features were selected from the content creator features proposed by Ishiguro et al. (Ishiguro et al., 2012). We implemented these features as our baseline system. The number of followers and friends has been consistently shown to be a good indicator of retweetability, whereas the number of stories has not been found to have a significant impact (Suh et al., 2010). Our prior analysis also showed that stories written in English are more likely to be viewed, so we used a binary feature indicating if the user’s language is English. The date when a curator started using Storify shows their experience. Normally, longtime users have more experience producing more popular curator stories than do new users. We are not aware of any prior work that analyzes the effect of language or date on content popularity.
The following are the seven curation features: (i) The number of hashtags (ii) The number of versions (iii) The number of embeds (iv) The story’s language (English or not) (v) The number of popular tweet elements/total elements (the number of retweets greater than 100) (vi) The number of popular image and video elements/total elements (the number of image views and video views greater than 1000) (vii) The total number of elements