5.4 Data exploration

Twitter gives out a lot of information but rtweet returns only some of it. Yet, lp is a large data frame with 17,905 observations and 88 variables. Let’s see what variable names we have.

names(lp)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

It is not possible to provide description for each of these variables. However, the variables are a mix of user data and tweet data. For instance, user_id tells us the unique user id while status_id is the unique id given to this tweet.

I would like to show you two interesting variables: source and verified. The first one contains the information on the device that was used to send out the tweet. The second variable tells us whether the person has a verified Twitter account.

Using count() function from dplyr we can see which device is the most popular. As we may have the same person tweeting multiple times, we will keep only distinct user_id-source pairs.

lp %>% 
  distinct(user_id, source) %>% 
  count(source, sort = TRUE) %>% 
  top_n(10)
## Selecting by n
## # A tibble: 10 x 2
##    source                  n
##    <chr>               <int>
##  1 Twitter for iPhone   8810
##  2 Twitter for Android  3485
##  3 Twitter Web Client   1251
##  4 Twitter Web App       655
##  5 Facebook              214
##  6 TweetDeck             203
##  7 Twitter for iPad      170
##  8 Tweetbot for iΟS       60
##  9 Instagram              55
## 10 IFTTT                  47

This particular pattern is more prevalent in the US. In the other countries, Twitter for Android usually tops the list.

How many verified accounts do we have in our sample?

lp %>% 
  distinct(user_id, verified) %>% 
  count(verified, sort = TRUE)
## # A tibble: 2 x 2
##   verified     n
##   <lgl>    <int>
## 1 FALSE    14695
## 2 TRUE       370

It’s impressive that we have 370 verified accounts. Later we will see whether their twitter sentiment is different from non-verified accounts.