5.4 Data exploration
Twitter gives out a lot of information but rtweet
returns only some of it. Yet, lp
is a large data frame with 17,905 observations and 88 variables. Let’s see what variable names we have.
names(lp)
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "hashtags" "symbols"
## [17] "urls_url" "urls_t.co"
## [19] "urls_expanded_url" "media_url"
## [21] "media_t.co" "media_expanded_url"
## [23] "media_type" "ext_media_url"
## [25] "ext_media_t.co" "ext_media_expanded_url"
## [27] "ext_media_type" "mentions_user_id"
## [29] "mentions_screen_name" "lang"
## [31] "quoted_status_id" "quoted_text"
## [33] "quoted_created_at" "quoted_source"
## [35] "quoted_favorite_count" "quoted_retweet_count"
## [37] "quoted_user_id" "quoted_screen_name"
## [39] "quoted_name" "quoted_followers_count"
## [41] "quoted_friends_count" "quoted_statuses_count"
## [43] "quoted_location" "quoted_description"
## [45] "quoted_verified" "retweet_status_id"
## [47] "retweet_text" "retweet_created_at"
## [49] "retweet_source" "retweet_favorite_count"
## [51] "retweet_retweet_count" "retweet_user_id"
## [53] "retweet_screen_name" "retweet_name"
## [55] "retweet_followers_count" "retweet_friends_count"
## [57] "retweet_statuses_count" "retweet_location"
## [59] "retweet_description" "retweet_verified"
## [61] "place_url" "place_name"
## [63] "place_full_name" "place_type"
## [65] "country" "country_code"
## [67] "geo_coords" "coords_coords"
## [69] "bbox_coords" "status_url"
## [71] "name" "location"
## [73] "description" "url"
## [75] "protected" "followers_count"
## [77] "friends_count" "listed_count"
## [79] "statuses_count" "favourites_count"
## [81] "account_created_at" "verified"
## [83] "profile_url" "profile_expanded_url"
## [85] "account_lang" "profile_banner_url"
## [87] "profile_background_url" "profile_image_url"
It is not possible to provide description for each of these variables. However, the variables are a mix of user data and tweet data. For instance, user_id
tells us the unique user id while status_id
is the unique id given to this tweet.
I would like to show you two interesting variables: source
and verified
. The first one contains the information on the device that was used to send out the tweet. The second variable tells us whether the person has a verified Twitter account.
Using count()
function from dplyr
we can see which device is the most popular. As we may have the same person tweeting multiple times, we will keep only distinct user_id
-source
pairs.
lp %>%
distinct(user_id, source) %>%
count(source, sort = TRUE) %>%
top_n(10)
## Selecting by n
## # A tibble: 10 x 2
## source n
## <chr> <int>
## 1 Twitter for iPhone 8810
## 2 Twitter for Android 3485
## 3 Twitter Web Client 1251
## 4 Twitter Web App 655
## 5 Facebook 214
## 6 TweetDeck 203
## 7 Twitter for iPad 170
## 8 Tweetbot for iΟS 60
## 9 Instagram 55
## 10 IFTTT 47
This particular pattern is more prevalent in the US. In the other countries, Twitter for Android usually tops the list.
How many verified accounts do we have in our sample?
lp %>%
distinct(user_id, verified) %>%
count(verified, sort = TRUE)
## # A tibble: 2 x 2
## verified n
## <lgl> <int>
## 1 FALSE 14695
## 2 TRUE 370
It’s impressive that we have 370 verified accounts. Later we will see whether their twitter sentiment is different from non-verified accounts.