Introduction

The goal of this project is to predict the word or phrase from a tweet that captures its provided sentiment. We are given a training dataset with an original tweet, its sentiment, and the selected text that captures its sentiment.

The metric in this competition is the word-level Jaccard score. A description of Jaccard similarity for strings can be found here.

This script takes ~1 minute to compile with the help of an RData file where a lot of computationally heavy lifting has already been done.

train <- read.csv("train.csv")
test <- read.csv("test.csv")
sample_submission <- read.csv("sample_submission.csv")

The training data has 27,481 tweets, and the test data has 3,534 tweets. Let’s implement the evaluation method first.

# modified jaccard score to take into account repeated words
jaccard <- function (str1, str2) {
  a = str_split(str_to_lower(str1), " ", simplify = TRUE) # lowercase, split
  b = str_split(str_to_lower(as.character(str2)), " ", simplify = TRUE)  # lowercase, split
  c = b[b %in% intersect(a, b)]
  return(length(c) / (length(a) + length(b) - length(c)))
}

Basic Cleaning

  1. Remove one tweet that is blank.
  2. Remove leading and trailing spaces.
  3. Change the data type of text to “character” for easy manipulation.
  4. Remove any weird rows where none of the selected text contains words from the original text split from spaces. Note that these likely are human annotation errors and also exist in the test data set.
## remove one row in train with blank tweet
train <- train[!(train$text == ""), ]

## change to character
train$text <- as.character(train$text)
train$selected_text <- as.character(train$selected_text)
test$text <- as.character(test$text)

## remove leading and trailing spaces, then add one leading space  
# create function
rm_spaces <- function (x) {
  x <- x %>%
    str_remove("^ +") %>%
    str_remove(" +$")
  # paste0(" ", x, " ")
}

# apply function
train <- train %>%
  mutate(text_clean = rm_spaces(text))
test <- test %>%
  mutate(text_clean = rm_spaces(text))

## columns for text similarity with selected text 
train$jaccard <- 0
for (i in 1:nrow(train)) {
  train$jaccard[i] <-  jaccard(train$text_clean[i], train$selected_text[i])
} 
rm(i)

## note there are rows with weird selected text (words not created by spaces, extra chars) ...i'll remove for now 
# train %>%
#   filter(jaccard == 0) %>%
#   select("text", "selected_text")
# 
weird_train <- train[(train$jaccard == 0), ]
train <- train[!(train$jaccard == 0), ]

# remove some weird characters
train$text_clean <- str_replace_all(train$text_clean, "ï|¿|½", "")
test$text_clean <- str_replace_all(test$text_clean, "ï|¿|½", "")

Data Exploration

Sentiment Distribution

The most common sentiment is neutral (~41%), followed by positive (~31%) and negative (~28%) in both training and test datasets.

rbind(train = summary(train$sentiment) / length(train$sentiment),
      test = summary(test$sentiment) / length(test$sentiment)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("condensed", "responsive", "hover"), full_width = F)
negative neutral positive
train 0.2790136 0.4125970 0.3083893
test 0.2832484 0.4046406 0.3121109
barchart(train$sentiment, main = "Training Sentiment Distribution", 
         scales = list(cex = c(1.8, 1)))
barchart(test$sentiment, main = "Test Data Sentiment Distrubution",
         scales = list(cex = c(1.8, 1)))

Comparing Selected and Original

Neutral sentiments are very often captured best by the entire tweet, while positive and negative sentiments are most commonly captured by a couple of key words.

# proportion of selected text that is exactly the same
train$exact_same <- as.character(train$text_clean) == as.character(train$selected_text)
train %>%
  group_by(sentiment) %>%
  summarize(exact_same = sum(exact_same),
            total_tweets = n(),
            proportion_same = exact_same/total_tweets) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed", "hover"), full_width = F)
sentiment exact_same total_tweets proportion_same
negative 1139 7513 0.1516039
neutral 9947 11110 0.8953195
positive 1115 8304 0.1342726
train %>%
  ggplot() +
  geom_histogram(mapping = aes(x = jaccard, fill = sentiment), 
                 position = 'dodge',
                 bins = 15) + 
  labs(title = "Similarity of original text with the portion that captures its sentiment",
       x = "Jaccard score")