How Many â€å“usefulã¢â‚¬â Votes Will a Yelp Review Receive?
Introduction
A YELP review can receive upward to 3 dissimilar kinds of votes; a funny vote, a absurd vote or a useful vote. I hypothesized that the counts of these votes plays a role in determining a review'due south ratings. I decided to test this hypothesis on businesses that cocky-identified as restaurants. I focused on the reviews with at to the lowest degree iii votes. The dependent variable, star ratings, was converted into a binary variable with a "loftier" for whatever rating in a higher place or equal to a 4 and "depression" for any rating below a 4. Cool votes were the best predictor of ratings followed by useful votes and funny votes came tertiary. The model could be improved past including additional variables. Understanding which votes are important to a review is important for both the reviewer and the concern. The reviewer can go along to write neat reviews and the businesses potentially get better ratings.
Information processing
The consummate YELP data is partioned into v "json" files: user, tip, review, checkin and business files. For my analysis I used the business and reviews files. The business dataset provides attributes such as location, stars and hours of functioning for the different business concern. The reviews dataset holds information such as the review text and votes held by a review. In that location are 61,000+ businesses and 1.5+ million reviews.
For my enquiry question I chose to focus on businesses that self-identified as restaurants. In addition, I focused on reviews that had at least 3 votes. I split the information into a preparation dataset and a testing information fix in an 80-20 training-testing split.
Once the data of interest were selected, the next footstep was audit the predictor variables and respose variables. The modal rating is 4 stars. The response variable, stars, is on a 5 point scale. I create a binary rewsponse variable past splitting ratings into "High" and "Low" categories where high includes any ratings greater or equal to 4. Low ratings covers the ratings beneath a 4.
Vote counts for each type, cool, funny and useful are right-skewed then I center the data by grouping into deciles.
Modeling method
My response variable is binary - high star rating vs depression star rating. To model a binary variable, I chose to use binary logistic regression model with a logit link function. I and so utilize the caret package to segmentation the data into a preparation and a testing ready. The final models were of the class:
star_rating(1= high and 0 = depression) ~ decile(cool votes) + decile(useful votes) + decile(funny votes) + decile(total votes)
Results and conclusion
Cools votes were the best predictors of review ratings. This upshot matches my hypothesis. Useful votes came 2d and at 3rd place was funny votes. To a reader/client, I believe a "cool" review is more attracting than a "funny" one. A decile increase in cool votes is associated with a 34% increase in the predicted probability odds of having a high rating. A decile increment in useful odds is associated 14% increase in the log odds of a high rating. A decile increase in funny odds is associated x% increase in the log odds of a high rating. Against the test dataset, the model was 31% accurate and had a 68% expanse under the curve.
Lawmaking
# -------------------------------------------------------------------------------------- # load packages library(jsonlite) library(plyr) library(dplyr) library(gmodels) library(ggplot2) library(e1071) library(AppliedPredictiveModeling) library(caret) library(MASS) library(reshape2) library(ROCR) # set working directory setwd("~/DS_course_project/data_files") # Read "R" datasets dfBus <- readRDS("business.rds") dfRev <- readRDS("review.rds") # clean variable names names(dfBus) <- make.names(names(dfBus), unique = True, allow_ = Simulated) names(dfRev) <- make.names(names(dfRev), unique = TRUE, allow_ = FALSE) # -------------------------------------------------------------------------------------- # select businesses to clarify # collapse listing business categories into one variable select_biz <-(dfBus[,c(1,4)]) dx <-(select_biz$categories) n <- length(dx) business_categ <- character(northward) for (i in one:n){ business_categ[i] = paste(dx[[i]], collapse = ' / ') } dx2 <- data.frame(select_biz$business.id, business_categ) # -------------------------------------------------------------------------------------- # Select "Eating house" businesses for(i in i:nrow(dx2)){ business_ct = dx2$business_categ[i] if(grepl("restaurants",business_ct,ignore.instance = TRUE) == TRUE) { dx2$busi_to_keep[i] <- i } else dx2$busi_to_keep[i] <- 0 } dx3 <- dx2[which(dx2$busi_to_keep == 1), ] business.id <- dx3[,ane] businesses_to_keep <- equally.data.frame(business concern.id) # -------------------------------------------------------------------------------------- # Reviews data dfRvws <-dfRev # merge reviews to the businesses dfRvwsAll <- merge(dfRvws,businesses_to_keep,by="business.id") dfRvwsAll$votes.total <- dfRvwsAll$votes.funny+dfRvwsAll$votes.useful+dfRvwsAll$votes.cool Revs <- dfRvwsAll[,c(3,four,6:nine)] # Select reviews with at to the lowest degree 3 total votes bQtr2x <- Revs[ which(Revs$votes.total >= iii),] # -------------------------------------------------------------------------------------- # data exploration summary(bQtr2x[,-one]) ## stars votes.funny votes.useful votes.cool ## Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.00 ## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: ii.000 1st Qu.: 1.00 ## Median :iv.000 Median : ane.000 Median : 2.000 Median : 1.00 ## Hateful :three.658 Mean : 1.669 Mean : 3.187 Hateful : 2.07 ## 3rd Qu.:5.000 3rd Qu.: 2.000 3rd Qu.: 4.000 3rd Qu.: 2.00 ## Max. :5.000 Max. :141.000 Max. :166.000 Max. :137.00 ## votes.full ## Min. : three.000 ## 1st Qu.: 3.000 ## Median : 5.000 ## Mean : 6.926 ## third Qu.: 7.000 ## Max. :444.000 # standard deviation sapply(bQtr2x[,-one], sd) ## stars votes.funny votes.useful votes.cool votes.total ## 1.284196 2.548183 3.074900 two.681225 7.672825 # group predictor variables in deciles to account for farthermost values bQtr3 <- bQtr2x bQtr3$cool.decile <- ntile(bQtr3$votes.cool, 10) bQtr3$funny.decile <- ntile(bQtr3$votes.funny, 10) bQtr3$useful.decile <- ntile(bQtr3$votes.useful, 10) bQtr3$total.decile <- ntile(bQtr3$votes.full, 10) # tabulate response variable table(bQtr3$stars) ## ## 1 2 3 4 v ## 22910 24811 35985 77546 73881 # transform response variable bQtr3$stars_high <- ifelse(bQtr3$stars <= iii, 0, 1) tabular array(bQtr3$stars_high) ## ## 0 1 ## 83706 151427 # Boxplots of vote types past stars dfTr3x <- bQtr3[,c(7:11)] dfTr3y <- melt(dfTr3x, measure.vars = 2:4) ggplot(dfTr3y, aes(x=cistron(stars_high), y=value,fill=variable))+ geom_boxplot()+ facet_grid(.~variable)+ ggtitle("Deciles of vote counts (funny, absurd, useful, total) by stars(ratings)") + labs(x="Stars(ratings)",y="deciles of votes")
# -------------------------------------------------------------------------------------- # tabulate response variable table(bQtr3$stars) ## ## 1 2 three four 5 ## 22910 24811 35985 77546 73881 # transform response variable bQtr3$stars_high <- ifelse(bQtr3$stars <= 3, 0, 1) table(bQtr3$stars_high) ## ## 0 ane ## 83706 151427 # -------------------------------------------------------------------------------------- # Partition into training and examination data gear up.seed(123) trainRows <- createDataPartition(bQtr3$stars_high, p = .8, listing = Fake) trainDf <- bQtr3[trainRows,-c(one:6)] trainDf2 <- bQtr3[trainRows,-c(i,two,vii:x)] testDf <- bQtr3[-trainRows,-c(1:half-dozen,xi)] testDf2 <- bQtr3[-trainRows,-c(one,2,7:11)] testResp <- bQtr3[-trainRows,11] # -------------------------------------------------------------------------------------- # Fit glm logistic model gear up.seed(123) modFit <- glm(stars_high ~ ., information =trainDf, family=binomial(link='logit')) summary(modFit) #model summary ## ## Call: ## glm(formula = stars_high ~ ., family = binomial(link = "logit"), ## data = trainDf) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -ii.4706 -1.1610 0.7257 0.9027 1.9891 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.608337 0.014598 41.674 <2e-16 *** ## cool.decile 0.295192 0.002749 107.383 <2e-16 *** ## funny.decile -0.147257 0.002679 -54.963 <2e-xvi *** ## useful.decile -0.095128 0.003228 -29.465 <2e-16 *** ## full.decile -0.044906 0.004773 -nine.408 <2e-xvi *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be one) ## ## Null deviance: 244906 on 188106 degrees of freedom ## Residual deviance: 226479 on 188102 degrees of freedom ## AIC: 226489 ## ## Number of Fisher Scoring iterations: 4 # odds ratios and 95% CIs using profiled log-likelihood exp(cbind(OR = coef(modFit), confint(modFit))) ## OR 2.5 % 97.v % ## (Intercept) ane.8373740 1.7855612 one.8907148 ## cool.decile i.3433846 1.3361738 1.3506500 ## funny.decile 0.8630719 0.8585473 0.8676116 ## useful.decile 0.9092567 0.9035188 0.9150260 ## total.decile 0.9560873 0.9471851 0.9650735 # Run anova() to analyze the table of deviance anova(modFit, test="Chisq") ## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: stars_high ## ## Terms added sequentially (starting time to final) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## Nada 188106 244906 ## cool.decile 1 7763.0 188105 237143 < 2.2e-xvi *** ## funny.decile 1 6991.4 188104 230151 < 2.2e-16 *** ## useful.decile 1 3583.6 188103 226568 < ii.2e-16 *** ## total.decile 1 88.5 188102 226479 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # -------------------------------------------------------------------------------------- # Prediction # assessing predictive ability of model # predicted probabilities fitted.results <- 1- predict(modFit,newdata = testDf, #glm does not predict class tin produce pr(event) type = "response") # If P(y=one|X) > 0.5 and so y = 1 otherwise y=0. fitted.results <- ifelse(fitted.results > 0.5,1,0) # classification misClasificError <- hateful(fitted.results != testResp) impress(paste('Accurateness',1-misClasificError)) ## [ane] "Accurateness 0.309126865989027" # plot the ROC curve p <- predict(modFit,newdata=testDf,blazon='response') pr <- prediction(p, testResp) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf)
# Area under the bend auc <- performance(pr, mensurate = "auc") auc <- auc@y.values[[i]] auc ## [1] 0.6798151 Source: https://rstudio-pubs-static.s3.amazonaws.com/126492_564a813a82084955b2c65ef0323b743c.html
0 Response to "How Many â€å“usefulã¢â‚¬â Votes Will a Yelp Review Receive?"
Post a Comment