How Many â€å“usefulã¢â‚¬â Votes Will a Yelp Review Receive?

Introduction

A YELP review can receive upward to 3 dissimilar kinds of votes; a funny vote, a absurd vote or a useful vote. I hypothesized that the counts of these votes plays a role in determining a review'due south ratings. I decided to test this hypothesis on businesses that cocky-identified as restaurants. I focused on the reviews with at to the lowest degree iii votes. The dependent variable, star ratings, was converted into a binary variable with a "loftier" for whatever rating in a higher place or equal to a 4 and "depression" for any rating below a 4. Cool votes were the best predictor of ratings followed by useful votes and funny votes came tertiary. The model could be improved past including additional variables. Understanding which votes are important to a review is important for both the reviewer and the concern. The reviewer can go along to write neat reviews and the businesses potentially get better ratings.

Information processing

The consummate YELP data is partioned into v "json" files: user, tip, review, checkin and business files. For my analysis I used the business and reviews files. The business dataset provides attributes such as location, stars and hours of functioning for the different business concern. The reviews dataset holds information such as the review text and votes held by a review. In that location are 61,000+ businesses and 1.5+ million reviews.

For my enquiry question I chose to focus on businesses that self-identified as restaurants. In addition, I focused on reviews that had at least 3 votes. I split the information into a preparation dataset and a testing information fix in an 80-20 training-testing split.

Once the data of interest were selected, the next footstep was audit the predictor variables and respose variables. The modal rating is 4 stars. The response variable, stars, is on a 5 point scale. I create a binary rewsponse variable past splitting ratings into "High" and "Low" categories where high includes any ratings greater or equal to 4. Low ratings covers the ratings beneath a 4.

Vote counts for each type, cool, funny and useful are right-skewed then I center the data by grouping into deciles.

Modeling method

My response variable is binary - high star rating vs depression star rating. To model a binary variable, I chose to use binary logistic regression model with a logit link function. I and so utilize the caret package to segmentation the data into a preparation and a testing ready. The final models were of the class:

star_rating(1= high and 0 = depression) ~ decile(cool votes) + decile(useful votes) + decile(funny votes) + decile(total votes)

Results and conclusion

Cools votes were the best predictors of review ratings. This upshot matches my hypothesis. Useful votes came 2d and at 3rd place was funny votes. To a reader/client, I believe a "cool" review is more attracting than a "funny" one. A decile increase in cool votes is associated with a 34% increase in the predicted probability odds of having a high rating. A decile increment in useful odds is associated 14% increase in the log odds of a high rating. A decile increase in funny odds is associated x% increase in the log odds of a high rating. Against the test dataset, the model was 31% accurate and had a 68% expanse under the curve.

Lawmaking

          # -------------------------------------------------------------------------------------- # load packages library(jsonlite) library(plyr) library(dplyr) library(gmodels) library(ggplot2) library(e1071) library(AppliedPredictiveModeling)  library(caret)  library(MASS) library(reshape2) library(ROCR) # set working directory setwd("~/DS_course_project/data_files") # Read "R" datasets dfBus <- readRDS("business.rds") dfRev <- readRDS("review.rds") # clean variable names names(dfBus) <- make.names(names(dfBus), unique = True, allow_ = Simulated) names(dfRev) <- make.names(names(dfRev), unique = TRUE, allow_ = FALSE)  # -------------------------------------------------------------------------------------- # select businesses to clarify # collapse listing business categories into one variable select_biz <-(dfBus[,c(1,4)])  dx <-(select_biz$categories) n <- length(dx) business_categ <- character(northward) for (i in one:n){   business_categ[i] = paste(dx[[i]], collapse = ' / ') } dx2 <- data.frame(select_biz$business.id, business_categ)  # -------------------------------------------------------------------------------------- # Select "Eating house" businesses for(i in i:nrow(dx2)){   business_ct = dx2$business_categ[i]   if(grepl("restaurants",business_ct,ignore.instance = TRUE) == TRUE) {     dx2$busi_to_keep[i] <- i   }   else dx2$busi_to_keep[i] <- 0 } dx3 <- dx2[which(dx2$busi_to_keep == 1), ] business.id <- dx3[,ane] businesses_to_keep <- equally.data.frame(business concern.id)  # -------------------------------------------------------------------------------------- # Reviews data dfRvws <-dfRev # merge reviews to the businesses dfRvwsAll <- merge(dfRvws,businesses_to_keep,by="business.id") dfRvwsAll$votes.total <- dfRvwsAll$votes.funny+dfRvwsAll$votes.useful+dfRvwsAll$votes.cool Revs <- dfRvwsAll[,c(3,four,6:nine)] # Select reviews with at to the lowest degree 3 total votes bQtr2x <- Revs[ which(Revs$votes.total >= iii),]        
          # -------------------------------------------------------------------------------------- # data exploration summary(bQtr2x[,-one])        
          ##      stars        votes.funny       votes.useful       votes.cool     ##  Min.   :1.000   Min.   :  0.000   Min.   :  0.000   Min.   :  0.00   ##  1st Qu.:3.000   1st Qu.:  0.000   1st Qu.:  ii.000   1st Qu.:  1.00   ##  Median :iv.000   Median :  ane.000   Median :  2.000   Median :  1.00   ##  Hateful   :three.658   Mean   :  1.669   Mean   :  3.187   Hateful   :  2.07   ##  3rd Qu.:5.000   3rd Qu.:  2.000   3rd Qu.:  4.000   3rd Qu.:  2.00   ##  Max.   :5.000   Max.   :141.000   Max.   :166.000   Max.   :137.00   ##   votes.full      ##  Min.   :  three.000   ##  1st Qu.:  3.000   ##  Median :  5.000   ##  Mean   :  6.926   ##  third Qu.:  7.000   ##  Max.   :444.000        
          # standard deviation sapply(bQtr2x[,-one], sd)        
          ##        stars  votes.funny votes.useful   votes.cool  votes.total  ##     1.284196     2.548183     3.074900     two.681225     7.672825        
          # group predictor variables in deciles to account for farthermost values bQtr3 <- bQtr2x bQtr3$cool.decile <- ntile(bQtr3$votes.cool, 10) bQtr3$funny.decile <- ntile(bQtr3$votes.funny, 10) bQtr3$useful.decile <- ntile(bQtr3$votes.useful, 10) bQtr3$total.decile <- ntile(bQtr3$votes.full, 10)                  
          # tabulate response variable table(bQtr3$stars)        
          ##  ##     1     2     3     4     v  ## 22910 24811 35985 77546 73881        
          # transform response variable bQtr3$stars_high <- ifelse(bQtr3$stars <= iii, 0, 1) tabular array(bQtr3$stars_high)        
          ##  ##      0      1  ##  83706 151427        
          # Boxplots of vote types past stars dfTr3x <- bQtr3[,c(7:11)] dfTr3y <- melt(dfTr3x, measure.vars = 2:4)   ggplot(dfTr3y, aes(x=cistron(stars_high), y=value,fill=variable))+   geom_boxplot()+   facet_grid(.~variable)+   ggtitle("Deciles of vote counts (funny, absurd, useful, total)            by stars(ratings)") +   labs(x="Stars(ratings)",y="deciles of votes")        

          # -------------------------------------------------------------------------------------- # tabulate response variable table(bQtr3$stars)        
          ##  ##     1     2     three     four     5  ## 22910 24811 35985 77546 73881        
          # transform response variable bQtr3$stars_high <- ifelse(bQtr3$stars <= 3, 0, 1) table(bQtr3$stars_high)        
          ##  ##      0      ane  ##  83706 151427        
          # -------------------------------------------------------------------------------------- # Partition into training and examination data gear up.seed(123) trainRows <- createDataPartition(bQtr3$stars_high, p = .8, listing = Fake) trainDf <- bQtr3[trainRows,-c(one:6)] trainDf2 <- bQtr3[trainRows,-c(i,two,vii:x)] testDf <- bQtr3[-trainRows,-c(1:half-dozen,xi)] testDf2 <- bQtr3[-trainRows,-c(one,2,7:11)] testResp <- bQtr3[-trainRows,11]        
          # -------------------------------------------------------------------------------------- # Fit glm logistic model gear up.seed(123) modFit <- glm(stars_high ~ ., information =trainDf, family=binomial(link='logit')) summary(modFit) #model summary        
          ##  ## Call: ## glm(formula = stars_high ~ ., family = binomial(link = "logit"),  ##     data = trainDf) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -ii.4706  -1.1610   0.7257   0.9027   1.9891   ##  ## Coefficients: ##                Estimate Std. Error z value Pr(>|z|)     ## (Intercept)    0.608337   0.014598  41.674   <2e-16 *** ## cool.decile    0.295192   0.002749 107.383   <2e-16 *** ## funny.decile  -0.147257   0.002679 -54.963   <2e-xvi *** ## useful.decile -0.095128   0.003228 -29.465   <2e-16 *** ## full.decile  -0.044906   0.004773  -nine.408   <2e-xvi *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## (Dispersion parameter for binomial family taken to be one) ##  ##     Null deviance: 244906  on 188106  degrees of freedom ## Residual deviance: 226479  on 188102  degrees of freedom ## AIC: 226489 ##  ## Number of Fisher Scoring iterations: 4        
          # odds ratios and 95% CIs using profiled log-likelihood exp(cbind(OR = coef(modFit), confint(modFit)))        
          ##                      OR     2.5 %    97.v % ## (Intercept)   ane.8373740 1.7855612 one.8907148 ## cool.decile   i.3433846 1.3361738 1.3506500 ## funny.decile  0.8630719 0.8585473 0.8676116 ## useful.decile 0.9092567 0.9035188 0.9150260 ## total.decile  0.9560873 0.9471851 0.9650735        
          # Run anova() to analyze the table of deviance anova(modFit, test="Chisq")        
          ## Analysis of Deviance Table ##  ## Model: binomial, link: logit ##  ## Response: stars_high ##  ## Terms added sequentially (starting time to final) ##  ##  ##               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)     ## Nada                         188106     244906               ## cool.decile    1   7763.0    188105     237143 < 2.2e-xvi *** ## funny.decile   1   6991.4    188104     230151 < 2.2e-16 *** ## useful.decile  1   3583.6    188103     226568 < ii.2e-16 *** ## total.decile   1     88.5    188102     226479 < 2.2e-16 *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1        
          # -------------------------------------------------------------------------------------- # Prediction # assessing predictive ability of model # predicted probabilities fitted.results <- 1- predict(modFit,newdata = testDf,                         #glm does not predict class tin produce pr(event)                         type = "response") # If P(y=one|X) > 0.5 and so y = 1 otherwise y=0. fitted.results <- ifelse(fitted.results > 0.5,1,0) # classification misClasificError <- hateful(fitted.results != testResp) impress(paste('Accurateness',1-misClasificError))        
          ## [ane] "Accurateness 0.309126865989027"        
          # plot the ROC curve p <- predict(modFit,newdata=testDf,blazon='response') pr <- prediction(p, testResp) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf)        

          # Area under the bend auc <- performance(pr, mensurate = "auc") auc <- auc@y.values[[i]] auc        
          ## [1] 0.6798151        

arrowoodhors1995.blogspot.com

Source: https://rstudio-pubs-static.s3.amazonaws.com/126492_564a813a82084955b2c65ef0323b743c.html

0 Response to "How Many â€å“usefulã¢â‚¬â Votes Will a Yelp Review Receive?"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel