Vlad Sandulescu

The winning solution to the KDD Cup 2016 competition - Predicting the future relevance of research institutions

2016-09-28T09:08:20+00:00

The ACM SIGKDD Conference on Knowledge Discovery and Data Mining, or short KDD is one of the largest premier academic conference on data science and large scale data mining and machine learning. Many well-known researchers choose to publish their research at KDD and the conference attracts most of the big names in the machine learning community. The KDD Cup challenge is held in conjunction with the conference and every year it gathers many participants competing to win the award. This year the top three teams received $10,000, $6500 and $3500 respectively. More than 500 teams registered for the competition which comprised of three stages, each time-boxed to one month each.

To compete in the KDD Cup is something I have been meaning to do for a few years now. To win it on the first go was simply incredible! This year, the stars aligned just right and I actually got some time to focus on this 3 months long competition. My friend and former colleague Mihai Chiru joined me and we proved once again we make one hell of a good team. The competition progressed great for us and in each phase we improved our predictions and overall rank.

The task? The research task was to predict the ranking of affiliations (universities, research institutions or companies) based on the number of their accepted full research papers at 8 future academic conferences in 2016. Each research paper is written by a number of authors, each affiliated to an university or company. The authors of a paper receive equal scores (1/number of paper authors) for each accepted paper. Then the score for each of the affiliations is simply the sum of these author scores. The affiliations are ranked according to these scores (also called ‘relevance’ by the competition organizers) and this represents the predicted ranking in the competition.

Ok, but how were the teams evaluated? The evaluation metric used to rank the teams was NDCG@20, a typical ranking evaluation metric. Basically a team would get a lower score if its ranking was different compared to the true ranking once the accepted full research papers would be published at the chosen conference. You can find more details about how NDCG@20 is computed together with a straightforward example here.

What data did we use? Microsoft, the competition organizer offered a free snapshot of the Microsoft Academic Graph (MAG), a heterogeneous graph containing various historical information about many academic papers. From this giant graph we extracted information about all the papers ever published at the chosen conferences, about all their authors and the affiliations the authors belonged to.

So how did we do it? First things first we looked at our data. We checked for any obvious trends for the accepted papers made by top affiliations to each conference. We consider a top affiliation one which had a large number of accepted papers at a conference in the last five years. The assumption is that for large conferences at least, the top 20 places each year will belong to more prolific affiliations, likely to have participated in the past to the conference. We also choose to focus more on the first 20 places because the evaluation metric used in the competition is NDCG@20. We have explored the dataset in a lot of ways but will only mention the most important finding in this article.

In Figure 1 we plot the number of full research papers accepted at the KDD conference between 2011 and 2015 for the top 20 affiliations. The length of each line maps the range of the number of accepted papers for the affiliation and the mean number of papers is marked by the larger dot on each line. The plot shows the mean number of papers across all years could be a good predictor to how an affiliation will score in the future.

Figure 1: Full range and mean value of the number of accepted full research papers for top 20 affiliations at KDD between 2011 and 2015

Second, all good predictive models need their trustworthy baseline. In the first phase of the competition we aim to build a solid baseline model. For this, we compute the probabilities that full research papers belong to affiliations, based on their number of accepted papers across all past five years. We then rank the affiliations according to these probabilities and this becomes our baseline model against which we will compare all our other models.

In the second phase and also in the final phase of the competition, we experiment with two classes of models: mixed models and gradient boosted decision trees (GBDT). The former is more interpretable while the latter has more predictive power. We set the relevance of each affiliation as the target of our predictions in whatever models we try. The relevance is, if you remember, the sum of the fractional contributions by all the authors of an affiliation for a conference in a year. You can imagine now our dataset is a large matrix where the columns are our features (I’ll explain which features we use in just a bit) and the rows are the observations. Basically each row holds the features values for each conference and each affiliation and each year.

More data is always better, so the next thing we try is to increase the dataset size by using information from more years and conferences related to each of the conferences we are interested in making predictions for. What does ‘related’ mean? Most researchers publish their work at different conferences. However they specialize in a specific area and so the conferences they publish at have to be more or less similar at least in a few respects. We use authors and keywords from the papers in MAG to cluster similar conferences together. It is a straightforward way to grow the dataset even more. The intuition behind this is the information from related conferences will enforce the patterns discovered by the models, because prolific affiliations are prolific across all conferences they submit to, not just at one of them. We compute the Jaccard similarity for both authors and keywords for any pair of conferences in the MAG. From this, we can determine which conferences are for example most similar to KDD in terms of common authors and common papers’ keywords. We experimented with different numbers of related conferences allows us to expand our training dataset immensely and greatly improve our predictions.

Ok, with the dataset in place comes the fun part: feature engineering. We have experimented with many other features, but will only mention here the ones which worked best for us. We created features meant to capture each affiliation’s long and short-term relevance trends:

Stats of all previous relevance scores (std, sum, mean, median, min, max)
Previous relevance scores computed in windows from previous year up to 4 years ago
Stats of previous relevance scores (std, sum, mean, median, min, max) computed in windows from previous year up to 4 years ago
Drift trend of previous relevance scores
Exponential weighted moving average of previous relevance scores with estimated smoothing parameter
Exponential weighted moving average of previous relevance scores, computed with a fixed smoothing parameter

Dataset+Features+Baseline+Tuning=Profit Final step in any competition is to polish your predictions through tuning. In the final phase of the competition the organizers chose 3 well-known conferences for validation: FSE, MM and MOBICOM. We search for the features configuration for which the GBDT model gives the best predictions for each of the conferences. We perform a grid search on different combinations of features and numbers of related conferences. Thus the training dataset was of course different for each of the 3 conferences. Although the final feature sets was different overall between the conferences, some of the features do well across all conferences: the exponential smoothing features improved the final predictions for all of them.

Figure 2 shows the corresponding results of the best features configurations for each conference. We used the tuning process to chose the best feature sets for each of the conferences, such that all the scores of the GBDT model are above the probabilities model baseline.

Figure 2: Results for the best configuration of the engineered features

Conclusions and tips for ML competitions This was my first shot at the KDD Cup competition and it couldn’t have ended better. I believe our systematic way to build the models coupled with some careful feature engineering was what in the end set us apart from the other teams. So here are some pointers I have for you to have in mind. You probably read all of them before but they cannot be stressed enough.

Tidy up your data and explore it! I mean really explore it, plot it like crazy. Spend a lot of time on doing this because you will get that extra intuition to create awesome features. You can squeeze some more performance by tuning your models or stacking them and the re-stacking them and then doing a final ensemble of it all, you know, the Kaggle way. But well thought out features will get you a much more elegant win. I don’t think a single company puts ensembles of 20 models into production.
Set up your own validation procedure and baseline model. You will compare all your other models with the baseline and this how you will progress. Flooding the leaderboards in order to test your model is going to get you nowhere.
Try simple models first and move on to more complicated ones gradually. Simple models are interpretable and can help you spot new features.
Teamwork is very important because it acts as a natural ensemble. You will not get all the ideas yourself.
Eyes on the prize! Don’t give up no matter how bad you do in the competition, because you will at least learn something new.

That’s it.

P.S.1. The approach is fully described in the paper Predicting the future relevance of research institutions - The winning solution of the KDD Cup 2016. Parts of this article were shamelessly taken from the paper. Some things were only superficially mentioned in this article. So I encourage you to read the entire paper for a full overview of the solution.

P.S.2. This entire article was also published on the Adform Engineering Blog

We won the KDD Cup 2016!

2016-08-09T07:08:20+00:00

Update:

Check out the abstract and read the paper which fully explains our approach.

You can read the papers from the top 12 teams which participated in the KDD Cup 2016 here.

ABSTRACT

The world’s collective knowledge is evolving through research and new scientific discoveries. It is becoming increasingly difficult to objectively rank the impact research institutes have on global advancements. However, since the funding, governmental support, staff and students quality all mirror the projected quality of the institution, it becomes essential to measure the affiliation’s rating in a transparent and widely accepted way. We propose and investigate several methods to rank affiliations based on the number of their accepted papers at future academic conferences. We carry out our investigation using publicly available datasets such as the Microsoft Academic Graph, a heterogeneous graph which contains various information about academic papers. We analyze several models, starting with a simple probabilities-based method and then gradually expand our training dataset, engineer many more features and use mixed models and gradient boosted decision trees models to improve our predictions.

The research task was to predict the ranking of affiliations (universities, research institutions or companies) based on the number of their accepted full research papers at various academic conferences in 2016. Microsoft, the competition organizer offered a free snapshot of the Microsoft Academic Graph (MAG), a heterogenous graph containing various historical information about many academic papers.

The competition progressed great for us and in each phase we improved our predictions and overall rank. I believe our systematic way to build the models coupled with some careful feature engineering was what in the end set us apart from the other teams.

The approach is fully described in the paper Predicting the future relevance of research institutions - The winning solution of the KDD Cup 2016. I will present the paper in August in San Francisco at the KDD Cup 2016: Towards measuring the impact of research institutions workshop part of KDD 2016.

Data Science and Machine Learning in Copenhagen Meetup - March 2016

2016-03-05T21:08:26+00:00

[Update]: I have moved the slides to my presentation here.

On March 1st, we decided to share with the data science community in Copenhagen some of the cool projects we work on at Adform. So I helped organize the second edition of the Machine Learning and Beer event at our office and welcomed around 60 machine learning enthusiasts . The meetup turned out great and the positive feedback we got from the participants made it all worthwhile. Nice to see the small Copenhagen DS community growing and becoming more sophisticated with more and more companies making machine learning part of their roadmaps.

You can check out the event photos, but you need to be a group member in order to see them.

Detecting Singleton Review Spammers Using Semantic Similarity

2015-03-26T21:08:26+00:00

My paper “Detecting Singleton Review Spammers Using Semantic Similarity” with Martin Ester was accepted at Rumors and Deception in Social Media workshop at WWW 2015. I will present the work at WWW this year in Florence.

Check out the abstract and view the paper.

ABSTRACT

Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more than a couple of reviews, discarding one-time reviewers. The number of singleton reviewers however is expected to be high for many review websites. While behavioral patterns are effective when dealing with elite users, for one-time reviewers, the review text needs to be exploited. In this paper we tackle the problem of detecting fake reviews written by the same person using multiple names, posting each review under a different name. We propose two methods to detect similar reviews and show the results generally outperform the vectorial similarity measures used in prior works. The first method extends the semantic similarity between words to the reviews level. The second method is based on topic modeling and exploits the similarity of the reviews topic distributions using two models: bag-of-words and bag-of-opinion-phrases. The experiments were conducted on reviews from three different datasets: Yelp (57K reviews), Trustpilot (9K reviews) and Ott dataset (800 reviews).

Predicting what user reviews are about with LDA and gensim

2014-09-09T00:00:00+00:00

I was rather impressed with the impressions and feedback I received for my Opinion phrases prototype - code repository here. So yesterday, I have decided to rewrite my previous post on topic prediction for short reviews using Latent Dirichlet Analysis and its implementation in gensim.

I have previously worked with topic modeling for my MSc thesis but there I used the Semilar toolkit and a looot of C# code. Having read many articles about gensim, I was itchy to actually try it out.

Why would we be interested in extracting topics from reviews?

It is becoming increasingly difficult to handle the large number of opinions posted on review platforms and at the same time offer this information in a useful way to each user so he or she can make a decision fast whether to buy the product or not. Topic-based aggregations and short review summaries are used to group and condense what other users think about the product in order to personalize the content served to a new user and shorten the time he needs to make a buying decision.

A short example always works best. Suppose a review says: The mailing pack that was sent to me was very thorough and well explained,correspondence from the shop was prompt and accurate,I opted for the cheque payment method which was swift in getting to me. All in all, a fast efficient service that I had the upmost confidence in,very professionally executed and I will suggest you to my friends when there mobiles are due for recycling :-)

Some of the topics that could come out of this review could be delivery, payment method and customer service.

In short, knowing what the review talks helps automatically categorize and aggregate on individual keywords and aspects mentioned in the review, assign aggregated ratings for each aspect and personalize the content served to a user. Or simply calculate the efficiency of each of the departments in a company by what people write in their reviews - in this example, the guys in the customer service department as well as the delivery guys would be pretty happy.

What do I need to run the code?

You can clone the repository and play with the Yelp’s dataset which contains many reviews or use your own short document dataset and extract the LDA topics from it.

Get the Yelp academic dataset and import the reviews from the json file into your local MongoDB by running the yelp/yelp-reviews.py file. Use MongoDB, take my word for it, you’ll never write to a text file ever again! You will also need PyMongo, NLTK, NLTK data (in Python run import nltk, then nltk.download()). I personally have these Corpora modules installed: Brown Corpus, SentiWordNet, WordNet, as well as the following Models: Treebank Part of Speech Tagger (HMM), Treebank Part of Speech Tagger (Maximum Entropy), Punkt Tokenizer Models. Finally, don’t forget to install gensim.

OK, enough foreplay, this is how the code works. Skip to the results if you are not interested in running the prototype.

How does the prototype work?

Well, the main goal of the prototype of to try to extract topics from a large reviews corpus and then predict the topic distribution for a new unseen review. Please read this paper first, before checking out the source code, as I have followed it rather closely and tried to reproduce their results. These guys won a prize in the Yelp dataset challenge and in order for me to check if I get similar results, I also experimented on the Yelp academic dataset.

Future plans include trying out the prototype on Trustpilot reviews, when we will open up the Consumer APIs to the world. I plan to do another blog post then, when I will explain how you can run the prototype on top of the Trustpilot API and get nice results from it.

If you clone the repository, you will see a few python files which make up the execution pipeline: yelp/yelp-reviews.py, reviews.py, corpus.py, train.py, display.py and predict.py. I have not yet made a main class to run the entire prototype, as I expect people might want to tweak this pipeline in a number of ways. For example, some may prefer a corpus containing more than just nouns, or avoid writing to Mongo, or keep more than 10000 words, or use more/less than 50 topics and so on.

You should just run these following files in order.

yelp/yelp-reviews.py - gets the reviews from the json file and imports them to MongoDB in a collection called Reviews
reviews.py/ reviews_parallel.py - loops through all the reviews in the initial dataset and for each review it: splits the review into sentences, removes stopwords, extracts parts-of-speech tags for all the remaining tokens, stores each review, i.e. reviewId, business name, review text and (word,pos tag) pairs vector to a new MongoDB database called Tags, in a collection called Reviews. If you have many reviews, try running reviews_parallel.py, which uses the Python multiprocessing features to parallelize this task and use multiple processed to do the POS tagging.
corpus.py - loops through all the reviews from the new MongoDB collection in the previous step, filters out all words which are not nouns, uses WordNetLemmatizer to lookup the lemma of each noun, stores each review together with nouns’ lemmas to a new MongoDB collection called Corpus.
train.py - feeds the reviews corpus created in the previous step to the gensim LDA model, keeping only the 10000 most frequent tokens and using 50 topics.
display.py - loads the saved LDA model from the previous step and displays the extracted topics.
predict.py - given a short text, it outputs the topics distribution. Simply lookout for the highest weights on a couple of topics and that will basically give the “basket(s)” where to place the text.
stopwords.txt - stopwords list created by Gerard Salton and Chris Buckley for the experimental SMART information retrieval system at Cornell University

The resulting topics

POS tagging the entire review corpus and training the LDA model takes considerable time, so expect to leave your laptop running over night while you dream of phis and thetas. It took ~10h on my personal laptop (Lenovo T420s with Intel i5 inside and 8GB of RAM) to do POS tagging for all 1,125,458 Yelp reviews (used reviews_parallel.py for this). I ran the LDA model for 50 topics, but feel free to choose more.

Here were the resulting 50 topics, ignore the bold words written in parenthesis for now:

0: (food or sauces or sides) 0.028sauce + 0.019meal + 0.018meat + 0.017salad + 0.016food + 0.015menu + 0.015side + 0.015flavor + 0.013dish + 0.012pork 1: (breakfast) 0.122egg + 0.096breakfast + 0.065bacon + 0.064juice + 0.033sausage + 0.032fruit + 0.024morning + 0.023brown + 0.023strawberry + 0.022crepe 2: (restaurant owner) 0.074owner + 0.073year + 0.048family + 0.032business + 0.029company + 0.028day + 0.026month + 0.025time + 0.024home + 0.021daughter 3: (terrace or surroundings) 0.065park + 0.030air + 0.028management + 0.027dress + 0.027child + 0.026parent + 0.025training + 0.024fire + 0.020security + 0.020treatment 4: (seafood) 0.091shrimp + 0.090crab + 0.077lobster + 0.060seafood + 0.054nail + 0.042salon + 0.039leg + 0.033coconut + 0.032oyster + 0.031scallop 5: (thai food) 0.055soup + 0.054rice + 0.045roll + 0.036noodle + 0.032thai + 0.032spicy + 0.029bowl + 0.028chicken + 0.026dish + 0.023beef 6: (cafe) 0.086sandwich + 0.063coffee + 0.048tea + 0.026place + 0.018cup + 0.016market + 0.015cafe + 0.015bread + 0.013lunch + 0.013order 7: (service) 0.068food + 0.049order + 0.044time + 0.042minute + 0.038service + 0.034wait + 0.030table + 0.029server + 0.024drink + 0.024waitress 8: (dessert) 0.078cream + 0.071ice + 0.059flavor + 0.056dessert + 0.049cake + 0.039chocolate + 0.021sweet + 0.015butter + 0.014taste + 0.013apple 9: (greek food) 0.052topping + 0.039yogurt + 0.034patty + 0.033hubby + 0.026flavor + 0.026sample + 0.024gyro + 0.022sprinkle + 0.021coke + 0.020greek 10: (service) 0.055time + 0.037job + 0.032work + 0.026hair + 0.025experience + 0.024class + 0.020staff + 0.020massage + 0.018day + 0.017week 11: (mexican food) 0.131chip + 0.081chili + 0.071margarita + 0.056fast + 0.031dip + 0.030enchilada + 0.026quesadilla + 0.026gross + 0.024bell + 0.020pastor 12: (price) 0.082money + 0.046% + 0.042tip + 0.040buck + 0.040ticket + 0.037price + 0.033pay + 0.029worth + 0.027cost + 0.024ride 13: (location or not sure) 0.061window + 0.058soda + 0.056lady + 0.037register + 0.031ta + 0.030man + 0.028haha + 0.026slaw + 0.020secret + 0.018wet 14: (italian food) 0.144pizza + 0.038wing + 0.031place + 0.029sauce + 0.026cheese + 0.023salad + 0.021pasta + 0.019slice + 0.016brisket + 0.015order 15: (family place or drive-in) 0.157car + 0.150kid + 0.030drunk + 0.028oil + 0.026truck + 0.024fix + 0.021college + 0.016vehicle + 0.016guy + 0.013arm 16: (bar or sports bar) 0.196beer + 0.069game + 0.049bar + 0.047watch + 0.038tv + 0.034selection + 0.033sport + 0.017screen + 0.017craft + 0.014playing 17: (hotel or accommodation) 0.134room + 0.061hotel + 0.044stay + 0.036pool + 0.027view + 0.024nice + 0.020gym + 0.018bathroom + 0.016area + 0.015night 18: (restaurant or atmosphere) 0.073wine + 0.050restaurant + 0.032menu + 0.029food + 0.029glass + 0.025experience + 0.023service + 0.023dinner + 0.019nice + 0.019date 19: (not sure) 0.052son + 0.027trust + 0.025god + 0.024crap + 0.023pain + 0.023as + 0.021life + 0.020heart + 0.017finish + 0.017word 20: (location or not sure) 0.057mile + 0.052arizona + 0.041theater + 0.037desert + 0.034middle + 0.029island + 0.028relax + 0.028san + 0.026restroom + 0.022shape 21: (club or nightclub) 0.064club + 0.063night + 0.048girl + 0.037floor + 0.037party + 0.035group + 0.033people + 0.032drink + 0.027guy + 0.025crowd 22: (brunch or lunch) 0.171wife + 0.071station + 0.058madison + 0.051brunch + 0.038pricing + 0.025sun + 0.024frequent + 0.022pastrami + 0.021doughnut + 0.016gas 23: (casino) 0.212vega + 0.103la + 0.085strip + 0.047casino + 0.040trip + 0.018aria + 0.014bay + 0.013hotel + 0.013fountain + 0.011studio 24: (service) 0.200service + 0.092star + 0.090food + 0.066place + 0.051customer + 0.039excellent + 0.035! + 0.030time + 0.021price + 0.020experience 25: (pub or fast-food) 0.254dog + 0.091hot + 0.026pub + 0.023community + 0.022cashier + 0.021way + 0.021eats + 0.020york + 0.019direction + 0.019root 26: (not sure) 0.087box + 0.040adult + 0.028dozen + 0.027student + 0.026sign + 0.025gourmet + 0.018decoration + 0.018shopping + 0.017alot + 0.016eastern 27: (bar) 0.120bar + 0.085drink + 0.050happy + 0.045hour + 0.043sushi + 0.037place + 0.035bartender + 0.023night + 0.019cocktail + 0.015menu 28: (italian food) 0.029chef + 0.027tasting + 0.024grand + 0.022caesar + 0.021amazing + 0.020linq + 0.020italian + 0.018superb + 0.016garden + 0.015al 29: (not sure) 0.064bag + 0.061attention + 0.040detail + 0.031men + 0.027school + 0.024wonderful + 0.023korean + 0.023found + 0.022mark + 0.022def 30: (mexican food) 0.122taco + 0.063bean + 0.043salsa + 0.043mexican + 0.034food + 0.032burrito + 0.029chip + 0.027rice + 0.026tortilla + 0.021corn 31: 0.096waffle + 0.057honey + 0.034cheddar + 0.032biscuit + 0.030haze + 0.025chicken + 0.024cozy + 0.022let + 0.022bring + 0.021kink 32: 0.033lot + 0.027water + 0.027area + 0.027) + 0.025door + 0.023( + 0.021space + 0.021parking + 0.017people + 0.013thing 33: 0.216line + 0.054donut + 0.041coupon + 0.030wait + 0.029cute + 0.027cooky + 0.024candy + 0.022bottom + 0.019smoothie + 0.018clothes 34: 0.090phoenix + 0.077city + 0.042downtown + 0.037gem + 0.026seating + 0.025tourist + 0.022convenient + 0.021joke + 0.020pound + 0.017tom 35: 0.072lol + 0.056mall + 0.041dont + 0.035omg + 0.034country + 0.030im + 0.029didnt + 0.028strip + 0.026real + 0.025choose 36: 0.159place + 0.036time + 0.026cool + 0.025people + 0.025nice + 0.021thing + 0.021music + 0.020friend + 0.019‘m + 0.018super 37: 0.138steak + 0.068rib + 0.063mac + 0.039medium + 0.026bf + 0.026side + 0.025rare + 0.021filet + 0.020cheese + 0.017martini 38: 0.075patio + 0.064machine + 0.055outdoor + 0.039summer + 0.038smell + 0.032court + 0.032california + 0.027shake + 0.026weather + 0.023pretzel 39: 0.124card + 0.080book + 0.079section + 0.049credit + 0.042gift + 0.040dj + 0.022pleasure + 0.019charge + 0.018fee + 0.017send 40: 0.081store + 0.073location + 0.049shop + 0.039price + 0.031item + 0.025selection + 0.023product + 0.023employee + 0.023buy + 0.020staff 41: 0.048az + 0.048dirty + 0.034forever + 0.033pro + 0.032con + 0.031health + 0.027state + 0.021heck + 0.021skill + 0.019concern 42: 0.037time + 0.028customer + 0.025call + 0.023manager + 0.023day + 0.020service + 0.018minute + 0.017phone + 0.017guy + 0.016problem 43: 0.197burger + 0.166fry + 0.038onion + 0.030bun + 0.022pink + 0.021bacon + 0.021cheese + 0.019order + 0.018ring + 0.015pickle 44: 0.069picture + 0.052movie + 0.052foot + 0.034vip + 0.031art + 0.030step + 0.024resort + 0.022fashion + 0.021repair + 0.020square 45: 0.054sum + 0.043dim + 0.042spring + 0.034diner + 0.032occasion + 0.029starbucks + 0.025bonus + 0.024heat + 0.022yesterday + 0.021lola 46: 0.071shot + 0.041slider + 0.038met + 0.038tuesday + 0.032doubt + 0.023monday + 0.022stone + 0.022update + 0.017oz + 0.017run 47: 0.152show + 0.050event + 0.046dance + 0.035seat + 0.031band + 0.029stage + 0.019fun + 0.018time + 0.015scene + 0.014entertainment 48: 0.099yelp + 0.094review + 0.031ball + 0.029star + 0.028sister + 0.022yelpers + 0.017serf + 0.016dream + 0.015challenge + 0.014‘m 49: 0.137food + 0.071place + 0.038price + 0.033lunch + 0.027service + 0.026buffet + 0.024time + 0.021quality + 0.021restaurant + 0.019eat

All right, they look pretty cohesive, which is a good sign. Now comes the manual topic naming step where we can assign one representative keyword to each topic. This is useful when predicting the topics of new unseen reviews. I have suggested some keywords based on my instant inspiration, which you can see in the round parenthesis. I got bored after half of them, but I feel I made the point. You only need to set these keywords once and summarize each topic. I suggested some keywords while watching over the Kung Pao Chicken and having a beer…so my keywords may not match yours! Anyway, you get the idea.

It’s up to you how you choose the keywords: you can be broader or more precise about what you are interested in the topic, select the most frequent word in the topic and setting that as the keywords, etc..

Predicting the topics of new unseen reviews

OK, now that we have the topics, let’s see how the model predicts the topics distribution for a new review:

It’s like eating with a big Italian family. Great, authentic Italian food, good advice when asked, and terrific service. With a party of 9, last minute on a Saturday night, we were sat within 15 minutes. The owner chatted with our kids, and made us feel at home. They have meat-filled raviolis, which I can never find. The Fettuccine Alfredo was delicious. We had just about every dessert on the menu. The tiramisu had only a hint of coffee, the cannoli was not overly sweet and they had this custard with wine that was so strangely good. It was an overall great experience!

The output of the predict.py file given this review is: [(0, 0.063979336376367435), (2, 0.19344804518265865), (6, 0.049013217061090186), (7, 0.31535985308065378), (8, 0.074829314265223476), (14, 0.046977300077683241), (15, 0.044438343698184689), (18, 0.09128157138884592), (28, 0.085020844956249786)]

Thus, the review is characterized mostly by topics 7 (32%) and 2 (19%). The missing topics, such as 1, 3, 4, 5 and so on are all zero, that’s why they are missing. Well, what do you know, those topics are about the service and restaurant owner.

Another one: Either the quality has gone down or my taste buds have higher expectations than the last time I was here (about 2 years ago). Now that SF has so many delicious Italian choices where the pasta is made in-house/homemade, it was tough for me to eat the store-bought pasta. The pasta lacked texture and flavor, and even the best sauce couldn’t change my disappointment. The gnocchi tasted better, but I just couldn’t get over how cheap the pasta tasted. I discovered another spot in North Beach called X and I was really impressed with their pasta, so that’s my new go-to spot.

Distribution: [(2, 0.049949761363727557), (14, 0.67415587326751736), (28, 0.14795291772795682), (33, 0.044461283686581303), (44, 0.044349729171608801)]

Clearly, the review is about topic 14, which is italian food.

Third time’s the charm: Really superior service in general; their reputation precedes them and they deliver. It can indeed be tough to get seating, but I find them willingly accommodating when they can be, and seating at the bar can be really enjoyable, actually. So many wonderful items to choose from, but don’t forget to save room for the over-the-top chocolate souffle; elegant and wondrous. Oh and hello, roast Maine lobster, mini quail and risotto with dungeness crab. De-lish.

[(0, 0.12795812236631765), (4, 0.25125769311344842), (8, 0.097887323141830185), (17, 0.15090844416208612), (24, 0.12415345702622631), (27, 0.067834960190092219), (35, 0.06375000000000007), (41, 0.06375000000000007)]

The topics predicted are topic 4 - seafood and topic 24 - service. Right on the money again. I just picked the first couple of topics but these can be selected based on their distribution, i.e. taking all above a set threshold.

It isn’t generally this sunny in Denmark though… Take a closer look at the topics and you’ll notice some are hard to summarize and some are overlapping. This is where a bit of LDA tweaking can improve the results.

While this method is very simple and very effective, it still needs some polishing, but that is beyond the goal of the prototype. LDA is however one of the main techniques used in the industry to categorize text and for the most simple review tagging, it may very well be sufficient.

Opinion spam detection - Literature review

2014-09-07T00:00:00+00:00

I have decided to make a sort of a series about several findings in my thesis about opinion spam detection. This is the pilot episode, a literature review of the most significant research papers on opinion spam until now.

OK, here goes. The opinion spam problem was first formulated by Jindal and Liu in the context of product reviews, (Jindal & Liu, 2008). By analyzing several million reviews from the popular Amazon.com, they showed how widespread the problem of fake reviews was. The existing detection methods can be split in the context of machine learning into supervised and unsupervised approaches. Second, they can be split into three categories by their features: behavioral, linguistic or those using a combination of these two. They categorized spam reviews into three categories: non-reviews, brand-only reviews and untruthful reviews. The authors ran a logistic regression classifier on a model trained on duplicate or near-duplicate reviews as positive training data, i.e. fake reviews, and the rest of the reviews they used as truthful reviews. They combined reviewer behavioral features with textual features and they aimed to demonstrate that the model could be generalized to detect non-duplicate review spam. This was the first documented research on the problem of opinion spam and thus did not benefit from existing training databases. The authors had to build their own dataset, and the simplest approach was to use near-duplicate reviews as examples of deceptive reviews. Although this initial model showed good results, it is still an early investigation into this problem.

(Lim, Nguyen, Jindal, Liu, & Lauw, 2010) is also an early work on detecting review spammers which proposed scoring techniques for the spamicity degree of each reviewer. The authors tested their model on Amazon reviews, which were initially taken through several data preprocessing steps. In this stage, they decided to only keep reviews from highly active users - users that had written at least 3 reviews. The detection methods are based on several predefined abnormalities indicators, such as general rating deviation, early deviation - i.e. how soon after a product appears on the website does a suspicious user post a review about it or very high/low ratings clusters. The features weights were linearly combined towards a spamicity formula and computed empirically in order to maximize the value of the normalized discounted cumulative gain measure. The measure showed how well a particular ranking improves on the overall goal. The training data was constructed as mentioned earlier from Amazon reviews, which were manually labeled by human evaluators. Although an agreement measure is used to compute the inter-evaluator agreement percentage, so that a review is considered fake if all of the human evaluators agree, this method of manually labeling deceptive reviews has been proven to lead to low accuracy when testing on real-life fake review data. First, (Ott, Choi, Cardie, & Hancock, 2011) demonstrated that it is impossible for humans to detect fake reviews simply by reading the text. Second, (Mukherjee, Liu, & Glance, 2012) proved that not even fake reviews produced through crowdsourcing methods are valid training data because the models do not generalize well on real-life test data.

(Wang, Xie, Liu, & Yu, 2012) considered the triangular relationship among stores, reviewers and their reviews. This was the first study to capture such relationships between these concepts and study their implications. They introduced 3 measures meant to do this: the stores reliability, the trustworthiness of the reviewers and the honesty of the reviews. Each concept depends on the other two, in a circular way, i.e. a store is more reliable when it contains honest reviews written by trustworthy reviewers and so on for the other two concepts. They proposed a heterogeneous graph based model, called the review graph, with 3 types of nodes, each type of node being characterized by a spamicity score inferred using the other 2 types. In this way, they aimed to capture much more information about stores, reviews and reviewers than just focus on behavioral reviewer centric features. This is also the first study on store reviews, which are different than product reviews. The authors argue that when looking at product reviews, while it may be suspicious to have multiple reviews from the same person for similar products, it is ok for the same person to buy multiple similar products from the same store and write a review every time about the experience. In almost all fake product reviews, studies which use the cosine similarity as a measure of review content alikeness, a high value is considered as a clear signal of cheating, since the spammers do not spend much time writing new reviews all the time, but reuse the exact same words. However, when considering store reviews, it is possible for the same user to make valid purchases from similar stores, thus reusing the content of his older reviews and not writing completely different reviews all the time. (Wang, Xie, Liu, & Yu, 2012) used an iterative algorithm to rank the stores, reviewers and reviews respectively, claiming that top rankers in each of the 3 categories are suspicious. They evaluated their top 10 top and bottom ranked spammer reviewers results using human evaluators and computed the inter-evaluator agreement. The evaluation of the resulted store reliability score, again for the top 10 top and bottom ranked stores was done by comparison with store data from Better Business Bureaus, a corporation that keeps track businesses reliability and possible consumer scams.

(Xie, Wang, Lin, & Yu, 2012) observed that the vast majority of reviewers (more than 90% in their study or resellerratings.com reviews up to 2010) only wrote one review, so they have focused their research on this type of reviewers. They also claim, similarly to (Feng, Xing, Gogar, & Choi, 2012), that a flow of fake reviews coming from a hired spammer distorts the usual distribution of ratings for the product, leaving distributional traces behind. Xie et al. observed the normal flow of reviews is not correlated with the given ratings over time. Fake reviews come in bursts of either very high ratings, i.e. 5-stars, or very low ratings, i.e. 1-star, so the authors aim to detect time windows in which these abnormally correlated patterns appear. They considered the number of reviews, average ratings and the ratio of singleton reviews which stick out when looking over different time windows. The paper makes important contributions to opinion spam detection by being the first study to date to formulate the singleton spam review problem. Previous works have disregarded this aspect completely by purging singleton reviews from their training datasets and focusing more on tracking the activity of reviewers as they make multiple reviews. It is of course reasonable to claim that the more information is saved about a user and the more data points about a user’s activity exist, the easier it is to profile that user and assert with greater accuracy whether he is a spammer or not. Still, it is simply not negligible that a large percentage of users on review platforms write only one review.

(Feng, Xing, Gogar, & Choi, 2012) published the first study to tackle the opinion spam as a distributional anomaly problem, considering crawled data from Amazon and TripAdvisor. They claim product reviews are characterized by natural distributions which are distorted by hired spammers when writing fake reviews. Their contribution consists of first introducing the notion of natural distribution of opinions and second of conducting a range of experiments that finds a connection between distributional anomalies and the time windows when deceptive reviews were written. For the purpose of evaluation they used a gold standard dataset containing 400 known deceptive reviews written by hired people, created by (Ott, Choi, Cardie, & Hancock, 2011). Their proposed method achieves a maximum accuracy of only 72.5% on the test dataset and thus is suitable as a technique to pinpoint suspicious activity within a time window and draw attention on suspicious products or brands. This technique does not solely represent however a complete solution where individual reviews can be deemed as fake or truthful, but simply brings to the foreground delimited short time windows where methods from other studies can be applied to detect spammers.

(Li, Huang, Yang, & Zhu, 2011) have used supervised learning and manually labeled reviews crawled from Epinions to detect product review spam. They also added to the model the helpfulness scores and comments the users associated with each review. Due to the dataset size of about 60K reviews and the fact that manual labeling was required, an important assumption was made - reviews that receive fewer helpful votes from people are more suspicious. Based on this assumption, they have filtered out review data accordingly, e.g. only considering reviews which have at least 5 helpfulness votes or comments. They achieved a 0.58 F-Score result using their supervised method model, which outperformed the heuristic methods used at that time to detect review spam. However, this result is very low when compared with that of more recent review spam detection models. The main reason for this has been the training of the model on manually labeled fake reviews data, as well as the initial data pre-processing step where reviews were selected based on their helpfulness votes. (Mukherjee et al., 2013) also makes the assumption that deceptive reviews get less votes. But their model evaluation later showed that helpfulness votes not only perform poorly but they may also be abused - groups of spammers working together to promote certain products may give many votes to each others reviews. The same conclusion has been also expressed by (Lim, Nguyen, Jindal, Liu, & Lauw, 2010).

(Ott, Choi, Cardie, & Hancock, 2011) produced the first dataset of gold-standard deceptive opinion spam, employing crowdsourcing through the Amazon Mechanical Turk. They demonstrated that humans cannot distinguish fake reviews by simply reading the text, the results of these experiments showing an at-chance probability. The authors found that although part-of-speech n-gram features give a fairly good prediction on whether an individual review is fake, the classifier actually performed slightly better when psycholinguistic features were added to the model. The expectation was also that truthful reviews resemble more of an informative writing style, while deceptive reviews are more similar in genre to imaginative writing. The authors coupled the part-of-speech tags in the review text which had the highest frequency distribution with the results obtained from a text analysis tool previously used to analyze deception. Testing their classifier against the gold-standard dataset, they revealed clue words deemed as signs of deceptive writing. However, this can be seen as overly simplistic, as some of these words, which according to the results have a higher probability to appear in a fake review, such as “vacation” or “family”, may as well appear in truthful reviews. The authors finally concluded that the domain context has an important role in the feature selection process. Simply put, the imagination of spammers is limited - e.g. in the case of hotel reviews, they tend to not be able to give spatial details regarding their stay. While the classifier scored good results on the gold-standard dataset, once the spammers learn about them, they could simply avoid using the particular clue words, thus lowering the classifier accuracy when applied to real-life data on the long term.

(Mukherjee, Liu, & Glance, 2012) were the first to try to solve the problem of opinion spam resulted from a group collaboration between multiple spammers. The method they proposed first extracts candidate groups of users using a frequent itemset mining technique. For each group, several individual and group behavioral indicators are computed, e.g. the time differences between group members when posting, the rating deviation between group members compared with the rest of the product reviewers, the number of products the group members worked together on, or review content similarities. The authors also built a dataset of fake reviews, with the help of human judges which manually labeled a number of reviews. They experimented both with learning to rank methods, i.e. ranking of groups based on their spamicity score and with classification using SVM and logistic regression, using the labeled review data for training. The algorithm, called GSRank considerably outperformed existing methods by achieving an area under the curve result (AUC) of 95%. This score makes it a very strong candidate for production environments where the community of users is very active and each user writes more than one review. However, not many users write a lot of reviews, there exists a relatively small percentage of “elite” contributing users. So this method would best be coupled with a method for detecting singleton reviewers, such as the method from (Xie, Wang, Lin, & Yu, 2012).

(Mukherjee, Venkataraman, Liu, & Glance, 2013) have questioned the validity of previous research results based on supervised learning techniques trained on Amazon Mechanical Turk (AMT) generated fake reviews. They tested the method of (Ott, Choi, Cardie, & Hancock, 2011) on known fake reviews from Yelp. The assumption was that the company had perfected its detection algorithm for the past decade and so its results should be trustworthy. Surprisingly, unlike (Ott, Choi, Cardie, & Hancock, 2011) which reported a 90% accuracy using the fake reviews generated through the AMT tool, (Mukherjee, Venkataraman, Liu, & Glance, 2013) experiments showed only a 68% accuracy when they tested Ott’s model on Yelp data. This led the authors to claim that any previous model trained using reviews collected through the AMT tool can only offer near chance accuracy and is useless when applied on real-life data. However, the authors do not rule out the effectiveness of using n-gram features in the model and they proved the largest accuracy obtained on Yelp data was achieved using a combination of behavioral and linguistic features. Their experiments show little improvement over accuracy when adding n-gram features. Probably the most interesting conclusion is that behavioral features considerably outperform n-gram features alone.

(Mukherjee et al., 2013) built an unsupervised model called the Author Spamicity Model that aims to split the users into two clusters - truthful users and spammers. The intuition is that the two types of users are naturally separable due to the behavioral footprints left behind when writing reviews. The authors studied the distributional divergence between the two types and tested their model on real-life Amazon reviews. Most of the behavioral features in the model have been previously used in two previous studies by (Mukherjee, Liu, & Glance, 2012) and (Mukherjee, Venkataraman, Liu, & Glance, 2013). In these studies though, the model was trained using supervised learning. The novelty about the proposed method in this paper is a posterior density analysis of each of the features used. This analysis is meant to validate the relevance of each model feature and also increase the knowledge on their expected values for truthful and fake reviews respectively.

(Fei et al., 2013) focused on detecting spammers that write reviews in short bursts. They represented the reviewers and the relationships between them in a graph and used a graph propagation method to classify reviewers as spammers. Classification was done using supervised learning, by employing human evaluation of the identified honest/deceptive reviewers. The authors relied on behavioral features to detect periods in time when review bursts per product coincided with reviewer burst, i.e. a reviewer is very prolific just as when a number of reviews which is higher than the usual average of reviews for a particular product is recorded. The authors discarded singleton reviewers from the initial dataset, since these provide little behavior information - all the model features used in the burst detection model require extensive reviewing history for each user. By discarding singleton reviewers, this method is similar to the one proposed by (Mukherjee, Liu, & Glance, 2012). These methods can thus only detect fake reviews written by elite users on a review platform. Exploiting review posting bursts is an intuitive way to obtain smaller time windows where suspicious activity occurs. This can be seen as a way to break the fake review detection method into smaller chunks and employ other methods which have to work with considerably less data points. This would decrease the computational and time complexity of the detection algorithm.

(Mukherjee, Venkataraman, Liu, & Glance, 2013) made an interesting observation in their study: the spammers caught by Yelp’s filter seem to have “overdone faking” in their try to sound more genuine. In their deceptive reviews, they tried to use words that appear in genuine reviews almost equally frequently, thus avoiding to reuse the exact same words in their reviews. This is exactly the reason why a cosine similarity measure is not enough to catch subtle spammers in real life scenarios, such as Yelp’s.

References

Jindal, N., & Liu, B. (2008). Opinion Spam and Analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (pp. 219–230). New York, NY, USA: ACM. doi:10.1145/1341531.1341560
Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting Product Review Spammers Using Rating Behaviors. In Proceedings of the 19^th ACM International Conference on Information and Knowledge Management (pp. 939–948). New York, NY, USA: ACM. doi:10.1145/1871437.1871557
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49^th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (pp. 309–319). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2002472.2002512
Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Spotting Opinion Spammers Using Behavioral Footprints. In Proceedings of the 19^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 632–640). New York, NY, USA: ACM. doi:10.1145/2487575.2487580
Mukherjee, A., Liu, B., & Glance, N. (2012). Spotting Fake Reviewer Groups in Consumer Reviews. In Proceedings of the 21^st International Conference on World Wide Web (pp. 191–200). New York, NY, USA: ACM. doi:10.1145/2187836.2187863
Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013). What yelp fake review filter might be doing. In Proceedings of the International Conference on Weblogs and Social Media.
Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2013). Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews. UIC-CS-03-2013. Technical Report.
Wang, G., Xie, S., Liu, B., & Yu, P. S. (2012). Identify Online Store Review Spammers via Social Review Graph. ACM Trans. Intell. Syst. Technol., 3(4), 61:1–61:21. doi:10.1145/2337542.2337546
Xie, S., Wang, G., Lin, S., & Yu, P. S. (2012). Review Spam Detection via Time Series Pattern Discovery. In Proceedings of the 21^st International Conference Companion on World Wide Web (pp. 635–636). New York, NY, USA: ACM. doi:10.1145/2187980.2188164
Feng, S., Xing, L., Gogar, A., & Choi, Y. (2012). Distributional Footprints of Deceptive Product Reviews. In J. G. Breslin, N. B. Ellison, J. G. Shanahan, & Z. Tufekci (Eds.), ICWSM. The AAAI Press. Retrieved from http://dblp.uni-trier.de/db/conf/icwsm/icwsm2012.html#FengXGC12
Li, F., Huang, M., Yang, Y., & Zhu, X. (2011). Learning to Identify Review Spam. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three (pp. 2488–2493). AAAI Press. doi:10.5591/978-1-57735-516-8/IJCAI11-414
Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Exploiting Burstiness in Reviews for Review Spammer Detection. In Seventh International AAAI Conference on Weblogs and Social Media.

Opinion phrases

2014-09-02T00:00:00+00:00

In Predicting what user reviews are about with LDA and gensim I played with extracting topics from short reviews and given a new review, tried to predict the most probable topic(s) it can be associated with. LDA relies on a bag-of-words model, which is a very popular document representation approach. The model disregards any syntactic dependencies between the words, i.e. any grammar, as well as word order in the documents. For a deeper read about the assumptions made by the LDA model, try to digest Blei’s paper…if you dare!

Anyway, much of the research on opinion mining used the bag-of-words model, but as Samaneh Abbasi Moghaddam also suggests in her PhD thesis Aspect-based opinion mining in online reviews, it is not clear whether this approach is actually the most effective. Instead, she experimented with a LDA model based on opinion phrases. The full details are found in her paper, but I will make a short summary of the method. In a nutshell, she concluded that bag-of-opinion phrases outperform the bag-of-words topic models and using the grammar relationships outperforms the existing preprocessing techniques in extracting aspects from user reviews.

What is an opinion phrase, anyway? An opinion phrase is defined as a pair (aspect, sentiment) like camera nice or room clean. This is the stuff that people are generally interested in when reading a review, these key points that sum up a user’s experience with the product. A very simple way to extract these pairs is to look for nouns and then pick the nearest adjective around it. This approch has obvious shortcomings.

Luckily, we are contemporary with some clever Stanford guys who made the Stanford CoreNLP tool. Given a sentence, it can extract syntactic dependencies between words, output the words’ base forms and even predict the overall sentiment of a text.

What syntactical relations should be exploited? The following grammatical dependencies are used to extract and construct the opinion phrases inside a short sentence. They are all present and explained in her PhD thesis. The Stanford typed dependencies manual explains very well the grammatical representations used below.

In order to end-up with opinion phrases, first, basic patterns are extracted:

Adjectival complement (acomp): The camera looks nice, parsed to acomp(nice, looks).
Adjectival modifier (amod): This camera has great zoom parsed to amod(zoom, great);
“And” conjunct (conj and): This camera has great zoom and resolution parsed to conj and(zoom, resolution).
Copula (cop): The screen is wide parsed to cop(wide, is).
Direct object (dobj): I love the quality parsed to dobj(love, quality).
Negation modifier (neg): The battery life is not long parsed to neg(long, not).
Noun compound modifier (nn): The battery life is not long parsed to nn(life, battery).
Nominal subject (nsubj): The screen is wide parsed to nsubj(wide, screen).

The simple patterns are then combined in a tree-like manner to obtain more valuable opinion phrases. (N indicates a noun, A an adjective, V a verb, h a head term, m a modifier, and < h, m > an opinion phrase)¹.

1.amod(N, A) →< N, A > This camera has great zoom and resolution → (zoom, great) 2.acomp(V, A) + nsubj(V, N) →< N, A > The camera case looks nice –> (case, nice)

cop(A, V ) + nsubj(A, N) →< N, A > The screen is wide and clear –> (screen, wide)
dobj(V, N) + nsubj(V, N0) →< N, V > I love the picture quality –> (picture, love)
< h1, m > +conj and(h1, h2) →< h2, m > This camera has great zoom and resolution –> (zoom, great), (resolution, great)
< h, m1 > +conj and(m1, m2) →< h, m2 > The screen is wide and clear –> (screen, wide), (screen, clear)
< h, m > +neg(m, not) →< h, not + m > The battery life is not long –> (battery life, not long)
< h, m > +nn(h, N) →< N + h, m > The camera case looks nice –> (camera case, nice)
< h, m > +nn(N, h) →< h + N, m > I love the picture quality –> (picture quality, love)

Of course, these syntactical relations can be improved further on by looking up more combined patterns - did not think about this too much, but I have a gut feeling there are more combined patterns out there.

Pruning the extracted patterns As you can see in the code if you look in the GitHub repository, the pruning step is quite important, as the “tree leafs” are the most significant patterns to keep, the final ones. I wanted opinion phrases like fish had and bill paid to go away, so I added the usual stopwords removal to the pruning step. This basically means eliminating the patterns which contain any stopword.

Some results Given a real-life user review: Great food and atmosphere. Plenty of TVs to watch the games. The chef and his partners just opened this great location. Pumpkin Soup with pumpkin oil and croutons is such a great start to the Fall season. Wood fired oven pumping out flatbreads. Sweet Potato gnocchi made in house with roasted corn and gorgonzola crema is unbelievable. Very impressive selection of beer handles and delicious cocktails. Amazing view of the sunset as well. Can’t wait to return.

the extracted opinion phrases were: [“atmosphere Great”, “food Great”, “location great”, “start great”, “corn roasted”, “gorgonzola crema roasted”, “selection impressive”, “view Amazing”]

Pretty good right?

Here’s another one: This place is amazing. I come here at least once a month & am never disappointed. Food & service is always great. The buffalo burger it TDF as well as the bruschetta. Outside seating is so cute with a lights (great for date nights) live music inside is a wonderful touch. This place is great to meet with friends, family or date night.

opinion phrases: [“place amazing”, “service great”, “burger buffalo”, “seating cute”, “touch wonderful”, “seating Outside”, “place great”]

By now, it should be pretty obvious to see how easier aggregating after specific aspects such as place and service is.

Another one: The mailing pack that was sent to me was very thorough and well explained,correspondence from the shop was prompt and accurate,I opted for the cheque payment method which was swift in getting to me. All in all, a fast efficient service that I had the upmost confidence in,very professionally executed and I will suggest you to my friends when there mobiles are due for recycling :-)

opinion phrases: [“correspondence, prompt”, “correspondence, accurate”, “service, efficient”]

OK, one more and that’s it: Aside from the wait to order and the other wait to get your food! I was there for a late lunch on Friday and I opted to forgo my usual salad choice and go for the eggplant parm sandwich - yum! Each bite of perfectly crusted eggplant had the most amazing tangy tomato sauce and melted cheese and it was oh so dlvine! I had it on wheat bread and finished my entire sandwich (and yes, I ordered a full size!) What a treat! Don’t try to special order at the restaurant - my boyfriend attempted to create his own sandwich and what he ended up with was nothing close to what he ordered. I can’t wait to go back to ‘make my own pizza’ - i know about that thanks to him! I have a feeling that I’ll be visiting this place quite a bit since I’m now living in the area after a recent move. It’s a good thing - well, it’s a tasty thing, maybe not so good for the waistline!

opinion phrases: [“salad choice usual”, “lunch late”, “salad choice forgo”, “tomato sauce amazing”, “eggplant crusted”, “tomato sauce tangy”, “sandwich finished”, “sandwich entire”, “size ordered”, “size full”, “sandwich create”, “order special”, “pizza make”, “move recent”, “bit visiting”, “thing good”, “thing tasty”]

In this last one, you may notice phrases like thing good, sandwich entire, sandwich finished and sandwich create, which don’t really help much in any aggregation, so it would be nice to eliminate them. This could easily be done by cleverly and time-consumingly shove words like thing, entire and create into the stopwords list. Otherwise, the LDA model should filter these out, assuming a lot of people don’t mention the exact phrases.

Check out the code repository, try extracting opinion phrases by running the code yourself and don’t forget to tweet to me if you have any comments.

The code is written in Java and it requires Stanford CoreNLP 3.4, Stanford Parser, JUnit and Mongo Java driver (if you plan to run it over many reviews stored in Mongo, because why wouldn’t you already keep the reviews in Mongo right?). If you do not want to use Mongo, just call the run method in Extract class giving it the text to extract opinion phrases from.

That is basically it.

Abbasi Moghaddam, Samaneh. Aspect-based opinion mining in online reviews. Diss. Applied Sciences: School of Computing Science, 2013. ↩

Opinion spam detection through semantic similarity

2014-07-12T00:00:00+00:00

Check out my thesis on Opinion spam detection through semantic similarity and the presentation slides.
You can also take a look at the repository if you need some LaTeX inspiration.