Skip to content

Curiousily

Create Dataset for Sentiment Analysis by Scraping Google Play App Reviews using Python

Deep Learning, NLP, Machine Learning, Neural Network, Sentiment Analysis, Python2 min read

Share

TL;DR In this tutorial, you’ll learn how to create a dataset for Sentiment Analysis by scraping user reviews for Android apps. You’ll convert the app and review information into Data Frames and save that to CSV files.

You’ll learn how to:

  • Set a goal and inclusion criteria for your dataset
  • Get real-world user reviews by scraping Google Play
  • Use Pandas to convert and save the dataset into CSV files

Setup

Let’s install the required packages and setup the imports:

1%watermark -v -p pandas,matplotlib,seaborn,google_play_scraper
1CPython 3.6.9
2IPython 5.5.0
3
4pandas 1.0.3
5matplotlib 3.2.1
6seaborn 0.10.0
7google_play_scraper 0.0.2.3
1import json
2import pandas as pd
3from tqdm import tqdm
4
5import seaborn as sns
6import matplotlib.pyplot as plt
7
8from pygments import highlight
9from pygments.lexers import JsonLexer
10from pygments.formatters import TerminalFormatter
11
12from google_play_scraper import Sort, reviews, app
13
14%matplotlib inline
15%config InlineBackend.figure_format='retina'
16
17sns.set(style='whitegrid', palette='muted', font_scale=1.2)

The Goal of the Dataset

You want to get feedback for your app. Both negative and positive are good. But the negative one can reveal critical features that are missing or downtime of your service (when it is much more frequent).

Lucky for us, Google Play has plenty of apps, reviews, and scores. We can scrape app info and reviews using the google-play-scraper package.

You can choose plenty of apps to analyze. But different app categories contain different audiences, domain-specific quirks, and more. We’ll start simple.

We want apps that have been around some time, so opinion is collected organically. We want to mitigate advertising strategies as much as possible. Apps are constantly being updated, so the time of the review is an important factor.

Ideally, you would want to collect every possible review and work with that. However, in the real world data is often limited (too large, inaccessible, etc). So, we’ll do the best we can.

Let’s choose some apps that fit the criteria from the Productivity category. We’ll use AppAnnie to select some of the top US apps:

1app_packages = [
2 'com.anydo',
3 'com.todoist',
4 'com.ticktick.task',
5 'com.habitrpg.android.habitica',
6 'cc.forestapp',
7 'com.oristats.habitbull',
8 'com.levor.liferpgtasks',
9 'com.habitnow',
10 'com.microsoft.todos',
11 'prox.lab.calclock',
12 'com.gmail.jmartindev.timetune',
13 'com.artfulagenda.app',
14 'com.tasks.android',
15 'com.appgenix.bizcal',
16 'com.appxy.planner'
17]

Scraping App Information

Let’s scrape the info for each app:

1app_infos = []
2
3for ap in tqdm(app_packages):
4 info = app(ap, lang='en', country='us')
5 del info['comments']
6 app_infos.append(info)
1100%|██████████| 15/15 [00:02<00:00, 6.34it/s]

We got the info for all 15 apps. Let’s write a helper function that prints JSON objects a bit better:

1def print_json(json_object):
2 json_str = json.dumps(
3 json_object,
4 indent=2,
5 sort_keys=True,
6 default=str
7 )
8 print(highlight(json_str, JsonLexer(), TerminalFormatter()))

Here is a sample app information from the list:

1print_json(app_infos[0])
1{
2 "adSupported": null,
3 "androidVersion": "Varies",
4 "androidVersionText": "Varies with device",
5 "appId": "com.anydo",
6 "containsAds": null,
7 "contentRating": "Everyone",
8 "contentRatingDescription": null,
9 "currency": "USD",
10 "description": "<b>\ud83c\udfc6 Editor's Choice by Google</b>\r\n\r\nAny.do is a To Do List, Calendar, Planner, Tasks & Reminders App That Helps Over 25M People Stay Organized and Get More Done.\r\n\r\n<b>\ud83e\udd47 \"It\u2019s A MUST HAVE PLANNER & TO DO LIST APP\" (NYTimes, USA TODAY, WSJ & Lifehacker).</b>\r\n\r\nAny.do is a free to-do list, planner & calendar app for managing and organizing your daily tasks, to-do lists, notes, reminders, checklists, calendar events, grocery lists and more.\r\n\r\n\ud83d\udcc5 Organize Your Tasks & To-Do List in Seconds\r\n\r\n\u2022 ADVANCED CALENDAR & DAILY PLANNER - Keep your to-do list and calendar events always at hand with our calendar widget. Any.do to-do list & planner support daily calendar view, 3-day Calendar view, Weekly calendar view & agenda view, with built-in reminders. Review and organize your calendar events and to do list side by side.\r\n\r\n\u2022 SYNCS SEAMLESSLY - Keeps all your to do list, tasks, reminders, notes, calendar & agenda always in sync so you\u2019ll never forget a thing. Sync your phone\u2019s calendar, google calendar, Facebook events, outlook calendar or any other calendar so you don\u2019t forget an important event.\r\n\r\n\u2022 SET REMINDERS - One time reminders, recurring reminders, Location reminders & voice reminders. NEW! Easily create tasks and get reminders in WhatsApp.\r\n\r\n\u2022 WORK TOGETHER - Share your to do list and assign tasks with your friends, family & colleagues from your task list to collaborate and get more done. \r\n\r\n---\r\n\r\nALL-IN-ONE PLANNER & CALENDAR APP FOR GETTING THINGS DONE\r\nCreate and set reminders with voice to your to do list. \r\nFor better task management flow we added a calendar integration to keep your agenda always up to date. \r\nFor better productivity, we added recurring reminders, location reminders, one-time reminder, sub-tasks, notes & file attachments. \r\nTo keep your to do list up to date, we\u2019ve added a daily planner and focus mode.\r\n\r\nINTEGRATIONS\r\nAny.do To do list, Calendar, planner & Reminders Integrates with Google Calendar, Outlook, WhatsApp, Slack, Gmail, Google Tasks, Evernote, Trello, Wunderlist, Todoist, Zapier, Asana, Microsoft to-do, Salesforce, OneNote, Google Assistant, Amazon Alexa, Office 365, Exchange, Jira & More.\r\n\r\nTO DO LIST, CALENDAR, PLANNER & REMINDERS MADE SIMPLE\r\nDesigned to keep you on top of your to do list, tasks and calendar events with no hassle. With intuitive drag and drop of tasks, swiping to mark to-do's as complete, and shaking your device to remove completed from your to do list - you can stay organized and enjoy every minute of it.\r\n\r\nPOWERFUL TO DO LIST TASK MANAGEMENT\r\nAdd a to do list item straight from your email / Gmail / Outlook inbox by forwarding do@Any.do. Attach files from your computer, Dropbox, or Google Drive to your to- tasks.\r\n\r\nDAILY PLANNER & LIFE ORGANIZER\r\nAny.do is a to do list, a calendar, an inbox, a notepad, a checklist, task list, a board for post its or sticky notes, a task & project management tool, a reminder app, a daily planner, a family organizer, an agenda, a bill planner and overall the simplest productivity tool you will ever have. \r\n\r\nSHARE LISTS, ASSIGN & ORGANIZE TASKS\r\nTo plan & organize projects has never been easier. Now you can share lists between family members, assign tasks to each other, chat and much more. Any.do will help you and the people around you stay in-sync and get reminders so that you can focus on what matters, knowing you had a productive day and crossed off your to do list.\r\n\r\nGROCERY LIST & SHOPPING LIST\r\nAny.do task list, calendar, agenda, reminders & planner is also great for shopping lists at the grocery store. Simply create a list on Any.do, share it with your loved ones and see them adding their shopping items in real-time.",
11 "descriptionHTML": "<b>\ud83c\udfc6 Editor&#39;s Choice by Google</b><br><br>Any.do is a To Do List, Calendar, Planner, Tasks &amp; Reminders App That Helps Over 25M People Stay Organized and Get More Done.<br><br><b>\ud83e\udd47 &quot;It\u2019s A MUST HAVE PLANNER &amp; TO DO LIST APP&quot; (NYTimes, USA TODAY, WSJ &amp; Lifehacker).</b><br><br>Any.do is a free to-do list, planner &amp; calendar app for managing and organizing your daily tasks, to-do lists, notes, reminders, checklists, calendar events, grocery lists and more.<br><br>\ud83d\udcc5 Organize Your Tasks &amp; To-Do List in Seconds<br><br>\u2022 ADVANCED CALENDAR &amp; DAILY PLANNER - Keep your to-do list and calendar events always at hand with our calendar widget. Any.do to-do list &amp; planner support daily calendar view, 3-day Calendar view, Weekly calendar view &amp; agenda view, with built-in reminders. Review and organize your calendar events and to do list side by side.<br><br>\u2022 SYNCS SEAMLESSLY - Keeps all your to do list, tasks, reminders, notes, calendar &amp; agenda always in sync so you\u2019ll never forget a thing. Sync your phone\u2019s calendar, google calendar, Facebook events, outlook calendar or any other calendar so you don\u2019t forget an important event.<br><br>\u2022 SET REMINDERS - One time reminders, recurring reminders, Location reminders &amp; voice reminders. NEW! Easily create tasks and get reminders in WhatsApp.<br><br>\u2022 WORK TOGETHER - Share your to do list and assign tasks with your friends, family &amp; colleagues from your task list to collaborate and get more done. <br><br>---<br><br>ALL-IN-ONE PLANNER &amp; CALENDAR APP FOR GETTING THINGS DONE<br>Create and set reminders with voice to your to do list. <br>For better task management flow we added a calendar integration to keep your agenda always up to date. <br>For better productivity, we added recurring reminders, location reminders, one-time reminder, sub-tasks, notes &amp; file attachments. <br>To keep your to do list up to date, we\u2019ve added a daily planner and focus mode.<br><br>INTEGRATIONS<br>Any.do To do list, Calendar, planner &amp; Reminders Integrates with Google Calendar, Outlook, WhatsApp, Slack, Gmail, Google Tasks, Evernote, Trello, Wunderlist, Todoist, Zapier, Asana, Microsoft to-do, Salesforce, OneNote, Google Assistant, Amazon Alexa, Office 365, Exchange, Jira &amp; More.<br><br>TO DO LIST, CALENDAR, PLANNER &amp; REMINDERS MADE SIMPLE<br>Designed to keep you on top of your to do list, tasks and calendar events with no hassle. With intuitive drag and drop of tasks, swiping to mark to-do&#39;s as complete, and shaking your device to remove completed from your to do list - you can stay organized and enjoy every minute of it.<br><br>POWERFUL TO DO LIST TASK MANAGEMENT<br>Add a to do list item straight from your email / Gmail / Outlook inbox by forwarding do@Any.do. Attach files from your computer, Dropbox, or Google Drive to your to- tasks.<br><br>DAILY PLANNER &amp; LIFE ORGANIZER<br>Any.do is a to do list, a calendar, an inbox, a notepad, a checklist, task list, a board for post its or sticky notes, a task &amp; project management tool, a reminder app, a daily planner, a family organizer, an agenda, a bill planner and overall the simplest productivity tool you will ever have. <br><br>SHARE LISTS, ASSIGN &amp; ORGANIZE TASKS<br>To plan &amp; organize projects has never been easier. Now you can share lists between family members, assign tasks to each other, chat and much more. Any.do will help you and the people around you stay in-sync and get reminders so that you can focus on what matters, knowing you had a productive day and crossed off your to do list.<br><br>GROCERY LIST &amp; SHOPPING LIST<br>Any.do task list, calendar, agenda, reminders &amp; planner is also great for shopping lists at the grocery store. Simply create a list on Any.do, share it with your loved ones and see them adding their shopping items in real-time.",
12 "developer": "Any.do Calendar & To-Do List",
13 "developerAddress": "Any.do Inc.\n\n6 Agripas Street, Tel Aviv\n6249106 ISRAEL",
14 "developerEmail": "feedback+androidtodo@any.do",
15 "developerId": "5304780265295461149",
16 "developerInternalID": "5304780265295461149",
17 "developerWebsite": "https://www.any.do",
18 "free": true,
19 "genre": "Productivity",
20 "genreId": "PRODUCTIVITY",
21 "headerImage": "https://lh3.googleusercontent.com/dZknnlk1LM8fYS3wjOvVHOmWKOGH1HAe691Yuh7LAeBj6a730A1CQqZnXxjNahAYUFFw",
22 "histogram": [27291, 9246, 13735, 29904, 262997],
23 "icon": "https://lh3.googleusercontent.com/zgOLUXCHkF91H8xuMTMLT17smwgLPwSBjUlKVWF-cZRFjlv-Uvtman7DiHEii54fbEE",
24 "installs": "10,000,000+",
25 "minInstalls": 10000000,
26 "offersIAP": true,
27 "price": 0,
28 "privacyPolicy": "https://www.any.do/privacy",
29 "ratings": 343174,
30 "recentChanges": "Faster and smoother for better user experience!",
31 "recentChangesHTML": "Faster and smoother for better user experience!",
32 "released": "Nov 10, 2011",
33 "reviews": 122170,
34 "score": 4.43388,
35 "screenshots": [
36 "https://lh3.googleusercontent.com/C-L3_FPMlKVrZItAORaszhnQzlzMyXcqF_-oGaabHm_OnwUW1jz02BXBVSKi0HRUtQ",
37 "https://lh3.googleusercontent.com/uAP6G5ANQcgVs4Uj6yrcsAo4OUhejTJRVCXOxnAVA5Efit_OtAnrOYyL1SUHj1rv",
38 "https://lh3.googleusercontent.com/AI5mLFu0Atsl0km2FO9_IwJXNy_1q1_X6Ua3EVMZNedp0dsDToDRaWQ1UDvI6mb1-I0",
39 "https://lh3.googleusercontent.com/bYCAn3mjgB4ugSY0PL-PCcMBfbvXCSFkzL-pLSIIbZ8sQByQPerHboPQ2fA126K4LDtU",
40 "https://lh3.googleusercontent.com/u-dX4lpTepsvXs33ds4xxYpApuGS4JBAEb0UsvY_fPbptxnF0QxaKNW0-tJVXaP8a1E",
41 "https://lh3.googleusercontent.com/qvUz_9IXHQd6FSLUALZo8NKLx-s4uDGyElPOGRsU28TCEficQc0BoNRloRRLqUkH2A",
42 "https://lh3.googleusercontent.com/tEyGs6MGlY97ccLc4c_HxV9xNOpsvwQyHz6uGAezkVtxm1ydAaTj5EZSUgqlg69qrrk",
43 "https://lh3.googleusercontent.com/StN0i2BskOs6HCfaPO0DMBOCQMCag3okWVI_SlFJtMytwbgNMBnD5i9hbSqdNlGxffmn",
44 "https://lh3.googleusercontent.com/GRKqWfo-PLzCKwpgZ8fej4PGsUp1q9eM5a3LQeiYCOW-KUpCOIHXOp3mteZWbJ-pz4My",
45 "https://lh3.googleusercontent.com/pFQQ_qi8u92duWCNXpEcNKpH2lVpD_hFd5f-UlTP_f6wft3YyYLMzwLitxt-UI6G8vs",
46 "https://lh3.googleusercontent.com/AoeCU6bT1x0eHRvJwvQyOSKJ31oSayox959qMNVaSzz3uN9bvk1cGek5zyRDe1BdtA",
47 "https://lh3.googleusercontent.com/vICme1f4J9vFt8wY3xBY-LshGgYyvSbsa4TLJyEtNsy0alUI0i9oMQVq8oJ4l_yR1Aw",
48 "https://lh3.googleusercontent.com/7sn9m__iVM-peiG6_jkKBuE-QVH_xDaycF_oR1XJlwcAC45ybNZ_Exor09ENOJ41Q2U",
49 "https://lh3.googleusercontent.com/9I_m2ZXgPtiU4Po4cw_cyIaEpZxynxQ1n3YkhFgakATfbu63a8_f8vGQDxKOHYITzew"
50 ],
51 "size": "Varies with device",
52 "summary": "Task Manager \u2705 Organizer \ud83d\udcc5 Agenda \ud83d\udcdd Daily Reminders \ud83d\udd14 All-in-One Simple App.",
53 "summaryHTML": "Task Manager \u2705 Organizer \ud83d\udcc5 Agenda \ud83d\udcdd Daily Reminders \ud83d\udd14 All-in-One Simple App.",
54 "title": "Any.do: To do list, Calendar, Planner & Reminders",
55 "updated": 1586258773,
56 "url": "https://play.google.com/store/apps/details?id=com.anydo&hl=en&gl=us",
57 "version": "Varies with device",
58 "video": "https://www.youtube.com/embed/2nkllLD0x6o?ps=play&vq=large&rel=0&autohide=1&showinfo=0",
59 "videoImage": "https://i.ytimg.com/vi/2nkllLD0x6o/hqdefault.jpg"
60}

This contains lots of information including the number of ratings, number of reviews and number of ratings for each score (1 to 5). Let’s ignore all of that and have a look at their beautiful icons:

1def format_title(title):
2 sep_index = title.find(':') if title.find(':') != -1 else title.find('-')
3 if sep_index != -1:
4 title = title[:sep_index]
5 return title[:10]
6
7fig, axs = plt.subplots(2, len(app_infos) // 2, figsize=(14, 5))
8
9for i, ax in enumerate(axs.flat):
10 ai = app_infos[i]
11 img = plt.imread(ai['icon'])
12 ax.imshow(img)
13 ax.set_title(format_title(ai['title']))
14 ax.axis('off')

png
png

We’ll store the app information for later by converting the JSON objects into a Pandas dataframe and saving the result into a CSV file:

1app_infos_df = pd.DataFrame(app_infos)
2app_infos_df.to_csv('apps.csv', index=None, header=True)

Scraping App Reviews

In an ideal world, we would get all the reviews. But there are lots of them and we’re scraping the data. That wouldn’t be very polite. What should we do?

We want:

  • Balanced dataset - roughly the same number of reviews for each score (1-5)
  • A representative sample of the reviews for each app

We can satisfy the first requirement by using the scraping package option to filter the review score. For the second, we’ll sort the reviews by their helpfulness, which are the reviews that Google Play thinks are most important. Just in case, we’ll get a subset from the newest, too:

1app_reviews = []
2
3for ap in tqdm(app_packages):
4 for score in list(range(1, 6)):
5 for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]:
6 rvs, _ = reviews(
7 ap,
8 lang='en',
9 country='us',
10 sort=sort_order,
11 count= 200 if score == 3 else 100,
12 filter_score_with=score
13 )
14 for r in rvs:
15 r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest'
16 r['appId'] = ap
17 app_reviews.extend(rvs)
1100%|██████████| 15/15 [00:45<00:00, 3.01s/it]

Note that we’re adding the app id and sort order to each review. Here’s an example for one:

1print_json(app_reviews[0])
1{
2 "appId": "com.anydo",
3 "at": "2020-04-05 22:25:57",
4 "content": "Update: After getting a response from the developer I would change my rating to 0 stars if possible. These guys hide behind confusing and opaque terms and refuse to budge at all. I'm so annoyed that my money has been lost to them! Really terrible customer experience. Original: Be very careful when signing up for a free trial of this app. If you happen to go over they automatically charge you for a full years subscription and refuse to refund. Terrible customer experience and the app is just OK.",
5 "repliedAt": "2020-04-07 14:09:03",
6 "replyContent": "Our policy and TOS are completely transparent and can be found in the Help Center and our main page. In addition, a payment can only be made upon the user's authorization via the app and Google Play. We provide users with a full 7 days trial to test the app with an additional 48 hours for a refund, along with priority support for all issues.",
7 "reviewCreatedVersion": "4.17.0.3",
8 "score": 1,
9 "sortOrder": "most_relevant",
10 "thumbsUpCount": 37,
11 "userImage": "https://lh3.googleusercontent.com/a-/AOh14GiHdfNEu1DwwcJ6yNyju8Yvn4JwjpzuXvD74aVmDA",
12 "userName": "Andrew Thomas"
13}

repliedAt and replyContent contain the developer response to the review. Of course, they can be missing.

How many app reviews did we get?

1len(app_reviews)
115750

Let’s save the reviews to a CSV file:

1app_reviews_df = pd.DataFrame(app_reviews)
2app_reviews_df.to_csv('reviews.csv', index=None, header=True)

Summary

Well done! You now have a dataset with more than 15k user reviews from 15 productivity apps. Of course, you can go crazy and get much much more.

You learned how to:

  • Set goals and expectations for your dataset
  • Scrape Google Play app information
  • Scrape user reviews for Google Play apps
  • Save the dataset to CSV files

Next, we’re going to use the reviews for sentiment analysis with BERT. But first, we’ll have to do some text preprocessing!

References

Share

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me