How to use Python and the Reddit API to build a local database of Reddit posts and comments — Beginner’s Guide

Brent Gaisford
8 min readMay 20, 2022

--

Using PRAW (the Python Reddit API Wrapper) to build a local database of Reddit posts and comments — Beginner’s Guide

There are two popular Python modules to access Reddit data. We’re going to use PRAW rather than Pushshift for two reasons. First, because PRAW allows you to query Reddit itself to get the most up to date posts and information, rather than waiting for things to be indexed and updated in Pushshift’s database (although it should be noted that Pushshift is a better way to get historical data). Second, PRAW allows you to use all the features of the Reddit API — so if at some point you think you might want to programmatically post, comment, or otherwise interact with Reddit beyond scraping data, you’ll want to do that with PRAW. Use that power for good though please, the world doesn’t need any more spam bots.

You can install PRAW using either conda or pip, depending on your preference. Use either of:

pip install praw
conda install -c conda-forge praw

You’ll also need the pandas module. If you don’t have that installed already, do that now as well.

First, let’s get access to the Reddit API

Head over to https://www.reddit.com/prefs/apps

Scroll down to the bottom of the page and click the “are you a developer? create an app…” button

Now you have a few fields to fill out:

Select script
name: Whatever you feel like
description: not required
about url: not required
redirect uri: required — you can use http://localhost:8080 to reference your local machine

After hitting submit, you’ll get a screen which includes your client_id in the top left and your secret in the middle — you’ll need these in a moment:

That’s all the setup we need to start using the API in read mode to pull data!

To post to Reddit, you’ll need to do some more configuration to authenticate via OAuth. We’ll cover that in a future post, but you can read about it here in the PRAW documentation as well if you want to get started right now.

Now let’s start our data pull

Here’s the good stuff — the code. First we need to start and authorize our PRAW instance.

import praw
import csv
import pandas as pd
reddit = praw.Reddit(
client_id="THE CLIENT ID YOU GOT FROM REDDIT",
client_secret="THE SECRET YOU GOT FROM REDDIT",
user_agent="A UNIQUE USER AGENT YOU SET RIGHT NOW")

You’ll need to fill in the client_id and secret that Reddit assigned you earlier, as well as create a unique name for your user agent. Here is what PRAW recommends to for your user agent, but this can be any unique name you want: “<platform>:<app ID>:<version string> (by u/<Reddit username>)"

Now let’s test if you’re successfully connected by pulling some basic data about the r/redditdev subreddit:

test_subreddit = reddit.subreddit(“redditdev”)
print(test_subreddit.display_name)
print(book_subreddit.title)
print(book_subreddit.description)

If that code worked and returned data about r/redditdev, you’re successfully connected to the reddit API!

Pulling all the most recent posts in a subreddit and creating a local database

The code below will create a csv file and populate it with the most recent ~900 posts in your subreddit of choice (the largest batch you can pull using PRAW). For a very active subreddit, that could be as little as a few days to a week. For a smaller subreddit, it could be months or years worth of posts. To change the subreddit you’re querying, change the value for my_subreddit in the code below.

A csv file with the posts will be created in the same directory as your code is running — to change that, change the file name and directory where it appears in the code below.

THIS WILL TAKE QUITE A WHILE TO RUN. The Reddit API only allows us to make 100 queries per second. PRAW will automatically batch your query for you, but it still means this code block could take as long as two minutes to run.

Every time you run this block, it will update your local submission DB and add any new posts since the last time you ran it! There are two major pieces to this code:

  1. First, we pull all the recent posts and all the useful attributes about those posts and save the results down to a local csv.
  2. Second, we de-duplicate our database by removing posts that were already in our local csv*

*Note that we don’t have to worry about people deleting their posts when dealing with duplicates in the submissions database. If a post is deleted, it will not longer appear in an API call, so we always want the most recent information about a post to get the most up to date score and comment count.

my_subreddit = "redditdev"
submission_headers = ['author', 'created_utc', 'id',
'is_original_content', 'is_self',
'link_flair_text', 'locked', 'name',
'num_comments', 'over_18', 'permalink',
'score', 'selftext', 'spoiler', 'stickied',
'subreddit', 'title', 'upvote_ratio', 'url']
#Note: 'a' opens the file in append mode to avoid overwriting data
with open("reddit_test_submission_db.csv", 'a',
encoding="utf-8", newline='') as f_object:
newposts = reddit.subreddit(my_subreddit).new(limit=None)
for post in newposts:
#Below are all the fields we'll request from PRAW for each post
data = {'author': post.author, 'created_utc':
post.created_utc, 'id': post.id,
'is_original_content': post.is_original_content,
'is_self': post.is_self, 'link_flair_text':
post.link_flair_text, 'locked': post.locked,
'name': post.name, 'num_comments':
post.num_comments, 'over_18': post.over_18,
'permalink': post.permalink, 'score': post.score,
'selftext': post.selftext, 'spoiler': post.spoiler,
'stickied': post.stickied, 'subreddit':
post.subreddit, 'title': post.title,
'upvote_ratio': post.upvote_ratio, 'url': post.url}
dictwriter_object = csv.DictWriter(
f_object, fieldnames=submission_headers)
dictwriter_object.writerow(data)
f_object.close()
#Code below will delete duplicates on successive pulls
post_db = pd.read_csv("reddit_test_submission_db.csv",
names=submission_headers, header=0)
post_db.drop_duplicates(subset="permalink",
keep="last", inplace=True)
post_db.to_csv("reddit_test_submission_db.csv",
index=False, chunksize=1000)

If you want to customize the attributes you are saving down for each post, check the PRAW documentation here and update the submission headers and data lines of code accordingly. That link also includes a description of all the fields we’re saving down you can use as a data dictionary.

Here’s what the top 5 rows of your data should look like after running the code:

Pull all the comments for those posts and save them down to a comment database

Similarly to above, this block will create a csv file in the same directory where your code is running — change the file references to change that.

THIS CODE BLOCK WILL TAKE A VERY, VERY LONG TIME. Because of the limit of 100 queries per second, this block could potentially take as long as 3–4 hours or more to execute the first time you run it, depending on how many comments each post typically receives in your subreddit of choice.

This code block has three major sections:

  1. Create a list of the submissions for which we want to pull the comments. On an ongoing basis it’s important not to request comments we’ve already pulled. If we keep requesting all the comments for all the submissions in our local database, the processing time will get longer and longer each time we run the program. Instead, we check our list of submissions and create a list of only the new submissions. In the code below, we find the most recent submission for which there are any comments, look at the date it was posted, subtract three days to make sure we don’t miss comments on posts that were still active and being commented on at the time we pulled posts most recently, and then request the comments for all the submissions in our database made after that date.
  2. Use PRAW to query the reddit API for all comments for that list of submissions. The code below requests all useful attributes for those comments and then appends them to a local csv.
  3. Delete duplicate comments from our comment database. With our three day margin of error we will pull some comments that are already in our database. However, sometimes people delete comments. If they did, we want to preserve the original comment before they deleted it and preserve the oldest instance of the comment in our data, while for situations where the comment was not deleted, we will capture the most up-to date data that shows the final upvote score.
#This block creates list of submissions for which we want comments
comment_headers = ['author', 'body', 'created_utc',
'distinguished', 'edited', 'id',
'is_submitter', 'link_id', 'parent_id',
'permalink', 'saved', 'score', 'stickied',
'submission', 'subreddit', 'subreddit_id']
with open('reddit_test_comment_db.csv', 'a') as comment_file:
comments_db = pd.read_csv('reddit_test_comment_db.csv',
usecols=["submission"],
names=comment_headers)
comment_file.close()
submission_db = pd.read_csv("reddit_test_submission_db.csv",
usecols=["created_utc", "id"])
#Filter down to submissions for which we don't yet have comments
comments_set = set(comments_db["submission"])
#259,200 is three days worth of seconds
try:
time_cutoff = max([created_utc for post_id, created_utc in
zip(submission_db.id, submission_db.created_utc)
if post_id in comments_set]) - 259200
except:
time_cutoff = submission_db["created_utc"].min()
submissions_to_pull = submission_db.loc[submission_db
["created_utc"] >=
time_cutoff, "id"]
#This block pulls all comments for the list of submissions we have identified
with open("reddit_test_comment_DB.csv", 'a',
encoding="utf-8", newline='') as f_object:
for row in submissions_to_pull:
submission = reddit.submission(id=row)
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
data = {'author': comment.author, 'body':
comment.body, 'created_utc':
comment.created_utc, 'distinguished':
comment.distinguished, 'edited':
comment.edited, 'id': comment.id,
'is_submitter': comment.is_submitter,
'link_id': comment.link_id, 'parent_id':
comment.parent_id, 'permalink':
comment.permalink, 'saved': comment.saved,
'score': comment.score, 'stickied':
comment.stickied, 'submission':
comment.submission, 'subreddit':
comment.subreddit,
'subreddit_id': comment.subreddit_id}
dictwriter_object = csv.DictWriter(f_object,
fieldnames=
comment_headers)
dictwriter_object.writerow(data)
f_object.close()

#Now drop duplicate rows
comment_db = pd.read_csv("reddit_test_comment_DB.csv",
names=comment_headers, header=0)
#First drop duplicates that are not edited, keep the last pull
comment_db.drop_duplicates(subset=["permalink", "body", "id"],
keep = "last", inplace=True)
#Then drop duplicates that have been edited, keep the first pull
comment_db.drop_duplicates(subset="permalink",
keep = "first", inplace=True)
comment_db.to_csv("reddit_test_comment_DB.csv",
index=False, chunksize=1000)

If you want to customize the attributes you are saving down about each comment, check the PRAW documentation here and update the comment headers and “data” line of code accordingly. That link also includes a description of all the fields we’re saving down that you can use as a data dictionary.

Your first five rows should look like this after running the code:

That’s it, you’ve made a local database of reddit posts and comments that you can keep up to date!

Wait a few days (or a few hours) and run your code again. That will add all the new posts and comments made in the interim to your database.

This is a great first step toward working with the Reddit API and pulling down data to use in all kinds of analysis. Congrats!

If you want to keep your database up to date over a longer period of time, one improvement would be to set up a local sql database instead of using csv files — it will be much more reliable and accessible in the long run.

This would also be a great program to add as a cron job or scheduled task. That way you can set it to run periodically in the background of your computer and keep your databases up to date without you having to run your code manually.

And finally, you could also slim down the code above. A number of the lines are only needed the first time you run the program to create the csv file, or can be dropped entirely if you create your csv file first. You can customize your program by removing those lines going forward — but don’t feel like you have to. If you keep running the code without any tweaks, your database will keep growing over time as new posts are made.

Happy coding y’all!

--

--