Web Scraping with Python and the Reddit API

Introduction

In this guide, we will be using Python to scrape data from Reddit. Reddit is a social news aggregation and forum-style discussion website. Registered members submit content to the site in the form of links, text posts, images, and videos, which are then “upvoted” or “downvoted” by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics including news, science, movies, video games, music, books, and almost anything else you can think of.

What is Web Scraping?

Web scraping is the process of extracting information from websites. This can be done manually by a human user or automatically by a computer program. Web scraping is a powerful tool for data collection and analysis, and it has many applications in various academic and non-academic fields.

Why Scrape Reddit?

Reddit is a popular website with a large and diverse user base from around the world. It contains a vast amount of data on a wide range of topics, making it a valuable resource for data analysis. By scraping Reddit, you can collect data on user behavior, trends, opinions, and more. This data can be used for market research, sentiment analysis, content analysis, cultural analysis, and other purposes.

Many websites allow scraping of their data through APIs (Application Programming Interfaces), which provide a structured way to access and retrieve data. Reddit has its own API that allows you to access its data programmatically, and is probably the most efficient way to scrape Reddit data. It is important to note, however, that not all websites allow scraping, and some may have restrictions on how their data can be used. Always be sure to read and understand the terms of service of any website you plan to scrape.

Prerequisites

Before we get started, you will need to have Python installed on your computer. You can download Python from the official website.

You will also need to install the following Python library:

PRAW: The Python Reddit API Wrapper (PRAW) is a Python package that allows you to access Reddit’s API. You can install it using “pip” with the following command in your terminal: pip install praw

Lastly, you will want a text editor to write your code in. I recommend using VSCode.

Getting Started

To get started with web scraping on Reddit, you will need to create a Reddit account and obtain API credentials. Here are the steps to do this:

1. Go to the Reddit website and create an account if you don’t already have one.

2. Go to the Reddit Apps page and click on the “Create App” or “Create Another App” button.

3. Fill in the required fields. For name and description, you can enter anything you like, e.g. “reddit scraper” and “This app scrapes recent subreddit titles”, respectively. You can keep about URL blank. For the redirect URI, you can enter http://localhost:8080. In the app type field, select “script”, since this is a personal use script.

4. Click on the “Create App” button to create your app.

After creating your app, you will see a page with your app’s “client ID” (the string of characters underneath the app title and personal use script text) and “client secret.” You will need these credentials to authenticate your app when accessing the Reddit API.

Building the Reddit Scraper

Now that you have your Reddit account and API credentials, you can start writing a Python script to scrape data from Reddit. As an example, let’s create a scraper that retrieves the most recent post titles from a specific subreddit.

Create a new Python script (e.g., reddit_scraper.py) and import the praw library first:

import praw

Next, we want to create a new function that utilizes our credentials to access the Reddit API:

def connect_to_reddit():
    reddit = praw.Reddit(
        client_id='your_client_id',      
        client_secret='your_client_secret',
        user_agent='u/your_username'    
    )
    return reddit

This function creates a new Reddit instance using your client ID, client secret, and user agent. Make sure to replace the client ID and secret with the ones Reddit provided you. The user agent is a unique identifier that helps Reddit determine the source of the API request, and for this you can simply use your Reddit username.

Now, let’s create a function that retrieves the most recent post titles from a specific subreddit:

def get_recent_post_titles(subreddit_name, post_limit=10):
    reddit = connect_to_reddit()  
    subreddit = reddit.subreddit(subreddit_name)  # Choose the subreddit

    recent_posts = subreddit.new(limit=post_limit)

    post_titles = [post.title for post in recent_posts]
    return post_titles

This function takes the name of a subreddit and an optional post limit as input, connects to Reddit using our credentials, retrieves the most recent posts from the specified subreddit, and returns a list of post titles.

Next, we’ll want to call this function with the desired subreddit name and post limit:

if __name__ == "__main__":
    subreddit_name = 'dogs'  
    post_limit = 5

    titles = get_recent_post_titles(subreddit_name, post_limit)
    print(f"Most recent {post_limit} post titles from r/{subreddit_name}:")

    for idx, title in enumerate(titles, 1):
        print(f"{idx}. {title}")

Let’s break this last part down. First, we specify the name of the subreddit we want to scrape (subreddit_name) and the number of recent posts we want to retrieve (post_limit). We then call the get_recent_post_titles function with these parameters and store the returned list of post titles in the titles variable. Finally, we print out the post titles with their corresponding index numbers using a for loop.

When you run the code in this example, you should see the 5 most recent post titles from the “dogs” subreddit printed to the console.

What Next?

This is just a simple example of how you can scrape data from Reddit using Python. There are many other ways to interact with the Reddit API and extract different types of data. You can explore the PRAW documentation to learn more about the capabilities of the library and how to use it effectively.

In addition, there are many other Python libraries and tools available for web scraping like Beautiful Soup and Scrapy.

To make your results more interesting than just printing text to the console, you can also combine web scraping with data analysis libraries like Pandas and visualization libraries like Matplotlib or Seaborn to gain insights from the data you collect.

To learn more Python tips and tricks or discuss your own projects with fellow students, feel free to join the Python User’s Group (PUG) on the CUNY Commons.

Happy scraping!

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Digital Initiatives at the Grad Center