About This Project

A comprehensive dataset of American city discourse on Reddit

What We're Building

The Reddit Data Collector is an ongoing effort to assemble one of the most complete publicly available archives of city-level conversation on Reddit across the United States. We are systematically crawling 47 major U.S. city subreddits, capturing every post and comment published between 2019 and 2025 — a seven-year window that spans some of the most transformative periods in modern American urban life.

The resulting dataset covers the COVID-19 pandemic and its aftermath, the remote work migration that reshaped housing markets, the racial justice protests of 2020, historic inflation cycles, changes in policing and public safety discourse, debates over transit, zoning, and development, the rise and fall of pandemic-era policies, and the everyday lived experience of millions of Americans in cities from New York to Honolulu.

City Subreddits

Years (2019–2025)

658

Crawl Tasks

200M+

Projected Records

Why City Subreddits?

City subreddits are a uniquely valuable source of naturalistic, longitudinal text data. Unlike Twitter/X (which skews toward broadcast and hot takes) or Facebook groups (which are typically closed), city subreddits are public, threaded, long-form, and deeply local. They contain everything from restaurant recommendations and sunset photos to heated debates about property taxes and public transit.

Each subreddit functions as a persistent digital town square for its city. Regular posters develop ongoing identities. Conversations reference specific streets, neighborhoods, businesses, and local politicians. The comment threads preserve genuine back-and-forth deliberation — agreements, disagreements, humor, frustration, and advice — in a way that few other platforms capture at scale.

For researchers, this makes city subreddits ideal for studying urban sentiment, migration patterns, housing affordability perceptions, pandemic response attitudes, and how communities process major events. The data is geographically anchored, temporally dense, and rich in social context.

What We Collect

For each post, we store:

Content: Title, body text (self-posts), external URLs
Metadata: Author, score (upvotes minus downvotes), number of comments, post flair, NSFW flag
Temporal: Unix timestamp of creation (second-level precision)
Identity: Reddit post ID and subreddit name

For each comment, we store:

Content: Full comment body text
Metadata: Author, score, parent comment/post ID
Temporal: Unix timestamp of creation
Threading: Link ID (parent post) and parent ID (for nested reply chains)

All data is stored in a normalized SQLite database with full-text indexing. Raw data is also archived as compressed JSONL files (zstd) for portability and backup.

The 47 Cities

Our selection covers every U.S. city with a population above approximately 500,000, plus several culturally significant cities and metros that punch above their weight in online discourse. The list spans all major regions:

Northeast: New York, Boston, Philadelphia, Pittsburgh, Baltimore, Buffalo, Washington D.C.
Southeast: Atlanta, Miami, Charlotte, Nashville, Memphis, New Orleans, Orlando, Jacksonville, Tampa, Raleigh, Louisville
Midwest: Chicago, Detroit, Minneapolis, Cleveland, Columbus, Cincinnati, Indianapolis, Milwaukee, Kansas City, St. Louis, Omaha
Southwest: Houston, Dallas, San Antonio, Austin, Phoenix, Tucson, Albuquerque, Las Vegas, Oklahoma City
West: Los Angeles, San Francisco, San Diego, Seattle, Portland, Denver, Sacramento, Salt Lake City, Honolulu

Data Source

We use the Arctic Shift API, a community-maintained successor to the now-defunct Pushshift archive. Arctic Shift provides free, rate-limited access to historical Reddit data without requiring Reddit API credentials or OAuth tokens.

The crawler operates with a concurrency limit of 3 simultaneous subreddit crawls, a 0.5-second delay between API requests, and exponential backoff on rate limit responses. We aim to be a respectful consumer of this community resource. No proxies or scraping tricks are used — just patient, well-behaved API calls.

Technical Architecture

Python 3.13
FastAPI
SQLite (WAL mode)
asyncio + aiohttp
Jinja2
Leaflet.js
Chart.js
zstd compression

The system consists of two main components: an async crawler that fetches and stores data, and a dashboard that visualizes collection progress. The crawler supports pause/resume via a progress tracking table — each subreddit, year, and data type (posts or comments) is tracked independently, so interrupted crawls pick up exactly where they left off without re-downloading data.

The public-facing map interface uses Leaflet.js with CartoDB dark tiles, with circle markers sized by record count and colored by crawl status. City detail panels load time series data via Chart.js and provide a paginated dataframe viewer with direct links back to the original Reddit posts and comments.

Intended Uses

This dataset is being assembled for academic and analytical purposes. Potential applications include:

Urban studies: Comparing discourse patterns across cities of different sizes, regions, and political leanings
NLP research: Training or fine-tuning language models on geographically segmented, temporally rich text
Sentiment analysis: Tracking how public opinion on housing, transit, crime, and development shifts over time
Pandemic response: Studying how different cities experienced and discussed COVID-19 from 2020 through recovery
Migration studies: Identifying patterns in "I just moved to..." and "I'm leaving because..." posts
Community dynamics: Analyzing how online civic discourse evolves, who participates, and what drives engagement

Ethics and Privacy

All data collected is publicly posted content on Reddit. We do not collect private messages, user profiles, or any data behind authentication. Author usernames are stored as they appear on the platform. Deleted posts and comments (marked as [deleted] or [removed]) retain only their metadata, as the content was already removed from public view before archival.

We encourage responsible use of this dataset. It should not be used for harassment, doxxing, or targeting individuals based on their posting history.

About Applesauce

This project is part of the Applesauce platform — a suite of tools for data collection, analysis, and resale built by independent developers. Other Applesauce projects include Sell (photo-to-listing for resellers) and Pickl (food safety label analysis).