About This Project
A comprehensive dataset of American city discourse on Reddit
What We're Building
The Reddit Data Collector is an ongoing effort to assemble one of the most complete publicly available archives of city-level conversation on Reddit across the United States. We are systematically crawling 47 major U.S. city subreddits, capturing every post and comment published between 2019 and 2025 — a seven-year window that spans some of the most transformative periods in modern American urban life.
The resulting dataset covers the COVID-19 pandemic and its aftermath, the remote work migration that reshaped housing markets, the racial justice protests of 2020, historic inflation cycles, changes in policing and public safety discourse, debates over transit, zoning, and development, the rise and fall of pandemic-era policies, and the everyday lived experience of millions of Americans in cities from New York to Honolulu.
Why City Subreddits?
City subreddits are a uniquely valuable source of naturalistic, longitudinal text data. Unlike Twitter/X (which skews toward broadcast and hot takes) or Facebook groups (which are typically closed), city subreddits are public, threaded, long-form, and deeply local. They contain everything from restaurant recommendations and sunset photos to heated debates about property taxes and public transit.
Each subreddit functions as a persistent digital town square for its city. Regular posters develop ongoing identities. Conversations reference specific streets, neighborhoods, businesses, and local politicians. The comment threads preserve genuine back-and-forth deliberation — agreements, disagreements, humor, frustration, and advice — in a way that few other platforms capture at scale.
For researchers, this makes city subreddits ideal for studying urban sentiment, migration patterns, housing affordability perceptions, pandemic response attitudes, and how communities process major events. The data is geographically anchored, temporally dense, and rich in social context.
What We Collect
For each post, we store:
- Content: Title, body text (self-posts), external URLs
- Metadata: Author, score (upvotes minus downvotes), number of comments, post flair, NSFW flag
- Temporal: Unix timestamp of creation (second-level precision)
- Identity: Reddit post ID and subreddit name
For each comment, we store:
- Content: Full comment body text
- Metadata: Author, score, parent comment/post ID
- Temporal: Unix timestamp of creation
- Threading: Link ID (parent post) and parent ID (for nested reply chains)
All data is stored in a normalized SQLite database with full-text indexing. Raw data is also archived as compressed JSONL files (zstd) for portability and backup.
The 47 Cities
Our selection covers every U.S. city with a population above approximately 500,000, plus several culturally significant cities and metros that punch above their weight in online discourse. The list spans all major regions:
- Northeast: New York, Boston, Philadelphia, Pittsburgh, Baltimore, Buffalo, Washington D.C.
- Southeast: Atlanta, Miami, Charlotte, Nashville, Memphis, New Orleans, Orlando, Jacksonville, Tampa, Raleigh, Louisville
- Midwest: Chicago, Detroit, Minneapolis, Cleveland, Columbus, Cincinnati, Indianapolis, Milwaukee, Kansas City, St. Louis, Omaha
- Southwest: Houston, Dallas, San Antonio, Austin, Phoenix, Tucson, Albuquerque, Las Vegas, Oklahoma City
- West: Los Angeles, San Francisco, San Diego, Seattle, Portland, Denver, Sacramento, Salt Lake City, Honolulu
Data Source
We use the Arctic Shift API, a community-maintained successor to the now-defunct Pushshift archive. Arctic Shift provides free, rate-limited access to historical Reddit data without requiring Reddit API credentials or OAuth tokens.
The crawler operates with a concurrency limit of 3 simultaneous subreddit crawls, a 0.5-second delay between API requests, and exponential backoff on rate limit responses. We aim to be a respectful consumer of this community resource. No proxies or scraping tricks are used — just patient, well-behaved API calls.
Technical Architecture
- Python 3.13
- FastAPI
- SQLite (WAL mode)
- asyncio + aiohttp
- Jinja2
- Leaflet.js
- Chart.js
- zstd compression
The system consists of two main components: an async crawler that fetches and stores data, and a dashboard that visualizes collection progress. The crawler supports pause/resume via a progress tracking table — each subreddit, year, and data type (posts or comments) is tracked independently, so interrupted crawls pick up exactly where they left off without re-downloading data.
The public-facing map interface uses Leaflet.js with CartoDB dark tiles, with circle markers sized by record count and colored by crawl status. City detail panels load time series data via Chart.js and provide a paginated dataframe viewer with direct links back to the original Reddit posts and comments.
Intended Uses
This dataset is being assembled for academic and analytical purposes. Potential applications include:
- Urban studies: Comparing discourse patterns across cities of different sizes, regions, and political leanings
- NLP research: Training or fine-tuning language models on geographically segmented, temporally rich text
- Sentiment analysis: Tracking how public opinion on housing, transit, crime, and development shifts over time
- Pandemic response: Studying how different cities experienced and discussed COVID-19 from 2020 through recovery
- Migration studies: Identifying patterns in "I just moved to..." and "I'm leaving because..." posts
- Community dynamics: Analyzing how online civic discourse evolves, who participates, and what drives engagement
Ethics and Privacy
All data collected is publicly posted content on Reddit. We do not collect private
messages, user profiles, or any data behind authentication. Author usernames are
stored as they appear on the platform. Deleted posts and comments (marked as
[deleted] or [removed]) retain only their metadata, as
the content was already removed from public view before archival.
We encourage responsible use of this dataset. It should not be used for harassment, doxxing, or targeting individuals based on their posting history.
About Applesauce
This project is part of the Applesauce platform — a suite of tools for data collection, analysis, and resale built by independent developers. Other Applesauce projects include Sell (photo-to-listing for resellers) and Pickl (food safety label analysis).