Recently I found myself in need of some open-source information. There are multiple ways to gather information on the web. You could use Selenium to scrape the web (although sites make that more difficult every year). One of the most effective ways is to use an API, and one of the best to use is Reddit’s.
Reddit in itself already gathers all sorts of information from the web. The subreddits and topics are endless. It’s very much a one-stop-shop for any topic you want to research. I recently found myself in need of such information, and really appreciated Reddit’s open-source mentality. So let me share it with you.
First login to your Reddit account and navigate to the app’s settings (www.reddit.com/prefs/apps), scroll to the bottom, and where you see the developer section click the button to create an app.
Fill in the appropriate information. I selected ‘script’ since this is a python script. In the redirect URL, I entered my Twitter account in case Reddit felt the need to find me. In return, you’ll get a personal use script key and that sweet super secret key.
After all that hard work let’s jump into the code. Open your favorite IDE, and import the requirements for this project. Since we’re calling an API we’ll need to use the request module, and since I’m making my own JSON database I’ll need pandas.
import pandas as pd
import requests
Those are the only tools we’ll need. Next, let’s set up our variables so we can make our requests calls through python to Reddit’s API to get our token before it’ll let us extract any information. I’ll load my personal script key, and secret key into CLIENT_ID, and SECRET_KEY. These will be the authentication credentials that I’ll be using behind Python’s request module.
CLIENT_ID = 'script_key_here' SECRET_KEY = "secret_key_here"auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_KEY)
Now before we can use requests to get our access token from Reddit we need to send along with it a data object with our username and password. For good practice, we should keep our password in a separate text file, and load it in a PW variable.
with open('pw.txt', 'r') as f:
pw= f.read()
Our data object will then look like this:
data = {
'grant_type' = 'password',
'username': = 'your_username_here',
'password': = "pw",
Great along with the request we’re sending we need headers objects.
headers = {'User-Agent": "MyAPI/0.0.1'}
Okay with all that we can now create our post variable to send to Reddit’s API. In return we’ll want to save the token it gives in in a TOKEN variable, and add that to our headers object since it’s missing. We’ll need that to get information back from Reddit.
To make the post request and get our token:
res = requests.post('http:/www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=headers)
From there load the token we get back into a TOKEN variable.
TOKEN= res.json(['access_token']
And add it to our header object from above.
headers['Authorization'] = f'bearer {TOKEN}'
In the end here’s how all that looks.
Whew… they never said coding was easy.
Now that we have our token we can make a get request from Reddit’s OAuth’s URL. As a tech enthusiast, I’ll scope the technology subreddit. I just need to attach my headers, and a parameter to limit the size of data returned.
res = requests.get('https://oauth.reddit.com/r/technology/hot', headers=headers, params={'limit': '50'})
From here I can start building my JSON object with this information. First, let me initiate my data object much like a JSON file is formatted.
data={"results": []}
Okay after printing out res.json() to see the information returned from Reddit I know the data I want is in [‘data’][‘children’]. In particular, I’m looking for the title, image, and URL for each technology post for my own tech website. I also see that some Reddit posts in the Technology subreddit are self asking questions, and not actually news posts. In my for loop, I’ll create a conditional to skip those so my JSON is as clean as possible.
Time to use a for loop, and append this to my data object.
Now that my data is filled out I can write a JSON file using JSON.dump().
And boom I have my own JSON file with the information I wanted straight from Reddit’s own API!