While small dataset might fit in your main memory, not all dataset will. Sometimes you cannot read the entire dataset at once and use it for processing. This is especially true for training large deep neural networks which typically need large amount of training data. Also datasets consisting of images, videos, audio etc. might be too large to fit in the memory. For example, a dataset of reddit comments from June 2019 is about 164 GB after extracting the archive (can be obtained from https://files.pushshift.io/reddit/comments/ ).
So it is essential to know how to read huge dataset lazily i.e. only reading from the data source when it is needed. Python generators to the rescue! A generator is a function that returns an iterator that is lazily evaluated. Since this is an iterator, we can use this in a for loop. Let’s see some examples to make the idea concrete.
For this example, I will read data from a file that contains reddit comments. The size of the file is about 164 GB so it won’t fit in the memory so we need to read the file contents on-demand. The file is in a jsonline format i.e. each line is a reddit comment which is a valid json object. You can also download a sample jsonl file from here ( https://gist.github.com/jangedoo/c8b84f9ec37b1c1970afb07e1bebfe6f ).
Let’s imagine that we want to count the number of posts made by each user. To implement this, we’ll read the file line by line and while reading it, we’ll keep track of the user and update the counter for that user, every time we see a post made by that user.
1
2
3
4
5
6
7
8
import json
def read_comments(path: str):
with open(path, "r") as f:
for line in f:
yield json.loads(line)
read_comments("./reddit_comments/RC_2019-06")
1
<generator object read_comments at 0x7f77e42b0c00>
First we implement a function that reads the file. We’ll use Python’s open
function to read. Note that the object returned by open
is also lazily evaluated. It does not read the contents of the file unless you ask for it. To ask for a line from the file, we use the f
object in the for loop. For every iteration in the for loop, we convert the line given by the f
to a Python dict
by calling json.loads
function and yield
it. Here yield
is the important thing. It basically makes our read_comments
function a generator function. If you print the result of the function, it should print something like <generator object read_comments at 0x7f77e42b0c00>
. It didn’t read the contents of the file at all. Reading happens only when someone asks for it. Typically a for loop, map
, filter
and functions in the itertools
module are the ones that consume from the generator.
You can also use the next
function to ask one item at a time as shown below. Each time you call next(comments)
, the comments generator will give you successive items.
1
2
comments = read_comments("./reddit_comments/RC_2019-06")
next(comments)
1
2
3
4
5
6
7
8
9
10
11
12
{'all_awardings': [],
'author': 'bbynug',
'author_created_utc': 1524515156,
'author_flair_background_color': None,
'author_flair_css_class': None,
...
'stickied': False,
'subreddit': 'iamverysmart',
'subreddit_id': 't5_2yuej',
'subreddit_name_prefixed': 'r/iamverysmart',
'subreddit_type': 'public',
'total_awards_received': 0}
Now we have a generator object called comments
from which we get a json object each time. Now our goal is to count the number of times a user has posted a comment. So first, we will map the json object to a string i.e. comment json to username using map
function. Then we can keep track of the number of times this user was seen using Counter
. Also, to prevent from reading the full comments file that I have, for testing purposes, I’ll only read the first 100 comments from the comments
using itertools.islice
function. If you are using the sample file from above, then this will not do anything since there are only 10 lines in the entire file.
Note that map
and any function within itertools
module work on both lists and generators as inputs and they always do lazy evaluation.
1
2
3
4
5
6
7
from itertools import islice
from collections import Counter
limited_comments = islice(comments, 100)
users = map(lambda c: c["author"], limited_comments)
comments_counter = Counter(users)
comments_counter.most_common(10)
1
2
3
4
5
6
7
8
9
10
[('[deleted]', 5),
('Shep-Hard', 1),
('grukalar_', 1),
('garbag3acct', 1),
('TheSurgicalOne', 1),
('CousinMabel', 1),
('Vaulter1', 1),
('MoonisHarshMistress', 1),
('88CELTIC', 1),
('RoyalHealer', 1)]
Using the code above, we can safely analyze files of almost any size without getting memory errors. At a given time only one json object is kept in the memory and a Counter
object to keep track of users and their comments count.
While lazy evaluation is awesome for working with dataset that does not fit in the memory, it has some limitations though. Here are two important ones to keep in mind.
- There is no going back. Once you read an item from a generator, we cannot ask a generator to give that item again. It will only give you the next item.
- You cannot use
len
function to determine the number of elements present. To know how many elements are present in a generator, we need to keep track of number of items read ourself until the generator no longer yields any item. But once a generator is exhausted i.e. all items from the generator has been consumed, it cannot be used again. It will not yield any items and throwsStopIteration
exception. So you’ll need to recreate the generator object and read from the beginning.
Here are some links if you want to read more about generators.
Take a look at https://docs.python.org/3.7/library/itertools.html for all available functions in itertools. They are very handy and are worth learning.
Comments