Lambda function with itertools count() and groupby()
Can someone please explain the groupby operation and the lambda function being used on this SO post?
key=lambda k, line=count(): next(line) // chunk
import tempfile from itertools import groupby, count temp_dir = tempfile.mkdtemp() def tempfile_split(filename, temp_dir, chunk=4000000): with open(filename, 'r') as datafile: # The itertools.groupby() function takes a sequence and a key function, # and returns an iterator that generates pairs. # Each pair contains the result of key_function(each item) and # another iterator containing all the items that shared that key result. groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: print(key, list(group)) output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k)) for line in group: with open(output_name, 'a') as outfile: outfile.write(line)
Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.
Martijn explained it really well, however I have a follow up question. Why is
line=count() passed as an argument to the lambda function every time? I tried assigning the variable
count() just once, outside the function.
line = count() groups = groupby(datafile, key=lambda k, line: next(line) // chunk)
and it resulted in
TypeError: <lambda>() missing 1 required positional argument: 'line'
count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the
groups = groupby(datafile, key=lambda k: next(count()) // chunk)
I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!作者: theguyoverthere 的来源 发布者: 2017 年 12 月 27 日
itertools.count() is an infinite iterator of increasing integer numbers.
lambda stores an instance as a keyword argument, so every time the lambda is called the local variable
line references that object.
next() advances an iterator, retrieving the next value:
>>> from itertools import count >>> line = count() >>> next(line) 0 >>> next(line) 1 >>> next(line) 2 >>> next(line) 3
next(line) retrieves the next count in the sequence, and divides that value by
chunk (taking only the integer portion of the division). The
k argument is ignored.
Because integer division is used, the result of the
lambda is going to be
chunk repeats of an increasing integer; if
chunk is 3, then you get
0 three times, then
1 three times, then
2 three times, etc:
>>> chunk = 3 >>> l = lambda k, line=count(): next(line) // chunk >>> [l('ignored') for _ in range(10)] [0, 0, 0, 1, 1, 1, 2, 2, 2, 3] >>> chunk = 4 >>> l = lambda k, line=count(): next(line) // chunk >>> [l('ignored') for _ in range(10)] [0, 0, 0, 0, 1, 1, 1, 1, 2, 2]
It is this resulting value that
groupby() groups the
datafile iterable by, producing groups of
When looping over the
groupby() results with
for k, group in groups:,
k is the number that the
lambda produced and the results are grouped by; the
for loop in the code ignores this.
group is an iterable of lines from
datafile, and will always contain
In response to the updated OP...
itertools.groupby iterator offers ways to group items together, giving more control when a key function is defined. See more on how
lambda function, is a functional, shorthand way of writing a regular function. For example:
>>> keyfunc = lambda k, line=count(): next(line)
Is equivalent to this regular function:
>>> def keyfunc(k, line=count()): ... return next(line) // chunk
Keywords: iterator, functional programming, anonymous functions
line=count()passed as an argument to the lambda function every time?
The reason is the same for normal functions. The
line parameter by itself is a positional argument. When a value is assigned, it becomes a default keyword argument. See more on positional vs. keyword arguments.
You can still define
line=count() outside the function by assigning the result to a keyword argument:
>>> chunk = 3 >>> line=count() >>> keyfunc = lambda k, line=line: next(line) // chunk # make `line` a keyword arg >>> [keyfunc("") for _ in range(10)] [0, 0, 0, 1, 1, 1, 2, 2, 2, 3] >>> [keyfunc("") for _ in range(10)] [3, 3, 4, 4, 4, 5, 5, 5, 6, 6] # note `count()` continues
... calling next on
count()directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the
Try the following experiment with
>>> numbers = count() >>> next(numbers) 0 >>> next(numbers) 1 >>> next(numbers) 2
As expected, you will notice
next() is yielding the next item from the
count() iterator. (A similar function is called iterating an iterator with a
for loop). What is unique here is that generators do not reset -
next() simply gives the next item in the line (as seen in the former example).
@Martijn Pieters pointed out
next(line) // chunk computes a floored integer that is used by
groupby to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how
- Docs for
- Docs for
- Beazley, D. and Jones, B. "7.7 Capturing Variables in Anonymous Functions," Python Cookbook, 3rd ed. O'Reilly. 2013.