Efficient sifting iterator for Python
Did you ever want to sift an iterator into two "sub" iterators? Now you can:
>>> even, odd = sift(xrange(10), lambda x: x % 2 == 0)
>>> list(even)
[0, 2, 4, 6, 8]
>>> list(odd)
[1, 3, 5, 7, 9]
Note that the input iterable is only lazily consumed:
>>> X = iter(xrange(10))
>>> even, odd = sift(X, lambda x: x % 2 == 0)
>>> X.next()
0
>>> X.next()
1
>>> X.next()
2
>>> list(even)
[4, 6, 8]
>>> list(odd)
[3, 5, 7, 9]
A tiny warning
Beware that in order to consume one of the returned iterators fully, you might be consuming a big part of the the input iterable and keeping its values buffered in memory. It's best to consume the largest iterable first, whenever possible:
>>> small_ones, large_ones = sift(xrange(10000), lambda x: x < 9000)
# low memory peak (no buffering needed)
>>> list(small_ones)
[0, ..., 8999]
>>> list(large_ones)
[9000, ..., 9999]
# at its peak, 9000 items are buffered in memory
>>> list(large_ones)
[9000, ..., 9999]
>>> list(small_ones)
[0, ..., 8999]
The implementation: https://gist.github.com/nvie/5804375
Written by Vincent Driessen
Related protips
2 Responses
Y U NO partition
?!
def partition(pred, iterable):
'Use a predicate to partition entries into false entries and true entries'
# partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9
t1, t2 = tee(iterable)
return filterfalse(pred, t1), filter(pred, t2)
http://docs.python.org/dev/library/itertools.html#itertools-recipes
(1) "Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed."
(2) tee() actually duplicates all elements from the input, making it much less memory-efficient as you need to consume both returned iterators relatively simultaneously, in order to prevent storing the complete input in memory. My version only keeps the minimum amount of input in memory.
>>> truthy, falsey = sift(xrange(10**6), lambda x: True)
>>> for x in truthy: pass # as long as you consume `truthy` first, this costs no memory at all
The tee() based version keeps the full input in memory until falsey is consumed, too.