Last Updated: August 12, 2019
·
3.605K
· nvie

Efficient sifting iterator for Python

Did you ever want to sift an iterator into two "sub" iterators? Now you can:

>>> even, odd = sift(xrange(10), lambda x: x % 2 == 0)
>>> list(even)
[0, 2, 4, 6, 8]
>>> list(odd)
[1, 3, 5, 7, 9]

Note that the input iterable is only lazily consumed:

>>> X = iter(xrange(10))
>>> even, odd = sift(X, lambda x: x % 2 == 0)
>>> X.next()
0
>>> X.next()
1
>>> X.next()
2
>>> list(even)
[4, 6, 8]
>>> list(odd)
[3, 5, 7, 9]

A tiny warning

Beware that in order to consume one of the returned iterators fully, you might be consuming a big part of the the input iterable and keeping its values buffered in memory. It's best to consume the largest iterable first, whenever possible:

>>> small_ones, large_ones = sift(xrange(10000), lambda x: x < 9000)

# low memory peak (no buffering needed)
>>> list(small_ones)
[0, ..., 8999]
>>> list(large_ones)
[9000, ..., 9999]

# at its peak, 9000 items are buffered in memory
>>> list(large_ones)
[9000, ..., 9999]
>>> list(small_ones)
[0, ..., 8999]

The implementation: https://gist.github.com/nvie/5804375

2 Responses
Add your response

Y U NO partition ?!

def partition(pred, iterable):
    'Use a predicate to partition entries into false entries and true entries'
    # partition(is_odd, range(10)) --> 0 2 4 6 8   and  1 3 5 7 9
    t1, t2 = tee(iterable)
    return filterfalse(pred, t1), filter(pred, t2)

http://docs.python.org/dev/library/itertools.html#itertools-recipes

over 1 year ago ·

(1) "Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed."

(2) tee() actually duplicates all elements from the input, making it much less memory-efficient as you need to consume both returned iterators relatively simultaneously, in order to prevent storing the complete input in memory. My version only keeps the minimum amount of input in memory.

>>> truthy, falsey = sift(xrange(10**6), lambda x: True)
>>> for x in truthy: pass  # as long as you consume `truthy` first, this costs no memory at all

The tee() based version keeps the full input in memory until falsey is consumed, too.

over 1 year ago ·