I'm trying to write a generator function (or achieve the equivalent) which takes an iterable xs in Python and counts the "runs". (This is a problem in Thinking Functionally with Haskell by Bird, which I want to translate into Python using Python's laziness features.) So
list(iter(count_runs(['a', 'a', 'b', 'c', 'a', 'd', 'd'])))
# => [(2, 'a'), (1, 'b'), (1, c'), (1, 'a'), (2, 'd')]
In Haskell it's
countRuns :: [a] -> [(Int, a)]
countRuns [] = []
countRuns x:xs = (1 + length us, x):countRuns vs
where us, vs = span (==x) xs
In Python, I'd like to write somethin like
from itertools import takewhile, dropwhile
def count_runs(xs):
# get first element x of xs, if it exists
us, vs = (takewhile(lambda y: y==x, xs),
dropwhile(lambda y: y==x, xs))
yield (1 + len(list(us)), x)
yield from count_runs(vs)
But the problem is that vs is an iterator already, so I will run into trouble if I call takewhile and dropwhile on it in the next recursion. (When I call list(takewhile(..., xs)) in the next recursion, it will get rid of the first element of dropwhile(..., xs) as well, because they're both looking at the same iterator.
How can I fix this issue, and what is the correct way to get the first element in the second line?
A significant difference between
spanandtakewhileis thattakewhileconsumes the first non-xvalue in order to determine when to stop yielding values. As a result, you'll lose any singleton items in the input; in particular,takewhileloses the firstbin producing the leading set ofas. The iterator protocol has no way of peeking at the next element of an iterator, nor does it have a way to put back an element it consumes.Instead, you'll need two independent iterators: one for
takewhileto produce the desired prefix, and another from which to drop that prefix for the recursive call.(Note that the
itertoolsdocumentation implements something similar tospanin its recipe section as abefore_and_afterfunction. It doesn't usetee, but I refer you to the actual implementation for details).)
However, most of this work is already done for you by
itertools.groupby.