Last Updated: February 25, 2016
·
1.309K
· jeromebanks

Scalable Reach Estimates with Sketch Sets

You can calculate an approximate number of uniques (aka estimated reach) of a data set in Hive by using Brickhouse's ( http://github.com/klout/brickhouse ) sketch_set UDF's

select estimated_reach( sketch_set( cookie ) )
from weblogs;

instead of

select count( distinct cookie )
from weblogs;