Scalable Reach Estimates with Sketch Sets
You can calculate an approximate number of uniques (aka estimated reach) of a data set in Hive by using Brickhouse's ( http://github.com/klout/brickhouse ) sketch_set UDF's
select estimated_reach( sketch_set( cookie ) )
from weblogs;
instead of
select count( distinct cookie )
from weblogs;
Written by Jerome Banks
Related protips
Have a fresh tip? Share with Coderwall community!
Post
Post a tip
Best
#Hadoop
Authors
devtripper
37.77K
kh1ramatsu
8.981K
Sponsored by #native_company# — Learn More
#native_title#
#native_desc#