Simple histograms in SQL

July 4, 2011

by Jason Buberel

10 comments

I’m hoping that this being the 4th of July holiday that neither Mike, Scott nor Jon will be paying enough attention to the blog to be alarmed by the appearance of a random bit of Postgres hackery. ┬áThis came about as the result of my wanting to get a feel for some of the initial rental data that we’ve begun collecting. And before you ask: No, sorry, I don’t have a date when we’ll begin incorporating this new set of data into our product lineup.

As I was browsing the data we collected last week, I began thinking about what the general distribution of rental prices looked like. Of course, a histogram is the obvious visualization for answering questions like that. But I dreaded having to jump through all of the hoops necessary to get the data into a charting tool like Excel – export, transfer, import, format, fiddle, tweak, fiddle, tweak, rinse repeat. I just wanted a quick snapshot, not a publication quality rendering.

I began by looking through the Postgres documentation on string manipulation functions, and came across the SQL ‘lpad’ function. With a bit of clever grouping, counting and scaling, I was able to quickly generate the following histogram of rental prices, grouped into $100 buckets:

 rent |                         frequency                          
------+------------------------------------------------------------
    0 | 
    1 | 
    2 | 
    3 | 
    4 | ###
    5 | ##################
    6 | ######################################
    7 | ######################################################
    8 | ##########################################################
    9 | ####################################################
   10 | ###########################################
   11 | ################################
   12 | ###########################
   13 | ########################
   14 | ###################
   15 | ################
   16 | ############
   17 | ##########
   18 | ########
   19 | #######
   20 | #####
   21 | ####
   22 | ###
   23 | ###
   24 | ##
   25 | ##
   26 | #
   27 | #
   28 | #
   29 | 
   30 | 

This particular histogram uses a bucket size of 100 and a unit size of 500 (you’ll see those numbers appear a few times in the query). Here is the exact SQL used:

select (re.highprice/100)::int as "rent", 
    lpad('', (count(re.*)/500)::int, '#') as "frequency" 
    from rental_entry re 
    group by ( re.highprice/100)::int 
    order by ( re.highprice/100)::int ;

The unit size of 500 was an arbitrary choice, based on the total number of entries in our sample data set and my desire to fit the output in my terminal screen. The key is to adjust the bucket and unit size constants based on the data to fit the output to a reasonably sized monitor.

Happy 4th of July!