Simple histograms in SQL

July 4, 2011

by Jason Buberel

10 comments

I’m hoping that this being the 4th of July holiday that neither Mike, Scott nor Jon will be paying enough attention to the blog to be alarmed by the appearance of a random bit of Postgres hackery. ┬áThis came about as the result of my wanting to get a feel for some of the initial rental data that we’ve begun collecting. And before you ask: No, sorry, I don’t have a date when we’ll begin incorporating this new set of data into our product lineup.

As I was browsing the data we collected last week, I began thinking about what the general distribution of rental prices looked like. Of course, a histogram is the obvious visualization for answering questions like that. But I dreaded having to jump through all of the hoops necessary to get the data into a charting tool like Excel – export, transfer, import, format, fiddle, tweak, fiddle, tweak, rinse repeat. I just wanted a quick snapshot, not a publication quality rendering.

I began by looking through the Postgres documentation on string manipulation functions, and came across the SQL ‘lpad’ function. With a bit of clever grouping, counting and scaling, I was able to quickly generate the following histogram of rental prices, grouped into $100 buckets:

 rent |                         frequency                          
------+------------------------------------------------------------
    0 | 
    1 | 
    2 | 
    3 | 
    4 | ###
    5 | ##################
    6 | ######################################
    7 | ######################################################
    8 | ##########################################################
    9 | ####################################################
   10 | ###########################################
   11 | ################################
   12 | ###########################
   13 | ########################
   14 | ###################
   15 | ################
   16 | ############
   17 | ##########
   18 | ########
   19 | #######
   20 | #####
   21 | ####
   22 | ###
   23 | ###
   24 | ##
   25 | ##
   26 | #
   27 | #
   28 | #
   29 | 
   30 | 

This particular histogram uses a bucket size of 100 and a unit size of 500 (you’ll see those numbers appear a few times in the query). Here is the exact SQL used:

select (re.highprice/100)::int as "rent", 
    lpad('', (count(re.*)/500)::int, '#') as "frequency" 
    from rental_entry re 
    group by ( re.highprice/100)::int 
    order by ( re.highprice/100)::int ;

The unit size of 500 was an arbitrary choice, based on the total number of entries in our sample data set and my desire to fit the output in my terminal screen. The key is to adjust the bucket and unit size constants based on the data to fit the output to a reasonably sized monitor.

Happy 4th of July!

{ 10 comments }

Konrad July 4, 2011 at 6:40 pm

Hell aye, i want write a transaction analogous this height of one s ambitio didnt have epoch of childbirth, may i repost this Simple histograms in SQL | Altos Research: to what extent s the Market?

Jason Buberel July 6, 2011 at 8:23 pm

You bet – feel free to share.

Michael Shaun July 5, 2011 at 10:41 pm

That's awesome..

Jason Buberel July 6, 2011 at 8:23 pm

Thanks. I'm not sure how useful it will be in practice, but it was fun to fiddle with.

Jeff Morris July 6, 2011 at 12:41 pm

Great Histography Jason! Nice one…

Jason Buberel July 6, 2011 at 8:24 pm

Thanks – but is that even a real word ;-) ?

Jeff Morris July 11, 2011 at 3:20 pm

Yes Jason, it is…

James July 11, 2011 at 6:43 am

Proving your geek spurs. SAS may lose their monopoly on text-based graphics!

Jason Buberel July 11, 2011 at 8:58 am

Yeah, it does have an 80's feel to it.

Real Estate Blogger July 27, 2011 at 12:55 pm

Always been a big fan of Stata to do my research

Comments on this entry are closed.

Additional comments powered by BackType

Previous post:

Next post: