Friday, February 27, 2009

Firehose continued …

Firehose continued …

I’m writing this at 38,000 feet (so they tell us) after four long, intense days at the TDWI conference. I’m still wired (maybe it was the Mocha Frappucino at the airport), and my book is done, and it seems as good a time as any to try and get some impressions down.

Las Vegas: I found it daunting. The trip from the airport to Caesar’s Palace took so long, and the streets were so full of people, that I had no desire at all to revisit those streets once I got into the hotel. The only shows I wanted to see were more money than I could justify on entertainment, so I stayed in. Between my traveling book collection (the Prydain series by Lloyd Alexander, classics I first read in Junior High and not since), long conference days and a broad selection of restaurants, I never needed to leave the hotel.

TDWI: This was my first TDWI (The Data Warehouse Institute) conference. It won’t be my last. I’m very impressed with this organization. The focus is on relevant education, and I found all four of the day-long sessions I took to be fascinating, taught by knowledgeable and highly-experienced practitioners (not just trainers following a curriculum). At work we’re in the process of building our first real data warehouse, and I found both validation that we’ve done a lot of things right just by thinking hard and applying good sense – yes, starting from business questions and mapping those to conceptual entities then to actual data elements was a good idea – and learned a number of refinements we can apply to make our ongoing design work better.

Two different classes in dimensional modeling yielded two very different paths to very similar outcomes, each with advantages and disadvantages. I suspect I won’t be able to resist synthesizing them, taking elements from each approach. Mapping all or most of the data elements from the source tables into Fact Groups, without worrying too much about whether they all are “needed” to answer questions the business is ready to ask yet, satisfies the data-packrat in me. After all, sometimes our business customers don’t even know a particular element is available, so how would they think to ask for it? On the other hand, filling out a Fact-Qualifier Matrix is a great way to be sure I’m headed the right direction, and our own similar approach turned up required elements we might not otherwise have looked for, or may not have grouped with the relevant Facts, otherwise.

A class in “Predictive Modeling for Non-statisticians” sounded too intriguing to pass up, and it was indeed a fascinating class. The instructor made some fairly bold assertions about the ability to use “brute force” methods with relatively large data sets, as opposed to much more sophisticated methods that must be applied to the relatively small sets (often sample sizes of a few dozen or a few hundred) around which classical statistical science was developed. I felt he proved those assertions rather well, and they matched my own semi-scientific about the power of large numbers when trying to understand one’s data. The was not ultimately that statistical methods should not be applied, but that deep knowledge of the meaning of one’s data is even more necessary to useful analysis than academic statistical knowledge, particularly when dealing with data sets of at least tens of thousands of rows, which are much more common today than they were decades ago. Indeed, in my day-to-day work I’m often dealing with millions of rows, and we are not all that large an organization! If want a sample of people who have had a specific treatment experience, I don’t usually have to think a whole lot about the minimum acceptable sample size – if anything, I use time bounds to keep the sample size down to something small enough that I can process the data in a reasonable time frame. This lets me make assertions like “In 10% of cases where A occurred over the past 12 months, B also occurred.” I don’t state confidence or error, because confidence is 100% and error is 0. Within the time-frame stated, I’ve examined all instances, not a sample.

hiatus

The loaner laptop ran out of battery on the plane, so I had to shut down. 24 hours later, I can say it was good to be back in the office today. Along with some "real work", I downloaded the InfoBright Open-Source Column-Store DBMS just as a little skunk-works project. I have it in mind to convert my most-used set of tables from the operational snapshot I spend most of my time in to a star schema and compare query performance between different DBMS's. If building out the start gets to be too much fuss, I'll just port the existing tables straight over and compare with them. After all, I already know how to write those queries!

No comments:

Post a Comment