Thursday, September 29, 2011

Want to know your users better? Run your own analytics!

(Originally posted on theicebreak blog.)

Theicebreak is all about understanding relationships. It makes sense then that we would want to ensure that we have a strong understanding of who we are in a relationship with - our user base.

To do so, we really focus on tracking every feature and functionality through cohort analysis, as championed by Fred Wilson, Dave McClure,  Eric Ries's lean startup movement and others.

There are now a couple of services that provide easy tools to do so for startups, including everyone's favorites...Mixpanel and KISSMetrics. We tried them all, but immediately hit some crucial limitations of these online platforms:
  1. We have multiple interfaces (web, mobile, api) and we would have to individually integrate with analytics on each one. A way to solve this is to integrate on the server side.
  2. We wanted to combine this data with other attributes, for example, break down a certain feature by male vs female users, or married vs engaged couples.
  3. We needed to be able to run custom queries and reports based on the data in a simple way.
  4. This data is private and very important for the company, and we would like to 'own' it.
  5. It should be inexpensive :)
Most of the analytics services (except the really expensive/complex ones) are typically focused on implementing the tracking in javascript or other front-end APIs. Some of them do support server side logging, but it is limited.

The only way we could meet the above requirements is to build our own system. Once we started to write it, it actually turned out to be quite easier than we originally thought, and also a lot more powerful.

The key requirements (besides solving the problems I described above) were:
  1. It should be quick to build and simple to use.
  2. It should be re-usable, and possible to run lots of complex analyses with ease.
  3. It shouldn't take any more effort to log different action types as the site grows.
We created a very simple 'events' table called event_log (we use mysql). This table basically had 5 fields:
Any action that a user takes, like answering an icebreaker, changing the password, posting a moment, updating stats, etc. gets an entry in this table. We also store the event_action_id that points to the actual action. This way we can dig deeper if needed. We focus on actual 'actions', not page views.

This is done via a simple api call to log an entry to this table after the user performs an action. This has been extremely powerful for us, and forms the basis of most of our decision making reports.

Because the table is simple but contains all the necessary fields, we can now run complex cohorts as well as simple reports just off of this table. There can be joins and other analyses that can be very easily executed without too much overhead. It also really simplifies the task of getting the same data on different types of actions, as it's all normalized in a clean, single table.

As an example, a basic cohort table can be easily generated by a query similar to:
week(u.time_created) AS cohort,
week(g.time_created) AS period,
count(*) FROM event_log g
JOIN user u ON u.uid=g.uid
u.time_created > '2011-01-01' AND
g.event_type = 1
GROUP BY cohort, period
ORDER BY cohort, period
A quick post-processing of the results and you can generate some very useful charts out of it; below is an example simulation of how retention would look for a typical website for a few weeks following a launch-spike.

Adding a simple dropdown for the event type, and you can get the same chart broken down by events. You can also very easily add other filters by doing joins with other tables and get real insights on who your users are.

For example, we can run reports like:
  • What's the usage of the icebreaker feature among married males?
  • What's the overall retention broken down by each relationship type (i.e. married, dating, etc)?
  • Do males add more stats or females, and how do these numbers change over time?

This is a good initial approach; it provides a lot more flexibility than other services, and will work effectively until you hit a few million users. Then you will start having 'scaling issues' (a good problem to have).

There are definitely some drawbacks to this approach, like:
  • The events table can grow really large. We only track the 'important' events, and for the unimportant ones, we track it at intervals (for example, we only log one visit per day). You can also vertically split the table to keep the size under control.
  • This is not connected to front-end analytics; doing so might complicate the system.
  • You have to write this from scratch. Even though it was relatively easy for us to write, it's still more effort than adding 3 lines of javascript code.
  • It can get slow... but it's not a user-facing feature, and we can cache the results. We also run this on a replica database, so it doesn't really affect the performance of our live service.

We use these cohort reports in conjunction with regular analytics (getclicky and Google analytics) for traditional page views, visitors, funnels, campaigns, and conversions. This has helped us understand a ton about how the site is being used, and what our users want and are doing.