Case study: How Safari Books Online uses BigQuery for Business Intelligence

Editor’s Note: This article is written by guest author Daniel Peter. Daniel is a Sr. Programmer Analyst at Safari Books Online, an online library subscription service founded by O’Reilly Media and Pearson Education. Safari Books Online uses BigQuery to solve a few key challenges: building detailed business dashboards to spot trends and manage abuse, improve sales team effectiveness through sales intelligence, and enable ad-hoc querying to answer specific business questions. They chose BigQuery over other technologies because of the speed of querying using a familiar SQL-like language, and the lack of required maintenance.

Safari Books Online has a large and diverse customer base, constantly searching and accessing our growing library of over 30,000 books and videos from an increasing array of desktop and mobile devices. This activity stream contains powerful knowledge which we can use to improve our service and increase profitability. Locked up in the mountains of usage data are trends such as top users, top titles, and connecting the dots for sales inquiries.

Our usage data was much too massive (in the billions of records range) to query online in its entirety with our previous toolset. Analysis could be done with third party web analytics tools such as Omniture, but those tools lacked the ability to query and explore record level data in real time. In addition, they didn’t have a great backend for developing visualizations against. SQL-like querying had to be done on smaller chunks of the data, and was labor intensive and slow. We were impatient waiting for MySQL queries to finish, and were often in doubt as to whether or not they would finish at all. Once you reach database client timeouts of 10 minutes, you are in the realm of not being able to do do ad-hoc analysis. Answering operational questions (such as tracking down abuse) can be time-sensitive, so speed is important. We played with Hadoop, but it proved to take a lot of resources to maintain, so we ended up putting it on the back burner for future projects.

We then learned about Google BigQuery through a video from Google I/O and determined it seemed perfect for our use case, so we decided to try it out. Read on to see how we were able to quickly create actionable knowledge out of otherwise difficult to use data using BigQuery.

Getting our data into BigQuery

No matter what database platform you use, the Extract-Transform-Load (ETL) step can be a challenge depending on your data sources. BigQuery is no exception. Google has a great article on Loading Data into BigQuery to help make this easier.

This diagram shows an overview of how the data flowed through the landing and ETL servers leading up to BigQuery.

Our usage data comes from our content delivery network (CDN) and our web application server logs. This data was packaged into time-based, batched chunks automatically copied into a ‘landing directory’. It needed some transformation to load into BigQuery in a format where we could get the most use out of it.

Here are the basic steps we went through:

Get a list of all source files we have previously loaded to BigQuery

This was as simple as calling the bq command-line tool with the format=json parameter from a PHP script:

bq --format=json query SELECT sourcefile FROM PageViews GROUP BY sourcefile

That gave us this JSON output:

[{"sourcefile":"PageViews.201207122232.gz"},
{"sourcefile":"PageViews.201207111006.gz"},
{"sourcefile":"PageViews.201207121024.gz"}]

Which could easily be brought into an array using PHP’s json_decode function.

Validate our list of source files waiting to be loaded into BigQuery

We got a list of all the files in our landing directory, and got their checksum with the linux command md5sum. We waited a short time, then got the file list again with the md5 sums. Any files whose md5 sum had changed were files currently being uploaded, so we removed them from our list to allow them more time to upload.

We also removed any files from the list which had already been loaded into BigQuery.

We then checked our files for integrity with the linux gunzip -t command.

Transform the files

We process around 40 compressed text files which are about 60MB each per 24 hour period. Doing ETL on this volume of data needs a decent server. We used a linux machine with 256GB of RAM and 32 cores, but you could get by with less. The larger the server the better, especially during the development process when you are iterating over and over. For production, and only processing the incremental data, you could scale down the resources. The main thing is that the processing time for 1 hour of usage data takes less than 1 hour to process or you will never get the data in BigQuery up to date!

The process starts with unzipping the files, and parsing the text line by line.

An example line might look like this:

50.0.113.6 - - [21/May/2012:06:12:51 +0000] "GET /techbus.safaribooksonline.com/9780596805395/how_python_runs_programs?reader=html HTTP/1.1" 504 417 "-" "Apple-PubSub/65.28" "-"

We wanted to exclude empty or irrelevant fields, reformat other fields, and break up multi value fields to end up with a table like this:

host datetime httpMethod domain path resource querystring status bytes useragent sourcefile
50.0.113.6 2012/05/21 06:12:51 GET techbus.safaribooksonline.com /9780596805395/ how_python_runs_programs reader=html 504 417 Apple-PubSub/65.28 PageViews.201207122232.gz
STRING STRING STRING STRING STRING STRING STRING INTEGER INTEGER STRING STRING

This PHP function worked well to parse up the web log line entry into an array:

str_getcsv($line, " ", '"', '\\');

We did some special processing on several of the fields:

  • We opted to store the datetime as a string rather than as a timestamp integer (Editor's note: BigQuery has since launched support for a TIMESTAMP type). We used PHP to get the date parts, and reformat them into this format: Y/m/d H:i:s which is sortable as text.
  • Broke up the HTTP request into domain, path, resource, and querystring using PHP’s explode command and some other logic.
  • We added in the name of the source file, to be able to trace a record back to its original source file, and prevent us from inserting any source file more than once.

We wrote out each transformed line to a new file, escaping special characters, and delimiting fields with a pipe | character.

We then then compressed the transformed file with the linux gzip -f -v command.

Copy the files to Google Cloud Storage

Copying the files up to Google Cloud Storage is a breeze thanks to the gsutil command line utility. It is literally as easy as copying files around on your local machine. Once installed, all we had to do was iterate through our transformed files and gsutil cp them up to Google:

foreach ($arrFiles as $file) {
  echo 'Sending file to Google Cloud Storage: '.$file;
  echo shell_exec('gsutil cp '.$file.' gs://data_warehouse/');
  echo ' (COMPLETE).';
  echo "\n";	
}

Load the files into BigQuery

We created this schema file:

[
{"type":"STRING","name":"host","mode":"NULLABLE"},
{"type":"STRING","name":"datetime","mode":"NULLABLE"},
{"type":"STRING","name":"httpMethod","mode":"NULLABLE"},
{"type":"STRING","name":"domain","mode":"NULLABLE"},
{"type":"STRING","name":"path","mode":"NULLABLE"},
{"type":"STRING","name":"resource","mode":"NULLABLE"},
{"type":"STRING","name":"querystring","mode":"NULLABLE"},
{"type":"INTEGER","name":"status","mode":"NULLABLE"},
{"type":"INTEGER","name":"bytes","mode":"NULLABLE"},
{"type":"STRING","name":"useragent","mode":"NULLABLE"},
{"type":"STRING","name":"sourcefile","mode":"NULLABLE"}
]

and saved it as schema.txt

Loading our files from Google Cloud Storage into BigQuery is very easy. With the bq command-line tool installed, we just ran this command:

bq load --field_delimiter='|' --max_bad_records=99999999 dataset.tableName gs://data_warehouse/file1.gz, gs://data_warehouse/file2.gz, gs://file3.gz schema.txt

A large batch of large gzipped files ingests surprisingly quickly. Gzipping and batching is the way to go for us, though Google recommendsuncompressed source files for the highest possible ingestion speeds due to parallelized load operations. Either way, we can monitor the status of these load jobs through the BigQuery Browser Tool’s “Job History”:

Using the data once it is in BigQuery

BigQuery proved to be a good platform for dashboard data, and offers the capability of drilling into the data in the dashboards further with the BigQuery browser-based UI for ad-hoc analysis.

Here is a dashboard used internally by our operations team to monitor top users:

If anything raises any questions you can get more information about a user, a title, an IP address by querying with the BigQuery Browser Tool.

Here is a dashboard we use internally to keep an eye on trending top titles:

Again, if anything raises any questions in this dashboard BigQuery can answer them for you.

The data is fed to these dashboards by an intermediary scheduled job which is executed via cron. This script queries BigQuery with the bq command-line tool, gets the result in json, then contacts some other web services to get more information which is not stored in BigQuery, such as the user data and book thumbnail images. Then it is all mashed up and presented in the dashboard. This is a good pattern, as you wouldn’t want to query BigQuery every time the dashboard was loaded as it could get expensive. It is best to do the analytical heavy lifting with BigQuery, then store the results in something like a cheap LAMP stack for mass consumption. (Editor's note: BigQuery now has query caching, which handles some of this automatically)

Sales Intelligence

Another use case BigQuery was great for was scouring our web logs for leads which came into our sales department. This allowed us to relate a lead to other accounts in our system, or gauge the level of interest someone may have in us by the number of preview books they had read in the past month on our site. The lead is created in our CRM system, and then BigQuery asynchronously searches through our logs like this:

$query->setQuery("SELECT userid, COUNT(userid) AS countUserid FROM [backoffice.PageViews] WHERE remoteip='".$ip."' GROUP BY userid ORDER BY countUserid DESC LIMIT 20");

The result is returned quickly, and re-attached to the lead record, allowing our sales department to gain intelligence and contact the lead without having to wait around for the information.

This gives us a nice summary of the users (or anonymous requests) the IP address of the lead has done. While IP addresses are not perfect identifiers, they are much better than nothing!

You can see someone from this Lead’s IP address has anonymously viewed (User ID 0) 232 pages of our books, and some other users who have accounts with us already are active on that IP.

Rather than use the BigQuery command line interface, this use case was better suited for Server to Server OAuth with the Google API Client Library for PHP. This enabled us to create a web service which mashes up data from different sources and returns the result to our CRM. One of those sources is BigQuery. One amazing testament to the speed of BigQuery is that I didn't even have to implement the webservice asynchronously. It returns the results from BigQuery within the timeout period of the web service that the CRM requests!

Ad-Hoc

Other use cases include pure ad-hoc querying which aren’t driven by dashboards, but rather specific business questions.

When we released our Android apps, it was easy to run a query on our usage data to see the Android usage grouped by day. In the Google BigQuery Browser Tool, it was easy to export the query results as a CSV file and load them into Google Spreadsheets to make a chart:

Here is an example of looking at our top users by number searches during a particular hour of the day:

There really is no limit to the kinds of insights you can gain from investigating your data. From an ad-hoc perspective, there is a huge sense of power being able to answer any question about your billions of rows of data in seconds. For example you can find out if that strange traffic on your site is a forum spam bot, or a malicious user.

As recommended by Google, we have divided our data up into years:

Normally this would be a bit of an inconvenience to have to do a union statement between the tables, but BigQuery has a nice shortcut for this:

SELECT page
FROM PageViews2006, PageViews2007, PageViews2008
WHERE page CONTAINS('google')
GROUP BY page

We just tack on as many years back as we need to go into our time-sharded tables and it will union them together since we have made our schemas the same. At the end of each year we archive off our “active” table and start a new one. This is the best of both worlds as it provides the accessibility of having all your data in one big table, with the performance of having them separate. As we grow it will make sense to further shard the data into monthly increments instead of yearly.

Summary

Generating meaningful knowledge out of massive amounts of data in timely manner is a challenge Safari Books Online, like many business, needed to solve. Being able to keep on top of trends as they happen, while you can still do something about them, instead of a month after the fact is very valuable. This leads to lower abuse problems, and top line revenue increases by gathering marketing intelligence on trending topics and collecting lead intelligence to close sales more effectively.

Future

We feel we are just scratching the surface of what is possible with BigQuery. Having your data in a queryable, high performance platform is a huge asset both for our planned projects and whatever unplanned ad-hoc questions may come up.