I had another one of those WURFL hacking episodes aimed at getting WURFL to work better for sites with large groups of active devices. And here it is. I’ve started calling it WURFL lightweight. Everyone looks up device info by user agent, so it’s optimized for that. It also assumes a few things:

  • The suggested caching method, multicache, is the caching method being used. It will be, no choice about it any more.
  • You want to front load as much work into parsing as you can. Parsing is a one-time offline process. Do as much as you can there to speed up the work that has to happen when a request comes in.
  • Rewriting files as the library is working is bad. No more agent to id cache (though I might bring this back in some form. Tune in next update to find out why.. dah dah dah!), no more option to auto update.

And that’s all it does. I ripped out the code related to other stuff. Especially older stuff. And I tweeked the caching and algorithms a bit:

  • The agent name to device id mapping files are divided up into a bunch of smaller files based on the initial two letters of the user agent. A lot less to load, and a lot less to search through.
  • There was an loop over the user agents to find the one that matched that decreased the length of the agent to match by one character on each iteration and just scanned the array over and over. I thought maybe this was because the PHP implementation underneath was heavily optimized to handle those operations, but my testing hasn’t shown that. So I changed it to a single loop through the agents and finding the longest matching prefix.
  • Sort the agents in the serialized array. This does change the matches for some devices. In the case where a user agent matches multiple entries to the same length the existing implementation returned the first match. Which I still do, but now that the entries are shorter that can be a different device. In some cases this corrects problems. In others I think it was acting as another defacto defaulting mechanism and I might have broken something. But in either case the differences list was pretty short (about 70 entries) out of the 3600 unique device agents I pulled from log files.

This one has given me quite a speed boost for the stuff I was testing with, we’ll have to see how it does in the wild. Here are the further refinements I’m thinking of:

  • Move the fallthrough processing into the parser and make the multicache device entries flat. Pulling a device from the multicache means pulling the device entry itself. Look for a fallback entry for that device, and if present pull it. Repeat until you’re at the root. And then collapse all those entries into the device entry. Do all that processing in update_lw_cache.php and put the full version in the multicache file. We’re reading at least the same number of blocks or less, doing no processing. This should be a net win. Unless the common source for the base devices end up cached, and the processor time to merge them is less expensive than the cache hit from the extra working set. Hmm.. this is one to test, and could vary for different setups.
  • Caching the user-agent to device lookup once we find a new one… like the agent to id cache did before.. HOWEVER not make it hold anything besides new agent to ID mappings. Then we can also make the initial lookup in the array hash the lookup, and only scan linear if we can’t find the agent we’re looking for. That’ll be pretty hot.