WURFL Lightweight for PHP

I had another one of those WURFL hacking episodes aimed at getting WURFL to work better for sites with large groups of active devices. And here it is. I’ve started calling it WURFL lightweight. Everyone looks up device info by user agent, so it’s optimized for that. It also assumes a few things:

  • The suggested caching method, multicache, is the caching method being used. It will be, no choice about it any more.
  • You want to front load as much work into parsing as you can. Parsing is a one-time offline process. Do as much as you can there to speed up the work that has to happen when a request comes in.
  • Rewriting files as the library is working is bad. No more agent to id cache (though I might bring this back in some form. Tune in next update to find out why.. dah dah dah!), no more option to auto update.

And that’s all it does. I ripped out the code related to other stuff. Especially older stuff. And I tweeked the caching and algorithms a bit:

  • The agent name to device id mapping files are divided up into a bunch of smaller files based on the initial two letters of the user agent. A lot less to load, and a lot less to search through.
  • There was an loop over the user agents to find the one that matched that decreased the length of the agent to match by one character on each iteration and just scanned the array over and over. I thought maybe this was because the PHP implementation underneath was heavily optimized to handle those operations, but my testing hasn’t shown that. So I changed it to a single loop through the agents and finding the longest matching prefix.
  • Sort the agents in the serialized array. This does change the matches for some devices. In the case where a user agent matches multiple entries to the same length the existing implementation returned the first match. Which I still do, but now that the entries are shorter that can be a different device. In some cases this corrects problems. In others I think it was acting as another defacto defaulting mechanism and I might have broken something. But in either case the differences list was pretty short (about 70 entries) out of the 3600 unique device agents I pulled from log files.

This one has given me quite a speed boost for the stuff I was testing with, we’ll have to see how it does in the wild. Here are the further refinements I’m thinking of:

  • Move the fallthrough processing into the parser and make the multicache device entries flat. Pulling a device from the multicache means pulling the device entry itself. Look for a fallback entry for that device, and if present pull it. Repeat until you’re at the root. And then collapse all those entries into the device entry. Do all that processing in update_lw_cache.php and put the full version in the multicache file. We’re reading at least the same number of blocks or less, doing no processing. This should be a net win. Unless the common source for the base devices end up cached, and the processor time to merge them is less expensive than the cache hit from the extra working set. Hmm.. this is one to test, and could vary for different setups.
  • Caching the user-agent to device lookup once we find a new one… like the agent to id cache did before.. HOWEVER not make it hold anything besides new agent to ID mappings. Then we can also make the initial lookup in the array hash the lookup, and only scan linear if we can’t find the agent we’re looking for. That’ll be pretty hot.
This entry was posted in ThisIsMobility. Bookmark the permalink.

5 Responses to WURFL Lightweight for PHP

  1. Strangiato says:

    I like your ideas for speeding up the WURFL for php.
    Using only multichache layout for the files and getting rid of the agent2id file really cleans up the code quite a bit.

    I used your approach with a couple of tweaks.

    1. Instead of breaking the user-agent to wurfl id into multiple little cache files based on the first to characters of the user agent, I simply made that hash structure 2 levels deep.
    The first level contains keys which contain the value of the user agent up to the first slash. That key then points to a hash containing the full user_agent to wurfl ids for agents that start with similar values.
    This is much simpler as there is only one file to worry about and you don’t have to scan the ENTIRE key set of the user agents. You can quickly tell if a similar agent exists, and if it does you only have to scan a handful looking for the best one. (Which brings up another point, first check if there is an exact match using isset() before trying the linear scan looking for the longest match.)

    2. I went ahead and stored in each device cache file the entire capability set. Its trivial during parsing to look these up, and when loading a device’s capabilities at runtime you only have to load a single file. The resultant files are all around 16k. (An extra benefit to this is that it is easily adaptable to storing these as blobs in a database.)

    3. Minor tweaks that probably aren’t worth the speedup.
    3a – Modified the matchlen() function to scan each user_agent character by character rather than repeatedly calling substr(). Theoretically scanning a string once is faster than scanning a string over and over (1+2+3+4+5…etc)
    3b – getDeviceCapability() seemed to be doing something odd, it was linearly scanning all the capabilities looking for a key match. The structure is a hash, why not use isset() and directlly return the value requested if it exists, much faster.

  2. miker says:

    I actually had the hash as a two level structure first, and then moved it off to individual files. Even with an accelerator in place to cache a compiled version of the agents files, reading in the entire array structure bloated the runtime some. For the test set I had, keeping the array parts distinct helped out.

    How much of a difference did #2 make for you? I’m considering doing that, but I’m not sure if the result is just going to be that the block cache layer at the OS level just needs to cache more files. With the existing ssytem there is the benefit of having a smaller on disk working set to cache.

    I actually ended up using a direct check before scanning, but I didn’t bother posting it. Turns out it only hits an exact match 30% of the time and didn’t make much of a difference. I’m considering trying to make WURFL adaptive, so that it learns new User Agents as it sees them. That was one of the benefits of agent2id, even though the cache of capabilities detracted from the lookup benefits for me. I’m thinking of putting something of the sort back in there.

  3. Strangiato says:

    Don’t have speed figures for you atm. I don’t actually use WURFL at my current position, but I am thinking of using it for a small personal project and was debating between using the Java vs PHP for that. I noticed the warnings about the speed of the PHP version and decided to just take a look and hack away for the heck of it.
    As for an adaptive WURFL, what has worked in the past for me is to record once for each unique user-agent during the day, the full set of HTTP headers. As a nightly process check for unknown user-agents. If a UAProf is given, use that to generate device information if possible, as a last resort use the HTTP accept headers to update device information. A UAProf to WURL entry utility is available on the WURFL site, (I think in ruby) but simple enough to do in PHP/Perl/Python. Then you’d run the wurfl update script if needed. Not realtime…but then again for the small project i’m working on, realtime device updating isn’t critical.

  4. miker says:

    Ooo, I didn’t know about the UAProf to WURFL script, I’ll have to look that up. Thanks!

  5. Chris Abbott says:

    Apart from all the UAProfiles, there’s a new approach to the device detection problem in the “DetectRight” link to have a play with. You might also want to check out what happened at the W3C Device Description Workshop in Madrid in July to see how the industry proposes to address the problem of everyone writing their own device repositories :)

    Chris

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">