Review: A Model-Based Approach for Crawling Rich Internet Applications

Researchers at the University of Ottawa and IBM have developed a new search engine crawler for the Deep Web of AJAX applications. Indexing most modern AJAX web apps can be a challenge, since the state of the page is based on much more than just the URL – it is also based on the server state, client JavaScript state, cookies, and the DOM, or document object model that defines the page’s HTML.

This complex state makes it difficult to crawl AJAX apps, since the search engine needs to “reset”, or go back to the beginning, every time it wants to scan a parallel state. In addition, it can be a challenge for the search engine to understand if two states or pages are even the same, since some content, like timestamps, change every time you view the page.

In order to make headway into scanning AJAX applications and indexing their content, the research team developed a new “Model-Based” approach to search engine crawling. While traditional techniques have primarily focused on breadth-first or depth-first search, the Model-based approach tries to optimize the search pattern based on where it expects to find the most promising new information.

While the researchers spent much of their time analyzing a “Hypercube” model meant to be more efficient than other algorithms, testing proved that a simpler “Menu-based” model was best. The Menu-based model classifies events into three categories – menu buttons, loop-backs, and standard events. Menu buttons lead to a different page with the same state every time they are clicked. Loop-backs cause a page to link back to itself when they are clicked. The menu-based approach proved the most effective, since many of the redundant states in Rich Internet Applications are menu-based.

Overall, while the Model-based search of AJAX applications has promise, it falls short of what is needed to scan most apps by not handling user input fields. Most of the data in AJAX applications hides behind input fields, which are built for intelligent human users instead of machines. Another weakness with the Model-based approach is that certain aspects of the page can change based on events outside of the search bot’s control, such as screen width changes in pages with responsive design, time lapse, or even external input or triggers. Still, it may be the case that some day search engines will be able to read not just the content of web pages, but understand what purpose they meant to accomplish.

With the slow, incremental advances in searching AJAX applications, the search engine crawler interface for most web apps is still the responsibility of the app’s development team. AJAX applications targeted for the public must create custom search engine interfaces that enable quick indexing of relevant data. Sites such as Facebook and LinkedIn have used these techniques to enable public Google search of profiles, and separate AJAX interfaces for their users. Still, with the ever-increasing ubiquity of Big Data, a brave new world is emerging for scanning the Deep Web. As standard such as SOAP are implemented by server-side interfaces, search engines can find new ways to index data that was previously solely the purvey of custom applications.

Written by Andrew Palczewski

About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.
Google+

3 thoughts on “Review: A Model-Based Approach for Crawling Rich Internet Applications”

Flossie says:

September 4, 2014 at 9:36 am

Write more, thats all I have to say. Literally, it seems as though you relied on the
video to make your point. You clearly know what youre talking about, why throw away your intelligence on just posting videos to your blog when you could be giving us
something enlightening to read?

internet service athens oh says:

September 6, 2014 at 4:13 am

Nice weblog here! Also your website quite a bit up very fast!
What web host are you using? Can I am getting your associate link for your host?
I want my web site loaded up as fast as yours lol

search says:

September 17, 2014 at 12:42 am

fantastic post, very informative. I ponder why the other
experts of this sector do not notice this. You should continue your writing.

I’m sure, you’ve a huge readers’ base already!

3 thoughts on “Review: A Model-Based Approach for Crawling Rich Internet Applications”

Leave a Reply Cancel reply