Curated content for content, computing, and digital experience professionsals

Day: February 3, 2009

Native Database Search vs. Commercial Search Engines

This topic is random and a short response to a question that popped up recently from a reader seeking technical research on the subject. Since none was available in the Gilbane library of studies, I decided to think about how to answer the subject with some practical suggestions.

The focus is on an enterprise with a substantive amount of content aggregated from a diverse universe of industry specific information, and what to do about searching it. If the information has been parsed and stored in an RDBMS database, is it not better to leverage the SQL query engine native to the RDBMS? Typical database engines might be: DB2, MS Access, MS SQL, MySQL, Oracle or Progress Software.

To be clear, I am not a developer but worked closely with software engineers for 20 years when I owned a software company. We worked with several DBMS products, three of them supported SQL queries and the application we invented and supported was a forerunner of today’s content management systems with a variety of retrieval (search) interfaces. The retrievable content our product supported was limited to metadata plus abstracts up to two or three pages in length; the typical database sizes of our customers ranged from 250,000 to a couple of million records.

This is small potatoes compared to what search engines typically traverse and index today but scale was always an issue and we were well aware of the limitations of the SQL engines to support contextual searching, phrase searching and complex Boolean queries. It was essential that indexes be built in real time, when records were added whether manually through screen forms, or through batch loads. The engine needed to support explicit adjacency (phrase) searching as well as key words anywhere in a field, in a record, or in a set. Saving and re-purposing results, storing search strategies, narrowing large sets incrementally, and browsing indexes of terminology (taxonomy navigation) to select unique terms that would enable a Boolean “and” or “or” query were part of the application. When our original text-based DBMS vendor went belly-up, we spent a couple of years test driving numerous RDBMS products to find one that would support the types of searches our customers expected. We settled on Progress Software primarily because of its support for search and experience as an OEM to application software vendors, like us. Development time was minimized because of good application building tools and index building utilities.

So, what does that have to do with the original question, native RDBMS search vs. standalone enterprise search? Based on discussions and observations with developers trying to optimize search for special applications, using generic search tools for database retrieval, I would make the following observations. Search is very hard and advanced search, including concept searching, Boolean operations, and text analytics, is harder still. Developers of enterprise search solutions have grappled with and solved search problems that need to be supported in environments where content is dynamically changing and growing, different user interfaces for diverse audiences and types of queries are needed, and query results require varieties of display formats. Also, in e-commerce applications, interfaces require routine screen face lifts that are best supported by specialized tools for that purpose.

Then you need to consider all these development requirements; they do not come out-of-the-box with SQL search:

  • Full text indexes and database field or metadata indexes require independent development efforts for each database application that needs to be queried.
  • Security databases must be developed to match each application where individual access to specific database elements (records or rows) is required.
  • Natural language queries require integration with taxonomies, thesauri, or ontologies; this means software development independent of the native search tools.
  • Interfaces must be developed for search engine administrators to make routine updates to taxonomies and thesauri, retrieval and results ranking algorithms, adjustments to include/exclude target content in the databases. These content management tasks require substantive content knowledge but should not require programming expertise and must be very efficient to execute.
  • Social features that support interaction among users and personalization options must be built.
  • Connectors need to be built to federate search across other content repositories that are non-native and may even be outside the enterprise.

Any one of these efforts is a multi-person and perpetual activity. The sheer scale of the development tasks mitigate against trying to sustain state-of-the-art search in-house with the relatively minimalist tools provided in most RDBMS suites. The job is never done and in-depth search expertise is hard to come by. Software companies that specialize in search for enterprises are also diverse in what they offer and the vertical markets they support well. Bottom line: identify your business needs and find the search vendor that matches your problem with a solution they will continue to support with regular updates and services. Finally, the issue of search performance and speed of processing are another huge factor to consider. For this you need some serious technical assessment. If the target application is going to be a big revenue generator with heavy loads and huge processing, do not overlook. Do benchmarks to prove the performance and scalability.

WoodWing Releases Enterprise 6 Content Publishing Platform

WoodWing Software has released Enterprise 6, the latest version of the company’s content publishing platform. Equipped with a new editing application called “Content Station”, Enterprise 6 offers article planning tools, direct access to any type of content repository, and integrated Web delivery functionality. Content Station allows users to create articles for delivery to the Web, print, and mobile devices, and offers out-of-the-box integration with the open-source Web content management system Drupal. Content Station works with Enterprise’s new server plug-ins to allow users to search, select, and retrieve content stored in other third-party repositories such as digital asset management systems, archives, and wire systems. Video, audio, and text files can then be collected into “dossiers”, edited, and set for delivery to a variety of outputs, all from a single user-interface. A built-in XML editor lets authors create documents intended solely for digital output. The content planning application lets managers assign content to users both inside and outside of the office. Enterprise’s Web publishing capabilities feature a direct integration with Drupal. Content authors click on a single button to preview or deliver content directly to Drupal and get information such as page views, ratings, and comments back from the Web CMS. And if something needs to be pulled from the site, editors can simply click “Unpublish”. They don’t have to contact a separate Web editor or navigate through another system’s interface. The server plug-in architecture also allows for any other Web content management system to be connected. http://www.woodwing.com/

Should you Migrate from SGML to XML?

An old colleague of mine from more than a dozen years ago found me on LinkedIn today. And within five minutes we got caught up after a gap of several years. I know, reestablishing lost connections happens all the time on social media sites. I just get a kick out of it every time it happens. But this is the XML blog, not the social media one, so…

My colleague works at a company that has been using SGML & XML technology for more than a 15 years. Their data is still in SGML. They feel they can always export to XML and do not plan to migrate their content and applications to SGML any time soon. The funny thing was that he was slightly embarrassed about still being in SGML.

Wait a minute! There is no reason to think SGML is dead and has to be replaced. Not in general. Maybe for specific applications a business case supports the upgrade, but it doesn’t have to every time. Not yet.

I know of several organizations that still manage data in the SGML they developed years ago. Early adopters, like several big publishers, some state and federal government applications, and financial systems were developed when there was only one choice. SGML, like XML, is a structured format. They are very, very similar. One format can be used to create the other very easily. They already sunk their investment into developing the SGML system and data, as well as training their users in it’s use. The incremental benefits of moving to XML do not support the costs of the migration. Not yet.

This brings up my main point, that structured data can be managed in many forms. These include XML, SGML, XHTML, databases, and probably other forms. The data may be structured, follow rules for hierarchy, occurrence and data typing, etc. but not be managed as XML, only exported as XML when needed. My personal opinion is that XML stored in databases provides some of the best combination of structured content management features, but different business needs suggest a variety of approaches may be suitable. Flat files stored in folders and formatted in old school SGML might still be enough and not warrant migration. Then again, it depends on the environment and the business objectives.

When XML first came out, someone coined the phrase that SGML stood for “Sounds Good, Maybe Later” because it was more expensive and difficult to implement. XML is more Web aware and is somewhat more clearly defined and therefore tools operate more consistently. Many organizations that felt SGML could not be justified were able to later justify migrating to XML. Others migrated right away to take advantage of the new tools or related standards. XML does eliminate some features of SGML that never seemed to work right too. It also demands Wellformed data, which reduces ambiguity and simplifies a few things. And tools have come a long way and are much more numerous, as expected.

XML is definitely more successful in terms of number and range of applications and XML adoption is an easier case to make today than SGML was back in the day. But many existing SGML applications still have legs. I would not suggest that a new application start off with SGML today, but I might modify the old saying to “Sounds Good, Migrate Later”.

So, when is it a good idea to migrate from SGML to XML? There are many tools available that do things with XML data better than they do with other structured forms. Many XML tools support SGML as well, but DBMS systems now can managed content as XML data type and use XML XPath nodes in processing. WIKIs and other tools can produce XML content and utilize other standards based on XML, but not SGML that I am aware of. If you want to take advantage of features of Web or XML tools, you might want to start planning your migration. But if your system is operational and stable, the benefits might not yet justify the investment and disruption from migrating. Not yet! </>