Component-based query sytem (Article)

Description ::

0. Index

1. Background
1.1 Old-style searches
1.2 Modular searches
1.3 Single-field searches
1.4 Boolean-valued query editor
2. Proposal: Set-valued query editor
2.1 First implementation
2.2 Second implementation
2.3 Enhancements, ideas

1. Background

I'm generally passionate about making search functionality work (this website somewhat excepted); I've discovered it's a lot easier to get your users to cooperate in entering data in a reasonable manner if they can get something back out of the database in a usable fashion; they have to see the fruits of their labor, it's not enough that they're being ordered to do it or that someone else is getting good reports out of it. Reports are another topic, today's is ad-hoc searches. We'll look at some existing options, and then also look (because it's all about me) at some systems I've designed, and how they might work for you (because it's also all about you!)

1.1 Old-style searches

You've seen these in all sorts of applications, web-based or not. They include a smattering of often-used fields, with little or no indication of how they should be used. Some text fields allow you to space-separate multiple terms for a little or search, while others will search for exact matches, matches that begin with your query, or matches that contain your query, often with little or no indication as to which. Numeric fields are often equals searches, but occasionally the programmers may bless you with a greater-than, less-than, or equals drop-down; not-equal is rather rare, generally with good reason. The entire query is generally considered a huge and search. You might be able to switch the whole thing to or in a few cases, but you certainly won't be given the option of setting up or(and(...,...),not(...),...) searches. In the example to the right, you'll see that you're given one scenario in which you can search by not; if any of the other fields were the one you needed to negate, that'd just be too bad. The problem isn't that you have to take what you're given -- that's almost always the case with software, and when it's not, it's only because you can modify the software yourself. The problems are the lack of consistency, the lack of modularity, and the lack of documentation as to what exactly the search screen will do for you.

A typical multi-field search
(Gmail)

Now, before you think I'm entirely bashing these -- I've built these, and they have their place. They're particularly good for first-time users, they're familiar, and they're quick to use -- you just start typing, and hit enter. That counts for a lot, and you should keep that in mind when evaluating the solution offered further down.

On the other hand, they're not friendly to power users, and it seems users in general have trouble understanding the implications of a and search: they will habitually fill in all the fields presented to them with whatever they think they know to be true, and then wonder why they don't find what they're looking for. We could switch all of these interfaces to default to or, but users would then be overwhelmed with results -- a whole new can of worms to deal with. It seems to me that we need a system that helps the user ask exactly the right question in the first place.

1.2 Modular searches

A slight improvement can be found in search systems that offer you chunks of search criteria, rather than the all-at-once, "shotgun" approach seen above. By forcing the user to select which chunks really matter, by making it slightly more difficult to fill in information that isn't absolutely critical to a successful search, these systems slow down users' habit of data-entry and focuses them on searching by just a few fields at a time.

A modular "chunked" search
(GlobalSpec)

These systems probably still don't offer the ability to nest searches or use the same chunk more than once in a single query, but with a little work it could allow a or(and(...,...),and(...,...,...) search -- each chunk would be a and block, and the whole query could be turned into a or search. From there, a little more work, allowing a chunk to be used several times, could give you even better or searches -- something like searching for pipe with a certain diameter, and made by one of several reputable manufacturers -- without overloading a single text field with the ability to contain several values for an implicit or search. The less a user must guess at the abilities of the search system, the better.

1.3 Single-field searches

On the other end of the spectrum, we have single-search fields. You've used these -- Google, for one. For searching the "web", I'll agree these are a pretty good idea. Webpages are composed primarily of text, and contrary to popular belief, the fact that a computer output the information does not mean it's in a format immediately usable by another computer. We therefore have millions of pages that, when read by a human, seem useful -- but would be useless to a multi-field, specialized query system. Luckily for you, companies like Google have invested millions in extracting what standardized information they can from webpages rather than just the pure text, providing you with neat features like "define:", or movie show times. The single-search text box is actually interpreting your query, trying to figure out if you're asking a question that could be better asked, and more appropriately served, in database terms. When you ask a question that is obviously related to the current weather in a particular area, it uses an entirely different method of answering your question than when you ask it what the weather is usually like in that same area. In effect, it's a complicated search screen that guesses which field you filled in by analyzing the text of your "plain english" question.

A single-search with extra features
(DeviantArt)

You can usually get to some of these search fields by using the "advanced" search, giving you an idea of what the search system is capable of. And then sometimes, you can't: in the case of DeviantArt, you'll have to find an article buried in the site's help system that tells you what arcane codes to type to use certain search features. By default it's a text-based search, which will look at author's name, comments, and other text related to images; when these modifiers are applied it then knows to search a few other things, like the size of an image or its categorization. One problem here is that an unrecognized search token will just default back to searching text; you won't necessarily be told that you asked the question incorrectly. DeviantArt just spits your question back at you, but without telling you how it interpreted it -- is it really searching by image size, or is it looking for a comment attached to the image that happens to contain the text "800x600"?

An image search
(Google)

Google's advanced image search will ask you some interesting questions, but as you can see in the screenshot, you'll then be somewhat concerned when your requests are not re-displayed in the search box. The URL still contains my request for small monochromatic images, but the text box in the page does not. I've learned from using it that I can search by filetype by typing in filetype:jpg, and in the future I can do so without going through the advanced search screen, but the system hasn't taught me how to ask my other questions directly. There's no apparent color:mono or size:small modifier. I'm a prisonner of the advanced-search screen. Why do I care? Haven't I expressed a preference for complicated multi-field searches anyway?

The thing is, that single-search box is really neat. A user can save his query just by saving the text in that box (rather than the URL, which most users know very little about), and send it to a friend. Simple searches can be typed directly, without lots of clicking and selecting. And if you look at the DeviantArt example again, you'll see that we can perform those complicated and, or, not searches I'm so keen on. You can even use parenthesis to group search expressions, nesting them! That's power. That power comes at a cost, of course: you can nest your queries, but unless they're text-related queries, you're probably going to write them incorrectly, and you won't know it. You'll just be frustrated that your neat idea didn't work out. Again, lots of good and bad, and we should keep all of this in mind.

1.4 Boolean-valued query editor

Jumping back to the other end of the spectrum, there's the idea of an expression editor -- one that exposes every search capability as its own node, with and, or, and no boolean operators. In other words, The Works.

A game event editor
(FreeSpace 2)

The advantage here is that it's clear that the query is well-defined; it may not be correct, but it won't be the system's fault. You have all the information you need to explain exactly what you want. You can nest complicated queries, and each type of question has its own specialized fields -- not generic text boxes. (Well, actually, in the case of the FreeSpace rule editor, I think the parameters for a given expression node in the tree were all provided in text box, but they didn't have to be. Let's just assume for the sake of argument that they did this more safely than they really did.) The downside, of course, is that it's not easy to create simple expressions, they're hard to copy-paste, they may even be hard to read to make sure you really got them right. There's always a trade-off.

I should note that the above served as one source of inspiration for what follows; the other was what I had heard (but never seen) about the QMan ("query manager") component of RPMS, a piece of government-funded software used by the Indian Health Services (IHS). QMan was (and still is, to my knowledge) entirely text-based (mainframe, terminal or console window), but what I had heard about it lead me to believe it had some neat nesting capabilities I'm about to describe. Whether or not it actually does, it served as inspiration as well.

2. Proposal: Set-valued query editor

2.1 First implementation

I originally embarked on this project because of the requirements for GPRA reports; these government-mandated reports are expressed mostly in terms of counts and ratios of people meeting certain criteria; many of these criteria fully or partially repeat each other, especially with the ratios, and they change as frequently as legislation changes. (That's another topic: periodic reports are only useful so long as their definition remains constant; it's not useful to compare apples from the previous year and oranges from the current year.) We needed something that would allow us to quickly build queries, double-check them with our users, change them as needed, and re-use parts for consistency. I had a secondary goal of providing users with an advanced search, to replace the traditional search screens that were constantly being upgraded with random fields for just a few users, resulting in a mish-mash of search capabilities.

A nested query definition
(Prometheus, AST)

The query system was a collection of individual query modules, each self-contained. There were a few generic boolean operations, such as "all of", "any of", "some of" (for "must match 3 or more of the following conditions" queries that GPRA required), and "none of"; the rest were very specific -- searches by name, tribal affiliation, condition and procedure codes, measurements (including lab results), participation in various group activities, etc. Each component knew what time of item it could find: people, visits, measurements, etc. One special component knew how to change a query from one type to another: you could search for people who had visits matching certain criteria; this allowed a single query to contain modules that returned different types of items. In the end though, the query always returned a set of items of like type. For reporting purposes, it was usually the count of those items that mattered.

Each component returned some SQL that defined how it would perform its search, exactly, regardless of any other components in the query. Each component could therefore be used individually or as part of a group. I was using Firebird 1.x at the time, and had to work around its limitations. I couldn't create queries of the form SELECT ... FROM (SELECT ... FROM), which limited my ability to create one query that included the output from all the components. I could, however, easily create views, and that's what I did. Each component therefore wound up creating one view in the database, all of identical column layout. Components that needed to pull from several sources, such as the boolean all of components, would create a query that joined between the views created by their child queries.

The boolean components were implemented more or less as follows:

- All of: SELECT a.id FROM a INNER JOIN b ON b.id = a.id INNER JOIN c ON c.id = a.id ... INNER JOIN z ON z.id = a.id

- Any of: SELECT a.id FROM a UNION SELECT b.id FROM b ... UNION SELECT z.id FROM z

- Some of (u to v of w): SELECT i.id FROM (union of child queries) GROUP BY id HAVING COUNT(*) BETWEEN u AND v

In a few cases, I knew that creating one large query that used views recursively would fail to optimize well; for those components that had a bad habit of slowing things down, I created reusable temp tables; when asked for the name of their view, they would run their part of the query, dump the results to a temporary table, and return the name of a view that would pull results from that temporary table. This hand-optimization was unfortunate, but necessary. The other limitation was that views, as with all metadata changes (DDL) in Firebird, depend on committing the transaction. I had to make several passes: one to create views (committing often), one to populate temp tables and run queries, and another to drop the views when everything was finished, and everything had to be done in just the right order to avoid dependency errors.

I mentioned earlier that many parts of these queries would be reused; I wanted to take advantage of that at runtime as well. Each query component knew how to serialize itself (turn its definition into a simple string that is easily saved to a database, or the clipboard, and understood later), and I used that ability to detect duplicate chunks in a given query. When found, I would force both parts of the query to return the same view name; if the results were going to be sent to a temporary table, only one copied would be created. As I recall, I tried to always force duplicate chunks to prepare themselves ahead of time and send the results to a temporary table, to avoid recalculating the same expression twice. Although the query definition was shown to the user as a tree, with clearly distinct branches, it was internally a graph, where branches could merge before execution.

I'll come back to technical details later (and moreso) for the newer version of this system, but let's talk about the benefits to the user. The user could save all of the query, or just a subtree, and drop it into another query. Users could use copy/paste semantics to move a chunk from one place to another, or email the definition to another user. Sadly, the serialized (text) version of the query was very messy, not something you would want users typing in themselves. Unlike the single-search queries we saw above, there was no way to just type in a simple query and "go". Users could also take an entire chunk of a query, and reduce it just to a comment; once they were sure a certain part was correct, they could stop looking at it and just trust that it was right. I realize it's hard to teach programmers to comment their code, and even harder for non-programmers, but at least the capability was there. Any component could be used any number of times, the queries could be infinitely nested, their definitions stored and re-used at will. What wasn't present was a system to make sure a re-used query chunk was kept synchronized across the different places it was used -- once pasted into a query, such a chunk lost its identity, and would not be updated when its template was modified. To address the readability issue, there was a standardized mechanism for getting a "plain text" explanation of a query out of any component; placed on a header page of a report, it explained every detail of the query that prepared the report, and even included the user's comments, if any were ever provided. Years later, it would still be possible to determine exactly why a printed report contained what it did. As much fun as it is to save query definitions, there's also the issue of saving query results; a mechanism was therefore provided to save the results to the database, and a search component was provided whose result would be to pull that saved result back out. You could therefore run one query, dump its result to the database, then create a whole new query that used the results (rather than the definition) of the previous one.

Queries and actions
(Prometheus, AST)

To keep things consistent, functionality such as "save these results for later use" was provided as actions. Like queries, you could prepare several (even several of the same type) and run them all at once on the results of your query (without rerunning the query.) Like queries, they could serialize themselves and entire sets of actions (like macros) could be saved for later use.

END of article content.

[this time, no outline was written, just a TOC. on my list to come back to.]

Continued at top