Web application development is currently a hot spot for the application of design patterns. Common patterns and development strategies all revolve around the separation of “business logic and display logic.” The focus has so far been on the separation of the interface code and the code that implements the functionality of the application, but almost completely ignored has been the separation of code that produces or obtains data and code that processes data that should be happening within the application code.
Obtaining data is a trivial task, which is why it generally receives so little focus. However, obtaining data correctly is not a trivial task. Filtering addresses this directly, by providing the critical component between the data source and the data sink. Many abstraction layers exist for PHP, such as ADOdb, PDO – PHP Data Objects, and Zend_Db. However, all of these provide a simple transport layer for sending a query to a database and managing its result. They wrap around the real data source (a MySQL, PostgreSQL, etc. database) and provide an endpoint that, in theory, will allow a programmer to change data sources without modifying his code. As a result, all of these data sources lack the ability to uniformly filter through results, which results in either clunky code using an abstracted interface or lots of handwritten raw queries that are non-portable anyway. Adding filtering as a means of controlling the data source is a solution to this problem.
Here, we will digress from web applications to the toolkit of black magic that is Lisp/Scheme. If you’re not interested in the theoretical build up, skip down two sections to where we return to more practical matters.
Theoretical Reasoning
I was first introduced to the concept of filtering while reading Structure and Interpretation of Computer Programs the textbook for MIT’s introductory computer science course, 6.001. The basic idea is that a filter is a way to interleave the creation of a sequence of values and its processing. For example, let’s consider what happens if one wishes to find the first three prime Fibonacci numbers less than 10000 (which are 2, 3, and 5). There are two ways to do this: first, the generation and testing of numbers can be treated separately, which allows the methods for computing Fibonacci numbers to be separated from the method for testing for prime numbers. In this case, if we wanted to change our search from Fibonacci numbers to Catalan numbers, it’d be as simple as substituting a different generation method. With this strategy, 20 numbers must be generated, but only 6 are ultimately tested before the requirements are met. This means that 14, or 70%, of the generated numbers are discarded without being used, which is unacceptable.
The second option is the much more natural choice: combine the generation of Fibonacci numbers with the test for primes. However, this means that the entire program must be changed in order to go from Fibonacci number to Catalan numbers. Most people would avoid this by creating a method that uses some sort of internal state to return the next Fibonacci number with each successive call, which improves portability and pluggability, but is not thread safe. This is, however, a start, and it can be formalized into the “stream” concept described in SICP.
A stream is simple: it is a sequence of values that are generated as requested (an instance of lazy evaluation). A filter is simply a function that takes a single value and returns a boolean indicating whether or not that value meets the requirements of the filter. It is then easy to build a stream filter, which takes a filter function and a stream, and returns a stream that is the result of applying the filter to every element of the stream. Recall, however, that streams are implemented with lazy evaluation, so the filter is not actually applied to the underlying stream until values are requested, which ensures low overhead in processing large streams where only the first few values are required—only those values will be computed. Naturally, filters can naturally be stacked to produce a series of values that match a variety of different criteria, without compromising the integrity of any of the individual checks, and without giving care to the ultimate source of the data. A Fibonacci stream can easily be replaced with a Catalan stream by passing a different parameter into the prime filter stream. This achieves the modularity and independence of the first approach while maintaining the efficiency of the second approach: the first three values of the prime stream are requested before the appropriate conditions are met, which in turn only requests the first 6 values from the Fibonacci stream (4 values to find 2, 1 value to find 3, and one value to find 5). There are a lot of possibilities with filters. They can be constructed as simple functions, or as functors/closures with internal state, which allows for extended parameterization.
From Lisp to the Real World
Now that filters have shown their theoretical promise in Lisp, it is time to take them to an imperative language that is commonly used to build applications: C, C++, or Java. This comes with a number of important changes: in Lisp, it is easily to implement lazy evaluation on a large scale; in C++, it is not; in Lisp, the primary data structure is a list; in C++, a number of data structures are going to be used. This means that the most reasonable definition for a stream and a stream filter must change. In C++, a stream can be any data structure, and a stream filter is any object or function that can take a filter predicate and a fully populated data structure, and return a data structure of the same type that contains all of the objects from the given structure that match the filter predicate. This could be extended to work with infinite streams or use some forms of lazy evaluation, but it has been much more common, in my experience, to be interested in extracting subsets of existing data.
We’ll start by declaring filters as objects, because this will enable us to take multiple filters and stack them on top of one another to create simple, yet flexible ways of matching multiple conditions as if with an AND clause. It would be possible to create filters that implement any logical operation on other filters (for example, finding items that match either of two filters (OR) or only one of two filters (XOR)), but we’ll concern ourselves with the most immediately useful (and therefore worth the most time investment—the rest is “interesting” from a theoretical standpoint, but until it comes up in practice, isn’t worth investigating as more than a conceptual notion). It is also possible to use templates/generics and other OO tricks to make filters work with any data structure via iterators, but we’ll start with a really simple type of filter: one that takes a vector of elements and returns a new vector of elements that match the filtering predicate. Templates will still be necessary, however, to ensure that the filter can interact with any type of object:
namespace filters {
template <typename Find>
class Filter {
typedef Filter<Find> FILTER_TYPE;
typedef std::vector<Find> CONTAINER;
typedef typename CONTAINER::iterator ITER;
public:
Filter(void) : filterFunc(NULL), subFilter(NULL) {}
Filter(bool (*filterFunc)(Find item)) : filterFunc(filterFunc), subFilter(NULL) {}
~Filter(void) {}
void stack(FILTER_TYPE *f);
bool test(Find item);
CONTAINER *filter(CONTAINER *items);
virtual bool operator()(Find item);
private:
bool (*filterFunc)(Find item);
FILTER_TYPE *subFilter;
};
};
In particular, this filter takes as a constructor parameter any one function that takes as a single argument a pointer to an object of the specified type and returns a boolean indicating whether the object should be accepted by the filter. One additional feature is present though. The filter object is actually a functor, because it allows for the filter to not only encode state information, but to represent the predicate used to test items as well. When implementing real filters, because they may represent comparisons to some standard value, but the filter definition calls only for one parameter to the test function (this is a consequence of ability to curry variables through lambda functions in Lisp/Scheme), so it is impossible to use an external function as a filter, as that standard value would also have to be supplied. I’ll give an example of this as part of a practical filter demonstration.
Now the meat of the filter has to be filled in. Note that this is not a real filter! This is the mechanism by which all filters work. For completion’s sake, however, we’ll implement the functor nature of this filter to simply test that an item is not null. Also note that because this is a template definition, this could should actually replace the declaration lines above, within the class declaration, because it must be included with the header so that the compiler can generate the appropriate source code:
void stack(FILTER_TYPE *f) {
if (this->subFilter == NULL)
this->subFilter = f;
else
this->subFilter->stack(f);
}
bool test(Find item) {
bool status;
if (this->filterFunc != NULL)
status = (*this->filterFunc)(item);
else
status = (*this)(item);
if (subFilter != NULL)
status = status & this->subFilter->test(item);
return status;
}
CONTAINER *filter(CONTAINER *items) {
CONTAINER *rtn = new CONTAINER;
ITER iter;
for (iter = items->begin(); iter != items->end(); iter++) {
if (this->test(*iter))
rtn->push_back(*iter);
}
return rtn;
}
virtual bool operator()(Find item) {
if (item != NULL)
return true;
return false;
}
Overall this is pretty simple and implements all of the basic contracts that must be satisfied by a stream filter from Lisp. Note however that the filter function will suffer from the same performance issues as a naïve approach when trying to find the first n values that match some criteria. This is a different problem from normal filtering, and a subclass of Filter should be used to do this by overriding filter() to expect two parameters and only iterate until the desired number of elements has been founds. The test() method is already designed to handle the situation by testing a value against all filters at once, rather than testing all values against a single filter, so it is unaffected by the change of strategy.
Finally, let me give a practical example. As part of a C++ library I’m developing for myself to use on some game-like projects, I’ve created a few filters to do useful searches for me. One of them is to locate all objects (positions represented by single points) within a specific distance from some reference point. This base class is generic enough that its only dependence is the Point class itself, but that should be simple enough to implement on your own:
namespace filters {
using namespace geom;
template <typename Find>
class DistanceFilter : public Filter<Find *> {
typedef Point *(Find::*FindPointFunc)(void) const;
public:
DistanceFilter(Point *p, double d, FindPointFunc getPosition, bool outer = false) :
Filter<Find *>(), pFunc(getPosition), p(p), dist(d), invert(outer) {}
~DistanceFilter(void) {}
virtual bool operator()(Find *item) {
Point *point = (item->*pFunc)();
return (p->distance(point) < dist) ^ invert;
}
private:
FindPointFunc pFunc;
Point *p;
double dist;
bool invert;
};
};
This particular class takes as state variables the reference point and the distance to test against. It also requires a boolean indicating whether the test is for less than or greater than the specified distance (i.e. Whether the location of the object is inside or outside the circle of radius d). Finally, it takes a function pointer to the method that should be used to obtain the target object’s position. Most people would say that function pointers (and pointers in general) are quite dangerous, but I’m reasonably certain that in this case, because the compiler must expand all of the templates, it is able to remove the function pointer entirely. If not, then I’m sure that this could be re-written in such a way that it would retain a generic application, but without the function pointer. The time investment in figuring this out seems wasted to me, because the solution I have works well for my purposes.
From the Real World to PHP
“That’s great,” you may be saying (if you read the section on C++), “but how does this apply to a situation involving a database abstraction layer? The DB already provides the data that matches the criteria that I want, so there is no point in doing additional filtering on the results, unless I’m doing something horribly wrong with my SQL queries!” And you would be correct. If your queries aren’t giving you exactly what you want, and it isn’t because your query would require complicated SQL as compared to some simple iterative parsing, I would agree that you should fix the queries, not introduce additional post-processing. But, what if you have a lot of different combinations of things to search for? There must be an easier way to specify combinations of searches than through explicitly written queries for each combination.
Imagine you’re implementing webmail using PHP. How many ways are there to search through emails? By sender, by subject, by date, by read status, etc. The list goes on. And what if you want an “advanced” search wherein you can specify multiple parameters at once? The headaches that would result from attempting to write out queries for all possible combinations of search features (If there are five different things tagged to an email that one can search through, there are 32 ways to combine some, all, or none of those criteria into a search—do you really want to write out 32 queries?). The solution is to take the filtering concept described it above, and refactor it to work with databases. The only difference is that now, instead of acting on sets of data, the filter will produce a query to send to a database, in order to return the desired data set.
In PHP, like in C++, a filter will be a class. Using knowledge about the structure of the table, it will create a way to easily specify the parameters to a query, provide security against SQL injections, and create combinatorial queries without any explicit query writing. Also, in keeping with the tradition of stream filters, these filters will be stackable, but that implementation is beyond the scope of this article (read: interesting but I haven’t done it yet). Although it is not as commonly useful, due to limitations on PHP (resolved with late static bindings, added in 5.3 and up), all filters will exist as objects with state, rather than only the ones that need state. It has to be done in order to centralize some of the core functionality. Ideally, a query searching through email using a filter would look something like this:
$params = array();
$params['sender'] = 'joe@smith.com';
$params['date'] = '20081117'; // YYYYMMDD
$filter = new EmailFilter();
$query = $filter->create_query($params);
$result = mysql_query($query);
while ($email = mysql_fetch_array($result)) {
// ... Process here.
}
Options such as sender, date, subject, etc. could be added and removed at will, with the filter generating a correct query in any case, and, even more importantly, generating a reasonable, logical query. MySQL will be able to handle any correct, yet inane queries, but it would be nice if they resembled something that a human might create for debugging purposes, and possibly for optimization purposes (10% in parse time goes a long way if you’re hammering out thousands of queries). For more complicated queries, joins, etc. this becomes increasingly difficult, but a lot of queries that are commonly used are pretty simple.
In a stroke of luck, the filters are even easier to implement in PHP than they were in C++. 99% of the functionality is again in the base class, with the default behavior for derived classes specifying only the table that a query should use:
abstract class Filter {
protected $table = 'default';
public function create_query($params) {
$where = $this->decode_filter($params);
$query = 'SELECT * FROM `'.$this->table.'` '.$where;
return $query;
}
public function decode_filter($params) {
$wheres = array();
foreach ($params as $key => $name) {
$wheres[] = '`'.mysql_real_escape_string($key).'` = "'.mysql_real_escape_string($name).'"';
}
if (!empty($wheres))
return 'WHERE '.implode($wheres, ' AND ');
return '';
}
}
class EmailFilter extends Filter {
public function EmailFilter() {
$this->table = 'emails';
}
}
However, this implementation is really stupid. It does nothing with its data, and it provides none of the features I mentioned earlier, outside of basic processing of columns and match values. And, with things the way that they are, the EmailFilter class serves almost no purpose, as it provides a simple piece of configuration data that could actually exist as a parameter to create_query() and save a lot on processing overhead. A real filter should be able to handle more than just simple string data, and it should be capable of handling complex data types. The purpose of any library is to simplify the life of the user (the developer using the application), so if it is unable to provide functionality without lots of extra work on the programmer’s part, it is not worth using.
With that in mind, the base Filter class must be modified to become aware of the columns in some way, so that it can be smarter with data preparation, and it must provide hooks for subclasses to specify new columns and provide their own processing for those columns:
abstract class Filter {
protected $table = 'default';
protected $special;
protected $columns;
protected $wheres;
public function __construct() {
$this->wheres = array();
$this->special = array();
$this->columns = array();
}
public function create_query($params) {
$where = $this->decode_filter($params);
$query = 'SELECT * FROM `'.$this->table.'` '.$where;
return $query;
}
public function decode_filter($params) {
$this->wheres = array();
foreach ($params as $key => $value) {
$this->process($key, $value);
}
if (!empty($this->wheres))
return 'WHERE '.implode($this->wheres, ' AND ');
return '';
}
public function process($key, $value) {
if (!empty($this->special) && in_array($key, array_keys($this->special))) {
return call_user_func($this->special[$key]['callback'], $key, $value);
}
else if (is_array($value)) {
$values = array();
foreach ($value as $item) {
$values[] = $this->handle_type($key, $item);
}
$this->wheres[] = '`'.mysql_real_escape_string($key).'` IN ('.implode($values, ', ').')';
}
else {
$this->wheres[] = '`'.mysql_real_escape_string($key).'` = '.$this->handle_type($key, $value);
}
}
protected function handle_type($key, $value) {
if (isset($this->columns[$key]) && $this->columns[$key]['type'] == 'int')
return (int) $value;
else
return '"'.mysql_real_escape_string($value).'"';
}
}
class EmailFilter extends Filter {
public function __construct() {
$this->table = 'emails';
$this->special['recipient'] = array('callback' => array($this, 'process_recipient'));
}
public function process_recipient($key, $value) {
$this->process('to', $value);
$this->process('cc', $value);
$this->process('bcc', $value);
}
}
The end result is a very flexible class, Filter, that can accept data in a number of formats (literals or arrays of literals—objects are slightly more complicated to handle). Along with it is a very simple example “real” filter that implements one simplifying macro, ‘recipient,’ which is used to populate values for the to, cc, and bcc fields of the email search with one phrase. The use of a child class is much more justified here, as it is now adding significant new functionality by providing a definition for a special key word processor that is not included with the normal output. Processing for object data types can be implemented readily in this way by manually taking the relevant object fields and passing them back up to a call to process. Remember, the end goal is ease of use for the developer who must use the filters to create SQL queries, and I think it is pretty easy to pass an array of values to the filter and have it implicitly iterate over it, processing each value according to the rules already specified:
$filter = new EmailFilter();
$params = array('from' => array('john_smith@gmail.com', 'tj@monticello.org'), 'subject' => 'Hello');
echo $filter->create_query($params);
echo '';
$params = array('recipient' => array('t1@gmail.com', 't2@gmail.com'));
echo $filter->create_query($params);
Produces:
SELECT * FROM `emails` WHERE `from` IN ("john_smith@gmail.com", "tj@monticello.org") AND `subject` = "Hello"
SELECT * FROM `emails` WHERE `to` IN ("t1@gmail.com", "t2@gmail.com") AND `cc` IN ("t1@gmail.com", "t2@gmail.com") AND `bcc` IN ("t1@gmail.com", "t2@gmail.com")
Conclusions
Overall, the concept of a filter is an essential component of the database access and abstraction layer. I wouldn’t call this a design pattern (I hate that phrase), but more of an idiom. There’s no one filtering solution that’s correct for every situation, and they aren’t always the answer. If a handful of static queries are necessary, then that is what the code should do. If the code will be changing what it is looking for or creating complicated expressions with partial information, then some form of automated query generation based on filtering is essential. It abstracts the way that data is represented from the application programmer in the same way that functionality is abstracted away from the user interface developer, allowing him to focus more on how the data needs to be used than on how it should be obtained.
Don’t forget that filters do have a penalty associated with them. The processing time is never free, but it isn’t really significant, because of the scale on which filters operate (a handful of items to process, not thousands). In true CS spirit, though, because servers are getting faster, the important thing is development time, not absolute efficiency. Filters fit quite neatly into the rapid application development paradigm by reducing the amount of time spent creating and debugging queries (if the formulaic query is right once, when it creates a different query it’ll probably still be right unless you give it garbage). The O(n) claim of efficiency, of course, pertains to the C++ implementation of filters, where each item that is to be examined is checked O(n) times or fewer, if the filters are stacked properly. I could go on for pages about how to order them to try to maximize this type of efficiency, but it’s not the point here. In PHP, even, the generation of a query is still O(n), but it is a different type of penalty, because it is associated with string processing, rather than comparisons.
Finally, remember that this is just a brief (?!) overview of the possibilities. I certainly haven’t explored all of the options, and there are a lot of different ways filters can be implemented, in both C++ and in PHP, and there are a number of things that I didn’t implement that may prove useful.

