MARTIN-LUTHER-UNIVERSITÄT HALLE-WITTENBERG
Institut für Informatik
Prof. Dr. Stefan Brass

 

Datalog with Ordered Facts

 

Paper about Datalog with ordered predicates, i.e. predicates which are lists/arrays of facts instead of sets of facts. [Slides of the talk at Datalog 2.0 2012]

The program basically does now what is promised in the paper and on this page, with the following exceptions (features are mentioned in the paper, but not yet implemented):

But currently the main thing to do is more testing (the program currently consists of 3328 lines of Prolog).

 


Introduction to the Language, Part I: Standard Datalog


The input syntax is Datalog with extensions for ordered predicates. First, Datalog is a simplified and purified variant of Prolog (a logic programming language). Datalog is intended mainly for database applications. A program consists of a set of rules and facts.

Facts, Predicates, Constants

A fact consists of a predicate, followed by arguments in parentheses (...), separated by commas. An argument in a fact is a constant, i.e.

Identifiers must start with a lowercase letter to distinguish them from variables, e.g. X. Variables are not allowed in facts. An example for a fact is

emp('Andrew', 4000, 'Manager').

This fact consists of the predicate emp and three arguments. For instance, the first argument is the string 'Andrew'. A fact corresponds to a tuple/row in a relational database, and the predicate corresponds to a relation name. The arguments correspond to the column values. In standard Datalog, the columns are identified by position, not by name as in relational databases. The contents of a database relation is specified in Datalog as a set of facts, for instance:

% Table emp(EName, Sal, Job).
emp('Andrew', 4000, 'Manager').
emp('Betty', 3000, 'Programmer').
emp('Chris', 3000, 'Programmer').
emp('Doris', 2000, 'Clerk').
emp('Eddy', 1000, 'Salesman').
emp('Fred', 1000, 'Programmer').

The first line is an example for a comment: A comment starts with a percent sign and extends to the end of the line. Note that every fact (and every rule) must be delimited with a full stop ".".

Rules, Variables

A rule consists of a head and a body. The head is a single literal, where a literal looks like a fact, but can contain variables as arguments (the optional order specification is discussed below). The body consists of several literals, separated by commas. A rule states that if the body is true, the head must be true, too. Therefore, head and body are separated by <- (in Prolog, one would use :-). An example for a rule is:

programmer(X) <- emp(X, Y, 'Programmer').

This rule defines a derived predicate programmer (corresponding to a view in a relational database). Whenever the system finds a match for the body literal, i.e. there is a way to assign values to the variables to that the body is already known to be true, the system derives the corresponding instance of the head literal. From the facts given above and from the rule, the system automatically derives the following facts:

programmer('Betty').
programmer('Chris').
programmer('Fred').

Predicates Without Arguments

As in Prolog, predicates without arguments do not use ():

it_rains.
use_umbrella <- it_rains.

Multiple Rules about a Predicate

It is possible to define a derived predicate with several rules. Then all these rules can be used to derive facts about the new predicate. This can be used to express disjunctive conditions (logical "or"). For instance, the following rules define administrative employees as managers or clerks:

admin_emp(X) <- emp(X, Y, 'Manager').
admin_emp(X) <- emp(X, Y, 'Clerk').

Note that variable names are local to rules: A variable X in one rule has no connection to a variable X in another rule.

Recursion

Derived predicates can be used in rule bodies, and recursive definitions are possible, too. Suppose the following database relation containing direct supervisors is given:

supervisor('Betty', 'Andrew').
supervisor('Chris', 'Betty').
supervisor('Doris', 'Andrew').
supervisor('Eddy', 'Andrew').
supervisor('Fred', 'Betty').

Now it is possible to define indirect supervisors (the "transitive closure" of the supervisor relation) as follows:

boss(X, Y) <- supervisor(X, Y).
boss(X, Z) <- supervisor(X, Y), boss(Y, Z).

At the beginning, only the first non-recursive rule can be applied. It basically copies the supervisor-facts to boss-facts. In the second round, one already has these boss-facts, and can use them in the second, recursive rule. Thus, one now derives the boss-relations over two hierarchy steps. In the third round, one could use the newly derived facts, but since the company has no more hierarchy levels, one does not find matching supervisor-facts. Since no new facts can be derived, the recursion ends here.

In contrast to Prolog, termination can be guaranteed. Even programs like the following do not cause an infinite recursion:

p(X) <- p(X).
p(a).

One the recursive rule is applied, the system detects that the derived fact is a duplicate, and eliminates it. When nothing new can be derived, query evaluation stops. Note also that the sequence of facts and rules is not important (also in contrast to Prolog).

Built-in Predicates, Comparison Operators

Besides predicates defined by facts and rules, rule bodies can also contain built-in predicates, which are defined by code inside the system. At the moment, our system contains only the comparisons <, <=, =, !=, >=, >. More arithmetic operations will follow soon. Of course, for comparison operators the usual infix syntax is used, e.g.

good_salary(Name) <- emp(Name, Sal, Job), Sal > 2500.

This is an example of a rule with two body literals. The body literals are always conjunctively connected (logical "and"), i.e. both body literals must be satisfied for the rule to "fire", and derive a fact about good_salary (with the same value for the variable Name which was used to satisfy the body of the rule). In the example, the following facts are derived:

good_salary('Andrew').
good_salary('Betty').
good_salary('Chris').

In the current version, a comparison silently fails, if the arguments have incompatibel types, e.g. a string is compared with a number. Do not rely on this behaviour, a later version may print an error message. However, that is not as easy as it might seem first. In Datalog, the order of body literals is not important. Thus, it would not be correct to print an error message as long as another body literal could fail. The correct way to do this is to use another logic value "error" besides "true" and "false". If the body is evaluated to "error", one can print the message. This was all too complicated for the first version, therefore, the comparison simply returns "false" if a string and a number is compared (or if two identfiers/atoms are compared: there is no defined order on them). However, != ("not equals") returns true in such cases.

Range Restriction (Safety)

All variables used in a rule must be bound to a concrete value when the rule is applied. Therefore, every variable must appear in a normal body literal (not with a built-in predicate, not negated). For instance, it is not permissible that a variable is used only in the rule head, but not in the body. Otherwise a single application of the rule could generate an unbounded number of new facts, which would at least be difficult.

In connection with built-in predicates, there are many cases where the range-restriction rule could be extended, e.g. a comparison like X = 1 obviously binds X to a value. However, to simplify the implementation, such cases are not considered, and X must in addition appear in normal body literal. At the moment this is anyway not very important, but when built-in predicates are added to perform arithmetic computations, the range-restriction rule will be relaxed.

Queries

At the moment, queries are handled in a simple way: The derived facts for a special predicate answer are printed at the end. At the moment, this is only a small prototype to allow experimentation with the language. But it will develop over time. Of course, in a real system, one would distinguish between

At the moment, all three components are contained in a single input file. The predicate answer is similar to the function main in C/C++: It is the top derived predicate, all other predicates are relevant only if they are used directly or indirectly by answer. Probably, in a future version, an alternative syntax (more similar to the goals in Prolog) will be introduced, but at the moment, the special predicate answer suffices. Anyway, for the bottom-up evaluation used in deductive databases, such a solution is very natural.

For printing the arguments of the derived answer-facts, the variables in the head of the first rule about answer are used. Thus, in order to get more readable output, it is probably advisable to have normally only one rule about answer, with only variables as arguments in the rule head. This is no real restriction, since one can always use another auxiallary predicate to first compute the answers. For a future version, it is planned to have also a tabular output as in a relational system, currently these variable bindings are printed. This is semantically no difference, the tabular output would only look nicer in most cases.

Negation, Stratification

Body literals may be negated, which means that the corresponding positive literal is not derivable. The negation operator is written \+ as in many Prolog systems. This is similar to the "not derivable" symbol in logic. E.g., the following rule finds the top manager of the company, who has no supervisor:

has_supervisor(X) <- supervisor(X,Y).
top_manager(X) <- emp(X, Y, Z), \+ has_supervisor(X).

The auxillary predicate has_supervisor is necessary to satisfy the range restriction requirement: Every variable must appear in a positive (i.e. non-negated) body literal (with a normal, i.e. not built-in predicate). Actually, this rule could be relaxed: It is no problem if a variable appears only in a single negative body literal, and nowhere else in the rule. Probably, a future version will be more generous in this respect, but the current version has only the simple range restriction requirement.

In order to avoid inconsistencies like "p is derivable if p is not derivable", recursion through negation is forbidden. Formally, it must be possible to assign levels (non-negative integers) to the predicates such that

There are many proposals for relaxing this condition, and a few applications which would really need this. However, since this is not the main emphasis of the prototype, the standard stratification will do for the first version. Furthermore, the stratification is also used in connection with ordered predicates, and there possible generalizations are subject of future research.

 


Introduction to the Language, Part II: Ordered Predicates


In standard Datalog, predicates are sets of facts: There are no duplicates, and the order/sequence is not defined. This corresponds to relational databases, where relations are sets (or multisets in SQL) of tuples. However, in relational databases, at least for the output, it is possible to use an ORDER BY clause in SQL. Whereas this could long be seen as only a cosmetical part of the final presentation of the query result, newer developments of SQL made order important also during query evaluation, for the main logic of the query. For instance, in Oracle it is possible to retrieve the three employees with highest salary as follows:

SELECT ENAME, SAL
FROM   (SELECT ENAME, SAL
        FROM EMP
        ORDER BY SAL DESC)
WHERE ROWNUM <= 3

This example is interesting, because it shows that a subquery, corresponding to a view or a derived predicate, might need a defined order. In older SQL standards, it was not possible to use ORDER BY in subqueries or view definitions. Now it has become important and was legalized in SQL-2008.

Ordered Predicates, Defining Order

Our language extension is to permit "ordered predicates" in addition to the standard predicates. The semantics is a multiset of facts, on which a partial order is defined. This includes especially arrays, where the facts have a linear order. In order to distinguish the two types of predicates, a special declaration of the ordered predicates is needed, e.g.

ordered emp_by_sal/2.

As usual in Prolog, it is possible to use the same predicate name with different arities (number of arguments). Therefore it is necessary to say explicitly that only emp_by_sal with two arguments is an ordered predicate. Predicates for which no ordered-declaration is given are considered as standard predicates. The declaration must precede the first use of the predicate in a rule head or body.

One can also view the ordered predicates as standard predicates with an additional, usually hidden, list-valued argument which defines the order. It is list-valued because there can be several ordering criteria of different priority. If two lists differ already in the first element, that defines the order. If they are equal in the first element, but differ in the second, that defines the order. If one list is a prefix of the other list, the prefix comes first. The lists can contain special terms of the form desc(X), which are sorted in inverse order. So the list elements are sorted as follows:

If different kinds of terms are compared, this is likely to be an error. The current implementation prints no warning, but this might be added in future. There is also a slight incompatibility with the comparison lterals: While 'a' < 1 and 'a' >= 1 currently both fail because of incompatible types, in the order routine this behaviour is not possible: One must come first, and currently this is the number. If a later version produces a type error at runtime, this asymmetric behaviour is removed.

It is possible that distinct facts have idential ordering values: For the functions rank and dense_rank (see below) this is important, for the function row_number (and thus for output) the implementation can decide which fact to put first. The inverse case is also possible: Two facts can differ only in the special ordering argument, then they are basically duplicates.

Because the ordering argument is often hidden, and it might not be explicitly stored to improve efficiency, it is not contained in the standard list of arguments, but it is written in angle brackets between the predicate and the normal argument list. In the example, we need to order employees by salary in descending order. Thus, the ordering argument is desc(Sal). However, the term would look a little long, therefore the notation ^Sal is used. The symbol ^ is the nearest one can get in ASCII to an arrow in upward direction, which symbolizes the direction of increasing values. Thus, the predicate emp_by_sal, which corresponds to the subquery in the SQL query above, is defined by the following rule:

emp_by_sal<^Sal>(EName, Sal) <- emp(EName, Sal, Job).

If for equal salaries, one wants to order by employee name (ascending), one can add a sorting criterion of smaller priority:

emp_by_sal<^Sal,EName>(EName, Sal) <- emp(EName, Sal, Job).

It is possible to define ordered predicates with several rules, then the union of the derived facts is computed, and the ordering values of all derived facts for the predicate determine the sort order. I.e. the facts derivable with the different rules can appear intermixed in the resulting sequence. Often, however, one wants that the facts derivable with the first rule come first, and then the facts derivable with the second rule, and so on. It is possible to use constants like 1, 2, ... in the ordering argument, but that remindes of the infamous line numbers in basic. Therefore, one can use the special ordering argument @, which is automatically replaced by the current rule number. The paper also contains a proposal for a default ordering value, which often helps to leave out the ordering argument completely, but that is not yet implemented in the current version. It is high on the TODO-list and will appear in the next version. With the default sort order, one would basically get the same order as in Prolog (Prolog has a defined order in which it produces answers, but it is not based on data values like the salary, but on the execution algorithm).

Using Positions in the Extension of Ordered Predicates

An important function of the system is that it condenses the ordering arguments, which are lists of values, basically to a single index: The position in the sorted sequence. While <...> can be used only in the rule head, in the body, one can access the "array index" of a fact. Therefore, our goal to select the three employees with highest salary can be solved as follows:

ordered emp_by_sal/2.
emp_by_sal<^Sal>(EName, Sal) <- emp(EName, Sal, Job).
answer(EName, Sal) <- emp_by_sal[N](EName, Sal), N <= 3.

This is equivalent to the above SQL query (repeated here for convenience):

SELECT ENAME, SAL
FROM   (SELECT ENAME, SAL
        FROM EMP
        ORDER BY SAL DESC)
WHERE ROWNUM <= 3

In the above solution, SQL as well as our Datalog extension, the order of the result tuples is not defined. Of course, one can decide to order the three employees by salary (highest first), i.e. to attach an ORDER BY SAL DESC clause to the outer query, too. In our Datalog extension, we declare answer as an ordered predicate, and can either use the salary as in SQL or the array index to specify the order. This latter solution is interesting, because it shows how an order can be preserved from a body literal to the defined predicate:

ordered answer/2.
answer<N>(EName, Sal) <- emp_by_sal[N](EName, Sal), N <= 3.

The default order specification, mentioned above, would do this automatically: If the predicate in the head is declared as an ordered predicate, and no order is specified, the default is first the rule number, and then the array indices of all body literals with ordered predicates in the sequence in which they appear in the body. In this way, rules are a bit like comprehension syntax: One first constructs a list of facts or fact-combinations and then can filter it with additional conditions.

Of course, it is also possible to use a different ordering sequence for the query result, i.e. we could sort the three employees with highest salary by name. This shows that more than one sort order can be used in a single query.

Partitioning

Suppose we want the employee with the highest salary for each job. E.g. Doris, a clerk, will be printed, because she is the only clerk, although she earns less than the manager Andrew. Then we need not a linear order on all facts for a predicate, but several chains (one for each job). Between the chains, the facts should be uncomparable. This is done by first partioning the facts by the job, and then ordering them only within a partition.

ordered emp_job/2.
emp_job<Job|^Sal>(EName, Sal, Job) <- emp(EName, Sal, Job).

This is somewhat similar to a two-dimensional array: The first dimension is the job (without particular order, so no indexes can be used here), the second dimension is the position in the sequence sorted by salary (for the given job). However, the syntax permits to access only the index in [...]. If the job is needed, it must be a standard argument of the fact. Given the above predicate, the top earner for each job is computed as follows:

answer(EName, Sal, Job) <- emp_job[1](EName, Sal, Job).

Ranking

In the last example, it is implementation-depended whether Betty or Chris is printed, because both earn 3000, which is the maximum for programmers in the example. Of course, such a behaviour is often not wanted, and may be considered unfair in other applications (e.g., when the best student is selected). Therefore, the SQL standard now contains a function RANK(), used as in the following example:

SELECT ENAME, SAL
FROM   (SELECT ENAME, SAL,
        RANK() OVER (PARTITION BY JOB ORDER BY SAL DESC) AS POS
        FROM EMP)
WHERE POS = 1

The function RANK() counts the number of tuples which have a strictly "better" ordering value and then adds 1 to start counting with 1 and not with 0. So in case several tuples are equal with respect to the ordering value, all get the same rank. However, afterwards it jumps to the actual number of tuples, so there are holes the in sequence of RANK() values. In contrast, DENSE_RANK has no holes, but then it loses the connection to the number of tuples. The standard index (which has no holes, corresponds to the number of tuples, but is sometimes arbitrary) is called ROW_NUMBER(). Here are the values of the three functions in the example (without partitioning):

EName Sal ROW_NUMBER RANKDENSE_RANK
'Andrew' 4000 111
'Betty' 3000 222
'Chris' 3000 322
'Doris' 2000 443
'Eddy' 1000 554
'Fred' 1000 654

In our proposal, one can choose a ranking function by using the corresponding keyword in the example, e.g.

answer(EName, Sal, Job) <- emp_job[rank:1](EName, Sal, Job).

This ensures that both, Betty and Chris, will be printed, because they have the same salary.

If one needs several ranking functions at the same time, one can access them in a single pair of brackets. E.g. suppose one wants to print the values of the three ranking functions (without partitioning), one could do that as follows:

ordered emp_by_sal/2.
emp_by_sal<^Sal>(EName, Sal) <- emp(EName, Sal, Job).
answer(EName, Sal, N, R, D) <- emp_by_sal[N,rank:R,dense_rank:D](EName, Sal).

Stratification

It is not surprising that with recursion and the possibility to determine the first literal, one can get contradictory/inconsistent situations, as with recursion through negation. E.g. the following input has no reasonable semantics and must therefore be excluded:

ordered p/1.
p<10>(a) <- p[1](b).
p<20>(b).

If p(b) is the first element in the sorted list p, then p(a) is true, which then would come first. But then p(b) is no longer the first element.

The solution is the same as in the case of negation: We require that it is possible to assign non-negative integers as "levels" to the predicates, such that for rules containing p[...](...) in the body, and q in the head, the level of q is strictly greater than the level of p. The same must hold for negation: For rules containing \+ p(...) in the body, and qin the head, the level of q must also be strictly greater than the level of p. Furthermore, in any case, if p appears in the body of a rule, and q appears in the head of the same rule (i.e. q depends on p), the level of q must be greater than or equal to the level of p.

These levels determine a computation sequence: The system can first compute all facts about predicates of level 0. There can be recursion between these predicates, but it is not yet possible to refer to the position of a fact in a sorted sequence, or the non-existence (non-derivability) of a fact. Once all derivable facts about predicates of level 0 are computed, the system sorts them and assigns positions. Now the facts for the predicates of level 1 can be computed. Their rules can refer to positions of facts of level 0 predicates, and also use level 0 predicates in negated body literals. This is possible because all derivable facts of level 0 predicates have been computed already and we know that no new facts can become derivable later.

Of course, the system can do query evaluation more efficiently, and does not necessarily have to really compute all derivable facts. However, it is important to have a very simple evaluation algorithm as a constructive definition of the semantics. The current prototype actually computes all derivable facts, but its purpose is currently only a demonstration of the language. We will work on more efficient query evaluation later.

The make this clear: Recursion with ordered predicates is possible, however, one cannot use the ordering position [...] in the recursive calls. In predicates of higher strata (i.e. which are not mutually recursive with the current predicate), one can of course use the ordering position.

Output

The starting point for the development of the language was the need to specify output declaratively. The idea was to define the output of a Datalog program by permitting to derive text pieces which are printed in some defined order. While answer is the "main"-predicate for standard queries, an ordered predicate output of arity 1 (i.e. with one argument) is the "main" predicate for doing output with Datalog. The difference is that if answer-facts are derivable, all arguments are printed with TABs to separate them, and a newline after each fact. For output, there is only a single argument, and this is printed for all derivable facts (in the defined order) without any separation (not even a space). In this way, a text can be constructed. A simple example is:

ordered output/1.
output<@>('Hello, ').
output<@>(Name) <- name(Name).
output<@>('.\n').
 
name('Nina').

With <@>, the current rule number is used to define the output sequence, i.e. the three derivable facts are printed in the sequence the facts/rules are written in the program. In the paper, a default order specification is proposed, which is precisely <@> in this example, so this part would not be needed, one could write for instance simply output('Hello '). This is not implemented yet, but will soon be added. The paper also proposes a "pattern syntax" for output, which is a big simplification and makes output very natural, but again it is only an abbreviation for standard rules, and will only be added in a later version.

In the example, three output facts are derivable:

output[1]('Hello, ').
output[2]('Nina').
output[3]('.\n').

So as explected, the program prints:

Hello, Nina.

(followed by a line break). The power lies in the possibility to use auxillary predicates to compute parts of the text. Suppose we want to compute an HTML table with the employee names and their salary ordered by employee names. This can be done as follows:

ordered sal_table/1.
ordered sal_table_row/1.
 
sal_table<@>('<table>\n').
sal_table<@>('<tr> <th>Employee</th> <th>Salary</th> </tr>\n').
sal_table<@,Pos>(Text) <- sal_table_row[Pos](Text).
sal_table<@>('</table>\n').
 
sal_table_row<EName,@>('<tr><td>') <- emp(EName, Sal, Job).
sal_table_row<EName,@>(EName) <- emp(EName, Sal, Job).
sal_table_row<EName,@>('</td><td>') <- emp(EName, Sal, Job).
sal_table_row<EName,@>(Sal) <- emp(EName, Sal, Job).
sal_table_row<EName,@>('</td></tr>\n') <- emp(EName, Sal, Job).

As mentioned before, this can be greatly simplified with the upcoming pattern syntax (see paper). Nevertheless it shows that predicates can be used to define pieces of a text, even parameterized pieces. The rule

sal_table<@,Pos>(Text) <- sal_table_row[Pos](Text).

shows how text from one predicate can be inserted into the text for another predicate.

The complete example can be downloaded here.

Loops over Facts: Accessing the Next and Last Fact

The possibility to number facts and to compute the next number permits to write loops over sets of facts. This can be used to write aggregation functions. E.g., the following program computes the sum of all salaries (the predicate is still misses in the current version of the prototype, therefore, this program cannot be testet yet):

ordered emp_list/1.
emp_list<EName>(EName, Sal) <- emp(EName, Sal, Job).
 
% sal_sum(N, S) means that the sum of the first N-1 elements is S.
% I.e. N is the index of the next element to add to the sum.
sal_sum(1, 0).
sal_sum(N1, S1) <-
    sal_sum(N, S),
    emp_list[N,next:N1](EName, Sal),
    S1 is S + Sal.
 
answer(S) <- sal_sum(nil, S).

The keyword next: in an array index binds the following variable to the index of the next array element, or nil, if the current element is the last. This simplifies the writing of loops, but is not strictly necessary. One could write the recursion also as follows:

sal_sum(1, 0).
sal_sum(N1, S1) <-
    sal_sum(N, S),
    emp_list[N](EName, Sal),
    N1 is N + 1,
    S1 is S + Sal.
 
answer(S) <-
     sal_sum(N, S),
     \+ emp_list[N](_,_).

This is not strictly equivalent to the previous solution, because in case emp_list is empty, this produces the sum 0, whereas the solution with next: gives the empty set of answers.

Of course, the possibility to compute aggregates does not mean that the language should not contain explicit aggregate functions. Aggregate functions are well-known from SQL and are obviously useful. These examples are intended only to demonstrate the expressive power of the sorting and numbering construct. Actually, it is no surprise that this is possible, since the LDL++ system can compute aggregates with nondeterministic choice. Our construct permits a kind of deterministic choice. Although a future version of the language will of course include aggregate functions, it is good to see that one can compute them directly, so one can also handle non-standard cases.

One can also use the symbol last in the array index to access the last fact in the sorted list. E.g., one can compute the minimal and maximal salary as follows:

ordered sal_list/1.
sal_list<Sal>(Sal) <- emp(EName, Sal, Job).
 
sal_range(Min, Max) <-
    sal_list[1](Min),
    sal_list[last](Max).

The complete example can be downloaded here.

 


Stefan Brass (brass@informatik.uni-halle.de), May 26, 2012

Original URL: http://www.informatik.uni-halle.de/~brass/order/   [XHTML 1.0 Checked]   [CSS Checked]   [Links Checked]