<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
<article lang="en">
<articleinfo>
<author>
<firstname>Petr</firstname>
<surname>Pajas</surname>
</author>
<title role="main_title"><medialabel> <inlinemediaobject>
<imageobject>
<imagedata fileref="images/pmltq.png" />
</imageobject>
<textobject>
<phrase>PML Tree Query</phrase>
</textobject>
</inlinemediaobject> </medialabel></title>
<subtitle>Documentation for version 0.7.10 (beta)</subtitle>
</articleinfo>
<section>
<title>Introduction</title>
<para>PML Tree Query (PML-TQ) is a query language and search engine
targeted for querying multi-layer annotated treebanks stored in the PML
data format. It can be used to query all kinds of treebanks: dependency,
constituency, multi-layered, parallel treebanks, as well as other kinds of
richly structured types of annotation.</para>
<para>The query language is declarative and offers both textual and
graphical representation of queries. Currently, there are two
implementations of the query engine, one based on a relational database
(Oracle or PostgreSQL &gt;= 8.4), the other based on Perl and the TrEd
toolkit. Three user interfaces are available: a WEB-based interface for
the database-based query engine displaying query results as SVG, a
full-featured graphical user interface for both engines available as a
plug-in to the tree editor TrEd, and a text-only command-line
interface.</para>
<para>Main features:</para>
<orderedlist>
<listitem>
<para>queries can span over all layers of annotation (including
annotation dictionaries)</para>
</listitem>
<listitem>
<para>allows arbitrary logical constraints</para>
</listitem>
<listitem>
<para>supports output filters (generate custom text output, compute
statistics, ...)</para>
</listitem>
<listitem>
<para>offers graphical query representation with relations (links)
between nodes depicted as arrows</para>
</listitem>
<listitem>
<para>GUI interface built into TrEd</para>
</listitem>
<listitem>
<para>understands PML data model (no conversion, no information
loss)</para>
</listitem>
</orderedlist>
<para>TODO: comparison with other query languages and engines (NetGraph,
TGrep2, TigerSearch, Nite, XPath, LPath+)</para>
</section>
<section>
<title>Basic concepts</title>
<para>A <firstterm>PML-TQ query</firstterm> consists of a
<firstterm>selective part</firstterm> that selects nodes from the treebank
and an optional sequence of <firstterm>output filters</firstterm> that are
used to extract data from the matching nodes, post-process the results,
compute statistics, generate tabular output, etc.</para>
<para>The <firstterm>selective part</firstterm> of a PML-TQ query
postulates requirements on one or more nodes from the treebank and their
mutual relationships (e.g. on the topological configuration in the tree
structure). It is formed by one or more <firstterm>node
selectors</firstterm>, which form the outermost scope of the query. Inner
scopes of the query are given by nested subqueries as described
later.</para>
<para>A <firstterm>node selector</firstterm> represents a node in the
treebank of a certain type (in the PML data model, the nodes in the
treebank annotation can be typed; the query can also refer to several
annotation layers with different types of nodes) and postulates
constraints on its properties including relationships to nodes represented
by other selectors.</para>
<para>Selectors may <firstterm>nest</firstterm> other selectors; a nested
selector belongs to the same scope as the containing selector The nested
selector may explicitly specify the relation of its matching node to the
node matched by the containing selector; the default relation is
<literal>child</literal>. The nesting of selectors can thus naturally
follow the topology of the matching tree.</para>
<para>Selectors can also be named and referred to from other node
selectors; however, in many cases, the need for explicitly naming them can
be eliminated by nesting.</para>
<para>A <firstterm>match</firstterm> of a query is a mapping which assigns
to each outermost-scoped selector a node from a treebank (called a
<firstterm>matching node</firstterm>) of the type specified by the
selector, in such a way that all the matching nodes are mutually distinct
and simultaneously satisfy the constraints postulated by their
corresponding selectors (including constraints on their mutual
relationships). The <firstterm>match</firstterm> can be represented as a
tuple of the matching nodes ordered accordingly to some canonical ordering
of the selectors from the outermost scope of the query. There can be zero,
one, or more distinct matches of the query in the treebank (two matches
are distinct if, as ordered tuples, they differ in at least one
node).</para>
<para>Non-identity rule: Two distinct selectors <emphasis>in the same
scope</emphasis> of the query always represent two distinct nodes in each
match of the query or sub-query.</para>
<para>Selectors can postulate the following types of
<firstterm>constraints</firstterm>: <itemizedlist>
<listitem>
<para>predicates</para>
</listitem>
<!--
<listitem>
<para>directly nested node selectors</para>
</listitem>
-->
<listitem>
<para>references to other selectors</para>
</listitem>
<listitem>
<para>subqueries</para>
</listitem>
<listitem>
<para>boolean combinations of the above</para>
</listitem>
</itemizedlist> In the following descriptions, we refer to the selector
postulating a constraint as as the <firstterm>current
selector</firstterm>.</para>
<para><firstterm>Predicate constraints</firstterm> assert equality,
inequality, or regular expression match between values computed from
terms. An atomic term is a constant (integer, float, or character string),
or an attribute of a node matched by the current selector or some other
selector in the current or outer scope of the query. A term is either an
atomic term or a term obtained from other terms using arithmetical
(<literal>+, *, -, div, mod</literal>) or string (concatenation
<literal>&amp;</literal> ) operators, or functions.</para>
<para>A <firstterm>reference</firstterm> is a constraint on the
relationship of a node matched by some named selector to the node matched
by the current selector. The referred selector must either belong to the
same scope as the current selector or to its outer scope.</para>
<para>A <firstterm>subquery</firstterm> is formed by a selector (called
the <firstterm>leading selector of the subquery</firstterm>) nested in the
current selector and augmented by restrictions on the number of
occurrences, computed as the number of distinct nodes matched by the
leading selector of the subquery relatively to a fixed match of the
selectors in the current and outer scope (including the current selector).
For example, to postulate a constraint that each node matched by the
current selector must have at least two child nodes, we create a subquery
in form of a nested selector in the child relation to the current selector
and restrict the number of occurrences to two and more.</para>
<para>The leading selector can nest other selectors. Each subquery starts
a new scope whose outer scope is the scope of the containing selector
together with the containing selector's outer scope (if any). Unlike
selectors from the outermost scope, selectors declared within a subquery
do not represent any particular node in the resulting match. They can
refer to selectors from the same scope, and also to selectors from the
outer scope, but not vice versa (selectors from the outer scope cannot
refer to the selectors in the subquery).</para>
<para>A subquery constraint is verified as follows: for each match of the
selectors in the current and outer scope, all matches of the subquery are
located (these may coincide with nodes matched by the selectors in the
outer scope). The number of distinct nodes matched by the leading selector
of the subquery are counted and this number is compared with the
restrictions on number of occurrences. The constraint is satisfied if and
only if these restrictions are met.</para>
<para>A constraint can also be a boolean combination of other constraints;
a nested node selector occurring in a boolean combination with other
nested node selectors or constraints is considered to be a subquery with
at least one occurrence.</para>
<para>A PML-TQ query can be visualized as a graph consisting of one or
more trees whose nodes are the selectors connected by edges according to
the nesting of selectors and subqueries. In this sense we may sometimes
refer to selectors as <firstterm>query nodes</firstterm> and to the query
as <firstterm>query graph</firstterm> or <firstterm>query tree</firstterm>
(a technical root can be added above all the trees so that a forest
becomes a single tree). The edges can be labeled or colored to represent
different relationships between nodes. References to named selectors can
be represented by an additional layer of links (edges) in the graph that
may go across the basic tree structure of the query tree.</para>
<!--
<glossary>
<glossentry>
<glossterm>
node selector (query node)
matching node
attribute
node type
relation
variable
subquery
reference
output filter
</glossterm>
<glossdef>
<para></para>
</glossdef>
</glossentry>
</glossary>
-->
</section>
<section id="tutorial">
<title>Tutorial</title>
<para>The purpose of this tutorial is to show how to create and run
queries from the tree editor TrEd. The textual form of the query can also
be used in the web or command-line interface.</para>
<para>As our examples, we use queries over the Prague Dependency Treebank
2.0; conceptually similar queries can be applied to most other treebanks,
although the node types and attributes will be probably different.</para>
<para>The tutorial gradually passes from very simple to complex queries
and demonstrates various common syntactic constructions of the PML-TQ
language. We always show how to write the query in the textual form as
well as how to build and run the query graphically in the tree editor
TrEd.</para>
<section>
<title>Getting started</title>
<para>To start using PML-TQ from the tree editor TrEd, press <keycombo
action="simul">
<keysym>Shift</keysym>
<keysym>F3</keysym>
</keycombo> or select <menuchoice>
<guimenu>Macros</guimenu>
<guisubmenu>TredMacro</guisubmenu>
<guimenuitem>New Tree Query</guimenuitem>
</menuchoice> from the main menu in TrEd. <mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tred_select_search_s.png"
format="PNG" />
</imageobject>
</mediaobject> Choose <guibutton>Treebank (server)</guibutton>; if
doing this for the first time, you will be asked to fill the connection
form: as <literal>url</literal>, fill the search engine URL including
port number (e.g. <literal>http://mysearchserver.org:8082</literal>
where <replaceable>mysearchserver.org</replaceable> is the host name or
IP address of your PML-TQ search service and
<replaceable>8082</replaceable> is the port the search service is
listening on); fill <literal>username</literal> and
<literal>password</literal> with your credentials for the search service
(which you can receive from the administrator of the PML-TQ search
service). When done, confirm with <literal>OK</literal>. TrEd will
attempt to contact the search service you have specified and will ask it
for a list of treebanks avaiable. Subsequently, you will be asked to
select treebanks for which you want to configure a connection.</para>
<para>Once your connections are configured, a dialog window with a list
of available connections will be offered. To select a connection, simply
choose it from the list and press <guibutton>Select</guibutton>. You may
also use the buttons <guibutton>Add Service</guibutton> to add/remove
connections to other treebanks provided by the server of the selected
connection, <guibutton>New URL</guibutton> to create a connection to a
new server, <guibutton>Info</guibutton> to display information about the
selected connection (such as name and description of the treebank),
<guibutton>Edit</guibutton> to modify the connection data
(<literal>url</literal>, <literal>username</literal>, and
<literal>password</literal>), and <guibutton>Remove</guibutton> to
remove the selected connection from the list.</para>
<para>A window with an empty query tree is displayed by TrEd:
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tred_empty_query_s.png"
format="PNG" />
</imageobject>
</mediaobject></para>
<para>You can also use the toolkit without TrEd from a web browser
(although you will not be able to build the query graphically nor see
the graphical representation of your query). To start, just open the
search engine URL and log in using any web browser capable of combining
JavaScript, CSS, and SVG (<ulink
url="http://www.w3.org/Graphics/SVG/">Scalable Vector Graphics</ulink>).
At the time of writing this tutorial, the best choice is the Opera
browser, followed by Google Chrome, Firefox, and Safari (please avoid
Microsoft Internet Explorer because it lacks native SVG support).</para>
</section>
<section>
<title>A simple query</title>
<para>Now we may create our first simple query. We shall search for all
nodes of the type <literal>t-node</literal> (tectogrammatical nodes in
PDT 2.0) that whose attribute <literal>functor</literal> equals to
<literal>DPHR</literal>. In TrEd, the query can be created in several
ways:</para>
<itemizedlist>
<listitem>
<para>Method 1: Press <keysym>Insert</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/add_node.png" format="PNG" />
</imageobject>
<textobject>
<phrase>Add node</phrase>
</textobject>
</inlinemediaobject> </guibutton> on the toolbar to create a new
selector, choose <literal>t-node</literal> from the offered list of
node types and confirm with <guibutton>OK</guibutton>.</para>
<para>Then press <keysym>=</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/test_equality.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Equality</phrase>
</textobject>
</inlinemediaobject> </guibutton> to create a new atomic
constraint, and fill <literal>functor</literal> as
<literal>a</literal> and <literal>"DPHR"</literal> as
<literal>b</literal> in the form (both the attribute name and its
value can be selected from a list). <mediaobject>
<imageobject role="html">
<imagedata align="center"
fileref="tred_new_equality_test_s.png" format="PNG" />
</imageobject>
</mediaobject></para>
</listitem>
<listitem>
<para>Method 2: Press <keysym>e</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/edit_file.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Edit query</phrase>
</textobject>
</inlinemediaobject> </guibutton> to open the query text editor
and type <programlisting>t-node [ functor='DPHR' ]</programlisting>
into the text field.</para>
</listitem>
<listitem>
<para>Method 3: Open the query text editor as above, but use helper
buttons below the text field to build the text of the query: Press
<guibutton>Type</guibutton> and select <literal>t-node</literal>,
Press <guibutton>[ ]</guibutton>, put the cursor between the
brackets by clicking there or by pressing the <keysym>left
arrow</keysym> two times, press <guibutton>Attribute</guibutton> and
select <literal>functor</literal> and finally press
<guibutton>=</guibutton> and select
<literal>"DPHR"</literal>.</para>
</listitem>
</itemizedlist>
<para>The result in TrEd will look like this:</para>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tutorial_queries_1.png"
format="PNG" />
</imageobject>
</mediaobject>
<para>To start the search now, press <keysym>Space</keysym> or
<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/search_filter.png" format="PNG" />
</imageobject>
<textobject>
<phrase>Query</phrase>
</textobject>
</inlinemediaobject> </guibutton> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/search.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Search</phrase>
</textobject>
</inlinemediaobject> </guibutton>. After a while, a window will pop up
indicating whether some results have been found, and pressing
<literal>Display</literal> will show the first result in a new view:
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tred_first_result_s.png"
format="PNG" />
</imageobject>
</mediaobject> To see the full sentence in the text field above the
trees , click on the result view on the right. Next result can be
displayed by pressing the key <keysym>n</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/search_next.png" format="PNG" />
</imageobject>
<textobject>
<phrase>Next match</phrase>
</textobject>
</inlinemediaobject> </guibutton>.</para>
<note>
<para>By default, the search engine returns up to 100 matches (in no
particular order), which should be more than sufficient for viewing a
few matching examples. This limit can be changed in the search engine
configuration (displayed by pressing <keysym>C</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/configure.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Configure</phrase>
</textobject>
</inlinemediaobject> </guibutton>, but raising this limit may slow
the search. We shall later see how to compute the number of all
matches, using output filters.</para>
</note>
<para>Press <keysym>p</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/search_previous.png" format="PNG" />
</imageobject>
<textobject>
<phrase>Previous match</phrase>
</textobject>
</inlinemediaobject> </guibutton> to go back to the previous match and
press <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/apply.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>This match</phrase>
</textobject>
</inlinemediaobject> </guibutton> to return the current match to
view.</para>
<para>Note that the result view contains not only the matching tree but
the complete document, so it is possible to see the tree preceding or
following the currently matching tree by pressing <keysym>Page
Up</keysym>/<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/1leftarrow.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Previous Tree</phrase>
</textobject>
</inlinemediaobject> </guibutton> or <keysym>Page
Down</keysym>/<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/1rightarrow.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Next Tree</phrase>
</textobject>
</inlinemediaobject> </guibutton>, respectively.</para>
</section>
<section>
<title>A query with two nodes</title>
<para>We shall now make the query more complex by adding another node to
it. We shall ask for a t-node with functor "DPHR" that has a child
(since DPHR t-nodes are often leaves, it may be interesting to see what
children we get).</para>
<para>To add a new child, select the only node in our query tree and
press <keysym>Insert</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/add_node.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Add node</phrase>
</textobject>
</inlinemediaobject> </guibutton> on the toolbar. Now a pop-up window
shows asking for a type of relation the new node to the existing node.
The default value is <literal>child</literal>. Since this is what we
want, click <guibutton>OK</guibutton>. TrEd automatically assumes the
new query node to be a selector for nodes of the type
<literal>t-node</literal>, since, according to the PML schema of the
tectogrammatical layer, a <literal>t-node</literal> can only have a
<literal>t-node</literal> for its child; otherwise it would offer a list
of nodes types to choose from.</para>
<para>In text form, the query can be equivalently expressed as
<programlisting>t-node [ functor='DPHR', t-node [ ] ]</programlisting>
or <programlisting>t-node [ functor='DPHR', child t-node [ ] ]</programlisting>
These forms use nesting of node selectors, the first form makes use of
the fact that the default relation of a nested selector to the selector
in which it is nested is <literal>child</literal>.</para>
<para>The query can also be expressed without nesting, using names,
either as <programlisting>t-node $a := [ functor='DPHR', child $b ];
t-node $b := [ ];</programlisting> or <programlisting>t-node $a := [ functor='DPHR' ];
t-node $b := [ parent $a ];</programlisting> naming the two nodes
<literal>$a</literal> and <literal>$b</literal> and either indicating
that <literal>$a</literal> has a child $b or that <literal>$b</literal>
has a parent $a.</para>
</section>
<section>
<title>Disjunctions, regular expressions and set enumerations</title>
<para>We now extend our query expression to cover also t-nodes with
functor <literal>CPHR</literal>. This can be done in three different
ways:</para>
<para>Using a disjunction: <programlisting>t-node [ functor='DPHR' or functor='CPHR', t-node [ ] ]</programlisting>
Apart from editing the query, this can be created using the GUI e.g. in
the following steps: <orderedlist>
<listitem>
<para>Select the top query node (the one with the
<literal>functor="DPHR"</literal> constraint</para>
</listitem>
<listitem>
<para>Press <keysym>h</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/toggle_hide_subtree.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>(Un)Expand</phrase>
</textobject>
</inlinemediaobject> </guibutton> to expand the constraints to
auxiliary nodes in the query tree. The result will look like
this:</para>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tutorial_queries_1.png"
format="PNG" />
</imageobject>
</mediaobject>
</listitem>
<listitem>
<para>Select the auxiliary node representing the constraint
<literal>functor="DPHR"</literal> and press <keysym>o</keysym> or
<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/or.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>OR</phrase>
</textobject>
</inlinemediaobject> </guibutton> to create an auxiliary
<literal>OR</literal> node above it.</para>
</listitem>
<listitem>
<para>Select the <literal>OR</literal> node and press
<keysym>=</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/test_equality.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Equality</phrase>
</textobject>
</inlinemediaobject> </guibutton> to create a new equality
constraint and as above, fill <literal>functor</literal> as
<literal>a</literal> and <literal>"CPHR"</literal> as
<literal>b</literal> in the form. Alternatively, you can select
the node <literal>functor="DPHR"</literal>, press <keycombo
action="simul">
<keysym>Ctrl</keysym>
<keysym>Insert</keysym>
</keycombo> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/editcopy.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Copy</phrase>
</textobject>
</inlinemediaobject> </guibutton> to copy it into the clipboard,
then select the <literal>OR</literal> node and press <keycombo
action="simul">
<keysym>Shift</keysym>
<keysym>Insert</keysym>
</keycombo> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/editpaste.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Paste</phrase>
</textobject>
</inlinemediaobject> </guibutton> to paste it. Then press
<keysym>Enter</keysym> or double-click the node or just the word
<literal>"DPHR"</literal> on one of the two nodes and change
<literal>"DPHR"</literal> to <literal>"CPHR"</literal>.</para>
</listitem>
<listitem>
<para>The result will look like this: <mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tutorial_queries_2b.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="tutorial_queries_2b.pdf"
format="PDF" />
</imageobject>
</mediaobject> Selecting the top t-node and pressing
<keysym>h</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/toggle_hide_subtree.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>(Un)Expand</phrase>
</textobject>
</inlinemediaobject> </guibutton> hides the auxiliary nodes, and
gives this: <mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="tutorial_queries_2c.png"
format="PNG" />
</imageobject>
</mediaobject></para>
</listitem>
</orderedlist></para>
<para>Using a regular expression: <programlisting>t-node [ functor ~ '[CD]PHR', t-node [ ] ]</programlisting>
Symbol <literal>~</literal> (tilde) denotes a binary relation between
two values that is true if and only if the value on the left interpreted
as string matches the value on the right interpreted as regular
expression. The regular expression <literal>[CD]PHR</literal> is matched
by any string containing either <literal>CPHR</literal> or
<literal>DPHR</literal> as a substring. If, for example,
<literal>XDPHR</literal> were a possible functor value, we would have to
be more precise and rewrite the expression as <programlisting>t-node [ functor ~ "^[CD]PHR$", t-node [ ] ]</programlisting>
Since <literal>^</literal> and <literal>$</literal> meta-characters are
only matched by the start and end of a string, the value of
<literal>functor</literal> now must be exactly <literal>CPHR</literal>
or <literal>DPHR</literal>.</para>
<para>Creating a regular expression test graphically in TrEd is similar
to creating an equality test: either press <keysym>~</keysym> key or the
<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/test_regexp.png" format="PNG"
valign="middle" />
</imageobject>
<textobject>
<phrase>RegExp</phrase>
</textobject>
</inlinemediaobject> </guibutton> toolbar button. Do not forget to
enclose the regular expression (field labeled <literal>b</literal> in
the dialog) into apostrophes or quotes since in the PML-TQ syntax it is
just a literal string.</para>
<para>Using a set enumeration: <programlisting>t-node [ functor in { "CPHR", "DPHR" }, t-node [ ] ]</programlisting></para>
<para>The relation <literal>in</literal> asserts that the value computed
from the expression on the left equals to a value of some of the
expressions listed in the set enumeration on the right.</para>
<para>The query text editor provides a button <guibutton>in { ...
}</guibutton> that, for attributes with a fixed set of possible values,
allows the enumeration to be created by selecting the desired values
from a list.</para>
</section>
<section>
<title>Types of relations (links)</title>
<section>
<title>Structural relations</title>
<para>The nodes in the query can be linked by several types of
relations. The built-in relations are the structural relations (child,
parent, ancestor, descendant, sibling, same-tree-as,
same-document-as), ordering relations (depth-first-precedes,
depth-first-follows, order-precedes, order-follows). The name of a
built-in relation can optionally be followed by a pair of colons
<literal>::</literal> in order to distinguish it from PML reference
relations described below.</para>
</section>
<section>
<title>PML Reference Links</title>
<para>The PML data model allows connecting nodes (and other data
structures) by so called PML references. In PML-TQ one can use any PML
reference as a relation by using the attribute path of an attribute
containing the reference, optionally followed by
<literal>-&gt;</literal> (in order to prevent a collision with a
similarly named built-in or implementation specific relations). For
example, in PDT 2.0, the nodes on the t-layer are connected to nodes
on the a-layer using a PML references in the attributes
<literal>a/lex.rf</literal> and <literal>a/aux.rf</literal>. The
following query uses the <literal>a/lex.rf</literal> PML reference as
a relation:</para>
<programlisting>
# t-layer dependency reversed on a-layer
a-node $A := [
child a-node $B := [ ]
];
t-node [
child t-node [
a/lex.rf $A
],
a/lex.rf $B
];
</programlisting>
<para>PML references are also used in PDT 2.0 to represent textual and
grammatical coreference links (attributes
<literal>coref_text.rf</literal> and
<literal>coref_gram.rf</literal>). For example, the following query
searches for a grammatical coreference where referring node precedes
the referred node. The query defines selectors for two
tectogrammatical nodes <literal>$referring</literal> and
<literal>$referred</literal> connected by a grammatical-coreference
link <literal>coref_gram.rf</literal>, such that the lexical
counterpart of <literal>$referred</literal> follows that of
<literal>$referring</literal> in the ordering of the a-layer (which
coincides with the ordering of the original sentence).</para>
<programlisting>
t-node $referring := [
a/lex.rf a-node $referring_lex := [],
coref_gram.rf t-node $referred := [
a/lex.rf a-node $referred_lex := [ order-follows $referring_lex ],
]
]</programlisting>
<para>In the last example, the two t-nodes were directly connected by
a grammatical-coreference link. If we want to look for nodes connected
by a chain of grammatical-coreference links, we can do it by using a
transitive closure of the relation <literal>coref_gram.rf</literal>,
which can be expressed in PML-TQ as
<literal>coref_gram.rf{1,}</literal>. The lower bound
<literal>1</literal> means we are looking for chains of length at
least 1 and the absence of the upper bound means that we put no limits
on the length of the chain.</para>
<programlisting>
t-node $referring := [
a/lex.rf a-node $referring_lex := [],
coref_gram.rf{1,} t-node $referred := [
a/lex.rf a-node $referred_lex := [ order-follows $referring_lex ],
]
]</programlisting>
<para>In TrEd, a relation can be made transitive by specifying the
minimum and/or maximum bound; this can be done e.g. by double-clicking
on an existing relation arrow.</para>
<para>Note that in the case of a cyclic chain of PML references, the
chains maximum length is the number of distinct nodes in the chain
plus one (i.e. the chain is allowed to start and end on the same node,
but it is not allowed to continue another round along the cycle). For
example, the following query searches for a cycle in the annotation of
textual coreference in the PDT 2.0 tectogrammatical annotation
(indeed, there is one cycle of length 2 left there by a
mistake):</para>
<programlisting>
t-node $t := [
coref_gram.rf{1,} $t
]</programlisting>
</section>
<section>
<title>Implementation- or corpus-specific relations</title>
<para>Finally, any particular implementation or installation of the
PML-TQ query engine can extend the language by defining and
implementing additional specific relations. The relations behave
syntactically as the built-in relations and must use different names
than the built-in relations (their name can be followed by a pair of
colons <literal>::</literal> in order to distinguish them from a PML
reference relation).</para>
<para>The current implementation defines two relations specific for
the PDT 2.0 annotation: <literal>echild</literal> and
<literal>eparent</literal>. These relations can be used both on
tectogrammatical and analytical layer and represent the effective
dependency, rather than technical dependency represented by the
built-in relations <literal>child</literal> and
<literal>parent</literal>. Thus, they abstract from certain
constructions such as coordination and apposition as well as the
dominance of prepositions (<literal>afun="AuxP"</literal>) and
connectives (<literal>afun="AuxC"</literal>) on the analytical
layer.</para>
<para>Here are a few examples of queries using these relations:</para>
<programlisting>
# a semantic verb with an ACT(or) and EFF(ect)
t-node [
gram/sempos='v',
echild t-node [ functor='ACT' ],
echild t-node [ functor='EFF' ],
]</programlisting>
<programlisting>
# a t-node with two effective parents (common modifier of coordinated nodes)
t-node [
eparent t-node [ ],
eparent t-node [ ],
]</programlisting>
<programlisting>
# a verb with no actant
t-node $a := [ gram/sempos='v',
! echild t-node [ functor in { 'ACT','PAT','ADDR','ORIG','EFF' } ]
]</programlisting>
<programlisting>
# reversed effective dependency on a-layer and t-layer
# excluding numeric constructions
a-node $A := [
m/tag !~ '^C',
echild a-node $B := [
m/tag !~ '^C'
]
];
t-node [
a/lex.rf $B,
echild t-node [ a/lex.rf $A ]
];</programlisting>
<para>Just like PML reference relations, specific relations can be
used in the transitive form by setting minimum and maximum bounds, for
example:</para>
<programlisting># effective descendant
t-node [ echild{1,} t-node [ ] ]</programlisting>
<programlisting># effective grand-grand child
t-node [ echild{2,2} t-node [ ] ]</programlisting>
</section>
</section>
<section id="tut_member_selector">
<title>Querying labeled references using the <literal>member</literal>
selector</title>
<para>The member selector is useful for querying some types of
complex-valued node attributes, e.g. lists of complex structures. Such
attributes do not occur in PDT 2.0, but do occur for example in
CoNLL-2009 Shared Task data when converted to PML.</para>
<para>Each node in CoNLL-2009 ST data can be annotated as an argument of
some other node, called predicate. The argument node carries an
attribute <literal>apreds</literal> which is a list of all predicates it
belongs to. The list consists of structures with two members: a PML
reference to the predicate node (<literal>apreds/target.rf</literal>)
and an argument label <literal>apreds/label</literal>. The set of
argument labels differs from language to language; for Czech data, the
labels correspond to tectogrammatical functors and the predicates to
effective parents.</para>
<para>So, each structure in the list represents one labeled semantic
relation. To be able to combine constraints on the target node
(predicate) with the label of the relation that points to it, we must
use a feature of PML-TQ called <literal>member</literal>
selectors.</para>
<para>The following example finds a PAT argument and its corresponding
predicate:</para>
<programlisting>node $arg := [
member apreds [
label = 'PAT',
target.rf node $pred := [ ]
]
]</programlisting>
<para>The intermediate <literal>member</literal> selector matches one
element of the <literal>apreds</literal> list at a time and tests its
label. If the label matches, the nested node selector for the
<literal>target.rf</literal> PML-reference relation takes action.</para>
<para>More details and further examples are given in <xref
linkend="member_selectors" />.</para>
</section>
<section>
<title>Subqueries (testing existence, non-existence and number of
occurrences)</title>
<para>Sometimes it is useful to test existence, non-existence or number
of occurrences of a node related to our query. For example, to find all
predicates without a subject in PDT 2.0, we could use the following
query</para>
<programlisting>a-node [ afun='Pred', 0x echild a-node [ afun='Sb' ] ]</programlisting>
<para>The query finds an a-node with <literal>afun='Pred'</literal> that
has no effective children with <literal>afun='Sb'</literal>. This is
expressed using a selector preceded by a restriction on number of
occurrences (0x - zero times), which is called a subquery.</para>
<para>To create a subquery graphically in TrEd, simply create the
selector <literal>echild a-node [ afun='Sb' ]</literal> as usual and
then, with the corresponding query subtree selected, press
<keysym>x</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/subquery.png"
format="PNG" valign="middle" />
</imageobject>
<textobject>
<phrase>Occurrences</phrase>
</textobject>
</inlinemediaobject> </guibutton>.</para>
<para>Of course, we could constrain the number of occurrences to a
non-zero value, too. For example, to find all predicates that govern one
subject or one object, but not both, we could use the following
query:</para>
<programlisting>a-node [ afun='Pred', 1x echild a-node [ afun in {'Sb','Obj'} ] ]</programlisting>
<para>The nodes matched by subqueries are not part of the result match
(in our example, the match would consist of the predicate nodes only,
the subjects or objects would not be included).</para>
<para>The number of occurrences of a subquery can be constrained not
only to a single number (0 in our example) but to any finite union of
bounded or partially unbounded intervals of positive integers; e.g.
<literal>0|2..4|6+x</literal> restricts the number of occurrences to
zero, two to four, or six or more, eliminating one and five. While the
plus sign stands for <emphasis>or more</emphasis>, the minus sign means
<emphasis>or less</emphasis>, as in <literal>4-x</literal> (occurring
four or less times).</para>
<para>Subqueries are also created using boolean operators, such as
negation:</para>
<programlisting>a-node [ afun='Pred', ! echild a-node [ afun='Sb' ] ]</programlisting>
<para>In this example, the selector <literal>! echild a-node [ afun='Sb'
]</literal> is automatically turned into a (still negated) subquery with
one and more occurrences; the query becomes: <programlisting>a-node [ afun='Pred', ! 1+x echild a-node [ afun='Sb' ] ]</programlisting></para>
<para>To create this query graphically, create the selector
<literal>echild a-node [ afun='Sb' ]</literal> as usual, then select the
query node corresponding to <literal>a-node [ afun='Pred' ]</literal>,
create a negation node (press <keysym>!</keysym> or <guibutton>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/not.png"
format="PNG" valign="middle" />
</imageobject>
<textobject>
<phrase>NOT</phrase>
</textobject>
</inlinemediaobject> </guibutton>) and drag the query subtree
corresponding to the subject onto the negation node.</para>
<para>A common use of subqueries is also constraining nodes on a
descending path from one node to another. Let us for example formulate a
query searching for a descending chain of tectogrammatical nodes with
the functor <literal>RSTR</literal> (restrictive or descriptive
abdominal modification). We want the chain to satisfy the following
conditions:</para>
<orderedlist>
<listitem id="subq-i1">
<para>The chain is connected to a node <literal>$N</literal> which
is a semantic noun (<literal>gram/sempos ~ "^n"</literal>) and has
other functor than <literal>RSTR</literal></para>
</listitem>
<listitem id="subq-i3">
<para>The chain is at least 3 nodes long.</para>
</listitem>
<listitem id="subq-i4">
<para>The descending chain ends with a node <literal>$R</literal>
with the functor <literal>RSTR</literal></para>
</listitem>
<listitem id="subq-i5">
<para>The chain cannot descend beyond <literal>$R</literal>, i.e.
<literal>$R</literal> has no child node with the functor
<literal>RSTR</literal>.</para>
</listitem>
<listitem id="subq-i2">
<para>All nodes that belong to the chain have the functor
<literal>RSTR</literal></para>
</listitem>
</orderedlist>
<para>The corresponding query looks like this:</para>
<programlisting>
t-node $N:= [
# condition <xref linkend="subq-i1" />.
gram/sempos ~ "^n",
functor != "RSTR",
# conditions <xref linkend="subq-i3" />. and <xref linkend="subq-i4" />.
descendant{3,} t-node $R := [
functor = "RSTR",
# condition <xref linkend="subq-i5" />.
0x t-node [ functor = "RSTR" ]
],
# condition <xref linkend="subq-i2" />.
0x descendant t-node [
!functor = "RSTR",
descendant $R
],
];</programlisting>
<para>Note how the condition <xref linkend="subq-i2" />. is expressed:
we say that there is no descendant of $N dominating $R whose
<literal>functor</literal> would not equal <literal>RSTR</literal>.
Thus, we have rewritten the original condition of the form
<inlineequation id="eq-subq1">
<?dbtex delims='no'?>
<alt role="tex">\[\forall x C(x,N,R) \]</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="subq_eq1.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="subq_eq1.pdf" format="PDF" />
</imageobject>
<textobject>
<phrase role="math">∀x C(x,N,R)</phrase>
</textobject>
</inlinemediaobject>
</inlineequation> as <inlineequation id="eq-subq2">
<?dbtex delims='no'?>
<alt role="tex">\[\neg\exists x \neg C(x,N,R) \]</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="subq_eq2.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="subq_eq2.pdf" format="PDF" />
</imageobject>
<textobject>
<phrase role="math">¬ ∃x ¬C(x,N,R)</phrase>
</textobject>
</inlinemediaobject>
</inlineequation>.</para>
<!-- <programlisting>t-node [ functor='DPHR', 2+x child t-node [ ] ]</programlisting> -->
<!-- <programlisting>t-node [ functor='DPHR', 2+x descendant t-node [ ] ]</programlisting> -->
<!-- <programlisting>t-node [ 2+x child t-node [ functor='DPHR' ] ]</programlisting> -->
<!-- <programlisting>t-node [ 2+x echild t-node [ functor='DPHR' ] ]</programlisting> -->
</section>
<section>
<title>Looking for small result trees?</title>
<para>Sometimes you want to find a good small example tree demonstrating
some linguistic phenomenon. You want it to fit to a presentation slide
or an article page. You can do so by putting a limit on the tree
size.</para>
<para>Using a subquery this can be done as follows:</para>
<programlisting>t-node [
10-x same-tree-as t-node [],
functor='DPHR', # the rest of your query
]</programlisting>
<para>This selects t-nodes with <literal>functor='DPHR'</literal> in
trees with at most 10 other t-nodes. Using functions, this can be
written as</para>
<programlisting>t-root [
descendants() &lt;= 10,
descendant t-node [ functor='DPHR' ]
]</programlisting>
<para>but note that in this case the <literal>t-root</literal> appears
as a node in the result set. To avoid it, we can write</para>
<programlisting>t-node [
functor='DPHR',
1x ancestor t-root [ descendants() &lt;= 10 ]
]</programlisting>
<para>For treebanks that do not have a special node type for the root
node, we can write e.g.:</para>
<programlisting>node [
functor='DPHR',
1x ancestor node [
depth() = 0, # the root
descendants() &lt;= 10
]
]</programlisting>
</section>
<section>
<title>Functions</title>
<para>PML-TQ provides a set of built-in functions that can be used in
expressions constraining nodes and also in output filters. The functions
can be split into the following categories:</para>
<itemizedlist>
<listitem>
<para>functions returning information about the tree
structure</para>
</listitem>
<listitem>
<para>functions related information about documents</para>
</listitem>
<listitem>
<para>string functions</para>
</listitem>
<listitem>
<para>numerical functions</para>
</listitem>
<listitem>
<para>group functions (applicable only in output filters)</para>
</listitem>
</itemizedlist>
<para>For description of all individual functions, refer to <xref
linkend="functions" />. Here, we only give a few examples demonstrating
the use of some of the functions from the first category on a few common
query constructions, usually also expressible by means of subqueries.
Whether it is more efficient to use functions than subqueries may depend
on implementation.</para>
<programlisting># a leaf node (using functions)
t-node [ sons()=0 ]</programlisting>
<para>The above query can be created graphically in TrEd by creating a
<literal>t-node</literal> selector (press <keysym>Insert</keysym> or
<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="pmltq_icons/add_node.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Add node</phrase>
</textobject>
</inlinemediaobject> </guibutton> on the toolbar and select a node
type (<guilabel>t-node</guilabel> in our example) from the list), then
create an equality constraint by pressing <keysym>=</keysym> or
<guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt"
fileref="pmltq_icons/test_equality.png" format="PNG" />
</imageobject>
<textobject>
<phrase>Equality</phrase>
</textobject>
</inlinemediaobject> </guibutton> and fill-in
<literal>sons()</literal> as <guilabel>a</guilabel> and
<literal>0</literal> as <guilabel>a</guilabel>.</para>
<para>Alternatively, type the query in the text form editor (opened by
pressing <keysym>e</keysym> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/edit_file.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Edit query</phrase>
</textobject>
</inlinemediaobject> </guibutton>). The function names with argument
templates can be inserted using the <guilabel>Functions</guilabel>
button in the editor.</para>
<para>Other queries involving functions can be created similarly.</para>
<programlisting># a leaf node (using a subquery)
t-node [ 0x child t-node [ ] ]</programlisting>
<programlisting># right-most child
t-node [ rbrothers()=0 ]</programlisting>
<programlisting># left-most child
t-node [ lbrothers()=0 ]</programlisting>
<programlisting># first leaf node in a subtree of $t (using functions)
t-node $t := [
descendant t-node [
sons()=0,
depth_first_order()-depth()=depth_first_order($t)-depth($t)
]
]</programlisting>
<programlisting># first leaf node in a subtree of $t (using a subquery)
t-node $t := [
descendant t-node $d := [ sons()=0 ],
0x descendant [ sons()=0, depth-first-precedes $d ],
]</programlisting>
<programlisting># last leaf node in a subtree of $t
t-node $t := [
descendant t-node [
sons()=0,
depth_first_order()-depth()=depth_first_order($t)+descendants($t)-1-depth($t)
]
]</programlisting>
<programlisting># last leaf node in a subtree of $t (using a subquery)
t-node $t := [
descendant t-node $d := [ sons()=0 ],
0x descendant [ depth-first-follows $d ],
]</programlisting>
</section>
<section>
<title>Output filters</title>
<para>Output filters are used for extracting data from the nodes matched
by the query and generating tabular output. Filters must follow the
selective part of the query and start with <literal>&gt;&gt;</literal>.
Filters can be chained: the first filter extracts data from the matching
nodes and all subsequent filters operate on the output from the
immediately preceding filter. Details can be found in the <link
linkend="outputFilter">PML-TQ Syntax Reference</link> and <xref
linkend="agg_functions" />.</para>
<para>The TrEd GUI does not provide any graphical builder for
output-filter nor does it represent the filters graphically. To enter a
filter, open the entire query in the query editor (press <keycombo
action="simul">
<keysym>Ctrl</keysym>
<keysym>E</keysym>
</keycombo> or <guibutton> <inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="10pt" fileref="icons/edit_file.png"
format="PNG" />
</imageobject>
<textobject>
<phrase>Edit query</phrase>
</textobject>
</inlinemediaobject> </guibutton> on the toolbar), place the cursor at
the end of the query and enter the filter code. Various buttons in the
editor can be used to insert special symbols, and templates for
functions and common constructions.</para>
<para>One of the simplest filters uses the group function
<literal>count()</literal> to compute the total number of matches of the
query in the treebank:</para>
<programlisting># counting occurrences
t-node [ functor='DPHR' ]
&gt;&gt; count()</programlisting>
<para>The group functions <literal>min()</literal>,
<literal>max()</literal>, and <literal>avg()</literal>, can be used to
compute maximum, minimum, and average values of data extracted from the
matching nodes. For example: to compute a maximum number of child nodes
of a t-node with the functor <literal>DPHR</literal>, we can use the
following:</para>
<programlisting>
t-node $n := [ functor='DPHR' ]
&gt;&gt; max(sons($n))</programlisting>
<para>The following query computes maximum, minimum and average size of
a tectogrammatical tree:</para>
<programlisting>
t-root $n := [ ]
&gt;&gt; descendants($n)
&gt;&gt; max(), min(), avg()</programlisting>
<para>The above query uses two filters: the first extracts the number of
descendants from each node matched by the selector
<literal>$n</literal>, the second computes maximum, minimum and average
value from the values returned by the first filter.</para>
<para>The following query shows a common grouping construction using the
'for' clause. It extracts the attribute <literal>functor</literal> from
the matched nodes and for each distinct value counts the number it
occurred:</para>
<programlisting>
t-node $n := [ ]
&gt;&gt; for $n.functor
give $1, count()</programlisting>
<para>Note that <literal>$1</literal> in the <literal>give</literal>
clause refers to the first (and only) key used in the
<literal>for</literal> clause, i.e. to
<literal>$n.functor</literal>.</para>
<para>By appending a <literal>sort by</literal> clause to a filter, we
may reorder the rows it produces by some of its columns. In the
following query, the output of the filter is sorted using the second
output column (the <literal>count()</literal>) in descendant order as
the primary key and the first output column (the <literal>$1</literal>
in the <literal>give</literal> clause) in the default (ascending)
order:</para>
<programlisting>t-node $n := [ ]
&gt;&gt; for $n.functor
give $1, count()
sort by $2 desc, $1</programlisting>
<para>The <literal>for</literal> clause can be used to create groups not
only by attribute values, but also by some of the matching nodes. For
example, in order to find out how many grammatical-coreference links can
start in one tectogrammatical node, we may use the following
query:</para>
<programlisting>
t-node $referring := [
coref_gram.rf t-node $referred := [ ]
];
&gt;&gt; for $referring give count()
&gt;&gt; max()</programlisting>
<para>The selective part of the query matches every pair of
tectogrammatical nodes that are linked by a grammatical-coreference
link. The first filter groups the resulting pairs of nodes by the first
of the nodes (<literal>$referring</literal>) and outputs the number of
pairs in each group; this is the number of grammatical-coreference links
starting in the node <literal>$referring</literal>. The second filter
simply computes the maximum of the values returned by the first
filter.</para>
<para>The <literal>for</literal> clause partitions all input rows into
groups before any further processing and the subsequent
<literal>give</literal> clause then produces one output row for each
group, letting all group functions, such as <literal>count()</literal>,
<literal>min()</literal>, <literal>max()</literal>, etc. operate on the
particular group.</para>
<para>PML-TQ further supports a syntax that allows different partitions
to be defined for different group function and also let the
<literal>give</literal> clause operate on all input rows. This is done
by following the function arguments by an <literal>over</literal>
clause. Here we show an example where we use one of the ranking group
functions (<literal>row_number()</literal>) to select just a few top
ranking rows from each group. Please refer to <xref
linkend="grouping_explained" /> for more examples.</para>
<para>In the following query we extract the syntactic label
(<literal>afun</literal>) and the part of speech (the first position of
the morphological tag) from every node on the analytical
(morphosyntactical) layer of PDT 2.0. Then we apply further filters to
output in order to obtain the three most frequent parts of speech for
each <literal>afun</literal>. If several parts of speech occur the same
number of times for a given afun, we sample those three that come first
alphabetically.</para>
<programlisting>
a-node $a:= [ ]
&gt;&gt; $a.afun, substr($a.m/tag,0,1) # get afun and part of speech (POS)
&gt;&gt; for $1,$2 give $1, $2, count() # count occurrences of POS for each afun
&gt;&gt; $1, $2, row_number(over $1 sort by $3 desc, $2) # get the rank of each POS over the afun
sort by $1, $3
&gt;&gt; filter $3 &lt;= 3
&gt;&gt; $1, $2, $3
</programlisting>
</section>
</section>
<section id="user-interfaces">
<title>User Interfaces</title>
<section id="tred-interface">
<title>Graphical interface in TrEd</title>
<para>A complete graphical user interface for PML-TQ is available as an
extension to the tree editor TrEd and provides the following
features:</para>
<itemizedlist>
<listitem>
<para>interactive graphical query builder</para>
</listitem>
<listitem>
<para>intelligent text query editor</para>
</listitem>
<listitem>
<para>client interface for remote PML-TQ search server</para>
</listitem>
<listitem>
<para>built-in sequential search engine for local files</para>
</listitem>
<listitem>
<para>visualization of resulting trees and documents</para>
</listitem>
<listitem>
<para>multiple views for queries spanning over several layers or
trees</para>
</listitem>
<listitem>
<para>query history (queries stored in a local file)</para>
</listitem>
</itemizedlist>
<para>To use this interface, start by downloading and installing the
tree editor TrEd from <ulink
url="http://ufal.mff.cuni.cz/~pajas/tred/"></ulink>. The PML-TQ
interface is available as an extension called <application>PML Tree
Query Interface for TrEd (pmltq)</application> which can be either
selected during installation of TrEd (on Windows) or from TrEd as
follows:</para>
<orderedlist>
<listitem>
<para>Start TrEd</para>
</listitem>
<listitem>
<para>Select <menuchoice>
<guimenu>Setup</guimenu>
<guimenuitem>Manage Extensions ...</guimenuitem>
</menuchoice> from the main menu</para>
</listitem>
<listitem>
<para>In the extension manager dialog press <guibutton>Get New
Extensions</guibutton> button (connection to the Internet is
required at this point).</para>
</listitem>
<listitem>
<para>In the list of available extensions locate an extension titled
<application>PML Tree Query Interface for TrEd (pmltq)</application>
and check the <guibutton>Install</guibutton> checkbutton beside
it.</para>
</listitem>
<listitem>
<para>If you intend to query some treebanks for which a specific
TrEd extension is provided, such as <application>Prague Dependency
Treebank 2.0</application>, <application>Penn
Treebank</application>, <application>Penn Arabic
Treebank</application>, <application>Tiger Corpus</application>,
<application>CoNLL 2009 Shared Task</application> data set etc,
check those extensions as well.</para>
</listitem>
<listitem>
<para>Press the <guibutton>Install Selected</guibutton> button, wait
for the installation to complete and close the extension manager
with the <guibutton>Close</guibutton> button.</para>
</listitem>
</orderedlist>
<para>See <xref linkend="tutorial" /> for a quick introduction to the
query interface.</para>
</section>
<section id="web-interface">
<title>Web interface</title>
<para>The PML-TQ search servers provide a client interface in the form
of a web application that can be accessed by any web browser capable of
combining JavaScript, CSS, and SVG (Scalable Vector Graphics). At the
time of writing this tutorial, the best choice is the Opera browser,
followed by Google Chrome, Firefox, and Safari (please avoid Microsoft
Internet Explorer because it lacks native SVG support).</para>
<para>Unlike the TrEd interface, this interface does not require any
installation, but lacks some features such as graphical query builder
and graphical representation of the query (the queries must be entered
in the text form), and of course does not support querying local files.
The history of past queries is available for queries run on the
particular PML-TQ search service.</para>
<para>For a quick introduction to the query language, see <xref
linkend="tutorial" /> (only the text version of the queries is
applicable).</para>
<para>To start using the web interface, simply open the PML-TQ server
URL in a web browser, enter your query and press <guibutton>Submit New
Query</guibutton>.</para>
<figure id="fig-opera">
<title>The Opera browser displaying a query and a result tree</title>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="images/screenshot_web_s.png"
format="PNG" />
</imageobject>
</mediaobject>
</figure>
<para>Results of queries with output filters are displayed as an HTML
table, queries without filters are rendered using SVG with the matching
nodes highlighted by colours. A simple toolbar is displayed above the
tree, with buttons for scaling the SVG image and displaying next, Nth,
or previous match. For each matching node a button in a corresponding
color is created above the displayed tree; by pressing the button, the
tree containing the particular node can be displayed, which is useful
for queries whose match can span across several trees or several
annotation layers (see <xref linkend="fig-opera" /> Below the toolbar,
the file name, tree number, total number of trees in the file and the
sentence (or other kind of textual representation) of the tree is
displayed. The <literal>&lt;</literal> and <literal>&gt;</literal> links
preceding the file name can be used to display neighboring trees from
the same document.</para>
</section>
<section id="command-line-interface">
<title>Command-line interface</title>
<para>PML-TQ comes with a command-line client utility called
<literal>pmltq</literal>. The tool can be used to perform queries
remotely (by connecting to a remote PML-TQ search engine) or locally
(using one of the tools that come with the TrEd toolkit:
<literal>btred</literal>, <literal>ntred</literal>, and
<literal>jtred</literal>).</para>
<para>Currently this client is able to produces plain-text output only,
so it is best used in connection with queries that provide output
filters.</para>
<para>TODO: general usage yet to be documented. Run <programlisting>pmltq --help</programlisting>
to display information about the program command-line options and usage.
<!--
describe 'pmltq' command-line utility,
describe usage from btred command line
--></para>
</section>
<section id="btred-interface">
<title>BTrEd interface</title>
<para></para>
</section>
</section>
<section id="query-language">
<title>Query Language</title>
<para>Basic examples, syntax reference.</para>
<section>
<title>Common syntax constructions</title>
<para>Selectors</para>
<informaltable>
<tgroup cols="2">
<tbody>
<row>
<entry><literal><replaceable>TYPE</replaceable> [ ...
]</literal></entry>
<entry><para>Selector for nodes of a given type (list of types
depends on the treebank)</para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ ...
]</literal></entry>
<entry><para>Named selector (can be referred to as
<literal>$a</literal>)</para></entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>Node relationships</para>
<informaltable>
<tgroup cols="2">
<tbody>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ child
<replaceable>TYPE</replaceable> $b:= [ ] ]</literal></entry>
<entry><para><literal>$b</literal> is a child of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ child
<replaceable>TYPE</replaceable> $b:= [ lbrothers()=0 ]
]</literal></entry>
<entry><para><literal>$b</literal> is the first child of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ child
<replaceable>TYPE</replaceable> $b:= [
lbrothers()=<replaceable>N</replaceable>-1 ] ]</literal></entry>
<entry><para><literal>$b</literal> is the
<replaceable>N</replaceable>-th child of <literal>$a</literal>
</para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ child
<replaceable>TYPE</replaceable> $b:= [ rbrothers()=0 ]
]</literal></entry>
<entry><para><literal>$b</literal> is the last child of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ child
<replaceable>TYPE</replaceable> $b:= [
rbrothers()=<replaceable>N</replaceable>-1 ] ]</literal></entry>
<entry><para><literal>$b</literal> is the
<replaceable>N</replaceable>-th to last child of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
children()=1, child <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the only child of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ parent
<replaceable>TYPE</replaceable> $b:= [ ] ]</literal></entry>
<entry><para><literal>$b</literal> is the parent
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ ancestor
<replaceable>TYPE</replaceable> $b:= [ ] ]</literal></entry>
<entry><para><literal>$b</literal> dominates
<literal>$a</literal> (<literal>$b</literal> is an ancestor of
<literal>$a</literal>) </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
ancestor{1,2} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the parent or grand parent
of <literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
descendant <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is dominated by
<literal>$a</literal> (<literal>$b</literal> is a descendant of
<literal>$a</literal>) </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
descendant{1,2} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the child or grand child
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
descendant <replaceable>TYPE</replaceable> $b:= [ ], 0x
descendant [ order-precedes $b ] ]</literal></entry>
<entry><para><literal>$b</literal> is a left-most descendant
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
descendant <replaceable>TYPE</replaceable> $b:= [ ], 0x
descendant [ order-follows $b ] ]</literal></entry>
<entry><para><literal>$b</literal> is a right-most descendant
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [ sibling
<replaceable>TYPE</replaceable> $b:= [ ] ]</literal></entry>
<entry><para><literal>$b</literal> is the sibling
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
sibling{,-1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the preceding sibling of
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
sibling{-1,-1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the immediately preceding
sibling of <literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
sibling{1,} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the following sibling
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
sibling{1,1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> is the immediately following
sibling of <literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
same-tree-as <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
order-follows <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> and follows <literal>$a</literal>
</para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
order-follows{1,1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> and immediately follows
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
order-precedes <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> and precedes <literal>$a</literal>
</para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
order-follows{1,1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> and immediately precedes
<literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
order-follows{-1,1} <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same tree as
<literal>$a</literal> and either immediately precedes or
immediately follows <literal>$a</literal> </para></entry>
</row>
<row>
<entry><literal><replaceable>TYPE</replaceable> $a := [
same-document-as <replaceable>TYPE</replaceable> $b:= [ ]
]</literal></entry>
<entry><para><literal>$b</literal> belongs to the same document
as <literal>$a</literal> </para></entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>Logical expressions</para>
<para>Value comparison</para>
<para>Arithmetical and string operators</para>
<para>Subquery and number of occurrences</para>
<para>Functions</para>
<para>Output filters</para>
</section>
<section>
<title>PML-TQ Syntax Reference</title>
<variablelist>
<varlistentry>
<term>query</term>
<listitem>
<mediaobject id="rail_query">
<imageobject role="html">
<imagedata fileref="rail_query.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_query.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>query : (nodeSelector + ';') ( ';' (outputFilter*) )?
;</phrase>
</textobject>
</mediaobject>
<para>PML-TQ query consists of one or more node selectors
separated by semicolon followed, optionally, by a list of output
filters.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>nodeSelector</term>
<listitem>
<mediaobject id="rail_nodeSelector">
<imageobject role="html">
<imagedata fileref="rail_nodeSelector.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_nodeSelector.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>nodeSelector : TYPE (variable ':=' )? \\ '['
constraints? ']' ;</phrase>
</textobject>
</mediaobject>
<para>Defines a node selector that selects all nodes of type
<replaceable>type</replaceable> that satisfy given constraints.
The selector can optionally be associated with a variable for
reference from other nodes.</para>
<para>The intended way to translate the selector syntax to English
is to read <quote>node of type <replaceable>TYPE</replaceable>
that has <replaceable>constraint-1</replaceable>, has
<replaceable>constraint-2</replaceable>, ... and has
<replaceable>constraint-N</replaceable></quote></para>
<para>A selector can be used as a constraint of some other
selector (i.e. it can be nested). If not preceded by a name of a
relation, it selects among child nodes of the node matched by the
containing selector; if preceded by a name of a relation, it
selects among nodes that are in the particular relation to the
node matched by the containing selector.</para>
<para>For example, the query <literal>a-node $x := [ descendant
a-node [ afun=$x.afun ] ]</literal> reads in English as <quote>
Find a node <literal>$x</literal> of type a-node that has a
descendant node of type a-node that has <literal>afun</literal>
equal to <literal>afun</literal> of <literal>$x</literal>.
</quote> Over PDT 2.0 data it selects all analytical nodes whose
subtree contains an analytical node with the same value of the
attribute <literal>afun</literal> (the query returns pairs of
nodes with the described relationship).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>variable</term>
<listitem>
<mediaobject id="rail_variable">
<imageobject role="html">
<imagedata fileref="rail_variable.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_variable.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>variable : DOLLAR NAME</phrase>
</textobject>
</mediaobject>
<para>Variables are used to name node selectors and refer to them
from other parts of the query. Variable starts with a
'<literal>$</literal>' (dollar) character and is followed by a
NAME consisting of alphabetical character or underscore and zero
or more alphanumerical characters or underscores. For example,
<literal>$foo_02</literal> or <literal>$x</literal> are valid
variable names, while <literal>$23</literal> is not.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>constraints</term>
<listitem>
<mediaobject id="rail_constraints">
<imageobject role="html">
<imagedata fileref="rail_constraints.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_constraints.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>constraints : constraint + ',';</phrase>
</textobject>
</mediaobject>
<para>One or more constraints.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>constraint</term>
<listitem>
<mediaobject id="rail_constraint">
<imageobject role="html">
<imagedata fileref="rail_constraint.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_constraint.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>constraint : ( ('!')? (predicate | relationSelector |
optionalSelector | memberSelector | relation
variable | subquery | '(' constraint ')' ) | '+'? relationSelector | (constraint
('and' | 'or' | ',' ) constraint)) ;
relationSelector : relation? nodeSelector ;
</phrase>
</textobject>
</mediaobject>
<para>The logical operators have the following precedence (in
decreasing order): '<literal>!</literal>',
'<literal>and</literal>', '<literal>or</literal>',
'<literal>,</literal>'; except for having the lowest precedence,
comma (<literal>,</literal>) has the semantics of
<literal>and</literal>. A constraint is either a binary test
predicate (<literal>=</literal>, <literal>!=</literal>,
<literal>~</literal>, <literal>!~</literal>,
<literal>&lt;</literal>, <literal>&lt;=</literal>,
<literal>&gt;</literal> etc.) on expressions (terms), a node
selector, member selector, subquery, a reference to a named node
selector (indicating that a node matched by referred selector must
be in the corresponding relation to the node matched by the
current selector), or a logical combination of any of these.
However, a node selector used in a complex logical expression is
treated as a subquery with at least one occurrence.</para>
<para>The intended way of reading a constraint aloud in English in
the context of a node selector is to precede it with the word
<quote>has</quote>.</para>
<para>
By default only distinct nodes are matched by the query. If however the
<firstterm>relationSelector</firstterm> is denoted by <literal>+</literal>
the matched nodes do not have to be disjoint from the other nodes matched
by the query. This is usefull for printing the whole sentences like in the following
example:
<programlisting>
a-node
[ afun = 'AuxV',
ancestor a-root $r :=
[ + descendant a-node $a := [ ] ] ];
>> for $r.id,$a.m/form,$a.ord give $1,$2,$3
>> give distinct concat($2, ' ' over $1 sort by $3)</programlisting>
This example prints the sentences with an auxiliary verb (afun='AuxV'). The first
matched node won't be among descendants of the root node unless the selector
is denoted by <literal>+</literal>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>optionalSelector</term>
<listitem>
<mediaobject id="rail_optionalSelector">
<imageobject role="html">
<imagedata fileref="rail_optionalSelector.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_optionalSelector.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>optionalSelector : '?' relation?
nodeSelector;</phrase>
</textobject>
</mediaobject>
<para>If a nested node selector is preceded by a question mark, it
is <firstterm>optional</firstterm>. This means, that if no node
matches this selector, the selector is assumed to match the same
node as its containing selector and all selectors or subqueries
directly nested in the optional selector are then evaluated as if
they were nested in the containing selector. For example,
<literal>a-node $a := [ afun='Sb', ? a-node $b:= [ afun='AuxC', $c
:= [ afun='Obj'] ] ]</literal> with $b optional, matches either a
descending chain of three a-nodes 'Sb-&gt;AuxC-&gt;Obj' (the
optional selector $b matching the middle node) or just the pair
'Sb-&gt;Obj', in which case both $a and $b are identified with the
'Sb' (the constraint <literal>afun='AuxC'</literal>) on $b is
disregarded), but it does <emphasis>not</emphasis>, for instance,
match a descending chain 'Sb-&gt;ExD-&gt;Obj'.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>member selector</term>
<listitem>
<mediaobject id="rail_memberSelector">
<imageobject role="html">
<imagedata fileref="rail_memberSelector.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_memberSelector.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>memberSelector : 'member' attributePath (variable ':='
)? \\ '[' constraints? ']'</phrase>
</textobject>
</mediaobject>
<para>Member selector can be used to match complex values in node
attributes of alternative, list, or sequence type. The selected
value is then treated almost as a node. Although we do not
indicate it explicitly in the syntax, member selectors cannot nest
node selectors via tree-structure or ordering relations.</para>
<para>Member selectors are described in detail more in <xref
linkend="member_selectors" />.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>subquery</term>
<listitem>
<mediaobject id="rail_subquery">
<imageobject role="html">
<imagedata fileref="rail_subquery.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_subquery.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>subquery : occurrences 'x' ( memberSelector |
relation? nodeSelector ); occurrences : ( NUMBER | NUMBER '+'
| NUMBER '-' | NUMBER '..' NUMBER ) + '|';</phrase>
</textobject>
</mediaobject>
<para>Subquery is a nested selector with a constraint on the
number of occurrences of the matching nodes (with respect to a
fixed match of the containing selector). For example,
<literal>3x</literal> specifies that the subquery must match
<emphasis>exactly</emphasis> three times, <literal>3+x</literal>
specifies that it must match <emphasis>at least</emphasis> three
times, <literal>3-x</literal> specifies it must match <emphasis>at
most</emphasis> three times, and <literal>3-10x</literal>
specifies that it must match <emphasis>at least</emphasis> three
times and at most ten times.</para>
<note>
<para>Node selectors that belong to a subquery cannot be
referred to by name from outside the subquery (outer
scope).</para>
</note>
</listitem>
</varlistentry>
<varlistentry>
<term>relation</term>
<listitem>
<mediaobject id="rail_relation">
<imageobject role="html">
<imagedata fileref="rail_relation.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_relation.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>relation : treeRelation | orderingRelation |
pmlRefRelation | implementationSpecificRelation ;</phrase>
</textobject>
</mediaobject>
<para>PML-TQ has the following types of relation:</para>
<variablelist>
<varlistentry>
<term>tree relations</term>
<listitem>
<mediaobject id="rail_treeRelation">
<imageobject role="html">
<imagedata fileref="rail_treeRelation.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_treeRelation.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>treeRelation : ( 'child' | 'parent' |
'same-tree-as' | 'descendant' distanceInterval? |
'ancestor' distanceInterval? | 'sibling'
distanceInterval?) '::'? ; distanceInterval : LEFTBRACE
NUMBER? ',' NUMBER? RIGHTBRACE ;</phrase>
</textobject>
</mediaobject>
<para>These are child, parent, descendant, ancestor, and
sibling. The last three can be followed by a distance
interval of the forms <literal>{min,max}</literal>, or
<literal>{,max}</literal>, or <literal>{min,}</literal>
where <literal>min</literal> and <literal>max</literal> are
integers with <literal>min</literal> less than or equal to
<literal>max</literal>. These values must be positive for
the relations <literal>descendant</literal> and
<literal>ancestor</literal>. For the
<literal>sibling</literal> relation, negative distance bound
values range over preceding siblings and positive bound
values range over following siblings.</para>
<para>For example, in the query <programlisting>t-node $a := [ sibling{-1,2} t-node $b := [ ] ]</programlisting>
the node <literal>$b</literal> can be either the immediately
preceding sibling of <literal>$a</literal> or one of the two
siblings immediately following <literal>$a</literal>.</para>
<para>In the query <programlisting>t-node $a := [ descendant{3,} t-node $b := [ ] ]</programlisting>
<literal>$b</literal> matches a descendant of
<literal>$a</literal> such that there are at least two nodes
on the path from $a to $b strictly between $a to $b.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>ordering relations</term>
<listitem>
<mediaobject id="rail_orderingRelation">
<imageobject role="html">
<imagedata fileref="rail_orderingRelation.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_orderingRelation.pdf"
format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>orderingRelation : ('depth-first-precedes' |
'depth-first-follows' | 'order-precedes' |
'order-follows') distanceInterval? '::'? ;</phrase>
</textobject>
</mediaobject>
<para>Relations <literal>depth-first-precedes</literal> and
<literal>depth-first-follows</literal> constraint mutual
position of two nodes in the canonical depth first ordering
of the tree.</para>
<para>Relations <literal>order-precedes</literal>,
<literal>order-follows</literal> are only available for
treebanks with explicit total ordering on the trees
(typically dependency treebanks).</para>
<para>These relations can be followed by a distance interval
of the forms <literal>{min,max}</literal>, or
<literal>{,max}</literal>, or <literal>{min,}</literal>
where <literal>min</literal> and <literal>max</literal> are
(possibly negative) integers, with <literal>min</literal>
less than or equal to <literal>max</literal>.</para>
<para>For example, in <programlisting>t-node $a := [ depth-first-precedes t-node $b := [ order-precedes $a ] ]</programlisting>
<literal>$a</literal> matches a node that precedes
<literal>$b</literal> in the depth-first order but follows
<literal>$b</literal> in the total ordering of the tree;
thus, for instance, <literal>$b</literal> can be a
descendant of <literal>$a</literal> that precedes
<literal>$a</literal>.</para>
<para>In the following query, <literal>$a</literal> must
immediately precede <literal>$b</literal> in the total
ordering of the tree: <programlisting>t-node $a := [ order-precedes{,1} t-node $b := [ ] ]</programlisting>
Conversely, in <programlisting>t-node $a := [ order-precedes{2,} t-node $b := [ ] ]</programlisting>
<literal>$a</literal> precedes <literal>$b</literal>, but
there must be at least one other node between the nodes
<literal>$a</literal> and <literal>$b</literal>. The query
<programlisting>t-node $a := [ order-precedes{1,2} t-node $b := [ ] ]</programlisting>
allows at most one node between <literal>$a</literal> and
<literal>$b</literal> in the total ordering of the
tree.</para>
<para>Negative distance bound extends the relation in the
opposite direction. For example, in this query:
<programlisting>t-node $a := [ order-precedes{-1,1} t-node $b := [ ] ]</programlisting>
<literal>$a</literal> either immediately precedes
<literal>$b</literal> or immediately follows
<literal>$b</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>relations based on PML-references</term>
<listitem>
<mediaobject id="rail_pmlRefRelation">
<imageobject role="html">
<imagedata fileref="rail_pmlRefRelation.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_pmlRefRelation.pdf"
format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>pmlRefRelation : attributePath '-&gt;'?
;</phrase>
</textobject>
</mediaobject>
<para>These relations represent ID-based references. They
are derived from attributes declared in the specific PML
schema of the treebank as PML references and represent the
relation from a referring node (the one containing the
reference) to the referenced node (the one whose ID the
reference contains).</para>
<para>The name of the relation is an attribute path to a PML
reference (declared in the PML schema as <literal>&lt;cdata
format="PMLREF"/&gt;</literal>).</para>
<para>For example, nodes of the type
<literal>t-node</literal> in PDT 2.0 are structures declares
as follows: <programlisting> &lt;type name="t-node.type"&gt;
&lt;structure role="#NODE" name="t-node"&gt;
...
&lt;member name="a"&gt;
&lt;structure&gt;
&lt;member name="lex.rf"&gt;
&lt;cdata format="PMLREF"/&gt;
&lt;/member&gt;
&lt;member name="aux.rf"&gt;
&lt;list ordered="0"&gt;
&lt;cdata format="PMLREF"/&gt;
&lt;/list&gt;
&lt;/member&gt;
&lt;/structure&gt;
&lt;/member&gt;
...
&lt;/structure&gt;
&lt;/type&gt;</programlisting></para>
<para>So, for a <literal>t-node</literal>, the attribute
paths <literal>a/lex.rf</literal> and
<literal>a/aux.rf</literal> refer to PML references. They
represent a pointer to a lexical counterpart node on the
analytical layer, and zero or more pointers to related
auxiliary nodes on the analytical layer (prepositions,
conjunctions, auxiliary verbs, etc.). We can thus use these
relations as follows: <programlisting>t-node $t:= [ a/lex.rf a-node $a:= [ afun='Sb' ], a/aux.rf a-node $x:= [ afun = 'AuxP' ] ]</programlisting>
The query locates a <literal>t-node</literal>
<literal>$t</literal> whose lexical counterpart on the
analytical layer is the node <literal>$b</literal> with
<literal>afun='Sb'</literal> and that contains at least one
pointer to an auxiliary node <literal>$x</literal> with
<literal>afun='AuxP'</literal> (a preposition).</para>
<para>Since the names of node attributes may be in collision
with other types of relations, we may force interpreting a
relation as a <literal>pmlRefRelation</literal> by following
it by the characters <literal>-&gt;</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>implementation specific relations</term>
<listitem>
<mediaobject id="railimplementationSpecificRelation">
<imageobject role="html">
<imagedata fileref="railimplementationSpecificRelation.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="railimplementationSpecificRelation.pdf"
format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>implementationSpecificRelation : NAME
distanceInterval? '::'? ;</phrase>
</textobject>
</mediaobject>
<para>An implementation can choose to provide further set of
relations provided that their names are not in collision
with the tree relations and ordering relations.</para>
<para>For example, the standard implementation of PML-TQ
defines relations <literal>echild</literal> and
<literal>eparent</literal> for the PDT 2.0 treebank to
represent so called effective-parent and effective-child
relations.</para>
<para>These relations can be optionally followed by an
interval bounding the number of transitive applications of
the relation. It can be of the forms
<literal>{min,max}</literal>, or <literal>{,max}</literal>,
or <literal>{min,}</literal> where <literal>min</literal>
and <literal>max</literal> are positive integers, with
<literal>min</literal> less than or equal to
<literal>max</literal>.</para>
<para>For example, <literal>eparent{1,}</literal> is a
transitive closure of the <literal>eparent</literal>
relation (i.e.\ effective ancestor). Similarly, in
<programlisting>t-node $a:=[ echild{1,2} $b:=[] ]</programlisting>
the node <literal>$a</literal> must be an effective parent
or effective parent of an effective grand parent of
<literal>$b</literal>. The queries <programlisting>t-node $a:=[ echild $b:=[] ]</programlisting>
and <programlisting>t-node $a:=[ echild{1,1} $b:=[] ]</programlisting>
are equivalent.</para>
</listitem>
</varlistentry>
</variablelist>
<para>All relation names except for PML-reference relations can be
followed by two colons in cases that could lead to syntactical
ambiguity (e.g. if they can be confused with node types).
PML-reference relations can be followed by an arrow (a dash and
greater-than characters) to avoid confusion with other relations,
node types, or keywords.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>predicate</term>
<listitem>
<mediaobject id="rail_predicate">
<imageobject role="html">
<imagedata fileref="rail_predicate.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_predicate.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>predicate : binaryComparison | setPredicate;
binaryComparison : expression ( ('!'? ('=' | TILDA | TILDASTAR
)) | '&lt;' | '&gt;' | '&lt;=' | '&gt;=') expression ;
setPredicate : (expression '!'? 'in' LEFTBRACE (expression
+',') RIGHTBRACE) ;</phrase>
</textobject>
</mediaobject>
<para>A predicate is either a binary comparison of two expressions
(terms) or a set membership predicate applied to a term and a set
specified as an enumeration of other terms.</para>
<para>The binary comparison consists of two expressions and a
binary relation. The relations <literal>~</literal> and
<literal>~*</literal> perform case-sensitive and case-insensitive
regular expression matching, respectively. The expression on the
right must evaluate to a regular expression. For example
<literal>afun ~ '^(Sb|Aux.*)$'</literal> is true if the value of
the attribute <literal>afun</literal> of the current node selector
either equals to the string <literal>Sb</literal> or starts with
the string <literal>Aux</literal>.</para>
<para>Set predicate consists of a expression, the operator
<literal>in</literal> and a comma-separated list of expressions
enclosed in braces. It is true if the value computed from the
expression on the left equals some of the values computed from the
expressions on the right of the operator.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>expression</term>
<listitem>
<mediaobject id="rail_expression">
<imageobject role="html">
<imagedata fileref="rail_expression.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_expression.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>expression : (literal | attributePath | function | '('
expression ')') + ('+' | '-' | '*' | 'div' | 'mod' | AMP)
;</phrase>
</textobject>
</mediaobject>
<para>Expressions are either literals (strings, integer of
floating point numbers), attribute paths, or functions, or any
combination of these obtain by application of the binary
string-concatenation operator ('&amp;') or the usual arithmetical
operations for addition ('+'), subtraction ('-'), multiplication
('*'), division ('div') and modulo ('mod'). Brackets can be used
in the usual manner for grouping sub-expressions.</para>
<para>For example, <literal>afun &amp; "." &amp;
substr(m/tag,0,2)</literal> is an expression returning a
concatenation of the value of the attribute
<literal>afun</literal>, a dot and the first two characters from
the value of the attribute <literal>m/tag</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>literal</term>
<listitem>
<mediaobject id="rail_literal">
<imageobject role="html">
<imagedata fileref="rail_literal.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_literal.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>literal : NUMBER | '"' STRING '"' | "'" STRING "'"
;</phrase>
</textobject>
</mediaobject>
<para>A literal is either a number in the decimal notation
(integer or floating point, e.g. <literal>231</literal> or
<literal>-1.0032</literal>) or a string of characters enclosed in
either <literal>"</literal> or <literal>'</literal>. Backslash
character <literal>\</literal> is used as an escape character, for
example to insert a quote or apostrophe.</para>
<para>For example: <literal>"Peter's"</literal> and
<literal>'Peter\'s'</literal> are both literals representing the
string <literal>Peter's</literal>. Similarly,
<literal>'\\\"\'\n\r\1'</literal> and
<literal>"\\\"\'\n\r\1"</literal> are literals representing the
five-character string <literal>\"'nr1</literal>. (Note that
<literal>\n</literal> is not currently used for the new-line
characters as in many other dialects; this however, may change in
a future version so it is adviced to avoid escaping characters
other than quote and apostrophe).</para>
</listitem>
</varlistentry>
<varlistentry id="attributePath">
<term>attributePath</term>
<listitem>
<mediaobject id="rail_attributePath">
<imageobject role="html">
<imagedata fileref="rail_attributePath.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_attributePath.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>attributePath : '*'? (variable '.' )? (XMLNAME |
'content()' | index | elementIndex ) + '/' ; index : '['
NUMBER ']'; elementIndex : XMLNAME index | index
XMLNAME;</phrase>
</textobject>
</mediaobject>
<para>An attribute path refers to a value of an attribute of a
treebank node matched by a certain selector. If the path starts
with a variable followed by a '.' (dot) character, then the it
refers to an attribute of the node matched by the selector
associated with the variable. Otherwise it refers to the node
matched by the current selector (i.e. the one within whose
constraints it occurs).</para>
<para>In the simplest form, attribute path is just a name of an
attribute, e.g. <literal>functor</literal>. However, some node
attributes may have complex values, i.e. structures with
attributes of their own. In such case one forms a slash-delimited
attribute path leading from the node to some nested atomic value.
Each component of the path (step) is either a</para>
<variablelist>
<varlistentry>
<term>name</term>
<listitem>
<para>specifying a member in a structure or any element of
the given name in a sequence</para>
</listitem>
</varlistentry>
<varlistentry>
<term>the string <literal>'content()'</literal></term>
<listitem>
<para>used for obtaining the content of a container</para>
</listitem>
</varlistentry>
<varlistentry>
<term>index
(<literal>[<replaceable>n</replaceable>]</literal>)</term>
<listitem>
<para>specifying <replaceable>n</replaceable>-th element in
an ordered list</para>
</listitem>
</varlistentry>
<varlistentry>
<term>name followed by an index
(<literal><replaceable>foo</replaceable>[<replaceable>n</replaceable>]</literal>)</term>
<listitem>
<para>specifying <replaceable>n</replaceable>-th element of
a given name (<literal>foo</literal>) in a sequence</para>
</listitem>
</varlistentry>
<varlistentry>
<term>name preceded by an index
(<literal>[<replaceable>n</replaceable>]<replaceable>foo</replaceable></literal>)</term>
<listitem>
<para>specifying <replaceable>n</replaceable>-th element in
a sequence and asserting that the
<replaceable>n</replaceable>-th element name is as given
(<literal>foo</literal>)</para>
</listitem>
</varlistentry>
</variablelist>
<para>If a partial attribute path returns an object of a list
type, the next step in the path can either be an index or can be
omitted (in which case any list member matches); if it returns a
sequence, the next step must be an element name optionally
followed or preceded by an index. If it is a structure or
container, the next step must be a name of an attribute; if it
returns an alternative or an unordered list, the next step must be
a valid step for the members of the alternative/list; if it
returns an atomic value, there must be no further step.</para>
<para>For example <literal>gram/sempos</literal> selects the
attribute <literal>sempos</literal> of a structure stored in the
attribute <literal>gram</literal> of the node matched by the
current selector. The path <literal>a/[1]/b</literal> may select
an attribute b of a structure stored in the 2nd member of an
ordered list stored in the attribute <literal>a</literal>; the
path <literal>a/b</literal> selects the attribute b of any member
of the list <literal>a</literal>.</para>
<para>If some part of an attribute path leads to an alternative,
list, or sequence, the path may match multiple values. Assume
<replaceable>R</replaceable> is some predicate containing such an
attribute path in some of its expressions. If attribute path is
preceded by the <literal>*</literal> character (a primitive
universal quantifier), then the predicate is true if and only if
it is true for every value matched by the attribute path. If
attribute path is not preceded by <literal>*</literal>, then an
existential quantifier is assumed: the predicate is true if and
only if there exists a value matched by the attribute path for
which the predicate is true. If the predicate contains more such
attribute paths, the quantifiers (universal and implicit
existential) are applied in the same order in which the attribute
paths occur in the predicate. See <xref linkend="quantifiers" />
for details.</para>
<para>Example: <programlisting>a-node $p:= [ child a-node [ afun=$p.afun, afun~'^Aux' ] ]</programlisting>
The above query selects a node and its child node (both of type
<literal>a-node</literal>) that have the same value of the
attribute <literal>afun</literal> and the value starts with the
substring <literal>Aux</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>function</term>
<listitem>
<mediaobject id="rail_function">
<imageobject role="html">
<imagedata fileref="rail_function.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_function.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>function : FUNCTION '(' (expression + ',' ) ')'
;</phrase>
</textobject>
</mediaobject>
<para>Functions are written as function name followed by a
comma-separated list of its arguments in brackets. The functions
currently supported by PML-TQ are listed in Section <xref
linkend="functions" /> :</para>
</listitem>
</varlistentry>
<varlistentry id="outputFilter">
<term>outputFilter</term>
<listitem>
<mediaobject id="rail_outputFilter">
<imageobject role="html">
<imagedata fileref="rail_outputFilter.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_outputFilter.pdf" format="PDF" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>outputFilter : '&gt;&gt;' ('for' (columnExp + ',')
'give' | ('give')?) ('distinct')? (columnExp + ',') \\
(('&gt;&gt;')? 'sort by' ((DOLLAR NUMBER ('asc'|'desc')?) +
','))? \\ ('&gt;&gt;' 'filter' rowConstraint *); rowConstraint
: ( ('!')? (rowBinaryComparison | rowSetPredicate | '('
rowConstraint ')') ) | (rowConstraint ('and' | 'or' | ',' )
rowConstraint) ; rowBinaryComparison : columnExp ( ('!'? ('='
| TILDA | TILDASTAR )) | '&lt;' | '&gt;' | '&lt;=' | '&gt;=')
columnExp ; rowSetPredicate : (columnExp '!'? 'in' LEFTBRACE
(literal +',') RIGHTBRACE) ;</phrase>
</textobject>
</mediaobject>
<para>An output filter transforms its input (the result of the
selective part of the query or the output of the previous filter)
into a table of values arranged into columns computed according to
the filter specification.</para>
<para>The simplest form of a filter consists of a comma-separated
list of expressions, each of which computes a column value in an
output row from either the values of (some) column of the input
row (output from the previous filter) or, in case of the first
filter, the attributes of nodes matched by the selective part of
the query. For example: <programlisting>&gt;&gt; $x.m/tag, depth($y)
&gt;&gt; 'found ' &amp; $1 &amp; ' at depth ' &amp; $2</programlisting>
defines two output filters: the first returns a table whose rows
correspond directly to the matches of the query's selective part.
Its first column contains the value of the attribute
<literal>m/tag</literal> of the node matched by the node selector
<literal>$x</literal> and the second column is the depth of the
node matched by the node selector <literal>$y</literal>. The
second output filter produces a table with just one column with
values of the form <literal>found <replaceable>X</replaceable> at
depth <replaceable>Y</replaceable></literal>, where
<replaceable>X</replaceable> and <replaceable>Y</replaceable>
represent the two columns from the preceding filter.</para>
<para>The column expressions can also use group functions that do
not specify partitioning (using an <literal>over</literal> clause)
and therefore range over the whole input table. In that case, all
column expressions must result in constant values and the filter
will produce a single row.</para>
<para>For example, <programlisting>&gt;&gt;min($1), max($1), avg($1)</programlisting>
is an output filter that computes the minimum, maximum and average
form the values in the first column of the preceding filter (and
returns a table with exactly one row having these three values in
its three columns).</para>
<para>Each filter may be followed by a <literal>&gt;&gt; sort
by</literal> clause which specifies the order of rows in the
resulting table. The clause is followed by a comma-separated list
of column references. They refer to the columns of the table
produced by this filter and are used to compute the primary,
secondary, tertiary,… etc. sorting key. Each column reference can
be followed by the word <literal>asc</literal> or
<literal>desc</literal> to enforce ascendant (default) or
descendant ordering on the corresponding column.</para>
<para>The filter may optionally be preceded by a
<literal>for</literal> clause that partitions the input rows into
groups according to given keys and produces one output row for
each group; any group function occurring in the output column
expression will range over the current group.</para>
<para>The <literal>for</literal> clause consists of a set of
column expressions used to compute a vector of values from each
input row. This vector serves as a key by which the input rows are
partitioned into groups. The <literal>give</literal> clause now
produces one output row for each group. Note that the column
references occurring in the <literal>give</literal> clause are
interpreted as references to the columns of the key vector, except
for the case of column references occurring in arguments of group
functions (that do not declare its own partitioning using an
<literal>over</literal> clause), which refer to the columns of the
input rows within the current group.</para>
<para>For example: <programlisting>
a-node $a:= [ child a-node $b := [] ];
&gt;&gt; for $a.afun, $b.afun
give $1, $2, count()
sort by $3 desc</programlisting> is a query whose selective part returns pairs
of a-nodes in the parent-child relationship. The output filter
first computes a key, in this case a vector of two of values: the
attribute <literal>afun</literal> of the parent and the child.
Then it groups the results according to this key and for each
group returns a row consisting of three values: the two columns
from the key vector (<literal>$1, $2</literal> and the number of
elements in the corresponding group (<literal>count()</literal>).
The <literal>sort by</literal> clause reorders this output by the
third column in descendant order. As a result we get the frequency
table of co-occurrence of <literal>afun</literal> attribute values
on an edge in the treebank with the most frequent
<literal>afun</literal> pairs first, which may look like this:
<programlisting>
Atr Atr 87306
AuxP Adv 56344
Sb Atr 56269
Obj Atr 51908
Adv Atr 48741
Pred Sb 44963
Pred AuxP 39125
Pred Obj 31744
AuxP Atr 31227
Coord Atr 30569
Coord Pred 24739
Pred Adv 24196
...</programlisting></para>
<para>A filter can further be followed by one or more
<literal>&gt;&gt; filter</literal> clauses which drop certain
output rows, leaving only those rows that satisfy a specific
constraint. For example, the following query <programlisting>a-node $n:= [ ]
# Filter 1
&gt;&gt; give lower($n.m/form)
&gt;&gt; filter $1 ~ '^[a-z]'
# Filter 2
&gt;&gt; for $1 give $1,count()
&gt;&gt; filter $2 &gt;= 1000
# Filter 3
&gt;&gt; give $1,$2 sort by $2 desc</programlisting> retrieves a list frequent
word forms from a treebank. The selective part returns each node
<literal>$n</literal> of the type <literal>a-node</literal>. For
each such node, filter 1 retrieves its lowercase word form from
the attribute <literal>m/form</literal> and drops those forms that
do not start with a letter. Filter 2 groups the rows with the same
form and counts the number of rows in each group, producing a
table with two columns: a form and number of its occurrences; due
to the grouping, the first column now contains only unique values.
The <literal>filter</literal> clause of Filter 2 prunes this
table, preserving only rows where the number of occurrences is at
least 1000. Finally, filter 3 copies both input columns and orders
the table by its second column, in the decreasing order, so that
the most frequent forms appear on top.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>columnExp</term>
<listitem>
<mediaobject id="rail_columnExp">
<imageobject role="html">
<imagedata fileref="rail_columnExp.png" format="PNG"
role="html" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="rail_columnExp.pdf" format="PDF"
width="100%" />
</imageobject>
<textobject role="syntax_diagram">
<phrase>columnExp : ((literal | inputValueRef | FUNCTION '('
(columnExp + ',' ) ')' | GROUPFUNCTION '(' (columnExp + ',' )
('over' (columnExp + ',' ) ('sort by' (columnRef + ',' ))?)?
')' | '(' columnExp ')') + ('+' | '-' | '*' | 'div' | 'mod' |
AMP)) ; inputValueRef: attributePath | columnRef ; columnRef:
DOLLAR NUMBER;</phrase>
</textobject>
</mediaobject>
<para>Column expressions are expressions that additionally allow
group functions and column references of the form
<literal>$<replaceable>N</replaceable></literal> where
<literal><replaceable>N</replaceable></literal>. Note that in the
first output filter following the selective part of the query,
attribute paths must be used to refer to the input of the filter,
where as everywhere else column references must be used.</para>
<para>The exact rules on usage of group functions and column
references in column expressions are detailed in the description
of output filters above.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section id="attribute_paths">
<title>Semantics of attribute paths</title>
<para>Nodes in PML are usually attribute-value structures with values
that can be of either atomic PML data types (integers, strings, etc.) or
complex PML data types (structures, lists, alternatives, sequences,
containers). This means that to get to an atomic value stored within a
node we sometimes have to travel through several nested complex data
structures, starting from the top-level structure which is the node
itself. PML-TQ uses attribute paths to describe the route of such a
travel.</para>
<para>The syntax of an attribute path is described by the <link
linkend="attributePath">attributePath</link> production of the PML-TQ
grammar.</para>
<para>In this section we define how attribute paths are evaluated.
(Note: complex queries over nested attribute values can be performed by
combining attribute paths with the <literal>member</literal> selector,
see <xref linkend="member_selectors" />.)</para>
<para>An attribute path consists of a sequence of steps separated by
slashes. This sequence can be optionally preceded by a primitive
quantifier the semantics of which is described later in <xref
linkend="quantifiers" />. Each step is either a name, the string
'<literal>content()</literal>', an index, or an element index.</para>
<para>Let <phrase role="math">P</phrase> be an attribute path consisting
of steps <phrase
role="math">S<subscript>1</subscript>,…,S<subscript>m</subscript>
(m&gt;0)</phrase>. The result of evaluation of <phrase
role="math">P</phrase> on a node <phrase role="math">N</phrase> is a set
<phrase role="math">V<subscript>N,P</subscript></phrase> of atomic
values (integers, floats, or character strings). The evaluation proceeds
by evaluating the steps from 1 to <phrase role="math">m</phrase>; the
evaluation of the step <phrase
role="math">S<subscript>n</subscript></phrase> (<phrase
role="math">1≤n≤m</phrase>) takes as input a set of values <phrase
role="math">V<subscript>n</subscript></phrase> and results in a set of
values <phrase role="math">V<subscript>n+1</subscript></phrase>. For
each <phrase role="math">n</phrase> (<phrase
role="math">1≤n≤m+1</phrase>), all values in <phrase
role="math">V<subscript>n</subscript></phrase> are of equal data type
(which, if <phrase role="math">P</phrase> is a valid attribute path for
<phrase role="math">N</phrase>, is complex for <phrase
role="math">n≤m</phrase> and atomic for <phrase
role="math">n=m+1</phrase>).</para>
<para>The evaluation is initialized with a set <phrase
role="math">V<subscript>1</subscript>={N}</phrase> whose only element is
the node <phrase role="math">N</phrase>, which is either a structure or
a container. The result of the evaluation if the set <phrase
role="math">V<subscript>N,p</subscript>=V<subscript>m+1</subscript></phrase>.</para>
<para>Let <phrase role="math">V<subscript>n</subscript></phrase> be
given. We now describe the evaluation based on the syntactic type of the
<phrase role="math">n</phrase>-th step <phrase
role="math">S<subscript>n</subscript></phrase> and the data type of
values in <phrase role="math">V<subscript>n</subscript></phrase>. Let
<phrase role="math">t<subscript>n</subscript></phrase> denote the type
of the values in <phrase role="math">V<subscript>n</subscript></phrase>.
<itemizedlist>
<listitem>
<para>If <phrase role="math">t<subscript>n</subscript></phrase> is
an atomic type (and there is an <phrase role="math">n</phrase>-th
step), the query compiler reports an error and fails.</para>
</listitem>
<listitem>
<para>If <phrase role="math">t<subscript>n</subscript></phrase> is
a structure or container type and <phrase
role="math">S<subscript>n</subscript></phrase> is a name of a
valid member for <phrase
role="math">t<subscript>n</subscript></phrase> (according to the
declaration of <phrase
role="math">t<subscript>n</subscript></phrase> in the
corresponding PML schema) then for every element of <phrase
role="math">V<subscript>n</subscript></phrase>, <phrase
role="math">V<subscript>n+1</subscript></phrase> contains the
value of its member <phrase
role="math">S<subscript>n</subscript></phrase>, provided the value
is non-null; i.e. <informalequation id="eq-a1">
<alt role="tex">V_{n+1}=\{v.{S_n}\,;\,v\in
V_n\,\&amp;\,\neg\hbox{\sf is-null}(v.{S_n})\}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a1.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a1.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> If the step <phrase
role="math">S<subscript>n</subscript></phrase> is
'<literal>content()</literal>' and <phrase
role="math">t<subscript>n</subscript></phrase> is a container type
with non-void content (i.e. the container has some content type
declared in the PML schema), <phrase
role="math">V<subscript>n+1</subscript></phrase> consists of the
content value of each element of <phrase
role="math">V<subscript>n</subscript></phrase>, i.e.
<informalequation id="eq-a2">
<alt role="tex">V_{n+1}=\{c\,;\,c=\hbox{\sf
content-of}(v)\,\&amp;\,v\in V_n\,\&amp;\,\neg\hbox{\sf
is-null}(c)\}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a2.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a2.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> If <phrase
role="math">S<subscript>n</subscript></phrase> is anything else,
the query compiler reports and error and fails.</para>
</listitem>
<listitem>
<para>If <phrase role="math">t<subscript>n</subscript></phrase> is
an ordered list and <phrase
role="math">S<subscript>n</subscript></phrase> is an index of the
form <literal>[k]</literal>, where <phrase role="math">k</phrase>
is a non-negative integer number, then <phrase
role="math">V<subscript>n+1</subscript></phrase> consists of the
<phrase role="math">(k+1)</phrase>-th value (<phrase
role="math">v[k]</phrase>) of each list in <phrase
role="math">V<subscript>n</subscript></phrase> whose length is at
least <phrase role="math">k+1</phrase>, i.e. <informalequation
id="eq-a3">
<alt role="tex">V_{n+1}=\{v[k]\,;\,v\in V_n\,\&amp;\,\hbox{\sf
length}(v)&gt;k\,\&amp;\,v[k\neg\hbox{\sf is-null}(])\}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a3.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a3.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation></para>
</listitem>
<listitem>
<para>If <phrase role="math">t<subscript>n</subscript></phrase> is
a list or an alternative, and <phrase
role="math">S<subscript>n</subscript></phrase> is any step except
for the case covered above (where <phrase
role="math">t<subscript>n</subscript></phrase> was an ordered list
and <phrase role="math">S<subscript>n</subscript></phrase> an
index), let <phrase
role="math">V'<subscript>n</subscript></phrase> consist of all
non-null values that occur in some list or alternative from
<phrase role="math">V<subscript>n</subscript></phrase>, i.e.
<informalequation id="eq-a4">
<alt role="tex">V'_{n}=\{w\,;\,(\exists v\in V_n)(\exists
k&lt;\hbox{\sf length}(v))(w=v[k]\,\&amp;\,\neg\hbox{\sf
is-null}(w))\}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a3.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a3.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> Then <phrase
role="math">V<subscript>n+1</subscript></phrase> is obtained by
applying the step <phrase
role="math">S<subscript>n</subscript></phrase> on the set <phrase
role="math">V'<subscript>n</subscript></phrase> instead according
to these rules.</para>
</listitem>
<listitem>
<para>If <phrase role="math">t<subscript>n</subscript></phrase> is
a sequence type and <phrase
role="math">S<subscript>n</subscript></phrase> is an element name
<phrase role="math">E</phrase> valid for the sequence type <phrase
role="math">t<subscript>n</subscript></phrase>, then <phrase
role="math">V<subscript>n+1</subscript></phrase> consists of all
non-null values of elements with a given name occurring in some
sequence from <phrase
role="math">V<subscript>n</subscript></phrase>, i.e.
<informalequation id="eq-a5">
<alt role="tex">V_{n+1}=\{w\,;\,(\exists v\in V_n)(\exists
k&lt;\hbox{\sf length}(v))(\hbox{\sf
name-of}(v[k])=E\,\&amp;\,w=\hbox{\sf
value-of}(v[k])\,\&amp;\,\neg\hbox{\sf is-null}(w))\}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a4.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a4.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation></para>
<para>If <phrase role="math">S<subscript>n</subscript></phrase> is
of the form '<literal>E[k]</literal>', where <phrase
role="math">E</phrase> is a valid element name for <phrase
role="math">t<subscript>n</subscript></phrase> and <phrase
role="math">k</phrase> is a non-negative integer number, then for
each sequence in <phrase
role="math">V<subscript>n</subscript></phrase> the <phrase
role="math">(k+1)</phrase>-st occurrence of element named <phrase
role="math">E</phrase> (if exists) is taken and the element's
value, if non-null, is added to <phrase
role="math">V<subscript>n+1</subscript></phrase>. Formally,
<informalequation id="eq-a6">
<?dbtex delims='no'?>
<alt role="tex">\begin{equation*} \begin{split}
V_{n+1}=\{w\,;\,(\exists v\in V_n)(\exists k&lt;\hbox{\sf
length}(v))( &amp;\#\{l&lt;k \,;\, \hbox{\sf name-of}(v[l])=E
\}=k\\ &amp;\,\&amp;\,\hbox{\sf name-of}(v[k])=E\\
&amp;\,\&amp;\,\hbox{\sf value-of}(v[k])=w
\,\&amp;\,\neg\hbox{\sf is-null}(w))\} \end{split}
\end{equation*}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a5.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a5.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation></para>
<para>If <phrase role="math">S<subscript>n</subscript></phrase> is
of the form '<literal>[k]E</literal>', where <phrase
role="math">E</phrase> is a valid element name for <phrase
role="math">t<subscript>n</subscript></phrase> and <phrase
role="math">k</phrase> is a non-negative integer number, then for
each sequence in <phrase
role="math">V<subscript>n</subscript></phrase> of length at least
<phrase role="math">k+1</phrase> the <phrase
role="math">(k+1)</phrase>-st element is taken and if its name is
<phrase role="math">E</phrase> and its value is non-null, the
value is added to <phrase
role="math">V<subscript>n+1</subscript></phrase>. Formally,
<informalequation id="eq-a7">
<?dbtex delims='no'?>
<alt role="tex">\begin{equation*} \begin{split}
V_{n+1}=\{w\,;\,(\exists v\in V_n)(&amp;\hbox{\sf
length}(v)&gt;k\,\&amp;\, \hbox{\sf name-of}(v[k])=E\\
&amp;\,\&amp;\,w=\hbox{\sf
value-of}(v[k])\,\&amp;\,\neg\hbox{\sf is-null}(w))\}
\end{split} \end{equation*}</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_a6.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_a6.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation></para>
</listitem>
</itemizedlist></para>
</section>
<section id="quantifiers">
<title>Primitive quantifiers</title>
<para>As described above, evaluation of an attribute path on some node
may result in a set of values (usually empty or containing one element,
but if lists, alternatives or sequences are on the path, the set can be
of arbitrary finite cardinality). In this section we define how the
truth value of a predicate involving attribute paths is evaluated. We
start with the formal definition and then demonstrate it on a few
examples.</para>
<formalpara>
<title>Definition</title>
<para>Let <phrase role="math">R</phrase> be a predicate and let
<phrase
role="math">p<subscript>1</subscript>,…,p<subscript>n</subscript></phrase>
be an enumeration of all occurrences of attribute paths in expressions
(terms) contained in <phrase role="math">R</phrase> in the order of
occurrence and let <phrase
role="math">N<subscript>1</subscript>,…,N<subscript>n</subscript></phrase>
be the nodes to which the attribute paths relate. Let for <phrase
role="math">i</phrase> (<phrase role="math">1≤<phrase
role="math">i</phrase>≤n</phrase>), <phrase
role="math">Q<subscript><phrase
role="math">i</phrase></subscript></phrase> denote the universal
quantifier <inlineequation>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifier_for_all.png" format="PNG" />
</imageobject>
<textobject role="tex">
<phrase>\forall</phrase>
</textobject>
<textobject>
<phrase>∀</phrase>
</textobject>
</inlinemediaobject>
</inlineequation> if <phrase role="math">p<subscript><phrase
role="math">i</phrase></subscript></phrase> starts with a '<phrase
role="math">*</phrase>' and <inlineequation>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifier_exists.png" format="PNG" />
</imageobject>
<textobject role="tex">
<phrase>\exists</phrase>
</textobject>
<textobject>
<phrase>∃</phrase>
</textobject>
</inlinemediaobject>
</inlineequation> otherwise. Let <phrase
role="math">V<subscript>i</subscript></phrase> denote the set <phrase
role="math">V<subscript>N<subscript>i</subscript>,p<subscript>i</subscript></subscript></phrase>
of atomic values obtained by evaluating the attribute paths <phrase
role="math">p<subscript>i</subscript></phrase> on the node <phrase
role="math">N<subscript>i</subscript></phrase>. Then the truth value
of the predicate <phrase role="math">R</phrase> for this assignment is
that of the formula <inlineequation id="eq-r1">
<alt role="tex">(Q_1 x_1 \in V_1)\ldots(Q_n x_n \in V_n)R'</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_r1.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_r1.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation> where <phrase role="math">R'</phrase> is an atomic
formula obtained from <phrase role="math">R</phrase> by replacing the
attribute paths <phrase
role="math">p<subscript>1</subscript>,…,p<subscript>n</subscript></phrase>
by variables <phrase
role="math">x<subscript>1</subscript>,…,x<subscript>n</subscript></phrase>.
(Note that each of the variables has exactly one occurrence in <phrase
role="math">R'</phrase>.)</para>
</formalpara>
<para>Now let us demonstrate the definition on some examples. Let
<phrase role="math">R</phrase> be the predicate <programlisting>*a != b/c + *$n.d/e/f</programlisting>
in the query <programlisting>z-node $n := [ child z-node $m := [ *a != b/c + *$n.d/e/f ] ]</programlisting>
where <literal>z-node</literal> is some node type, and
<literal>a</literal> <literal>b/c</literal>, <literal>d/e/f</literal>
are valid attribute paths for nodes of this type. Let <phrase
role="math">N</phrase> and <phrase role="math">M</phrase> be nodes from
the searched data assigned to the selectors <literal>$n</literal> and
<literal>$m</literal>.</para>
<para>Let's assume that the attribute <literal>a</literal> is a list of
numbers and that <phrase
role="math">V<subscript>N,a</subscript></phrase> is the set of all
values of the attribute <literal>a</literal> for the node <phrase
role="math">N</phrase>. Similarly, let <phrase
role="math">V<subscript>M,b/c</subscript></phrase> be the set of all
values of the attribute path <literal>b/c</literal> evaluated on the
node <phrase role="math">M</phrase> and <phrase
role="math">V<subscript>N,d/e/f</subscript></phrase> the set of all
values of the attribute path <literal>d/e/f</literal> evaluated on the
node <phrase role="math">N</phrase>.</para>
<para>The truth value of the predicate <phrase role="math">R</phrase> is
obtained by evaluating the following formula: <informaltable colsep="0"
frame="none" pgwide="1" rowsep="0">
<tgroup cols="2" colsep="0pt">
<colspec align="center" colnum="1" colwidth="95" />
<colspec align="right" colnum="2" colwidth="5" />
<tbody>
<row>
<entry><para> <informalequation id="eq-1">
<alt role="tex">(\forall x \in V_{N,a})(\exists y \in
V_{M,b/c})(\forall z \in V_{N,d/e/f}) x\neq y + z</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_f1.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_f1.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> </para></entry>
<entry>(1)</entry>
</row>
</tbody>
</tgroup>
</informaltable> Thus, the query matches the nodes A, B if and only if
B is a child of A and the above formula holds. Note that since the
attribute paths <literal>*a</literal> and <literal>*$a.d/e/f</literal>
start with the character '<literal>*</literal>' (called a primitive
universal quantifier) the corresponding variables <phrase
role="math">x</phrase> and <phrase role="math">z</phrase> in the formula
<link linkend="eq-1">(1)</link> are universally quantified, whereas
<phrase role="math">y</phrase>, corresponding to <literal>b/c</literal>
is quantified existentially. Note also, that the order of the
quantifiers in the formula <link linkend="eq-1">(1)</link> preserves the
order of the occurrence of the corresponding attribute paths in <phrase
role="math">R</phrase>.</para>
<para>If we had instead written <programlisting>! *a = b/c + *$n.d/e/f</programlisting>
moving the negation (<literal>!</literal>) out of the predicate, the
resulting formula would be completely different: <informaltable
colsep="0" frame="none" pgwide="1" rowsep="0">
<tgroup cols="2" colsep="0pt">
<colspec align="center" colnum="1" colwidth="95" />
<colspec align="right" colnum="2" colwidth="5" />
<tbody>
<row>
<entry><para> <informalequation id="eq-2">
<alt role="tex">\neg(\forall x \in V_{N,a})(\exists y \in
V_{M,b/c})(\forall z \in V_{N,d/e/f}) x= y + z</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_f2.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_f2.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> </para></entry>
<entry>(2)</entry>
</row>
</tbody>
</tgroup>
</informaltable> and equivalent to <informaltable colsep="0"
frame="none" pgwide="1" rowsep="0">
<tgroup cols="2" colsep="0pt">
<colspec align="center" colnum="1" colwidth="95" />
<colspec align="right" colnum="2" colwidth="5" />
<tbody>
<row>
<entry><para> <informalequation id="eq-3">
<alt role="tex">(\exists x \in V_{N,a})(\forall y \in
V_{M,b/c})(\exists z \in V_{N,d/e/f}) x\neq y + z</alt>
<mediaobject>
<imageobject role="html">
<imagedata align="center" fileref="quantifiers_f3.png"
format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata align="center" fileref="quantifiers_f3.pdf"
format="PDF" />
</imageobject>
</mediaobject>
</informalequation> </para></entry>
<entry>(3)</entry>
</row>
</tbody>
</tgroup>
</informaltable></para>
<para>Let us give some more practical examples. Assume nodes of the type
<literal>p-node</literal> have an attribute <literal>functions</literal>
that is an unordered list or an alternative of string values. Let us
consider the following constraints and the translations to formulae
(assuming <phrase role="math">V<subscript>N</subscript></phrase> is the
set of values of the <literal>functions</literal> for a given p-node
<phrase role="math">N</phrase>):</para>
<orderedlist>
<listitem>
<para><literal>functions = functions</literal> asserts that the list
<literal>functions</literal> on <phrase role="math">N</phrase> is
non-empty, since the constraint translates to the formula
<inlineequation id="eq-5">
<alt role="tex">\qquad\qquad (\exists x \in V_N)(\exists y \in
V_N)x=y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f5.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f5.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation> which is equivalent to <inlineequation>
<alt role="tex">(\exists x \in V_N)x=x</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f5_1.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f5_1.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation> and to <inlineequation>
<alt role="tex">V_N\neq\emptyset</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f5_2.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f5_2.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>.</para>
</listitem>
<listitem>
<para><literal>functions != functions</literal> asserts that the
list <literal>functions</literal> on <phrase role="math">N</phrase>
contains (at least) two distinct strings, since it translates to
<inlineequation id="eq-6">
<alt role="tex">\qquad\qquad (\exists x \in V_N)(\exists y \in
V_N)x\neq y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f6.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f6.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>.</para>
</listitem>
<listitem>
<para><literal>!functions = functions</literal> means that the list
<literal>functions</literal> on <phrase role="math">N</phrase> is
empty, since it is a negation of
<literal>functions=functions</literal> .</para>
</listitem>
<listitem>
<para><literal>*functions = *functions</literal> asserts that the
list <literal>functions</literal> on <phrase role="math">N</phrase>
contains at most one unique value because it translates to
<inlineequation id="eq-7">
<alt role="tex">\qquad\qquad (\forall x \in V_N)(\forall y \in
V_N)x= y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f7.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f7.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>, which is satisfied if the set <phrase
role="math">V<subscript>N</subscript></phrase> is either empty or
contains exactly one element.</para>
</listitem>
<listitem>
<para><literal>*functions = functions</literal> is always satisfied,
because it translates to <inlineequation id="eq-8">
<alt role="tex">\qquad\qquad (\forall x \in V_N)(\exists y \in
V_N)x= y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f8.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f8.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>.</para>
</listitem>
<listitem>
<para><literal>functions = *functions</literal> asserts that the
list <literal>functions</literal> contains exactly one unique value,
since it translates to <inlineequation id="eq-9">
<alt role="tex">\qquad\qquad (\exists x \in V_N)(\forall y \in
V_N)x=y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f9.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f9.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>.</para>
</listitem>
<listitem>
<para><literal>*functions != functions</literal> asserts that the
list <literal>functions</literal> is either empty or contains at
least two unique value, since it translates to <inlineequation
id="eq-10">
<alt role="tex">\qquad\qquad (\forall x \in V_N)(\exists y \in
V_N)x\neq y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f10.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f10.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>.</para>
</listitem>
<listitem>
<para><literal>functions != *functions</literal> translates to
<inlineequation id="eq-11">
<alt role="tex">\qquad\qquad (\exists x \in V_N)(\forall y \in
V_N)x\neq y</alt>
<inlinemediaobject>
<imageobject role="html">
<imagedata fileref="quantifiers_f11.png" format="PNG" />
</imageobject>
<imageobject role="fo">
<imagedata fileref="quantifiers_f11.pdf" format="PDF" />
</imageobject>
</inlinemediaobject>
</inlineequation>, which is never satisfied (since x=x).</para>
</listitem>
</orderedlist>
</section>
<section id="member_selectors">
<title>Member selectors</title>
<para>While primitive quantifiers allow one to quantify over atomic
values on a complete attribute path, something it is useful to quantify
over elements on some partial attribute paths while specifying
conditions on the quantified element. This can be done using the
relation <literal>member</literal>, which allows us to access data
structures stored in node's attributes almost as if they were regular
nodes.</para>
<para>For example: assume each node contains an annotation of the
bridging anaphora. This can be represented by an attribute called
<literal>bridging</literal> that is a list of structures with two
members, <literal>informal-type</literal> (a string value for the type
of the anaphora) and <literal>target.rf</literal> (a PML reference to
the anaphora referred node). So, each element of the
<literal>bridging</literal> list represents a labeled link to another
node. We might want to ask for nodes with
<literal>functor="ACT"</literal> that are pointed to from the current
node by a pointer with a specific <literal>informal-type</literal>, say
<literal>"CONTRAST"</literal>.</para>
<para>The following query will <emphasis>not work</emphasis>
<programlisting>t-node $n := [
bridging/informal-type = "CONTRAST",
bridging/target-node.rf t-node [ functor="ACT" ]
]</programlisting> The problem with this query is that
<literal>bridging</literal> attribute contains a list and constraints on
its values are independent. So, <literal>$n</literal> can still match a
node whose <literal>bridging</literal> attribute is a list consisting of
two list members, one that points to an ACTor but whose
<literal>informal-label</literal> is not <literal>"CONTRAST"</literal>,
and one with <literal>informal-label="CONTRAST"</literal> but pointing
somewhere else.</para>
<para>In order to be able to say that these list members are the same,
we have to use the <literal>member</literal> selector:</para>
<programlisting>t-node $n:= [
member bridging [
informal-type = "CONTRAST",
target-node.rf t-node [ functor="ACT" ]
]
]</programlisting>
<para>This query says that one of the values matched by the attribute
path <literal>bridging</literal> (i.e. one of the members of the list
contained in the attribute <literal>bridging</literal> of the node
matched by <literal>$a</literal>) has
<literal>informal-type="CONTRAST"</literal> and points to a
<literal>t-node</literal> with <literal>functor="ACT"</literal>. Thus,
<literal>member</literal> selector works as an existential
quantification over values of a list or alt attribute. We can also use
<literal>member</literal> selector for sequence attributes, but the
attribute path that follows the <literal>member</literal> keyword must
specify a particular element name in the last step. For example, if
<literal>bridging</literal> was not a list but a sequence of elements
named e.g. <literal>terminal</literal>, <literal>non-terminal</literal>,
we must use either <programlisting>member bridging/terminal [...] </programlisting>
to quantify over all terminals in the <literal>bridging</literal>
sequence or <programlisting>member bridging/non-terminal [...] </programlisting>
to quantify over all non-terminals in the <literal>bridging</literal>
sequence.</para>
<para>Of course, we can use <literal>member</literal> selectors within
<literal>member</literal> selectors and (as we have seen), we can use
PML-reference relations within <literal>member</literal> selectors. But
since the objects represented by the <literal>member</literal> selector
are not a regular nodes one cannot use the tree-structure selectors such
as <literal>child</literal> or <literal>parent</literal> within
<literal>member</literal> selectors.</para>
<para>A <literal>member</literal> selector can be used for a subquery,
i.e. prefixed by an occurrence clause or used in boolean expressions. In
particular, negated or zero-times occurring <literal>member</literal>
selector with negated constraints can be used to formulate universal
quantification over values of a list, alt, or sequence.</para>
<para></para>
</section>
<section id="grouping_explained">
<title>Group functions explained</title>
<para>PML-TQ supports group functions in the output filters. These
functions compute its value based on a group of input rows, not just the
current row. For example, one can use them to compute a sum or maximum
over a group of input rows, or a rank of the current input row in some
particular ordering of the group of input rows it belongs to.</para>
<para>A filter can specify grouping of input rows in the following ways:
<itemizedlist>
<listitem>
<para>Considering all input rows as one group.</para>
<para>In this case case the filter produces exactly one output row
whose columns are either constant or computed by applying a group
function.</para>
<para>Examples of such filters are <programlisting> &gt;&gt; count() </programlisting>
computing just the count of all input rows, or <programlisting> &gt;&gt; concat($1), sum($2)/count(), max($2)-min($2)</programlisting>
computing a row consisting of three columns: the first one
concatenating all values in column 1 of the input from the
preceding filter, the second computing an average value of column
2 of the input (which we could also write as
<literal>avg($2)</literal>), the third the difference between
maximum and minimum value column 2 of the input.</para>
</listitem>
<listitem>
<para>Partitioning input rows into groups and evaluating group
functions over each group.</para>
<para>This method allows one to define a partition on the input
rows into groups that share some key value and then operate on
each of these groups individually, producing one output row per
partition group.</para>
<para>In the first step, the filter computes a vector of values
from each input row, called a <firstterm>key</firstterm>, and puts
all rows that produce the same key to one group. The filter then
produces for each group one output row, computed based on the
columns of the group key and group functions ranging over the
group.</para>
<para>For example, consider the following query:</para>
<programlisting>t-root $r:= [
descendant t-node $n []
]
&gt;&gt; $r.id, depth($n)
&gt;&gt; for $1 give max($2)
&gt;&gt; for $1 give $1, count() sort by $2 desc</programlisting>
<para>The main part of the query returns all pairs of nodes
<literal>($r,$n)</literal> where $r is a root node and $n is any
of its descendants. For all such occurrences, the first filter
produces a row with two columns: the (unique) ID of
<literal>$r</literal> and the depth of the node
<literal>$n</literal> in the tree.</para>
<para>The second filter takes the first column of its input row
(i.e. the <literal>ID</literal>) as the key and groups the input
row by this key; this results in as many groups as there were
unique root nodes in the matching set and each group consists of
the input rows that belong to one root. This is done by the
<literal>for $1</literal> clause. The <literal>give
max($2)</literal> clause says that for each group, we want to
produce an output row with a single column whose value is
calculated as the maximum of the values in column 2 of the input
filter, i.e. maximum of depths of the nodes <literal>$n</literal>
for the given root <literal>$r</literal>. This is basically the
depth of the tree rooted in <literal>$r</literal>.</para>
<para>Finally, the third filter computes a distribution of tree
depths. It's input are depths of individual trees. It groups the
input rows according to their first and only column (the depth).
Since there is just one input column, the key vector here is the
whole input row; so all rows within each group are identical. For
each group, the filter counts number of rows in it and outputs a
row whose first column is the first column of the group key (in
our case the tree depth) and the second is the count (number of
trees of the particular depth). Finally, the <literal>sort by $2
desc</literal> clause orders the output table by the 2nd column in
the descending order.</para>
<para>Note that there are two types of column references in the
<literal>give...</literal> clause. Those that occur within a
grouping function (such as <literal>$2</literal> in
<literal>max($2)</literal>) refer to columns of the rows in the
current group. Those that occur outside a group function, (such as
<literal>$1</literal> in <literal>$1,count()</literal>) refer to
columns of the key vector of the current group (so it is
guaranteed that this value is constant over all rows in each
group).</para>
</listitem>
<listitem>
<para>On-line grouping within group functions.</para>
<para>PML-TQ also allows group functions to be computed over some
group of input rows without actually partitioning the input, i.e.
without reducing the number of output rows to one per group. Using
this method, each group function can specify its own way of
grouping and ordering of the input rows. Consider this example on
PDT 2.0:</para>
<programlisting>t-node $a := [
t_lemma = '#Forn',
t-node $b := [
functor = "FPHR"
]
];
&gt;&gt; $a.id, distinct concat($b.t_lemma, ' ' over $a sort by $b.deepord )
</programlisting>
<para>The selective part of the query finds t-node
<literal>$a</literal> with <literal>t_lemma='#Forn'</literal>
which is a technical node governing a flat list of words in a
foreign language and its child-node <literal>$b</literal> with
functor <literal>FPHR</literal> (foreign phrase), which matches
any of the words.</para>
<para>To reconstruct the foreign phrase, we need for each
<literal>$a</literal> to concatenate the
<literal>t_lemma</literal> of all corresponding
<literal>$b</literal>'s in the ordering of the tree (given by the
attribute <literal>deepord</literal>).</para>
<para>The input to our filter are the pairs of nodes
<literal>$a</literal> and <literal>$b</literal>. The concatenation
is performed by the group function <literal>concat</literal>. The
first argument says which value we want to extract from each
input, the second argument <literal>' '</literal> is optional and
specifies a separator of the concatenated values. The
<literal>over ...</literal> clause defines partitioning into
groups just as the <literal>for ...</literal> clause; the
difference is that this partitioning is particular to this group
function, not to the whole filter. In fact, given an input row,
the expression <literal>concat($b.t_lemma, ' ' over $a
...</literal> says that we want to select those input rows where
<literal>$a</literal> is the same as in the current row, and for
this selection concatenate the values
<literal>$b.t_lemma</literal>. The <literal>sort by...</literal>
clause says that the selected rows should be ordered by
<literal>$b.deepord</literal>.</para>
<para>Unlike partitioning using the <literal>for...</literal>
clause, this type of partitioning is done within the group
function, so the filter produces one output row for each input
row, not just one per group. Thus, we would get the same output
for each input pair <literal>($a,$b)</literal> where
<literal>$a</literal> is the same; we therefore use the keyword
<literal>distinct</literal> to filter out the duplicities from the
output.</para>
<para>Thus, on the output, we get the ID of <literal>$a</literal>
and the foreign phrase its subtree represents.</para>
<para>This type of grouping can also be combined with the
<literal>for...</literal> clause to define a partitioning on the
set of groups produced by the <literal>for...</literal> clause. In
that case, the column references within a
<literal>over...</literal> clause refer to the columns of the
group key of the <literal>for...</literal> clause.</para>
<para>The clause <literal>over all</literal> can be used as a
special variant of the <literal>over...</literal> clause creating
only one group which spans over all input rows. The same effect
can also be achieved using a constant key in the
<literal>over...</literal> clause, e.g. <literal>over
0</literal>.</para>
</listitem>
</itemizedlist></para>
<para>Let us consider an example that combines grouping based on
'<literal>for</literal>' and '<literal>over</literal>'.</para>
<para>Let's say that for each functor A we want to know the probability
that a child of a node with functor=A has functor A, B, C, D, etc. In
other words, we want to compute the joint distribution of functor labels
over nodes connected by an edge. This is done by the following
query:</para>
<programlisting>t-node $p := [ t-node $c := [ ] ];
&gt;&gt; for $p.functor,$c.functor
give $1,$2,count() div sum(count() over $1)
sort by $1,$3 desc,$2
&gt;&gt; $1,$2,percnt($3,2) &amp; '%'</programlisting>
<para>The selective part select a pair of t-nodes
<literal>($p,$c)</literal> where <literal>$p</literal> governs
<literal>$c</literal>.</para>
<para>We first group together pairs with the same pair of functors using
<literal> for $p.functor,$c.functor</literal>. This partitions original
input rows into groups
<literal>g<subscript>1</subscript>,...,g<subscript>n</subscript></literal>,
each determined by the unique value of the grouping key
<literal>($p.functor,$c.functor)</literal>. For each group we give to
the next filter this pair of functors and the probability of the second
functor on the first one. The filter thus produces one output row for
each group. Let us see how the output row is produced for a group
<literal>g<subscript>i</subscript></literal> whose key is for example
<literal>('PRED','ACT')</literal>.</para>
<para>The first two columns are just taken from the key vector, using
<literal>$1,$2</literal> and producing 'PRED','ACT'. The probability
that 'ACT' is governed by 'PRED' is computed using the following
expression:</para>
<programlisting>count()<co id="ex.joint.func.count.1"
linkends="ex.joint.func.count.1.desc" /> div sum(count() over $1)<co
id="ex.joint.func.sum" linkends="ex.joint.func.sum.desc" /></programlisting>
<calloutlist>
<callout arearefs="ex.joint.func.count.1"
id="ex.joint.func.count.1.desc">
<para>The first <literal>count()</literal> just computes the number
of rows in the current group
<literal>g<subscript>i</subscript></literal>, i.e. the number of
occurrences of 'PRED','ACT' among the selected pairs of
nodes.</para>
</callout>
<callout arearefs="ex.joint.func.sum" id="ex.joint.func.sum.desc">
<para>The <literal>over $1</literal> clause in the expression
<literal>sum(... over $1)</literal>. partitions the set
<literal>g<subscript>1</subscript>,...,g<subscript>n</subscript></literal>
into groups of higher level, putting together those groups, that
share the 1st column of the group key.</para>
<para>So, our current group
<literal>g<subscript>i</subscript></literal> for ('PRED','ACT')
representing will end up in one sack with all groups whose key
starts with 'PRED', e.g. ('PRED','PAT'), ('PRED','EFF'), etc. For
our group <literal>g<subscript>i</subscript></literal>, the sum will
compute its value over this sack, by summing the number of elements
in each group that ended in this sack. Thus the result is in fact
the number of all occurrences of 'PRED' dominating any other
functor.</para>
</callout>
</calloutlist>
<para>The filter then orders the results primarily by the functor of the
parent node, secondarily by the probability in decreasing order, and
tertiary by the functor of the child. The second filter merely applies
the expression <literal>percnt($3,2) &amp; '%'</literal> to the 3rd
column, dividing it by 100, rounding the result to two decimal points
and appends a percent sign.</para>
<para>The output looks like this (without the ellipses):</para>
<programlisting id="joint_functor_results">
ACMP RSTR 54.37%
ACMP PAT 10.77%
ACMP ACT 10.34%
ACMP APP 4.99%
...
ACT RSTR 49.56%
ACT PAT 9.75%
ACT APP 8.68%
...
ADDR RSTR 64.67%
ADDR APP 14.74%
ADDR PAT 4.24%
ADDR ID 3.17%
...
ADVS PRED 48.5%
ADVS ACT 13.18%
ADVS CM 9.09%
...</programlisting>
<para>The query just explained can also be rewritten as follows:</para>
<programlisting>
t-node $p := [ t-node $c := [ ] ];
&gt;&gt; $p.functor,$c.functor
&gt;&gt; distinct $1,$2, count(over $1,$2) div count(over $1)
sort by $1,$3 desc, $2
&gt;&gt; $1, $2, percnt($3,2) &amp; '%'</programlisting>
<!--
<para>
It first finds all pairs of nodes (<literal>$p,$c</literal>) in the relation of effective dependence
and extracts their functors.
Next, we apply the filter:
<literal>distinct $1,$2, count(over $1,$2), count(over $1)</literal>
produces four columns:
the two functors, the number of occurrences of this particular
pair of functors
(since <literal>count(over $1,$2)</literal> partitions
the input rows by the pair of functors and counts the number
of elements in the group to which the current input row belongs),
and the number of occurrences of the first functor
(<literal>count(over $1)</literal>). The
<literal>distinct</literal> keyword removes
any duplicates from the output.
So, for each distinct pair of functors occurring
on the effective dependency edge
we get the total number of occurrences of this pair
and the number of co-occurrences of the second of the functors
with the first one.
The third filters divides these two values
and reorders the output using the first functor as
the primary key and the computed ratio in descending order as the secondary key.
Finally, in the last filter,
we convert the ratio to percents
and rounds it to two decimal points by applying
the function <literal>percnt($3,2)</literal>
and appending the percent sign.
Note that if we did this already in a the third filter, the
<literal>sort by</literal> clause would consider
values in the third column as strings,
which would result in alphabetical rather than numerical ordering,
putting e.g. both <literal>2%</literal> and <literal>20%</literal> before <literal>10%</literal>.
</para>-->
<para>Another reformulation of the query uses the group function
<literal>ratio()</literal>:</para>
<programlisting>
t-node $p := [ echild t-node $c := [ ] ];
&gt;&gt; for $p.functor,$c.functor
give $1,$2,ratio(count() over $1)
sort by $1,$3 desc
&gt;&gt; $1, $2, percnt($3,2) desc</programlisting>
<para>Finally, we may insert filters selecting just a few (say two) best
ranking rows for each functor in the first column:</para>
<programlisting>
t-node $p := [ echild t-node $c := [ ] ];
&gt;&gt; for $p.functor,$c.functor
give $1,$2,ratio(count() over $1)
sort by $1,$3 desc
&gt;&gt; $1,$2,$3, row_number(over $1)
&gt;&gt; filter $4&lt;=2
&gt;&gt; $1, $2, percnt($3,2) desc</programlisting>
<para>to produce the following output (continuing after the final
ellipsis):</para>
<programlisting>
ACMP RSTR 54.37%
ACMP PAT 10.77%
ACT RSTR 49.56%
ACT PAT 9.75%
ADDR RSTR 64.67%
ADDR APP 14.74%
ADVS PRED 48.5%
ADVS ACT 13.18%
...</programlisting>
</section>
<section id="functions">
<title>Functions</title>
<section>
<title>Functions related to the tree structure</title>
<variablelist>
<varlistentry>
<term>name($var?)</term>
<listitem>
<para>Return the name of a node matched by the given selector
(assume current selector if used without an argument). This
function only makes sense if the node is an element of a PML
sequence (of trees or child nodes).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>depth($var?)</term>
<listitem>
<para>Returns the depth in the tree (counting from 0) of a node
matched by a given selector. If no argument is given, the
current selector is assumed.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>descendants($var?)</term>
<listitem>
<para>Returns the number of descendants of the node matched by a
given selector. If no argument is given, the current selector is
assumed.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>lbrothers($var?)</term>
<listitem>
<para>Returns the number of left siblings of the node matched by
a given selector. If no argument is given, the current selector
is assumed.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>rbrothers($var?)</term>
<listitem>
<para>Returns the number of right siblings of the node matched
by a given selector. If no argument is given, the current
selector is assumed.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>sons($var?)</term>
<listitem>
<para>Returns the number of child nodes of the node matched by a
given selector. If no argument is given, the current selector is
assumed.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>depth_first_order($var?)</term>
<listitem>
<para>Returns the depth-first order of the node matched by the
given selector (counting from 0). If no argument is given, the
current selector is assumed.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>Functions related to corpus</title>
<variablelist>
<varlistentry>
<term>file($var?)</term>
<listitem>
<para>Returns the file name of a document in which the node
matched by a given (or current) selector occurs.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>tree_no($var?)</term>
<listitem>
<para>Returns index of a tree in which a the node matched by a
given (or current) selector occurs, i.e. the position of the
tree in its document.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>address($var?)</term>
<listitem>
<para>Returns an URL uniquely determining the node matched by a
given (or current) selector. The same value can be also computed
using the following expression: <literal>file($var) &amp; '##'
&amp; tree_no($var) &amp; '.' &amp;
depth_first_order($var)</literal>.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>String functions</title>
<variablelist>
<varlistentry>
<term>length(string)</term>
<listitem>
<para>Returns the string length of a given expression.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>substr(string,offset,length?)</term>
<listitem>
<para>Returns a substring of a given string starting at at a
given offset (first character in the string has offset 0) and
spanning for a given length or to the end of the string, if
length is omitted or if the original string has less than
offset+length characters.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>match(string,regexp,flags?)</term>
<listitem>
<para>Returns the first substring of a given string matching a
given regular expression. The optional third argument can be a
string of flags modifying the behavior of the regular expression
matching procedure. The following flags are supported:</para>
<variablelist>
<varlistentry>
<term>i</term>
<listitem>
<para>case insensitive match</para>
</listitem>
</varlistentry>
<varlistentry>
<term>c</term>
<listitem>
<para>case sensitive match (default)</para>
</listitem>
</varlistentry>
<varlistentry>
<term>n</term>
<listitem>
<para>allows the period (.) to match the newline
character</para>
</listitem>
</varlistentry>
<varlistentry>
<term>m</term>
<listitem>
<para>Treat string as multiple lines. That is, change "^"
and "$" from matching the start or end of the string to
matching the start or end of any line anywhere within the
string.</para>
</listitem>
</varlistentry>
</variablelist>
</listitem>
</varlistentry>
<varlistentry>
<term>replace(string,substr,replacement)</term>
<listitem>
<para>Substitutes a given replacement for all (non-overlapping)
occurrences of a given substring in a given string and returns
the result. For example, <literal>replace('banana
ananas','ana','ANA')</literal> returns <literal>bANAna
ANAnas</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>substitute(string,regexp,replacement,flags?)</term>
<listitem>
<para>Substitutes a given replacement for the first or all
(non-overlapping) substrings matching a given regular expression
in a given string and returns the result.</para>
<para>The default behavior is to replace just the first matching
substring. To replace all occurrences, use the flag 'g'.</para>
<para>The replacement string may contain references to
subexpressions of the matching regular expression
(subexpressions are parts of the expressions enclosed in
brackets). The string \N, where N is a digit from 1 to 9, is a
reference to the N-th subexpression (counting opening brackets
from the left), and is substituted in the result by the
substring matched by that subexpression. For example, if the
regular expression was <literal>a(b(c))(d)</literal>, then
<literal>\1</literal>, <literal>\2</literal>, and
<literal>\3</literal> refer to the subexpressions
<literal>(b(c))</literal>, <literal>(c)</literal>, and
<literal>(d)</literal>, respectively.</para>
<para>Note: in PML-TQ string literals, backslash
(<literal>\</literal>) is used as an escaping character,
therfore, if the replacement string is a written as a string
literal, one must use e.g. <literal>\\2</literal> instead of
<literal>\2</literal>. Moreover, because of this special use of
backslash in the replacement string, literal backslash has to be
written as <literal>\\</literal>, which, in a PML-TQ string
literal, becomes <literal>\\\\</literal>.</para>
<para>The optional third argument is a string of flags modifying
the behavior of the regular expression matching procedure. Any
of the flags described above for the function
<literal>match()</literal> can be used here, and additionally
<literal>substitute()</literal> supports the following
flag:</para>
<variablelist>
<varlistentry>
<term>g</term>
<listitem>
<para>global replace: replace all non-overlapping matches
of a given regular expression</para>
</listitem>
</varlistentry>
</variablelist>
<para>For example, <literal>substitute('banana
ananas','([^n])a','\\1@','g')</literal> returns the string
<literal>b@nana @nanas</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>lower(string)</term>
<listitem>
<para>Returns lowercase version of a given expression.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>upper(string)</term>
<listitem>
<para>Returns uppercase version of a given expression.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>tr(string,characters_to_replace,replacement)</term>
<listitem>
<para>Replaces all occurrences of given characters in a string
with corresponding characters in the replacement set, that is,
replaces all occurrences of the Nth character from
<replaceable>characters_to_replace</replaceable> with the Nth
character in the <replaceable>replacement</replaceable>. For
example, <literal>tr('122-34','24','ab')</literal> returns the
string <literal>1aa-b4</literal>.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>Numerical functions</title>
<variablelist>
<varlistentry>
<term>ceil(number)</term>
<listitem>
<para>Return the smallest integer value greater than or equal to
a given numerical argument.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>floor(number)</term>
<listitem>
<para>Return the largest integer value less than or equal to the
numerical argument.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>round(number,places?)</term>
<listitem>
<para>Returns a given number rounded to a specified number of
decimal places (0 if the second argument is not
specified).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>trunc(number,places?)</term>
<listitem>
<para>Returns a number truncated to a certain number of decimal
places (0 if second argument is not specified).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>exp(number)</term>
<listitem>
<para>Returns <emphasis>e</emphasis> powered to a given
number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>ln(number)</term>
<listitem>
<para>Returns natural logarithm of a given number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>power(base?,number)</term>
<listitem>
<para>Returns base powered to a given number. If base is not
specified, the base of 10 is used.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>log(base?,number)</term>
<listitem>
<para>Returns logarithm of a given number using a given base (or
10 if base is not specified).</para>
</listitem>
</varlistentry>
<varlistentry>
<term>sqrt(number)</term>
<listitem>
<para>Returns the square root of a given number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>abs(number)</term>
<listitem>
<para>Returns the absolute value of a given number.</para>
</listitem>
</varlistentry>
<!--
<varlistentry>
<term>percnt(number,places?)</term>
<listitem><para>This is a convenience function, which returns
the same value as <literal>round(100*number,places)</literal>.</para></listitem>
</varlistentry>
-->
</variablelist>
</section>
<section id="agg_functions">
<title>Group Functions</title>
<para>A group function is a function that computes its value based on
a group of input rows, not just the current row. Group functions can
only be used in the <literal>give...</literal> clause of an output
filter and cannot be used in the selective part of the query. There
are several ways of using these functions:</para>
<orderedlist>
<listitem>
<para><literal><replaceable>GROUPFUNCTION</replaceable>(<replaceable>arguments...</replaceable>)</literal>.
If this form is used in the <literal>give</literal> part of a
<literal>for ... give ...</literal> clause, the function computes
its value by considering only those input rows that belong to the
current group. If used outside a <literal>for ... give
...</literal> clause, the function computes its value by
considering all input rows (i.e. the rows returned by the
preceding filter or the selective part of the query).</para>
</listitem>
<listitem>
<para><literal><replaceable>GROUPFUNCTION</replaceable>(<replaceable>arguments...</replaceable>
over <replaceable>columnExp, ...</replaceable>)</literal>. This
form creates a temporary partitioning of the current input
according to given column specifications following the
<literal>over</literal> keyword and computes its value by
considering input rows of the group to which the current input row
belongs. Unlike with <literal>for ... give ...</literal>, this
partitioning is only used to compute the value of the group
function and does not propagate to the output, so the filter still
returns one output row for each input row, rather than a row for
each group.</para>
</listitem>
<listitem>
<para><literal><replaceable>GROUPFUNCTION</replaceable>(<replaceable>arguments...</replaceable>
over all)</literal> computes its value by considering all input
rows (if used in the <literal>give ...</literal> clause of a
<literal>for ... give ...</literal> filter, this means not just
rows in the current group).</para>
</listitem>
<listitem>
<para><literal><replaceable>GROUPFUNCTION</replaceable>(<replaceable>arguments...</replaceable>
over <replaceable>columnExp,...</replaceable> sort by
<replaceable>columnRef,...</replaceable> )</literal> is a variant
of the previous grouping that forces an ordering on the rows
within a group; this is useful for the group function
<literal>concat()</literal> and the ranking functions
<literal>row_number()</literal>, <literal>rank()</literal>, and
<literal>dense_rank()</literal>.</para>
</listitem>
</orderedlist>
<variablelist>
<varlistentry>
<term>count([ over ... ])</term>
<listitem>
<para>Returns the number of input rows considered.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>sum(<replaceable>expression</replaceable> [ over ...
])</term>
<listitem>
<para>For each row considered, computes the expression and
returns a sum of the results. The expression must evaluate to a
number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>concat(<replaceable>expression</replaceable>,
[<replaceable>"separator"</replaceable>] [ over ... [sort by
...]])</term>
<listitem>
<para>For each row considered, computes the expression and
returns a character string obtained by concatenating the
results; the optional second argument may be a literal string to
be used as a separator of the concatenated strings. The values
are concatenated in no particular order unless the <literal>over
... sort by ...</literal> clause is used.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>min(<replaceable>expression</replaceable> [ over ...
])</term>
<listitem>
<para>For each row considered, computes the expression and
returns the minimum of the resulting values. The expression must
evaluate to a number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>max(<replaceable>expression</replaceable> [ over ...
])</term>
<listitem>
<para>For each row considered, computes the expression and
returns the maximum of the resulting values. The expression must
evaluate to a number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>avg(<replaceable>expression</replaceable> [ over ...
])</term>
<listitem>
<para>For each row considered, computes the expression and
returns the average of the resulting values. The expression must
evaluate to a number.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>ratio(<replaceable>expression</replaceable> [ over ...
])</term>
<listitem>
<para>Returns the ratio of the value computed using the
expression to the sum of all values computed using this
expression over all rows considered.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>row_number([over ...] [sort by ...])</term>
<listitem>
<para>For each row considered, returns its number within its
group counting from 1. The partitioning can be specified by the
<literal>over ...</literal> clause and the rows in each group
can be ordered by the <literal>sort by ...</literal>
clause.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>rank([over ...] [sort by ...])</term>
<listitem>
<para>For each row considered, returns its relative rank with
gaps counting from 1 with respect to the other rows in the same
group; the rank is based on ordering the group according to the
expressions in the <literal>sort by ...</literal> clause. Ties,
i.e. rows with equal values of the expressions in the
<literal>sort by ...</literal> clause, receive the same rank,
however, if two rows do receive the same rank the rank numbers
will subsequently skip: if two rows are of rank 1, the next rank
will be 3 (there will be no rank 2). Thus, the returned value
for a row is the same as <literal>row_number()</literal> of the
first row in the group with the same sort key. For example, the
values in an (already sorted) group
<literal>1,1,1,3,7,7,20</literal> have the following ranks:
<literal>1,1,1,4,5,5,7</literal>. See also
<literal>dense_rank()</literal>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>dense_rank([over ...] [sort by ...])</term>
<listitem>
<para>For each row considered, returns its relative rank without
gaps counting from 1 with respect to the other rows in the same
group; the rank is based on ordering the group according to the
expressions in the <literal>sort by ...</literal> clause. Ties,
i.e. rows with equal values of the expressions in the
<literal>sort by ...</literal> clause, receive the same rank. A
dense rank returns a ranking number without any gaps. For
example, the values in an (already sorted) group
(1,1,1,3,7,7,20) have the following
<literal>1,1,1,3,7,7,20</literal> have the following ranks:
<literal>1,1,1,2,3,3,4</literal>. See also
<literal>rank()</literal>.</para>
</listitem>
</varlistentry>
</variablelist>
</section>
</section>
<section>
<title>Examples</title>
<para></para>
</section>
</section>
<section>
<title>Installing PML-TQ server</title>
<para>To run PML-TQ Servers, first you will need to install an SQL
database server. <ulink
10g or 11g and <ulink url="http://www.postgresql.org/">PostgreSQL</ulink>
ver. min. 8.4.1 are fully supported .</para>
<section>
<title>Quick start</title>
<para>Instalation in a few quick steps:</para>
<itemizedlist>
<listitem>
<para>Install PostgreSQL &gt;= 8.4 or Oracle &gt;= 10g database
(e.g. the free XE version)</para>
</listitem>
<listitem>
<para>Create a new directory for data conversion by copying the
contrib/prepare_data/sample/ directory</para>
<programlisting>mkdir my_treebank_pmltq
cd my_treebank_pmltq
cp -R contrib/prepare_data/sample/* .</programlisting>
</listitem>
<listitem>
<para>Convert your treebank into the PML format (depends on the
native format of your data)</para>
</listitem>
<listitem>
<para>Create a soft-link to your treebank data in my_treebank_pmltq
(just for convenience)</para>
<programlisting>ln -s /path/to/your/treebank/data/in/pml data</programlisting>
</listitem>
<listitem>
<para>Edit carefully the following configuration scripts in
my_treebank_pmltq directory according to the comments provided
there:</para>
<programlisting>$EDITOR bin/config.sh
$EDITOR bin/convert_to_db.sh</programlisting>
</listitem>
<listitem>
<para>Run the following script on any machine</para>
<programlisting>bin/convert_to_db.sh</programlisting>
</listitem>
<listitem>
<para>Run the following scripts on the machine running the SQL
server or a machine able to access the SQL server via the database
command-line client (sqlplus or psql)</para>
<programlisting>bin/create_db.sh
bin/load_to_db.sh</programlisting>
</listitem>
<listitem>
<para>Change to the pmltq installation directory:</para>
<programlisting>cd /path/to/pmltq</programlisting>
</listitem>
<listitem>
<para>Create/edit the configuration file config/pmltq_cgi.conf
:</para>
<programlisting>[ -f config/pmltq_cgi.conf ] || mv config/pmltq_cgi.conf.sample config/pmltq_cgi.conf
$EDITOR config/pmltq_cgi.conf</programlisting>
</listitem>
<listitem>
<para>Create/edit the configuration file run/run.conf :</para>
<programlisting>[ -f run/run.conf ] || mv run/run.conf.sample run/run.conf
$EDITOR run/run.conf</programlisting>
</listitem>
<listitem>
<para>Start PMLTQ servers</para>
<programlisting>run/run.sh --start</programlisting>
</listitem>
<listitem>
<para>Use them!</para>
<programlisting>./pmltq</programlisting>
<para>or</para>
<itemizedlist>
<listitem>
<para>open <ulink
in your browser.</para>
</listitem>
<listitem>
<para>start TrEd, press <keycombo action="simul"><keysym>Shift</keysym><keysym>F3</keysym>
</keycombo>, configure connection, create a
query and press <keysym>Space</keysym>.</para>
</listitem>
</itemizedlist>
</listitem>
</itemizedlist>
</section>
<section id="prepare_data">
<title id="prepare_data.title">Preparing data</title>
<para>First of all make sure your treebanks are in <ulink
url="http://ufal.mff.cuni.cz/jazz/pml/">PML</ulink>. Otherwise you need
to convert your data to <ulink
url="http://ufal.mff.cuni.cz/jazz/pml/">PML</ulink> first. (TODO: list
few converters)</para>
<section>
<title>Converting PML data into SQL</title>
<para>Start by following steps:</para>
<itemizedlist>
<listitem>
<para>Create a new directory for the data conversion by copying
the <filename>contrib/prepare_data/sample/</filename>
directory.<programlisting>mkdir my_treebank_pmltq
cd my_treebank_pmltq
cp -R contrib/prepare_data/sample/*</programlisting></para>
</listitem>
<listitem>
<para>Create a soft-link to your treebank data in
my_treebank_pmltq (just for convenience and easy
access)<programlisting>ln -s /path/to/your/treebank/data/in/pml data</programlisting></para>
<para>Your directory now contains a <filename>bin</filename>
directory with a set of esential scripts.</para>
<variablelist>
<varlistentry>
<term>config.sh</term>
<listitem>
<para>This configuration file controls the data conversion
from PML to SQL/dumps and loading of the data to the SQL
database.</para>
<para>You will need to configure at least basic treebank
infomation and the database access. See the comments inside the
file.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>convert_to_db.sh</term>
<listitem>
<para>Converts PML data to sql dumps. If you need something
that paralelizes jobs; this is the place.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>create_db.sh</term>
<listitem>
<para>Creates databases according to your
configuration.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>load_to_db.sh</term>
<listitem>
<para>Loads sql dumps to the database. You need to convert
the data and setup your configuration prior running this
script.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>sql_shell.sh</term>
<listitem>
<para>Starts sql command line client (sqlplus or psql)
according to your database configuration.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>generate_pmltq_cgi_conf.sh</term>
<listitem>
<para>Generates a fragment of
<filename>pmltq_cgi.conf</filename> for the current
treebank.</para>
</listitem>
</varlistentry>
</variablelist>
</listitem>
<listitem>
<para>Edit carefully the following configuration scripts in
my_treebank_pmltq directory according to the comments provided
there:<programlisting>$EDITOR bin/config.sh
$EDITOR bin/convert_to_db.sh</programlisting></para>
</listitem>
<listitem>
<para>Run the following script on any machine:<programlisting>bin/convert_to_db.sh</programlisting></para>
</listitem>
</itemizedlist>
</section>
<section>
<title>Importing SQL data</title>
<para>The following scripts are supposed to be run on the machine
running the SQL server or a machine able to access the SQL server via
the database command-line client (sqlplus or psql).</para>
<itemizedlist>
<listitem>
<para>Run <filename>create_db.sh</filename> in case you need to
create a fresh new database. Be careful, as this script also
deletes any existing data in the database.<programlisting>bin/create_db.sh</programlisting></para>
</listitem>
<listitem>
<para>Run following script to load sql dumps into the database.
Any existing data in the database will be deleted.<programlisting>bin/load_to_db.sh</programlisting></para>
</listitem>
</itemizedlist>
</section>
</section>
<section>
<title>Setting up configuration files</title>
<para>There are two main configuration files you will need to setup.
Both of them are in PML-TQ installation directory.</para>
<section id="pmltq_cgi">
<title id="pmltq_cgi.title">pmltq_cgi.conf</title>
<para>This XML file holds list of treebanks including database
connection settings. This is also the place where you can set all
limits for your PML-TQ server and very important url to print service.
For additional information consult xml schema in
<filename>resources/pmltq_cgi_conf_schema.xml</filename>.</para>
<para>Example config:</para>
<programlisting>&lt;?xml version="1.0" encoding="utf-8"?&gt;
&lt;pmltq_cgi_config xmlns="http://ufal.mff.cuni.cz/pdt/pml/"&gt;
&lt;head&gt;
&lt;schema href="pmltq_cgi_conf_schema.xml" /&gt;
&lt;/head&gt;
&lt;limit&gt;500&lt;/limit&gt;
&lt;row_limit&gt;10000&lt;/row_limit&gt;
&lt;timeout&gt;30&lt;/timeout&gt;
&lt;configurations&gt;
&lt;dbi id="pdt20_mwe" public="0" featured="99"&gt;
&lt;description&gt;
&lt;title&gt;Prague Dependency Treebank 2.0 with Multi-Word Expressions&lt;/title&gt;
&lt;abstract&gt;This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data.&lt;/abstract&gt;
&lt;/description&gt;
&lt;driver&gt;Pg&lt;/driver&gt;
&lt;host&gt;localhost&lt;/host&gt;
&lt;port&gt;5432&lt;/port&gt;
&lt;database&gt;pdt20_mwe&lt;/database&gt;
&lt;username&gt;pmltq&lt;/username&gt;
&lt;password&gt;your_password&lt;/password&gt;
&lt;sources&gt;
&lt;AM schema="valency_lexicon"&gt;/work/vallex&lt;/AM&gt;
&lt;AM schema="adata"&gt;/work/data&lt;/AM&gt;
&lt;AM schema="tdata"&gt;/work/data&lt;/AM&gt;
&lt;/sources&gt;
&lt;/dbi&gt;
&lt;/configurations&gt;
&lt;/pmltq_cgi_config&gt;</programlisting>
<variablelist>
<varlistentry>
<term>&lt;limit&gt;</term>
<listitem>
<para>maximum number of results to return for node
queries</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;row_limit&gt;</term>
<listitem>
<para>maximum number of results for filter queries</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;timeout&gt;</term>
<listitem>
<para>timeout before the search engine give up (in
seconds)</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;tree_print_service&gt;</term>
<listitem>
<para>exact URL of print service</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;configurations&gt;</term>
<listitem>
<para>list of treebanks</para>
<para>List items can be generated by running
<filename>generate_pmltq_cgi_conf.sh</filename> script in
treebank directory. See section <xref
endterm="prepare_data.title" linkend="prepare_data" />.</para>
<variablelist>
<varlistentry>
<term>&lt;dbi&gt;</term>
<listitem>
<para>list item holder</para>
<para>Attributes:</para>
<variablelist>
<varlistentry>
<term>id</term>
<listitem>
<para>treebank id; this is usually also a database
name</para>
</listitem>
</varlistentry>
<varlistentry>
<term>public</term>
<listitem>
<para>visible or hidden in public treebank
list</para>
</listitem>
</varlistentry>
<varlistentry>
<term>featured</term>
<listitem>
<para>popularity index</para>
</listitem>
</varlistentry>
</variablelist>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;description&gt;</term>
<listitem>
<para>detailed information about the treebank</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;driver&gt;</term>
<listitem>
<para><literal>Pg</literal> for Postgres,
<literal>Oracle</literal> for Oracle database
engine</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;host&gt;, &lt;port&gt;, &lt;username&gt;,
&lt;password&gt;</term>
<listitem>
<para>database connection information</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;database&gt;</term>
<listitem>
<para>database name</para>
</listitem>
</varlistentry>
<varlistentry>
<term>&lt;sources&gt;</term>
<listitem>
<para>data sources for each layer</para>
</listitem>
</varlistentry>
</variablelist>
</listitem>
</varlistentry>
</variablelist>
<important>
<para>We strongly recommend using
<filename>generate_pmltq_cgi_conf.sh</filename> to generate treebank
configration.</para>
</important>
</section>
<section>
<title id="run.conf">run.conf</title>
<para>Configuration file for <filename>run/run.sh</filename> a script
to start PML-TQ HTTP servers. Follow comments in the file.</para>
<para>Example config:</para>
<para><programlisting>#
# Configuration file for run.sh
#
auth_file="${pmltq_dir}/config/.authorization"
config_file="${pmltq_dir}/config/pmltq_cgi.conf"
btred_config_file="/path/to/btred.rc"
tred_dir=/path/to/tred
# list of print service extensions
extensions=pdt20,pdt_vallex,ptb,arabic_treebank,hydt
# if you want graphics in the WWW interface, setup to your tred installation here
# You'll also need Xvfb installed and the xvfb-run script
print_service="xvfb-run $tred_dir/bin/start_btred -c $btred_config_file -Z $tred_dir/resources -m $tred_dir/examples/print_srvr_simple.btred --enable-extensions $extensions"
on_print_service_stop="grep_kill '[X]vfb :99'"
# Treebank ports and number of http servers to fork.
# Format:
#
# CONFIG_ID=PORT,NUMBER_OF_INSTANCES
#
# e.g
# foo=1234,2
# bar=1235,7
#
# where CONFIG_ID refers to the id of &lt;dbi&gt; elements in $config_file
#
# List of CONFIG_ID's of services to start by default
pdt20_mwe=8082,1
all=(# &lt;-- names of treebanks to be run by default
pdt20_mwe
) </programlisting></para>
</section>
</section>
<section>
<title>Setting up print service</title>
<para>If you want graphics in the WWW interface, Tred and Xvfb
installation will be needed. For Tred installation follow this <ulink
url="http://ufal.mff.cuni.cz/~pajas/tred/ar01s02.html#install-unix">link
to Tred Manual</ulink>. Installation of Xvfb depends on your Linux
distribution; use your packaging system or see <ulink
url="http://www.x.org/wiki/">www.x.org</ulink>.<note>
<para>Some Xvfb installations are without
<filename>xvfb-run</filename> script. You can easily get
<filename>xvfb-run</filename> script from other distributions as
they are usually the same.</para>
</note></para>
</section>
</section>
<section>
<title>Tools</title>
<para></para>
</section>
</article>