vicheka oeun

Obase (work-in-progress)

Obase is an Ontology-based query rewrite tool.

What is a query?

a simple request for data from a database. There are many types of query languages and different data structures for each database.

Current problem

imagine there are 3 databases one using postgres for profiles, mongodb for interactions and elasticsearch for news articles. Now if we want to search for all data relating to a person we would have to manually write 3 queries. The issue with this is, the person writing the query needs to know schemas of each database, adding unnecessary overhead.

Why does this matter?

Real-world impact

Solution

Query rewriting is a technique where a system transforms a user query into an equivalent but more optimal query based on certain rules, constraints, or mappings.

Case study

There are different types of query rewrites:

  1. Materialized view rewrite - focuses on the view, i.e. joining or aggregating data on tables
  2. Semantic rewriting - focuses on rewriting to match the intent of the user
  3. Optimization rewriting - focuses improving performance of the query execution

Oracle’s query rewrite

Oracle uses query rewrite to optimize their data warehouse (storing data various sources).

They use a form of materialized view rewrite, which automatically transform a user’s query to utilize the materialized view (pre-compute summary of data) instead, allowing it to reduce the need for processing large datasets directly.

Doing this leads to significantly faster performance boost.

How they do this

  1. Automatic transformation - Oracle checks if the result can be retrieved from an existing materialized view instead of base tables.
  2. Transparency - Oracle does the modification instead of end-user
  3. Conditions for rewrite - Oracle sets various conditions, but essentially:
    1. if an materialized view exists
    2. either all or part of the result is obtainable from a materialized view
    3. many more

Azure query rewrite

Their version of query rewrite focuses on enhancing search accuracy by transforming user queries to more effective versions through correcting typos, expanding synonyms to refine search results.

How they do this

  1. User submits query - the query would include a “search” property that specifies a query rewrite to the system
  2. Generative Model enhances the query - the query is sent to a generative model that modifies it for synonyms, error, phrase formulation…etc.
  3. Retrieving the search result - both the original query and rewritten version are used to fetch search results.

An Ontology-based approach

First and foremost, what is an Ontology?

Example: an ontology of a human, would include many different things like face, body, emotion…etc. Ontologies are what used to define these properties of the subject human.

Real-life example

Palantir is a famous company that heavily uses ontology as part of their technology, specifically using to build mappings and links between various data source to objects allowing them to extract incredible insights on certain topics that drive data-driven decision making.

How Ontologies can be used for query rewrite

I got a lot of my information from reading this paper, feel free to check it out.

The Approach

  1. We begin with something called Tuple-Generating Dependency (TGD) which you can think of it as an abstract translation of the semantic knowledge of an ontology that works with database query. This is because ontologies are typically designed more as higher level concepts and TGD bridges that gap to practical database operations.
    • They define the semantic rules and relationships in your domain
    • These are built prior to the querying process

Without getting overly complicated in the details, the quick and simple explanation of what TGD’s are like is:

φ(X,Y) → ∃Z ψ(X,Z)

Where,

Essentially, if the if-part exists, some parts of the then-part must also exist

Example:

student(X, Y) → enrolled(X, Z)

This TGD states: "If X is a student with name Y, then X must be enrolled in some program Z”

  1. Here, the actual rewriting occurs
    1. Factorization - this process focuses on factorizing our TGD’s

      1. Checks if the atoms (building block of a TGD) is:

        1. Unifiable through substitution like matching constants or same predicate and compatible variables

          Example:

          q() ← takes(S, C), course(C, T, D), takes(S, C')

          can be factorized to:

          q() ← takes(S, C), course(C, T, D)

      2. We cannot factorize if there’s conflicting constants like: “Math” vs “Biology”

        Example: q() ← student(X, "CS"), takes(X, C), student(X, "Math")

    2. Backward-chaining - this process focuses on substituting existing TGD with other TGD’s that imply the same thing

      The process:

      • Start with the Goal Query: Begin with the query you want to answer against the ontology.

      • Work Backwards Using Rules: For each atom in your query, look for TGDs whose heads (right sides) match that atom.

      • Generate New Queries: Replace the matched atom with the body (left side) of the TGD, creating a new query that's one step "closer" to the database.

      • Repeat Until Database Level: Continue this process until you've generated all possible queries that could be evaluated directly against the database.

        Example:

        Original query:

        q(X) ← csStudent(X)
        

        Consider there’s existing TGD:

        TGD1: takes(S, C) ∧ course(C, "Programming") → csStudent(S)

        TGD2: major(S, "CS") → csStudent(S)

        Backward Chaining Process:

        1. Start with the original query: q(X) ← csStudent(X)

        2. Apply TGD1 backwards:

          • csStudent(X) matches csStudent(S) in TGD1's head
          • Replace csStudent(X) with TGD1's body
          • New query: q(X) ← takes(X, C) ∧ course(C, "Programming")
        3. Apply TGD2 backwards:

          • csStudent(X) matches csStudent(S) in TGD2's head
          • Replace csStudent(X) with TGD2's body
          • New query: q(X) ← major(X, "CS")
        4. Result - The rewritten query is the union of all generated queries:

          q(X) ← takes(X, C) ∧ course(C, "Programming")
          ∪
          q(X) ← major(X, "CS")
          

        This says: "A student is a CS student if they take a programming course OR if they have declared CS as their major."

    3. Optimization - this process focuses on identifying and removing redundant atoms from queries.

      General idea is:

      • Identifies Redundant Atoms: Finds atoms in a query that are logically implied by other atoms in the same query, given the set of TGDs.
      • Removes These Atoms: Eliminates the redundant atoms to create a simpler, equivalent query.
      • Uses Dependency Graph: Leverages the pre-computed graph of how values can propagate through positions in the ontology.

      How It Works:

      1. Atom Coverage: An atom A "covers" atom B if, whenever A is true, B must also be true according to the TGDs.
      2. Coverage Detection: The algorithm checks if there are paths in the dependency graph from positions in one atom to corresponding positions in another.
      3. Elimination Order: Processes atoms in reverse order, eliminating those that are covered by others.

      Example

      Given a query:

      q() ← professor(P, D), faculty(P), teaches(P, C)
      

      And a TGD:

      professor(X, Y) → faculty(X)
      

      The atom faculty(P) is covered by professor(P, D) and can be eliminated:

      q() ← professor(P, D), teaches(P, C)
      

      This optimization significantly reduces the size of the rewritten query without changing its meaning, making query execution much more efficient.

Architecture

Obase Architecture

Step 1: User Query Input

A user submits an SQL-like query that uses ontological terms rather than specific database tables. This is the entry point to the system.

For example, a user might query: SELECT * FROM CSStudent WHERE takenCourse = "Programming101"

Step 2: Rewrite Mechanism Processes Query

The Rewrite Mechanism receives the query and begins processing it. This is where most of the intelligence in the system resides.

Steps 2a & 2b: Ontology Consultation

Steps 3a & 3b: Schema Mapping Consultation

Step 4: Canonical Query Generation

Using all this information, the Rewrite Mechanism generates a canonical (standardized) query representation that's database-agnostic but contains all the logical operations needed.

Step 5: Schema Support for Canonical Query

The Schema Mappings provide additional information to the Canonical Query about specific field mappings, data types, and transformation rules needed to properly query each database.

Step 6: Database-Specific Query Execution

The Canonical Query is translated into database-specific queries and sent to the appropriate databases. For our example:

Step 7: Result Collection

The databases execute their respective queries and return results to the Results component, which combines and normalizes them.

Step 8: Results Returned to User

The combined results are sent back to the user in a unified format that maintains the ontological terminology they originally used.