Obase (work-in-progress)
Obase is an Ontology-based query rewrite tool.
What is a query?
a simple request for data from a database. There are many types of query languages and different data structures for each database.
- SQL (structured data)
- NoSQL (semi-structured data)
- Elasticsearch (unstructured text)
Current problem
imagine there are 3 databases one using postgres for profiles, mongodb for interactions and elasticsearch for news articles. Now if we want to search for all data relating to a person we would have to manually write 3 queries. The issue with this is, the person writing the query needs to know schemas of each database, adding unnecessary overhead.
Why does this matter?
- Data integration - this process requires copying data into a central warehouse which may lead to dupes, data lagging behind live data and if the source database changes the pipeline might break.
- Performance - querying multiple database in real-time can be slow and expensive
- Semantic - inconsistencies among databases because different naming conventions
Real-world impact
- Health care - doctors and researchers need cross-database queries for retrieval of patient records, medical history…etc.
- Business and Enterprise - large corporations store data across multiple systems.
- Academic research - researchers work with datasets across institutions
Solution
Query rewriting is a technique where a system transforms a user query into an equivalent but more optimal query based on certain rules, constraints, or mappings.
Case study
There are different types of query rewrites:
- Materialized view rewrite - focuses on the view, i.e. joining or aggregating data on tables
- Semantic rewriting - focuses on rewriting to match the intent of the user
- Optimization rewriting - focuses improving performance of the query execution
Oracle’s query rewrite
Oracle uses query rewrite to optimize their data warehouse (storing data various sources).
They use a form of materialized view rewrite, which automatically transform a user’s query to utilize the materialized view (pre-compute summary of data) instead, allowing it to reduce the need for processing large datasets directly.
In basic terms, it checks if there’s an existing table and rewrites the query to use that, instead of running through the entire database.
Example:
Original query:
SELECT region, SUM(revenue) FROM sales GROUP BY region;
Using Materialized view
/*create summary table*/ CREATE MATERIALIZED VIEW sales_summary AS SELECT region, SUM(revenue) FROM sales GROUP BY region; /*rewrite to use the summary table*/ SELECT * FROM sales_summary;
Doing this leads to significantly faster performance boost.
How they do this
- Automatic transformation - Oracle checks if the result can be retrieved from an existing materialized view instead of base tables.
- Transparency - Oracle does the modification instead of end-user
- Conditions for rewrite - Oracle sets various conditions, but essentially:
- if an materialized view exists
- either all or part of the result is obtainable from a materialized view
- many more
Azure query rewrite
Their version of query rewrite focuses on enhancing search accuracy by transforming user queries to more effective versions through correcting typos, expanding synonyms to refine search results.
How they do this
- User submits query - the query would include a “search” property that specifies a query rewrite to the system
- Generative Model enhances the query - the query is sent to a generative model that modifies it for synonyms, error, phrase formulation…etc.
- Retrieving the search result - both the original query and rewritten version are used to fetch search results.
An Ontology-based approach
First and foremost, what is an Ontology?
- It is a way of showing the properties of a subject area and how they are related by defining a set of concepts and categories that represent the subject.
Example: an ontology of a human, would include many different things like face, body, emotion…etc. Ontologies are what used to define these properties of the subject human.
Real-life example
Palantir is a famous company that heavily uses ontology as part of their technology, specifically using to build mappings and links between various data source to objects allowing them to extract incredible insights on certain topics that drive data-driven decision making.
How Ontologies can be used for query rewrite
I got a lot of my information from reading this paper, feel free to check it out.
The Approach
- We begin with something called Tuple-Generating Dependency (TGD) which you can think of it as an abstract translation of the semantic knowledge of an ontology that works with database query. This is because ontologies are typically designed more as higher level concepts and TGD bridges that gap to practical database operations.
- They define the semantic rules and relationships in your domain
- These are built prior to the querying process
Without getting overly complicated in the details, the quick and simple explanation of what TGD’s are like is:
φ(X,Y) → ∃Z ψ(X,Z)
Where,
- φ(X,Y) - represent the body (if-part)
- ψ(X,Z) - represent the head (then-part)
- X, Y, Z - sets of variables
Essentially, if the if-part exists, some parts of the then-part must also exist
Example:
student(X, Y) → enrolled(X, Z)
This TGD states: "If X is a student with name Y, then X must be enrolled in some program Z”
- Here, the actual rewriting occurs
Factorization - this process focuses on factorizing our TGD’s
Checks if the atoms (building block of a TGD) is:
Unifiable through substitution like matching constants or same predicate and compatible variables
Example:
q() ← takes(S, C), course(C, T, D), takes(S, C')can be factorized to:
q() ← takes(S, C), course(C, T, D)
We cannot factorize if there’s conflicting constants like: “Math” vs “Biology”
Example:
q() ← student(X, "CS"), takes(X, C), student(X, "Math")
Backward-chaining - this process focuses on substituting existing TGD with other TGD’s that imply the same thing
The process:
Start with the Goal Query: Begin with the query you want to answer against the ontology.
Work Backwards Using Rules: For each atom in your query, look for TGDs whose heads (right sides) match that atom.
Generate New Queries: Replace the matched atom with the body (left side) of the TGD, creating a new query that's one step "closer" to the database.
Repeat Until Database Level: Continue this process until you've generated all possible queries that could be evaluated directly against the database.
Example:
Original query:
q(X) ← csStudent(X)
Consider there’s existing TGD:
TGD1: takes(S, C) ∧ course(C, "Programming") → csStudent(S)TGD2: major(S, "CS") → csStudent(S)Backward Chaining Process:
Start with the original query:
q(X) ← csStudent(X)Apply TGD1 backwards:
csStudent(X)matchescsStudent(S)in TGD1's head- Replace
csStudent(X)with TGD1's body - New query:
q(X) ← takes(X, C) ∧ course(C, "Programming")
Apply TGD2 backwards:
csStudent(X)matchescsStudent(S)in TGD2's head- Replace
csStudent(X)with TGD2's body - New query:
q(X) ← major(X, "CS")
Result - The rewritten query is the union of all generated queries:
q(X) ← takes(X, C) ∧ course(C, "Programming") ∪ q(X) ← major(X, "CS")
This says: "A student is a CS student if they take a programming course OR if they have declared CS as their major."
Optimization - this process focuses on identifying and removing redundant atoms from queries.
General idea is:
- Identifies Redundant Atoms: Finds atoms in a query that are logically implied by other atoms in the same query, given the set of TGDs.
- Removes These Atoms: Eliminates the redundant atoms to create a simpler, equivalent query.
- Uses Dependency Graph: Leverages the pre-computed graph of how values can propagate through positions in the ontology.
How It Works:
- Atom Coverage: An atom A "covers" atom B if, whenever A is true, B must also be true according to the TGDs.
- Coverage Detection: The algorithm checks if there are paths in the dependency graph from positions in one atom to corresponding positions in another.
- Elimination Order: Processes atoms in reverse order, eliminating those that are covered by others.
Example
Given a query:
q() ← professor(P, D), faculty(P), teaches(P, C)
And a TGD:
professor(X, Y) → faculty(X)
The atom
faculty(P)is covered byprofessor(P, D)and can be eliminated:q() ← professor(P, D), teaches(P, C)
This optimization significantly reduces the size of the rewritten query without changing its meaning, making query execution much more efficient.
Architecture
Step 1: User Query Input
A user submits an SQL-like query that uses ontological terms rather than specific database tables. This is the entry point to the system.
For example, a user might query: SELECT * FROM CSStudent WHERE takenCourse = "Programming101"
Step 2: Rewrite Mechanism Processes Query
The Rewrite Mechanism receives the query and begins processing it. This is where most of the intelligence in the system resides.
Steps 2a & 2b: Ontology Consultation
- 2a: The Rewrite Mechanism sends a request to the Ontology component to understand what "CSStudent" means semantically.
- 2b: The Ontology responds with semantic information. For example, it might tell the system that a "CSStudent" is defined as either "a student with CS major" OR "a student who has taken programming courses."
Steps 3a & 3b: Schema Mapping Consultation
- 3a: The Rewrite Mechanism asks the Schema Mappings component how ontological concepts map to actual database structures.
- 3b: Schema Mappings responds with information about where to find the data. For example:
- "CSStudent" maps to
studentstable in PostgreSQL wheremajor='CS' - "takenCourse" maps to a relationship in MongoDB's
enrollmentscollection
- "CSStudent" maps to
Step 4: Canonical Query Generation
Using all this information, the Rewrite Mechanism generates a canonical (standardized) query representation that's database-agnostic but contains all the logical operations needed.
Step 5: Schema Support for Canonical Query
The Schema Mappings provide additional information to the Canonical Query about specific field mappings, data types, and transformation rules needed to properly query each database.
Step 6: Database-Specific Query Execution
The Canonical Query is translated into database-specific queries and sent to the appropriate databases. For our example:
- A SQL query to PostgreSQL:
SELECT * FROM students WHERE major='CS' AND id IN (...) - A MongoDB query to find relevant course enrolments
Step 7: Result Collection
The databases execute their respective queries and return results to the Results component, which combines and normalizes them.
Step 8: Results Returned to User
The combined results are sent back to the user in a unified format that maintains the ontological terminology they originally used.