Filtered by date 2007-07-25, 1 - 10 of 13 Postings (
all,
summary)
Created by Anett Szabo, last modified by Malte Sussdorff 25 Jul 2007, at 05:57 PM
Data modeling is the hardest and most important activity in the RDBMS
world. If you get the data model wrong, your application might not do
what users need, it might be unreliable, it might fill up the database
with garbage. Why then do we start a SQL tutorial with the most
challenging part of the job? Because you can't do queries, inserts, and
updates until you've defined some tables. And defining tables is
data
modeling.
When data modeling, you are telling the RDBMS the following:
- what elements of the data you will store
- how large each element can be
- what kind of information each element can contain
- what elements may be left blank
- which elements are constrained to a fixed range
- whether and how various tables are to be linked
Three-Valued Logic
Programmers in most computer languages are familiar with Boolean logic.
A variable may be either true or false. Pervading SQL, however, is
the alien idea of
three-valued logic. A column can be true,
false, or NULL. When building the data model you must affirmatively
decide whether a NULL value will be permitted for a column and, if so,
what it means.
For example, consider a table for recording user-submitted comments to a
Web site. The publisher has made the following stipulations:
- comments won't go live until approved by an editor
- the admin pages will present editors with all comments that are
pending approval, i.e., have been submitted but neither approved nor
disapproved by an editor already
Here's the data model:
create table user_submitted_comments (
comment_id integer primary key,
user_id not null references users,
submission_time date default sysdate not null,
ip_address varchar(50) not null,
content clob,
approved_p char(1) check(approved_p in ('t','f'))
);
Implicit in this model is the assumption that
approved_p
can be NULL and that, if not explicitly set during the INSERT, that is
what it will default to. What about the check constraint? It would
seem to restrict
approved_p to values of "t" or "f". NULL,
however, is a special value and if we wanted to prevent
approved_p from taking on NULL we'd have to add an explicit
not null constraint.
How do NULLs work with queries? Let's fill
user_submitted_comments with some sample data and see:
insert into user_submitted_comments
(comment_id, user_id, ip_address, content)
values
(1, 23069, '18.30.2.68', 'This article reminds me of Hemingway');
Table created.
SQL> select first_names, last_name, content, user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id;
FIRST_NAMES LAST_NAME CONTENT APPROVED_P
------------ --------------- ------------------------------------ ------------
Philip Greenspun This article reminds me of Hemingway
We've successfully JOINed the
user_submitted_comments and
users table to get both the comment content and the name of
the user who submitted it. Notice that in the select list we had to
explicitly request
user_submitted_comments.approved_p. This is because
the
users table also has an
approved_p
column.
When we inserted the comment row we did not specify a value for the
approved_p column. Thus we expect that the value would be
NULL and in fact that's what it seems to be. Oracle's SQL*Plus
application indicates a NULL value with white space.
For the administration page, we'll want to show only those
comments where the approved_p column is NULL:
SQL> select first_names, last_name, content, user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id
and user_submitted_comments.approved_p = NULL;
no rows selected
"No rows selected"? That's odd. We know for a fact that we have one
row in the comments table and that is
approved_p column is
set to NULL. How to debug the query? The first thing to do is simplify
by removing the JOIN:
SQL> select * from user_submitted_comments where approved_p = NULL;
no rows selected
What is happening here is that any expression involving NULL evaluates
to NULL, including one that effectively looks like "NULL = NULL". The
WHERE clause is looking for expressions that evaluate to true. What you
need to use is the special test IS NULL:
SQL> select * from user_submitted_comments where approved_p is NULL;
COMMENT_ID USER_ID SUBMISSION_T IP_ADDRESS
---------- ---------- ------------ ----------
CONTENT APPROVED_P
------------------------------------ ------------
1 23069 2000-05-27 18.30.2.68
This article reminds me of Hemingway
An adage among SQL programmers is that the only time you can use
"= NULL" is in an UPDATE statement (to set a column's value to
NULL). It never makes sense to use "= NULL" in a WHERE clause.
The bottom line is that as a data modeler you will have to decide which
columns can be NULL and what that value will mean.
Back to the Mailing List
Let's return to the mailing list data model from the introduction:
create table mailing_list (
email varchar(100) not null primary key,
name varchar(100)
);
create table phone_numbers (
email varchar(100) not null references mailing_list,
number_type varchar(15) check (number_type in ('work','home','cell','beeper')),
phone_number varchar(20) not null
);
This data model locks you into some realities:
- You will not be sending out any physical New Year's cards to folks
on your mailing list; you don't have any way to store their addresses.
- You will not be sending out any electronic mail to folks who work at
companies with elaborate Lotus Notes configurations; sometimes Lotus
Notes results in email addresses that are longer than 100 characters.
- You are running the risk of filling the database with garbage since
you have not constrained phone numbers in any way. American users could
add or delete digits by mistake. International users could mistype
country codes.
- You are running the risk of not being able to serve rich people
because the
number_type column may be too constrained.
Suppose William H. Gates the Third wishes to record some extra phone
numbers with types of "boat", "ranch", "island", and "private_jet". The
check (number_type in ('work','home','cell','beeper'))
statement prevents Mr. Gates from doing this.
- You run the risk of having records in the database for people whose
name you don't know, since the
name column of
mailing_list is free to be NULL.
- Changing a user's email address won't be the simplest possible
operation. You're using
email as a key in two tables and
therefore will have to update both tables. The references
mailing_list keeps you from making the mistake of only updating
mailing_list and leaving orphaned rows in
phone_numbers. But if users changed their email addresses
frequently, you might not want to do things this way.
- Since you've no provision for storing a password or any other means
of authentication, if you allow users to update their information, you
run a minor risk of allowing a malicious change. (The risk isn't as
great as it seems because you probably won't be publishing the complete
mailing list; an attacker would have to guess the names of people on
your mailing list.)
These aren't necessarily bad realities in which to be locked. However,
a good data modeler recognizes that every line of code in the .sql file
has profound implications for the Web service.
To get some more information on how a simple datamodel for a Discussion Forum can evolve, read en:sql-wn-data_modeling-philip
Representing Web Site Core Content
Free-for-all Internet discussions can often be useful and occasionally
are compelling, but the anchor of a good Web site is usually a set of
carefully authored extended documents. Historically these have tended
to be stored in the Unix file system and they don't change too often.
Hence I refer to them as
static pages. Examples of static
pages on the photo.net server include this book chapter, the tutorial on
light for photographers at
http://www.photo.net/making-photographs/light.
We have some big goals to consider. We want the data in the database to
- help community experts figure out which articles need revision and
which new articles would be most valued by the community at large.
- help contributors work together on a draft article or a new version
of an old article.
- collect and organize reader comments and discussion, both for
presentation to other readers but also to assist authors in keeping
content up-to-date.
- collect and organize reader-submitted suggestions of related content
out on the wider Internet (i.e., links).
- help point readers to new or new-to-them content that might interest
them, based on what they've read before or based on what kind of content
they've said is interesting.
The big goals lead to some more concrete objectives:
- We will need a table that holds the static pages themselves.
- Since there are potentially many comments per page, we need a
separate table to hold the user-submitted comments.
- Since there are potentially many related links per page, we need a
separate table to hold the user-submitted links.
- Since there are potentially many authors for one page, we need a
separate table to register the author-page many-to-one relation.
- Considering the "help point readers to stuff that will interest
them" objective, it seems that we need to store the category or
categories under which a page falls. Since there are potentially many
categories for one page, we need a separate table to hold the mapping
between pages and categories.
create table static_pages (
page_id integer not null primary key,
url_stub varchar(400) not null unique,
original_author integer references users(user_id),
page_title varchar(4000),
page_body clob,
obsolete_p char(1) default 'f' check (obsolete_p in ('t','f')),
members_only_p char(1) default 'f' check (members_only_p in ('t','f')),
price number,
copyright_info varchar(4000),
accept_comments_p char(1) default 't' check (accept_comments_p in ('t','f')),
accept_links_p char(1) default 't' check (accept_links_p in ('t','f')),
last_updated date,
-- used to prevent minor changes from looking like new content
publish_date date
);
create table static_page_authors (
page_id integer not null references static_pages,
user_id integer not null references users,
notify_p char(1) default 't' check (notify_p in ('t','f')),
unique(page_id,user_id)
);
Note that we use a generated integer
page_id key for this
table. We could key the table by the
url_stub (filename),
but that would make it very difficult to reorganize files in the Unix
file system (something that should actually happen very seldom on a Web
server; it breaks links from foreign sites).
How to generate these unique integer keys when you have to insert a new
row into static_pages? You could
- lock the table
- find the maximum
page_id so far
- add one to create a new unique
page_id
- insert the row
- commit the transaction (releases the table lock)
Much better is to use Oracle's built-in sequence generation facility:
create sequence page_id_sequence start with 1;
Then we can get new page IDs by using
page_id_sequence.nextval in INSERT statements (see
the Transactions chapter for a fuller
discussion of sequences).
Reference
Here is a summary of the data modeling tools available to you in
Oracle, each hyperlinked to the Oracle documentation. This reference
section covers the following:
---
based on SQL for Web Nerds
Created by Anett Szabo, last modified by Malte Sussdorff 25 Jul 2007, at 05:56 PM
A trigger is a fragment of code that you tell Oracle to run before or
after a table is modified. A trigger has the power to
- make sure that a column is filled in with default information
- make sure that an audit row is inserted into another table
- after finding that the new information is inconsistent with other
stuff in the database, raise an error that will cause the entire
transaction to be rolled back
Remember the mailing lists example from the beginning?
Suppose that you've been using the above data model to collect the names
of Web site readers who'd like to be alerted when you add new articles.
You haven't sent any notices for two months. You want to send everyone
who signed up in the last two months a "Welcome to my Web service;
thanks for signing up; here's what's new" message. You want to send the
older subscribers a simple "here's what's new" message. But you can't
do this because you didn't store a registration date. It is easy enough
to fix the table:
alter table mailing_list add (registration_date date);
But what if you have 15 different Web scripts that use this table? The
ones that query it aren't a problem. If they don't ask for the new
column, they won't get it and won't realize that the table has been
changed (this is one of the big selling features of the RDBMS). But
the scripts that update the table will all need to be changed. If you
miss a script, you're potentially stuck with a table where various
random rows are missing critical information.
Oracle has a solution to your problem: triggers. A trigger is a
way of telling Oracle "any time anyone touches this table, I want you to
execute the following little fragment of code". Here's how we define
the trigger mailing_list_registration_date:
create trigger mailing_list_registration_date
before insert on mailing_list
for each row
when (new.registration_date is null)
begin
:new.registration_date := sysdate;
end;
Note that the trigger only runs when someone is trying to insert a row
with a NULL registration date. If for some reason you need to copy over
records from another database and they have a registration date, you
don't want this trigger overwriting it with the date of the copy.
A second point to note about this trigger is that it runs for each
row. This is called a "row-level trigger" rather than a
"statement-level trigger", which runs once per transaction, and is usually
not what you want.
A third point is that we're using the magic Oracle procedure
sysdate, which will return the current time. The Oracle
date type is precise to the second even though the default
is to display only the day.
A fourth point is that, starting with Oracle 8, we could have done this
more cleanly by adding a default sysdate instruction to the
column's definition.
The final point worth noting is the :new. syntax. This
lets you refer to the new values being inserted. There is an analogous
:old. feature, which is useful for update triggers:
create or replace trigger mailing_list_update
before update on mailing_list
for each row
when (new.name <> old.name)
begin
-- user is changing his or her name
-- record the fact in an audit table
insert into mailing_list_name_changes
(old_name, new_name)
values
(:old.name, :new.name);
end;
/
show errors
This time we used the
create or replace syntax. This keeps
us from having to
drop trigger mailing_list_update if we
want to change the trigger definition. We added a comment using the SQL
comment shortcut "--". The syntax
new. and
old. is used in the trigger definition, limiting the
conditions under which the trigger runs. Between the
begin
and
end, we're in a PL/SQL block. This is Oracle's
procedural language, described later, in which
new.name
would mean "the
name element from the record in
new". So you have to use
:new instead.
It is obscurities like this that lead to competent Oracle
consultants being paid $200+ per hour.
The "/" and show errors at the end are instructions to
Oracle's SQL*Plus program. The slash says "I'm done typing this piece
of PL/SQL, please evaluate what I've typed." The "show errors" says "if
you found anything to object to in what I just typed, please tell me".
Also consider the general_comments table:
create table general_comments (
comment_id integer primary key,
on_what_id integer not null,
on_which_table varchar(50),
user_id not null references users,
comment_date date not null,
ip_address varchar(50) not null,
modified_date date not null,
content clob,
-- is the content in HTML or plain text (the default)
html_p char(1) default 'f' check(html_p in ('t','f')),
approved_p char(1) default 't' check(approved_p in ('t','f'))
);
Users and administrators are both able to edit comments. We want to
make sure that we know when a comment was last modified so that we can
offer the administrator a "recently modified comments page". Rather
than painstakingly go through all of our Web scripts that insert or
update comments, we can specify an invariant in Oracle that "after every
time someone touches the
general_comments table, make sure
that the
modified_date column is set equal to the current
date-time." Here's the trigger definition:
create trigger general_comments_modified
before insert or update on general_comments
for each row
begin
:new.modified_date := sysdate;
end;
/
show errors
We're using the PL/SQL programming language, discussed in
the procedural language chapter. In this
case, it is a simple
begin-end block that sets the
:new value of
modified_date to the result of
calling the
sysdate function.
When using SQL*Plus, you have to provide a / character to get the
program to evaluate a trigger or PL/SQL function definition. You then
have to say "show errors" if you want SQL*Plus to print out what went
wrong. Unless you expect to write perfect code all the time, it can be
convenient to leave these SQL*Plus incantations in your .sql files.
An Audit Table Example
The canonical trigger example is the stuffing of an audit table. For
example, in the data warehouse section of the ArsDigita Community
System, we keep a table of user queries. Normally the SQL code for a
query is kept in a
query_columns table. However, sometimes
a user might hand edit the generated SQL code, in which case we simply
store that in the
query_sqlqueries table. The SQL code for a query might be very
important to a business and might have taken years to evolve. Even if
we have good RDBMS backups, we don't want it getting erased because of a
careless mouse click. So we add a
queries_audit table to
keep historical values of the
query_sql column:
create table queries (
query_id integer primary key,
query_name varchar(100) not null,
query_owner not null references users,
definition_time date not null,
-- if this is non-null, we just forget about all the query_columns
-- stuff; the user has hand edited the SQL
query_sql varchar(4000)
);
create table queries_audit (
query_id integer not null,
audit_time date not null,
query_sql varchar(4000)
);
Note first that
queries_audit has no primary key. If we
were to make
query_id the primary key, we'd only be able to
store one history item per query, which is not our intent.
How to keep this table filled? We could do it by making sure that every
Web script that might update the query_sql column inserts a
row in queries_audit when appropriate. But how to enforce
this after we've handed off our code to other programmers? Much better
to let the RDBMS enforce the auditing:
create or replace trigger queries_audit_sql
before update on queries
for each row
when (old.query_sql is not null and (new.query_sql is null or old.query_sql <> new.query_sql))
begin
insert into queries_audit (query_id, audit_time, query_sql)
values
(:old.query_id, sysdate, :old.query_sql);
end;
The structure of a row-level trigger is the following:
CREATE OR REPLACE TRIGGER ***trigger name***
***when*** ON ***which table***
FOR EACH ROW
***conditions for firing***
begin
***stuff to do***
end;
Let's go back and look at our trigger:
- It is named
queries_audit_sql; this is really of
no consequence so long as it doesn't conflict with the names of other
triggers.
- It will be run
before update, i.e., only when someone
is executing an SQL UPDATE statement.
- It will be run only when someone is updating the table
queries.
- It will be run only when the old value of
query_sql is
not null; we don't want to fill our audit table with NULLs.
- It will be run only when the new value of
query_sql is
different from the old value; we don't want to fill our audit table with
rows because someone happens to be updating another column in
queries. Note that SQL's three-valued logic forces us to
put in an extra test for new.query_sql is null because
old.query_sql <> new.query_sql will not evaluate to true
when new.query_sql is NULL (a user wiping out the custom
SQL altogether; a very important case to audit).
Creating More Elaborate Constraints with Triggers
The default Oracle mechanisms for constraining data are not always
adequate. For example, the ArsDigita Community System auction module
has a table called
au_categories. The
category_keyword column is a unique shorthand way of
referring to a category in a URL. However, this column may be NULL
because it is not the primary key to the table. The shorthand method of
referring to the category is optional.
create table au_categories (
category_id integer primary key,
-- shorthand for referring to this category,
-- e.g. "bridges", for use in URLs
category_keyword varchar(30),
-- human-readable name of this category,
-- e.g. "All types of bridges"
category_name varchar(128) not null
);
We can't add a UNIQUE constraint to the
category_keyword
column. That would allow the table to only have one row where
category_keyword was NULL. So we add a trigger that can
execute an arbitrary PL/SQL expression and raise an error to prevent an
INSERT if necessary:
create or replace trigger au_category_unique_tr
before insert
on au_categories
for each row
declare
existing_count integer;
begin
select count(*) into existing_count from au_categories
where category_keyword = :new.category_keyword;
if existing_count > 0
then
raise_application_error(-20010, 'Category keywords must be unique if used');
end if;
end;
This trigger queries the table to find out if there are any matching
keywords already inserted. If there are, it calls the built-in Oracle
procedure raise_application_error to abort the transaction.
Reference
---
based on SQL for Web Nerds
Created by Malte Sussdorff, last modified by Malte Sussdorff 25 Jul 2007, at 05:55 PM
Constraints
When you're defining a table, you can constrain single rows by adding
some magic words after the data type:
not null; requires a value for this column
unique; two rows can't have the same value in this
column (side effect in Oracle: creates an index)
primary key; same as unique except that no
row can have a null value for this column and other tables can refer to
this column
check; limit the range of values for column, e.g.,
rating integer check(rating > 0 and rating <= 10)
references; this column can only contain values present
in another table's primary key column, e.g.,
user_id not null references users in the
bboard table forces the user_id column to only
point to valid users. An interesting twist is that you don't have to
give a data type for user_id; Oracle assigns this column to
whatever data type the foreign key has (in this case integer).
Constraints can apply to multiple columns:
create table static_page_authors (
page_id integer not null references static_pages,
user_id integer not null references users,
notify_p char(1) default 't' check (notify_p in ('t','f')),
unique(page_id,user_id)
);
Oracle will let us keep rows that have the same
page_id and
rows that have the same
user_id but not rows that have the
same value in both columns (which would not make sense; a person can't
be the author of a document more than once). Suppose that you run a
university distinguished lecture series. You want speakers who are
professors at other universities or at least PhDs. On the other hand,
if someone controls enough money, be it his own or his company's, he's
in. Oracle stands ready:
create table distinguished_lecturers (
lecturer_id integer primary key,
name_and_title varchar(100),
personal_wealth number,
corporate_wealth number,
check (instr(upper(name_and_title),'PHD') <> 0
or instr(upper(name_and_title),'PROFESSOR') <> 0
or (personal_wealth + corporate_wealth) > 1000000000)
);
insert into distinguished_lecturers
values
(1,'Professor Ellen Egghead',-10000,200000);
1 row created.
insert into distinguished_lecturers
values
(2,'Bill Gates, innovator',75000000000,18000000000);
1 row created.
insert into distinguished_lecturers
values
(3,'Joe Average',20000,0);
ORA-02290: check constraint (PHOTONET.SYS_C001819) violated
As desired, Oracle prevented us from inserting some random average loser
into the
distinguished_lecturers table, but the error
message was confusing in that it refers to a constraint given the name
of "SYS_C001819" and owned by the PHOTONET user. We can give our
constraint a name at definition time:
create table distinguished_lecturers (
lecturer_id integer primary key,
name_and_title varchar(100),
personal_wealth number,
corporate_wealth number,
constraint ensure_truly_distinguished
check (instr(upper(name_and_title),'PHD') <> 0
or instr(upper(name_and_title),'PROFESSOR') <> 0
or (personal_wealth + corporate_wealth) > 1000000000)
);
insert into distinguished_lecturers
values
(3,'Joe Average',20000,0);
ORA-02290: check constraint (PHOTONET.ENSURE_TRULY_DISTINGUISHED) violated
Now the error message is easier to understand by application programmers.
Created by Malte Sussdorff, last modified by Malte Sussdorff 25 Jul 2007, at 05:54 PM
THIS NEEDS TO BE AMMENDED FOR POSTGRESQL
The basics:
CREATE TABLE your_table_name (
the_key_column key_data_type PRIMARY KEY,
a_regular_column a_data_type,
an_important_column a_data_type NOT NULL,
... up to 996 intervening columns in Oracle8 ...
the_last_column a_data_type
);
Even in a simple example such as the one above, there are few items
worth noting. First, I like to define the key column(s) at the very
top. Second, the
primary key constraint has some powerful
effects. It forces
the_key_column to be non-null. It
causes the creation of an index on
the_key_column, which
will slow down updates to
your_table_name but improve the
speed of access when someone queries for a row with a particular value
of
the_key_column. Oracle checks this index when inserting
any new row and aborts the transaction if there is already a row with
the same value for
the_key_column. Third, note that there
is no comma following the definition of the last row. If you are
careless and leave the comma in, Oracle will give you a very confusing
error message.
If you didn't get it right the first time, you'll probably want to
alter table your_table_name add (new_column_name a_data_type any_constraints);
or
alter table your_table_name modify (existing_column_name new_data_type new_constraints);
In Oracle 8i you can drop a column:
alter table your_table_name drop column existing_column_name;
(see
http://www.oradoc.com/keyword/drop_column).
If you're still in the prototype stage, you'll probably
find it easier to simply
drop table your_table_name;
and recreate it. At any time, you can see what you've got defined in
the database by querying Oracle's
Data Dictionary:
SQL> select table_name from user_tables order by table_name;
TABLE_NAME
------------------------------
ADVS
ADV_CATEGORIES
ADV_GROUPS
ADV_GROUP_MAP
ADV_LOG
ADV_USER_MAP
AD_AUTHORIZED_MAINTAINERS
AD_CATEGORIES
AD_DOMAINS
AD_INTEGRITY_CHECKS
BBOARD
...
STATIC_CATEGORIES
STATIC_PAGES
STATIC_PAGE_AUTHORS
USERS
...
after which you will typically type
describe
table_name_of_interest in SQL*Plus:
SQL> describe users;
Name Null? Type
------------------------------- -------- ----
USER_ID NOT NULL NUMBER(38)
FIRST_NAMES NOT NULL VARCHAR2(100)
LAST_NAME NOT NULL VARCHAR2(100)
PRIV_NAME NUMBER(38)
EMAIL NOT NULL VARCHAR2(100)
PRIV_EMAIL NUMBER(38)
EMAIL_BOUNCING_P CHAR(1)
PASSWORD NOT NULL VARCHAR2(30)
URL VARCHAR2(200)
ON_VACATION_UNTIL DATE
LAST_VISIT DATE
SECOND_TO_LAST_VISIT DATE
REGISTRATION_DATE DATE
REGISTRATION_IP VARCHAR2(50)
ADMINISTRATOR_P CHAR(1)
DELETED_P CHAR(1)
BANNED_P CHAR(1)
BANNING_USER NUMBER(38)
BANNING_NOTE VARCHAR2(4000)
Note that Oracle displays its internal data types rather than the ones
you've given, e.g.,
number(38) rather than
integer and
varchar2 instead of the specified
varchar.
Created by Malte Sussdorff, last modified by Malte Sussdorff 25 Jul 2007, at 05:51 PM
Data Types
For each column that you define for a table, you must specify the data
type of that column. Here are your options:
| Character Data |
| char(n)
|
A fixed-length character string, e.g., char(200) will take
up 200 bytes regardless of how long the string actually is. This works
well when the data truly are of fixed size, e.g., when you are recording
a user's sex as "m" or "f". This works badly when the data are of
variable length. Not only does it waste space on the disk and in the
memory cache, but it makes comparisons fail. For example, suppose you
insert "rating" into a comment_type column of type
char(30) and then your Tcl program queries the database.
Oracle sends this column value back to procedural language clients
padded with enough spaces to make up 30 total characters. Thus if you
have a comparison within Tcl of whether $comment_type ==
"rating", the comparison will fail because
$comment_type is actually "rating" followed by 24 spaces.
The maximum length char in Oracle8 is 2000 bytes.
|
| varchar(n)
|
A variable-length character string, up to 4000 bytes long in Oracle8.
These are stored in such a way as to minimize disk space usage, i.e., if
you only put one character into a column of type
varchar(4000), Oracle only consumes two bytes on disk.
The reason that you don't just make all the columns
varchar(4000) is that the Oracle indexing system is limited
to indexing keys of about 700 bytes.
|
| clob
|
A variable-length character string, up to 4 gigabytes long in Oracle8.
The CLOB data type is useful for accepting user input from such
applications as discussion forums. Sadly, Oracle8 has tremendous
limitations on how CLOB data may be inserted, modified, and queried.
Use varchar(4000) if you can and
prepare to suffer if you can't.
In a spectacular demonstration of what happens when companies don't
follow the lessons of
The
Mythical Man Month, the regular string functions don't work
on CLOBs. You need to call identically named functions in the DBMS_LOB
package. These functions take the same arguments but in different
orders. You'll never be able to write a working line of code without
first reading
the
DBMS_LOB section of the
Oracle8 Server Application Developer's Guide.
|
| nchar, nvarchar, nclob
|
The n prefix stands for "national character set". These work like char,
varchar, and clob but for multi-byte characters (e.g., Unicode; see
http://www.unicode.org).
|
| Numeric Data |
| number
|
Oracle actually only has one internal data type that is used for storing
numbers. It can handle 38 digits of precision and exponents from -130
to +126. If you want to get fancy, you can specify precision and scale
limits. For example, number(3,0) says "round everything to an
integer [scale 0] and accept numbers than range from -999 to +999". If
you're American and commercially minded, number(9,2) will
probably work well for storing prices in dollars and cents (unless
you're selling stuff to Bill
Gates, in which case the billion dollar limit imposed by the
precision of 9 might prove constraining). If you are
Bill Gates, you might not want to get distracted by
insignificant numbers: Tell Oracle to round everything to the nearest
million with number(38,-6).
|
| integer
|
In terms of storage consumed and behavior, this is not any different
from number(38) but I think it reads better and it is more
in line with ANSI SQL (which would be a standard if anyone actually
implemented it).
|
| Dates and Date/Time Intervals (Version 9i and newer) |
| timestamp
|
A point in time, recorded with sub-second precision. When creating a
column you specify the number of digits of precision beyond one second
from 0 (single second precision) to 9 (nanosecond precision). Oracle's
calendar can handle dates between between January 1, 4712 BC and
December 31, 9999 AD. You can put in values with the
to_timestamp function and query them out using the
to_char function. Oracle offers several variants of this
datatype for coping with data aggregated across multiple timezones.
|
| interval year to month
|
An amount of time, expressed in years and months.
|
|---|
| interval day to second
|
An amount of time, expressed in days, hours, minutes, and seconds. Can
be precise down to the nanosecond if desired.
|
| Dates and Date/Time Intervals (Versions 8i and earlier) |
| date
|
Obsolete as of version 9i. A point in time,
recorded with one-second precision, between January 1, 4712 BC and
December 31, 4712 AD. You can put in values with the
to_date function and query them out using the
to_char function. If you don't use these functions, you're
limited to specifying the date with the default system format mask,
usually 'DD-MON-YY'. This is a good recipe for a Year 2000 bug since
January 23, 2000 would be '23-JAN-00'. On better-maintained systems,
this is often the ANSI default: 'YYYY-MM-DD', e.g., '2000-01-23' for
January 23, 2000.
|
| number
|
Hey, isn't this a typo? What's number doing in the date
section? It is here because this is how Oracle versions prior to 9i
represented date-time intervals, though their docs never say this
explicitly. If you add numbers to dates, you get new dates. For
example, tomorrow at exactly this time is sysdate+1. To
query for stuff submitted in the last hour, you limit to
submitted_date > sysdate - 1/24.
|
|---|
| Binary Data |
| blob
| BLOB stands for "Binary Large OBject". It doesn't really have
to be all that large, though Oracle will let you store up to 4 GB. The
BLOB data type was set up to permit the storage of images, sound
recordings, and other inherently binary data. In practice, it also gets
used by fraudulent application software vendors. They spend a few years
kludging together some nasty format of their own. Their MBA executive
customers demand that the whole thing be RDBMS-based. The software
vendor learns enough about Oracle to "stuff everything into a BLOB".
Then all the marketing and sales folks are happy because the application
is now running from Oracle instead of from the file system. Sadly, the
programmers and users don't get much because you can't use SQL very
effectively to query or update what's inside a BLOB.
|
| bfile
| A binary file, stored by the operating system (typically Unix)
and kept track of by Oracle. These would be useful when you
need to get to information both from SQL (which is kept purposefully
ignorant about what goes on in the wider world) and from an application
that can only read from standard files (e.g., a typical Web server).
The bfile data type is pretty new but to my mind it is already
obsolete: Oracle 8.1 (8i) lets external applications view content in
the database as though it were a file on a Windows NT server. So why
not keep everything as a BLOB and enable Oracle's Internet File System?
|
Despite this plethora of data types, Oracle has some glaring holes that
torture developers. For example, there is no Boolean data type. A
developer who needs an
approved_p column is forced to use
char(1) check(this_column in ('t','f')) and then, instead
of the clean query
where approved_p is forced into
where approved_p = 't'.
Oracle8 includes a limited ability to create your own data types.
Covering these is beyond the scope of this book. See
Oracle8 Server Concepts, User-Defined Datatypes.
Created by Malte Sussdorff, last modified by Malte Sussdorff 25 Jul 2007, at 05:50 PM
Back in 1995, I built a threaded discussion forum, described
ad
nauseum in
http://philip.greenspun.com/wtr/dead-trees/53013.htm.
Here's how I stored the postings:
create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
email varchar(200),
name varchar(200),
one_line varchar(700),
message clob,
notify char(1) default 'f' check (notify in ('t','f')),
posting_time date,
sort_key varchar(600)
);
German order reigns inside the system itself: messages are uniquely
keyed with
msg_id, refer to each other (i.e., say "I'm a
response to msg X") with
refers_to, and a thread can be
displayed conveniently by using the
sort_key column.
Italian chaos is permitted in the email and
name columns; users could remain anonymous, masquerade as
"president@whitehouse.gov" or give any name.
This seemed like a good idea when I built the system. I was concerned
that it work reliably. I didn't care whether or not users put in bogus
content; the admin pages made it really easy to remove such postings
and, in any case, if someone had something interesting to say but needed
to remain anonymous, why should the system reject their posting?
One hundred thousand postings later, as the moderator of the
photo.net Q&A forum, I began to see the dimensions
of my data modeling mistakes.
First, anonymous postings and fake email addresses didn't come from
Microsoft employees revealing the dark truth about their evil bosses.
They came from complete losers trying and failing to be funny or
wishing to humiliate other readers. Some fake addresses came from
people scared by the rising tide of spam email (not a serious problem
back in 1995).
Second, I didn't realize how the combination of my email alert systems,
fake email addresses, and Unix mailers would result in my personal
mailbox filling up with messages that couldn't be delivered to
"asdf@asdf.com" or "duh@duh.net".
Although the solution involved changing some Web scripts, fundamentally
the fix was add a column to store the IP address from which a post was
made:
alter table bboard add (originating_ip varchar(16));
Keeping these data enabled me to see that most of the anonymous posters
were people who'd been using the forum for some time, typically from the
same IP address. I just sent them mail and asked them to stop,
explaining the problem with bounced email.
After four years of operating the photo.net community, it became
apparent that we needed ways to
- display site history for users who had changed their email addresses
- discourage problem users from burdening the moderators and the
community
- carefully tie together user-contributed content in the various
subsystems of photo.net
The solution was obvious to any experienced database nerd: a canonical
users table and then content tables that reference it. Here's a
simplified version of the data model, taken from a toolkit for building
online communities, describe in
http://philip.greenspun.com/panda/community:
create table users (
user_id integer not null primary key,
first_names varchar(100) not null,
last_name varchar(100) not null,
email varchar(100) not null unique,
..
);
create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
topic varchar(100) not null references bboard_topics,
category varchar(200), -- only used for categorized Q&A forums
originating_ip varchar(16), -- stored as string, separated by periods
user_id integer not null references users,
one_line varchar(700),
message clob,
-- html_p - is the message in html or not
html_p char(1) default 'f' check (html_p in ('t','f')),
...
);
create table classified_ads (
classified_ad_id integer not null primary key,
user_id integer not null references users,
...
);
Note that a contributor's name and email address no longer appear in the
bboard table. That doesn't mean we don't know who posted a
message. In fact, this data model can't even represent an anonymous
posting:
user_id integer not null references users
requires that each posting be associated with a user ID and that there
actually be a row in the
users table with that ID.
First, let's talk about how much fun it is to move a live-on-the-Web
600,000 hit/day service from one data model to another. In this case,
note that the original bboard data model had a single
name column. The community system has separate columns for
first and last names. A conversion script can easily split up "Joe
Smith" but what is it to do with William Henry Gates
III?
How do we copy over anonymous postings? Remember that Oracle is not
flexible or intelligent. We said that we wanted every row in the
bboard table to reference a row in the users
table. Oracle will abort any transaction that would result in a
violation of this integrity constraint. So we either have to drop all
those anonymous postings (and any non-anonymous postings that refer to
them) or we have to create a user called "Anonymous" and assign all the
anonymous postings to that person. The technical term for this kind of
solution is kludge.
A more difficult problem than anonymous postings is presented by
long-time users who have difficulty typing and or keeping a job.
Consider a user who has identified himself as
- Joe Smith; jsmith@ibm.com
- Jo Smith; jsmith@ibm.com (typo in name)
- Joseph Smith; jsmth@ibm.com (typo in email)
- Joe Smith; cantuseworkaddr@hotmail.com (new IBM policy)
- Joe Smith-Jones; joe_smithjones@hp.com (got married, changed name,
changed jobs)
- Joe Smith-Jones; jsmith@somedivision.hp.com (valid but not
canonical corporate email address)
- Josephina Smith; jsmith@somedivision.hp.com (sex change; divorce)
- Josephina Smith; josephina_smith@hp.com (new corporate address)
- Siddhartha Bodhisattva; josephina_smith@hp.com (change of philosophy)
- Siddhartha Bodhisattva; thinkwaitfast@hotmail.com (traveling for
awhile to find enlightenment)
Contemporary community members all recognize these postings as coming
from the same person but it would be very challenging even to build a
good semi-automated means of merging postings from this person into one
user record.
Once we've copied everything into this new normalized data
model, notice that we can't dig ourselves into the same hole again. If
a user has contributed 1000 postings, we don't have 1000 different
records of that person's name and email address. If a user changes
jobs, we need only update one column in one row in one table.
The html_p column in the new data model is worth
mentioning. In 1995, I didn't understand the problems of user-submitted
data. Some users will submit plain text, which seems simple, but in
fact you can't just spit this out as HTML. If user A typed < or >
characters, they might get swallowed by user B's Web browser. Does this
matter? Consider that "<g>" is interpreted in various online
circles as an abbreviation for "grin" but by Netscape Navigator as an
unrecognized (and therefore ignore) HTML tag. Compare the meaning of
"We shouldn't think it unfair that Bill Gates has more wealth than the
100 million poorest Americans combined. After all, he invented the
personal computer, the graphical user interface, and the Internet."
with
"We shouldn't think it unfair that Bill Gates has more wealth than the
100 million poorest Americans combined. After all, he invented the
personal computer, the graphical user interface, and the Internet. <g>"
It would have been easy enough for me to make sure that such characters
never got interpreted as markup. In fact, with AOLserver one can do it
with a single call to the built-in procedure ns_quotehtml.
However, consider the case where a nerd posts some HTML. Other users
would then see
"For more examples of my brilliant thinking and modesty, check out <a
href="http://philip.greenspun.com/">my home page</a>."
I discovered that the only real solution is to ask the user whether the
submission is an HTML fragment or plain text, show the user an approval
page where the content may be previewed, and then remember what the user
told us in an
html_p column in the database.
Is this data model perfect? Permanent? Absolutely. It will last for
at least... Whoa! Wait a minute. I didn't know that Dave Clark was
replacing his original Internet Protocol, which the world has been
running since around 1980, with IPv6
(http://www.faqs.org/rfcs/rfc2460.html). In the near future, we'll have IP
addresses that are 128 bits long. That's 16 bytes, each of which takes
two hex characters to represent. So we need 32 characters plus at least
7 more for periods that separate the hex digits. We might also need a
couple of characters in front to say "this is a hex representation".
Thus our brand new data model in fact has a crippling deficiency. How
easy is it to fix? In Oracle:
alter table bboard modify (originating_ip varchar(50));
You won't always get off this easy. Oracle won't let you shrink a
column from a maximum of 50 characters to 16, even if no row has a value
longer than 16 characters. Oracle also makes it tough to add a column
that is constrained
not null.
Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 04:32 PM
In the introduction we covered some examples of inserting data into a
database by typing at SQL*Plus:
insert into mailing_list (name, email)
values ('Philip Greenspun','philg@mit.edu');
Generally, this is not how it is done. As a programmer, you write code
that gets executed every time a user submits a discussion forum posting
or classified ad. The structure of the SQL statement remains fixed but
not the string literals after the
values.
The simplest and most direct interface to a relational database involves
a procedural program in C, Java, Lisp, Perl, or Tcl putting together a
string of SQL that is then sent to to the RDBMS. Here's how the
ArsDigita Community System constructs a new entry in the clickthrough
log:
insert into clickthrough_log
(local_url, foreign_url, entry_date, click_count)
values
('$local_url', '$foreign_url', trunc(sysdate), 1)"
The INSERT statement adds one row, filling in the four list columns.
Two of the values come from local variables set within the Web server,
$local_url and
$foreign_url. Because these
are strings, they must be surrounded by single quotes. One of the
values is dynamic and comes straight from Oracle:
trunc(sysdate). Recall that the
date data
type in Oracle is precise to the second. We only want one of these rows
per day of the year and hence truncate the date to midnight. Finally,
as this is the first clickthrough of the day, we insert a constant value
of 1 for
click_count.
Atomicity
Each SQL statement executes as an atomic transaction. For example,
suppose that you were to attempt to purge some old data with
delete from clickthrough_log where entry_date + 120 < sysdate;
(delete clickthrough records more than 120 days old) and that 3500 rows
in
clickthrough_log are older than 120 days. If your
computer failed halfway through the execution of this DELETE, i.e.,
before the transaction committed, you would find that none of the rows
had been deleted. Either all 3500 rows will disappear or none will.
More interestingly, you can wrap a transaction around multiple SQL
statements. For example, when a user is editing a comment, the
ArsDigita Community System keeps a record of what was there before:
ns_db dml $db "begin transaction"
# insert into the audit table
ns_db dml $db "insert into general_comments_audit
(comment_id, user_id, ip_address, audit_entry_time, modified_date, content)
select comment_id,
user_id,
'[ns_conn peeraddr]',
sysdate,
modified_date,
content from general_comments
where comment_id = $comment_id"
# change the publicly viewable table
ns_db dml $db "update general_comments
set content = '$QQcontent',
html_p = '$html_p'
where comment_id = $comment_id"
# commit the transaction
ns_db dml $db "end transaction"
This is generally referred to in the database industry as
auditing. The database itself is used to keep track of what has
been changed and by whom.
Let's look at these sections piece by piece. We're looking at a Tcl
program calling AOLserver API procedures when it wants to talk to
Oracle. We've configured the system to reverse the normal Oracle world
order in which everything is within a transaction unless otherwise
committed. The begin transaction and end
transaction statements never get through to Oracle; they are
merely instructions to our Oracle driver to flip Oracle out and then
back into autocommit mode.
The transaction wrapper is imposed around two SQL statements. The first
statement inserts a row into general_comments_audit. We
could simply query the general_comments table from Tcl and
then use the returned data to create a standard-looking INSERT.
However, if what you're actually doing is moving data from one place
within the RDBMS to another, it is extremely bad taste to drag it all
the way out to an application program and then stuff it back in. Much
better to use the "INSERT ... SELECT" form.
Note that two of the columns we're querying from
general_comments don't exist in the table:
sysdate and '[ns_conn peeraddr]'. It is legal
in SQL to put function calls or constants in your select list, just
as you saw at the beginning of
the Queries chapter where we discussed Oracle's
one-row system table: dual. To refresh your memory:
select sysdate from dual;
SYSDATE
----------
1999-01-14
You can compute multiple values in a single query:
select sysdate, 2+2, atan2(0, -1) from dual;
SYSDATE 2+2 ATAN2(0,-1)
---------- ---------- -----------
1999-01-14 4 3.14159265
This approach is useful in the transaction above, where we combine
information from a table with constants and function calls. Here's a
simpler example:
select posting_time, 2+2
from bboard
where msg_id = '000KWj';
POSTING_TI 2+2
---------- ----------
1998-12-13 4
Let's get back to our comment editing transaction and look at the basic
structure:
- open a transaction
- insert into an audit table whatever comes back from a SELECT
statement on the comment table
- update the comment table
- close the transaction
Suppose that something goes wrong during the INSERT. The tablespace in
which the audit table resides is full and it isn't possible to add a
row. Putting the INSERT and UPDATE in the same RDBMS transactions
ensures that if there is a problem with one, the other won't be applied
to the database.
Consistency
Suppose that we've looked at a message on the bulletin board and decide
that its content is so offensive we wish to delete the user from our
system:
select user_id from bboard where msg_id = '000KWj';
USER_ID
----------
39685
delete from users where user_id = 39685;
*
ERROR at line 1:
ORA-02292: integrity constraint (PHOTONET.SYS_C001526) violated - child record
found
Oracle has stopped us from deleting user 39685 because to do so would
leave the database in an inconsistent state. Here's the definition of
the bboard table:
create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
...
user_id integer not null references users,
one_line varchar(700),
message clob,
...
);
The
user_id column is constrained to be not null.
Furthermore, the value in this column must correspond to some row in the
users table (
references users). By asking
Oracle to delete the author of msg_id 000KWj from the
users
table before we deleted all of his or her postings from the
bboard table, we were asking Oracle to leave the RDBMS in
an inconsistent state.
Mutual Exclusion
When you have multiple simultaneously executing copies of the same
program, you have to think about
mutual exclusion. If a program
has to
- read a value from the database
- perform a computation based on that value
- update the value in the database based on the computation
Then you want to make sure only one copy of the program is executing at
a time through this segment.
The /bboard module of the ArsDigita Community System has to do this.
The sequence is
- read the last message ID from the
msg_id_generator table
- increment the message ID with a bizarre collection of Tcl scripts
- update the
last_msg_id column
in the msg_id_generator table
First, anything having to do with locks only makes sense when the three
operations are grouped together in a transaction. Second, to avoid
deadlocks a transaction must acquire all the resources (including locks)
that it needs at the start of the transaction. A SELECT in Oracle does
not acquire any locks but a SELECT .. FOR UPDATE does. Here's the
beginning of the transaction that inserts a message into the
bboard table (from /bboard/insert-msg.tcl):
select last_msg_id
from msg_id_generator
for update of last_msg_id
Mutual Exclusion (the Big Hammer)
The
for update clause isn't a panacea. For example, in the
Action Network (described in
Chapter 16 of Philip and Alex's
Guide to Web Publishing), we need to make sure that a
double-clicking user doesn't generate duplicate FAXes to politicians.
The test to see if the user has already responded is
select count(*) from an_alert_log
where member_id = $member_id
and entry_type = 'sent_response'
and alert_id = $alert_id
By default, Oracle locks one row at a time and doesn't want you to throw
a FOR UPDATE clause into a SELECT COUNT(*). The implication of that
would be Oracle recording locks on every row in the table. Much more
efficient is simply to start the transaction with
lock table an_alert_log in exclusive mode
This is a big hammer and you don't want to hold a table lock for more
than an instant. So the structure of a page that gets a table lock
should be
- open a transaction
- lock table
- select count(*)
- if the count was 0, insert a row to record the fact that the user
has responded
- commit the transaction (releases the table lock)
- proceed with the rest of the script
- ...
What if I just want some unique numbers?
Does it really have to be this hard? What if you just want some unique
integers, each of which will be used as a primary key? Consider a table
to hold news items for a Web site:
create table news (
title varchar(100) not null,
body varchar(4000) not null,
release_date date not null,
...
);
You might think you could use the
title column as a key,
but consider the following articles:
insert into news (title, body, release_date)
values
('French Air Traffic Controllers Strike',
'A walkout today by controllers left travelers stranded..',
'1995-12-14');
insert into news (title, body, release_date)
values
('French Air Traffic Controllers Strike',
'Passengers at Orly faced 400 canceled flights ...',
'1997-05-01');
insert into news (title, body, release_date)
values
('Bill Clinton Beats the Rap',
'Only 55 senators were convinced that President Clinton obstructed justice ...',
'1999-02-12');
insert into news (title, body, release_date)
values
('Bill Clinton Beats the Rap',
'The sexual harassment suit by Paula Jones was dismissed ...',
'1998-12-02);
It would seem that, at least as far as headlines are concerned,
little of what is reported is truly new. Could we add
primary key (title, release_date)
at the end of our table definition? Absolutely. But keying by title
and date would result in some unwieldy URLs for editing or approving
news articles. If your site allows public suggestions, you might find
submissions from multiple users colliding. If you accept comments on
news articles, a standard feature of the ArsDigita Community System,
each comment must reference a news article. You'd have to be
sure to update both the comments table and the news table if you needed
to correct a typo in the
title column or changed the
release_date.
The traditional database design that gets around all of these problems
is the use of a generated key. If you've been annoyed by having to
carry around your student ID at MIT or your patient ID at a hospital,
now you understand the reason why: the programmers are using generated
keys and making their lives a bit easier by exposing this part of their
software's innards.
Here's how the news module of the ArsDigita Community System works, in
an excerpt from http://software.arsdigita.com/www/doc/sql/news.sql:
create sequence news_id_sequence start with 1;
create table news (
news_id integer primary key,
title varchar(100) not null,
body varchar(4000) not null,
release_date date not null,
...
);
We're taking advantage of the nonstandard but very useful Oracle
sequence facility. In almost any Oracle SQL statement, you can
ask for a sequence's current value or next value.
SQL> create sequence foo_sequence;
Sequence created.
SQL> select foo_sequence.currval from dual;
ERROR at line 1:
ORA-08002: sequence FOO_SEQUENCE.CURRVAL is not yet defined in this session
Oops! Looks like we can't ask for the current value until we've
asked for at least one key in our current session with Oracle.
SQL> select foo_sequence.nextval from dual;
NEXTVAL
----------
1
SQL> select foo_sequence.nextval from dual;
NEXTVAL
----------
2
SQL> select foo_sequence.nextval from dual;
NEXTVAL
----------
3
SQL> select foo_sequence.currval from dual;
CURRVAL
----------
3
You can use the sequence generator directly in an insert, e.g.,
insert into news (news_id, title, body, release_date)
values
(news_id_sequence.nextval,
'Tuition Refund at MIT',
'Administrators were shocked and horrified ...',
'1998-03-12);
Background on this story: http://philip.greenspun.com/school/tuition-free-mit.html
In the ArsDigita Community System implementation, the
news_id is actually generated in /news/post-new-2.tcl:
set news_id [database_to_tcl_string $db "select news_id_sequence.nextval from dual"]
This way the page that actually does the database insert,
/news/post-new-3.tcl, can be sure when the user has inadvertently hit
submit twice:
if [catch { ns_db dml $db "insert into news
(news_id, title, body, html_p, approved_p,
release_date, expiration_date, creation_date, creation_user,
creation_ip_address)
values
($news_id, '$QQtitle', '$QQbody', '$html_p', '$approved_p',
'$release_date', '$expiration_date', sysdate, $user_id,
'$creation_ip_address')" } errmsg] {
# insert failed; let's see if it was because of duplicate submission
if { [database_to_tcl_string $db "select count(*)
from news
where news_id = $news_id"] == 0 } {
# some error other than dupe submission
ad_return_error "Insert Failed" "The database ..."
return
}
# we don't bother to handle the cases where there is a dupe submission
# because the user should be thanked or redirected anyway
}
In our experience, the standard technique of generating the key at the
same time as the insert leads to a lot of duplicate information in the
database.
Sequence Caveats
Oracle sequences are optimized for speed. Hence they offer the minimum
guarantees that Oracle thinks are required for primary key generation
and no more.
If you ask for a few nextvals and roll back your transaction, the
sequence will not be rolled back.
You can't rely on sequence values to be, uh, sequential. They will be
unique. They will be monotonically increasing. But there might be gaps.
The gaps arise because Oracle pulls, by default, 20 sequence values into
memory and records those values as used on disk. This makes nextval
very fast since the new value need only be marked use in RAM and not on
disk. But suppose that someone pulls the plug on your database server
after only two sequence values have been handed out. If your database
administrator and system administrator are working well together, the
computer will come back to life running Oracle. But there will be a gap
of 18 values in the sequence (e.g., from 2023 to 2041). That's because
Oracle recorded 20 values used on disk and only handed out 2.
More
---
based on SQL for Web Nerds
Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 04:31 PM
The relational database provides programmers with a high degree of
abstraction from the physical world of the computer. You can't tell
where on the disk the RDBMS is putting each row of a table. For all you
know, information in a single row might be split up and spread out
across multiple disk drives. The RDBMS lets you add a column to a
billion-row table. Is the new information for each row going to be
placed next to the pre-existing columns or will a big new block of disk
space be allocated to hold the new column value for all billion rows?
You can't know and shouldn't really care.
A view is a way of building even greater abstraction.
Suppose that Jane in marketing says that she wants to see a table
containing the following information:
- user_id
- email address
- number of static pages viewed
- number of bboard postings made
- number of comments made
This information is spread out among four tables. However, having read
the preceding chapters of this book, you're perfectly equipped to serve
Jane's needs with the following query:
select u.user_id,
u.email,
count(ucm.page_id) as n_pages,
count(bb.msg_id) as n_msgs,
count(c.comment_id) as n_comments
from users u, user_content_map ucm, bboard bb, comments c
where u.user_id = ucm.user_id(+)
and u.user_id = bb.user_id(+)
and u.user_id = c.user_id(+)
group by u.user_id, u.email
order by upper(email)
Then Jane adds "I want to see this every day, updated with the latest
information. I want to have a programmer write me some desktop software
that connects directly to the database and looks at this information; I
don't want my desktop software breaking if you reorganize the data
model."
create or replace view janes_marketing_view
as
select u.user_id,
u.email,
count(ucm.page_id) as n_pages,
count(bb.msg_id) as n_msgs,
count(c.comment_id) as n_comments
from users u, user_content_map ucm, bboard bb, comments c
where u.user_id = ucm.user_id(+)
and u.user_id = bb.user_id(+)
and u.user_id = c.user_id(+)
group by u.user_id, u.email
order by upper(u.email)
To Jane, this will look and act just like a table when she queries it:
select * from janes_marketing_view;
Why should she need to be aware that information is coming from four
tables? Or that you've reorganized the RDBMS so that the information
subsequently comes from six tables?
Protecting Privacy with Views
A common use of views is protecting confidential data. For example,
suppose that all the people who work in a hospital collaborate by using
a relational database. Here is the data model:
create table patients (
patient_id integer primary key,
patient_name varchar(100),
hiv_positive_p char(1),
insurance_p char(1),
...
);
If a bunch of hippie idealists are running the hospital, they'll think
that the medical doctors shouldn't be aware of a patient's insurance
status. So when a doc is looking up a patient's medical record, the
looking is done through
create view patients_clinical
as
select patient_id, patient_name, hiv_positive_p from patients;
The folks over in accounting shouldn't get access to the patients'
medical records just because they're trying to squeeze money out of
them:
create view patients_accounting
as
select patient_id, patient_name, insurance_p from patients;
Relational databases have elaborate permission systems similar to those
on time-shared computer systems. Each person in a hospital has a unique
database user ID. Permission will be granted to view or modify certain
tables on a per-user or per-group-of-users basis. Generally the RDBMS
permissions facilities aren't very useful for Web applications. It is
the Web server that is talking to the database, not a user's desktop
computer. So the Web server is responsible for figuring out who is
requesting a page and how much to show in response.
Protecting Your Own Source Code
The ArsDigita Shoppe system, described in
http://philip.greenspun.com/panda/ecommerce,
represents all orders in one table, whether they were denied by the
credit card processor, returned by the user, or voided by the merchant.
This is fine for transaction processing but you don't want your
accounting or tax reports corrupted by the inclusion of failed orders.
You can make a decision in one place as to what constitutes a reportable
order and then have all of your report programs query the view:
create or replace view sh_orders_reportable
as
select * from sh_orders
where order_state not in ('confirmed','failed_authorization','void');
Note that in the privacy example (above) we were using the view to leave
unwanted columns behind whereas here we are using the view to leave
behind unwanted rows.
If we add some order states or otherwise change the data model, the
reporting programs need not be touched; we only have to keep this view
definition up to date. Note that you can define every view with
"create or replace view" rather than "create view"; this saves a
bit of typing when you have to edit the definition later.
If you've used select * to define a view and
subsequently alter any of the underlying tables, you have to redefine
the view. Otherwise, your view won't contain any of the new columns.
You might consider this a bug but Oracle has documented it,
thus turning the behavior into a feature.
Views-on-the-fly and OUTER JOIN
Let's return to our first OUTER JOIN example, from the simple queries
chapter:
select users.user_id, users.email, classified_ads.posted
from users, classified_ads
where users.user_id = classified_ads.user_id(+)
order by users.email, posted;
...
USER_ID EMAIL POSTED
---------- ----------------------------------- ----------
52790 dbrager@mindspring.com
37461 dbraun@scdt.intel.com
52791 dbrenner@flash.net
47177 dbronz@free.polbox.pl
37296 dbrouse@enter.net
47178 dbrown@cyberhighway.net
36985 dbrown@uniden.com 1998-03-05
36985 dbrown@uniden.com 1998-03-10
34283 dbs117@amaze.net
52792 dbsikorski@yahoo.com
...
The plus sign after classified_ads.user_id is our
instruction to Oracle to "add NULL rows if you can't meet this JOIN
constraint".
Suppose that this report has gotten very long and we're only interested
in users whose email addresses start with "db". We can add a WHERE
clause constraint on the email column of the
users table:
select users.user_id, users.email, classified_ads.posted
from users, classified_ads
where users.user_id = classified_ads.user_id(+)
and users.email like 'db%'
order by users.email, posted;
USER_ID EMAIL POSTED
---------- ------------------------------ ----------
71668 db-designs@emeraldnet.net
112295 db1@sisna.com
137640 db25@umail.umd.edu
35102 db44@aol.com 1999-12-23
59279 db4rs@aol.com
95190 db@astro.com.au
17474 db@hotmail.com
248220 db@indianhospitality.com
40134 db@spindelvision.com 1999-02-04
144420 db_chang@yahoo.com
15020 dbaaru@mindspring.com
...
Suppose that we decide we're only interested in classified ads since
January 1, 1999. Let's try the naive approach, adding another WHERE
clause constraint, this time on a column from the
classified_ads table:
select users.user_id, users.email, classified_ads.posted
from users, classified_ads
where users.user_id = classified_ads.user_id(+)
and users.email like 'db%'
and classified_ads.posted > '1999-01-01'
order by users.email, posted;
USER_ID EMAIL POSTED
---------- ------------------------------ ----------
35102 db44@aol.com 1999-12-23
40134 db@spindelvision.com 1999-02-04
16979 dbdors@ev1.net 2000-10-03
16979 dbdors@ev1.net 2000-10-26
235920 dbendo@mindspring.com 2000-08-03
258161 dbouchar@bell.mma.edu 2000-10-26
39921 dbp@agora.rdrop.com 1999-06-03
39921 dbp@agora.rdrop.com 1999-11-05
8 rows selected.
Hey! This completely wrecked our outer join! All of the rows where the
user had not posted any ads have now disappeared. Why? They didn't
meet the
and classified_ads.posted > '1999-01-01'
constraint. The outer join added NULLs to every column in the report
where there was no corresponding row in the
classified_ads
table. The new constraint is comparing NULL to January 1, 1999 and the
answer is... NULL. That's three-valued logic for you. Any computation
involving a NULL turns out NULL. Each WHERE clause constraint must
evaluate to true for a row to be kept in the result set of the SELECT.
What's the solution? A "view on the fly". Instead of OUTER JOINing the
users table to the
classified_ads, we will
OUTER JOIN
users to a
view of
classified_ads that contains only those ads posted since
January 1, 1999:
select users.user_id, users.email, ad_view.posted
from
users,
(select *
from classified_ads
where posted > '1999-01-01') ad_view
where users.user_id = ad_view.user_id(+)
and users.email like 'db%'
order by users.email, ad_view.posted;
USER_ID EMAIL POSTED
---------- ------------------------------ ----------
71668 db-designs@emeraldnet.net
112295 db1@sisna.com
137640 db25@umail.umd.edu
35102 db44@aol.com 1999-12-23
59279 db4rs@aol.com
95190 db@astro.com.au
17474 db@hotmail.com
248220 db@indianhospitality.com
40134 db@spindelvision.com 1999-02-04
144420 db_chang@yahoo.com
15020 dbaaru@mindspring.com
...
174 rows selected.
Note that we've named our "view on the fly"
ad_view for the
duration of this query.
How Views Work
Programmers aren't supposed to have to think about how views work.
However, it is worth noting that the RDBMS merely stores the view
definition and not any of the data in a view. Querying against a view
versus the underlying tables does not change the way that data are
retrieved or cached. Standard RDBMS views exist to make programming
more convenient or to address security concerns, not to make data access
more efficient.
How Materialized Views Work
Starting with Oracle 8.1.5, introduced in March 1999, you can have a
materialized view, also known as a
summary. Like a
regular view, a materialized view can be used to build a black-box
abstraction for the programmer. In other words, the view might be
created with a complicated JOIN, or an expensive GROUP BY with sums and
averages. With a regular view, this expensive operation would be done
every time you issued a query. With a materialized view, the expensive
operation is done when the view is created and thus an individual query
need not involve substantial computation.
Materialized views consume space because Oracle is keeping a copy of the
data or at least a copy of information derivable from the data. More
importantly, a materialized view does not contain up-to-the-minute
information. When you query a regular view, your results includes
changes made up to the last committed transaction before your SELECT.
When you query a materialized view, you're getting results as of the
time that the view was created or refreshed. Note that Oracle lets you
specify a refresh interval at which the materialized view will
automatically be refreshed.
At this point, you'd expect an experienced Oracle user to say "Hey,
these aren't new. This is the old CREATE SNAPSHOT facility that we used
to keep semi-up-to-date copies of tables on machines across the
network!" What is new with materialized views is that you can create
them with the ENABLE QUERY REWRITE option. This authorizes the SQL
parser to look at a query involving aggregates or JOINs and go to the
materialized view instead. Consider the following query, from
the ArsDigita Community System's /admin/users/registration-history.tcl
page:
select
to_char(registration_date,'YYYYMM') as sort_key,
rtrim(to_char(registration_date,'Month')) as pretty_month,
to_char(registration_date,'YYYY') as pretty_year,
count(*) as n_new
from users
group by
to_char(registration_date,'YYYYMM'),
to_char(registration_date,'Month'),
to_char(registration_date,'YYYY')
order by 1;
SORT_K PRETTY_MO PRET N_NEW
------ --------- ---- ----------
199805 May 1998 898
199806 June 1998 806
199807 July 1998 972
199808 August 1998 849
199809 September 1998 1023
199810 October 1998 1089
199811 November 1998 1005
199812 December 1998 1059
199901 January 1999 1488
199902 February 1999 2148
For each month, we have a count of how many users registered at
photo.net. To execute the query, Oracle must sequentially scan the
users table. If the users table grew large and you wanted
the query to be instant, you'd sacrifice some timeliness in the stats
with
create materialized view users_by_month
enable query rewrite
refresh complete
start with 1999-03-28
next sysdate + 1
as
select
to_char(registration_date,'YYYYMM') as sort_key,
rtrim(to_char(registration_date,'Month')) as pretty_month,
to_char(registration_date,'YYYY') as pretty_year,
count(*) as n_new
from users
group by
to_char(registration_date,'YYYYMM'),
to_char(registration_date,'Month'),
to_char(registration_date,'YYYY')
order by 1
Oracle will build this view just after midnight on March 28, 1999. The
view will be refreshed every 24 hours after that. Because of the
enable query rewrite clause, Oracle will feel free to grab
data from the view even when a user's query does not mention the view.
For example, given the query
select count(*)
from users
where rtrim(to_char(registration_date,'Month')) = 'January'
and to_char(registration_date,'YYYY') = '1999'
Oracle would ignore the
users table altogether and pull
information from
users_by_month. This would give the same
result with much less work. Suppose that the current month is March
1999, though. The query
select count(*)
from users
where rtrim(to_char(registration_date,'Month')) = 'March'
and to_char(registration_date,'YYYY') = '1999'
will also hit the materialized view rather than the
users
table and hence will miss anyone who has registered since midnight
(i.e., the query rewriting will cause a different result to be returned).
More:
Reference
---
based on SQL for Web Nerds
Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 04:31 PM
Here's a familiar simple example from the
complex queries chapter:
select user_id,
count(*) as how_many from bboard
where not exists (select 1 from bboard_authorized_maintainers
bam where bam.user_id =
bboard.user_id) and
posting_time + 60 > sysdate group by user_id order
by how_many desc;
Doesn't seem so simple, eh? How about if we rewrite it:
select user_id, count(*) as how_many
from bboard
where not exists (select 1 from
bboard_authorized_maintainers bam
where bam.user_id = bboard.user_id)
and posting_time + 60 > sysdate
group by user_id
order by how_many desc;
If your code isn't properly indented then you will never be able to
debug it. How can we justify using the word "properly"? After all, the
SQL parser doesn't take extra spaces or newlines into account.
Software is indented properly when the structure of the software is
revealed and when the indentation style is familiar to a
community of programmers.
Rules for All Queries
If it fits on one line, it is okay to leave on one line:
select email from users where user_id = 34;
If it doesn't fit nicely on one line, give each clause a separate line:
select *
from news
where sysdate > expiration_date
and approved_p = 't'
order by release_date desc, creation_date desc
If the stuff for a particular clause won't fit on one line, put a
newline immediately after the keyword that opens the clause. Then
indent the items underneath. Here's an example from the ArsDigita
Community System's static .html page administration section. We're
querying the
static_pages table, which holds a copy of any
.html files in the Unix file system:
select
to_char(count(*),'999G999G999G999G999') as n_pages,
to_char(sum(dbms_lob.getlength(page_body)),'999G999G999G999G999') as n_bytes
from static_pages;
In this query, there are two items in the select list, a count of all
the rows and a sum of the bytes in the
page_body column (of
type CLOB, hence the requirement to use
dbms_lob.getlength
rather than simply
length). We want Oracle to format these
numbers with separation characters between every three digits. For
this, we have to use the
to_char function and a mask of
'999G999G999G999G999' (the "G" tells Oracle to use the
appropriate character depending on the country where it is installed,
e.g., comma in the U.S. and period in Europe). Then we have to give the
results correlation names so that they will be easy to use as Tcl
variables. By the time we're done with all of this, it would be
confusing to put both items on the same line.
Here's another example, this time from the top-level comment
administation page for the ArsDigita Community System. We're going to
get back a single row with a count of each type of user-submitted
comment:
select
count(*) as n_total,
sum(decode(comment_type,'alternative_perspective',1,0)) as n_alternative_perspectives,
sum(decode(comment_type,'rating',1,0)) as n_ratings,
sum(decode(comment_type,'unanswered_question',1,0)) as n_unanswered_questions,
sum(decode(comment_type,'private_message_to_page_authors',1,0)) as n_private_messages
from comments
Notice the use of
sum(decode to count the number of each
type of comment. This gives us similar information to what we'd get
from a GROUP BY, but we get a sum total as well as category totals.
Also, the numbers come out with the column names of our choice. Of
course, this kind of query only works when you know in advance the
possible values of
comment_type.
Rules for GROUP BY queries
When you're doing a GROUP BY, put the columns that determine the group
identity first in the select list. Put the aggregate columns that
compute a function for that group afterwards:
select links.user_id, first_names, last_name, count(links.page_id) as n_links
from links, users
where links.user_id = users.user_id
group by links.user_id, first_names, last_name
order by n_links desc
---
based on SQL for Web Nerds
Created by Anett Szabo, last modified by Anett Szabo 25 Jul 2007, at 04:29 PM
On its face, the relational database management system would appear to
be a very poor tool for representing and manipulating trees. This
chapter is designed to accomplish the following things:
- show you that a row in an SQL database can be thought of as an
object
- show you that a pointer from one object to another can be represented
by storing an integer key in a regular database column
- demonstrate the Oracle tree extensions (CONNECT BY ... PRIOR)
- show you how to work around the limitations of CONNECT BY with PL/SQL
The canonical example of trees in Oracle is the org chart.
create table corporate_slaves (
slave_id integer primary key,
supervisor_id references corporate_slaves,
name varchar(100)
);
insert into corporate_slaves values (1, NULL, 'Big Boss Man');
insert into corporate_slaves values (2, 1, 'VP Marketing');
insert into corporate_slaves values (3, 1, 'VP Sales');
insert into corporate_slaves values (4, 3, 'Joe Sales Guy');
insert into corporate_slaves values (5, 4, 'Bill Sales Assistant');
insert into corporate_slaves values (6, 1, 'VP Engineering');
insert into corporate_slaves values (7, 6, 'Jane Nerd');
insert into corporate_slaves values (8, 6, 'Bob Nerd');
SQL> column name format a20
SQL> select * from corporate_slaves;
SLAVE_ID SUPERVISOR_ID NAME
---------- ------------- --------------------
1 Big Boss Man
2 1 VP Marketing
3 1 VP Sales
4 3 Joe Sales Guy
6 1 VP Engineering
7 6 Jane Nerd
8 6 Bob Nerd
5 4 Bill Sales Assistant
8 rows selected.
The integers in the
supervisor_id are actually pointers to
other rows in the
corporate_slaves table. Need to display
an org chart? With only standard SQL available, you'd write a program
in the client language (e.g., C, Lisp, Perl, or Tcl) to do the
following:
- query Oracle to find the employee
where supervisor_id is
null, call this $big_kahuna_id
- query Oracle to find those employees whose
supervisor_id = $big_kahuna_id
- for each subordinate, query Oracle again to find their subordinates.
- repeat until no subordinates found, then back up one level
With the Oracle CONNECT BY clause, you can get all the rows out at once:
select name, slave_id, supervisor_id
from corporate_slaves
connect by prior slave_id = supervisor_id;
NAME SLAVE_ID SUPERVISOR_ID
-------------------- ---------- -------------
Big Boss Man 1
VP Marketing 2 1
VP Sales 3 1
Joe Sales Guy 4 3
Bill Sales Assistant 5 4
VP Engineering 6 1
Jane Nerd 7 6
Bob Nerd 8 6
VP Marketing 2 1
VP Sales 3 1
Joe Sales Guy 4 3
Bill Sales Assistant 5 4
Joe Sales Guy 4 3
Bill Sales Assistant 5 4
VP Engineering 6 1
Jane Nerd 7 6
Bob Nerd 8 6
Jane Nerd 7 6
Bob Nerd 8 6
Bill Sales Assistant 5 4
20 rows selected.
This seems a little strange. It looks as though Oracle has produced all
possible trees and subtrees. Let's add a START WITH clause:
select name, slave_id, supervisor_id
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id in (select slave_id
from corporate_slaves
where supervisor_id is null);
NAME SLAVE_ID SUPERVISOR_ID
-------------------- ---------- -------------
Big Boss Man 1
VP Marketing 2 1
VP Sales 3 1
Joe Sales Guy 4 3
Bill Sales Assistant 5 4
VP Engineering 6 1
Jane Nerd 7 6
Bob Nerd 8 6
8 rows selected.
Notice that we've used a subquery in the START WITH clause to find out
who is/are the big kahuna(s). For the rest of this example, we'll just
hard-code in the
slave_id 1 for brevity.
Though these folks are in the correct order, it is kind of tough to tell
from the preceding report who works for whom. Oracle provides a magic
pseudo-column that is meaningful only when a query includes a CONNECT
BY. The pseudo-column is level:
select name, slave_id, supervisor_id, level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;
NAME SLAVE_ID SUPERVISOR_ID LEVEL
-------------------- ---------- ------------- ----------
Big Boss Man 1 1
VP Marketing 2 1 2
VP Sales 3 1 2
Joe Sales Guy 4 3 3
Bill Sales Assistant 5 4 4
VP Engineering 6 1 2
Jane Nerd 7 6 3
Bob Nerd 8 6 3
8 rows selected.
The
level column can be used for indentation. Here we will
use the concatenation operator (
||) to add spaces in front
of the name column:
column padded_name format a30
select
lpad(' ', (level - 1) * 2) || name as padded_name,
slave_id,
supervisor_id,
level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;
PADDED_NAME SLAVE_ID SUPERVISOR_ID LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man 1 1
VP Marketing 2 1 2
VP Sales 3 1 2
Joe Sales Guy 4 3 3
Bill Sales Assistant 5 4 4
VP Engineering 6 1 2
Jane Nerd 7 6 3
Bob Nerd 8 6 3
8 rows selected.
If you want to limit your report, you can use standard WHERE clauses:
select
lpad(' ', (level - 1) * 2) || name as padded_name,
slave_id,
supervisor_id,
level
from corporate_slaves
where level <= 3
connect by prior slave_id = supervisor_id
start with slave_id = 1;
PADDED_NAME SLAVE_ID SUPERVISOR_ID LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man 1 1
VP Marketing 2 1 2
VP Sales 3 1 2
Joe Sales Guy 4 3 3
VP Engineering 6 1 2
Jane Nerd 7 6 3
Bob Nerd 8 6 3
7 rows selected.
Suppose that you want people at the same level to sort alphabetically.
Sadly, the ORDER BY clause doesn't work so great in conjunction with
CONNECT BY:
select
lpad(' ', (level - 1) * 2) || name as padded_name,
slave_id,
supervisor_id,
level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1
order by level, name;
PADDED_NAME SLAVE_ID SUPERVISOR_ID LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man 1 1
VP Engineering 6 1 2
VP Marketing 2 1 2
VP Sales 3 1 2
Bob Nerd 8 6 3
Jane Nerd 7 6 3
Joe Sales Guy 4 3 3
Bill Sales Assistant 5 4 4
select
lpad(' ', (level - 1) * 2) || name as padded_name,
slave_id,
supervisor_id,
level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1
order by name;
PADDED_NAME SLAVE_ID SUPERVISOR_ID LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man 1 1
Bill Sales Assistant 5 4 4
Bob Nerd 8 6 3
Jane Nerd 7 6 3
Joe Sales Guy 4 3 3
VP Engineering 6 1 2
VP Marketing 2 1 2
VP Sales 3 1 2
SQL is a set-oriented language. In the result of a CONNECT BY query, it
is precisely the order that has value. Thus it doesn't make much sense
to also have an ORDER BY clause.
JOIN doesn't work with CONNECT BY
If we try to build a report showing each employee and his or her
supervisor's name, we are treated to one of Oracle's few informative
error messages:
select
lpad(' ', (level - 1) * 2) || cs1.name as padded_name,
cs2.name as supervisor_name
from corporate_slaves cs1, corporate_slaves cs2
where cs1.supervisor_id = cs2.slave_id(+)
connect by prior cs1.slave_id = cs1.supervisor_id
start with cs1.slave_id = 1;
ERROR at line 4:
ORA-01437: cannot have join with CONNECT BY
We can work around this particular problem by creating a view:
create or replace view connected_slaves
as
select
lpad(' ', (level - 1) * 2) || name as padded_name,
slave_id,
supervisor_id,
level as the_level
from corporate_slaves
connect by prior slave_id = supervisor_id
start with slave_id = 1;
Notice that we've had to rename
level so that we didn't end
up with a view column named after a reserved word. The view works just
like the raw query:
select * from connected_slaves;
PADDED_NAME SLAVE_ID SUPERVISOR_ID THE_LEVEL
------------------------------ ---------- ------------- ----------
Big Boss Man 1 1
VP Marketing 2 1 2
VP Sales 3 1 2
Joe Sales Guy 4 3 3
Bill Sales Assistant 5 4 4
VP Engineering 6 1 2
Jane Nerd 7 6 3
Bob Nerd 8 6 3
8 rows selected.
but we can JOIN now
select padded_name, corporate_slaves.name as supervisor_name
from connected_slaves, corporate_slaves
where connected_slaves.supervisor_id = corporate_slaves.slave_id(+);
PADDED_NAME SUPERVISOR_NAME
------------------------------ --------------------
Big Boss Man
VP Marketing Big Boss Man
VP Sales Big Boss Man
Joe Sales Guy VP Sales
Bill Sales Assistant Joe Sales Guy
VP Engineering Big Boss Man
Jane Nerd VP Engineering
Bob Nerd VP Engineering
8 rows selected.
If you have sharp eyes, you'll notice that we've actually OUTER JOINed
so that our results don't exclude the big boss.
Select-list subqueries do work with CONNECT BY
Instead of the VIEW and JOIN, we could have added a subquery to the
select list:
select
lpad(' ', (level - 1) * 2) || name as padded_name,
(select name
from corporate_slaves cs2
where cs2.slave_id = cs1.supervisor_id) as supervisor_name
from corporate_slaves cs1
connect by prior slave_id = supervisor_id
start with slave_id = 1;
PADDED_NAME SUPERVISOR_NAME
------------------------------ --------------------
Big Boss Man
VP Marketing Big Boss Man
VP Sales Big Boss Man
Joe Sales Guy VP Sales
Bill Sales Assistant Joe Sales Guy
VP Engineering Big Boss Man
Jane Nerd VP Engineering
Bob Nerd VP Engineering
8 rows selected.
The general rule in Oracle is that you can have a subquery that returns
a single row anywhere in the select list.
Does this person work for me?
Suppose that you've built an intranet Web service. There are things
that your software should show to an employee's boss (or boss's boss)
that it shouldn't show to a subordinate or peer. Here we try to figure
out if the VP Marketing (#2) has supervisory authority over Jane Nerd
(#7):
select count(*)
from corporate_slaves
where slave_id = 7
and level > 1
start with slave_id = 2
connect by prior slave_id = supervisor_id;
COUNT(*)
----------
0
Apparently not. Notice that we start with the VP Marketing (#2) and
stipulate
level > 1 to be sure that we will never
conclude that someone supervises him or herself. Let's ask if the Big
Boss Man (#1) has authority over Jane Nerd:
select count(*)
from corporate_slaves
where slave_id = 7
and level > 1
start with slave_id = 1
connect by prior slave_id = supervisor_id;
COUNT(*)
----------
1
Even though Big Boss Man isn't Jane Nerd's direct supervisor, asking
Oracle to compute the relevant subtree yields us the correct result. In
the ArsDigita Community System Intranet module, we decided that this
computation was too important to be left as a query in individual Web
pages. We centralized the question in a PL/SQL procedure:
create or replace function intranet_supervises_p
(query_supervisor IN integer, query_user_id IN integer)
return varchar
is
n_rows_found integer;
BEGIN
select count(*) into n_rows_found
from intranet_users
where user_id = query_user_id
and level > 1
start with user_id = query_supervisor
connect by supervisor = PRIOR user_id;
if n_rows_found > 0 then
return 't';
else
return 'f';
end if;
END intranet_supervises_p;
Family trees
What if the graph is a little more complicated than employee-supervisor?
For example, suppose that you are representing a family tree. Even
without allowing for divorce and remarriage, exotic South African
fertility clinics, etc., we still need more than one pointer for each
node:
create table family_relatives (
relative_id integer primary key,
spouse references family_relatives,
mother references family_relatives,
father references family_relatives,
-- in case they don't know the exact birthdate
birthyear integer,
birthday date,
-- sadly, not everyone is still with us
deathyear integer,
first_names varchar(100) not null,
last_name varchar(100) not null,
sex char(1) check (sex in ('m','f')),
-- note the use of multi-column check constraints
check ( birthyear is not null or birthday is not null)
);
-- some test data
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(1, 'Nick', 'Gittes', 'm', NULL, NULL, NULL, 1902);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(2, 'Cecile', 'Kaplan', 'f', 1, NULL, NULL, 1910);
update family_relatives
set spouse = 2
where relative_id = 1;
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(3, 'Regina', 'Gittes', 'f', NULL, 2, 1, 1934);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(4, 'Marjorie', 'Gittes', 'f', NULL, 2, 1, 1936);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(5, 'Shirley', 'Greenspun', 'f', NULL, NULL, NULL, 1901);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(6, 'Jack', 'Greenspun', 'm', 5, NULL, NULL, 1900);
update family_relatives
set spouse = 6
where relative_id = 5;
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(7, 'Nathaniel', 'Greenspun', 'm', 3, 5, 6, 1930);
update family_relatives
set spouse = 7
where relative_id = 3;
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(8, 'Suzanne', 'Greenspun', 'f', NULL, 3, 7, 1961);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(9, 'Philip', 'Greenspun', 'm', NULL, 3, 7, 1963);
insert into family_relatives
(relative_id, first_names, last_name, sex, spouse, mother, father, birthyear)
values
(10, 'Harry', 'Greenspun', 'm', NULL, 3, 7, 1965);
In applying the lessons from the employee examples, the most obvious
problem that we face now is whether to follow the mother or the father
pointers:
column full_name format a25
-- follow patrilineal (start with my mom's father)
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives
connect by prior relative_id = father
start with relative_id = 1;
FULL_NAME
-------------------------
Nick Gittes
Regina Gittes
Marjorie Gittes
-- follow matrilineal (start with my mom's mother)
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives
connect by prior relative_id = mother
start with relative_id = 2;
FULL_NAME
-------------------------
Cecile Kaplan
Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Here's what the official Oracle docs have to say about CONNECT BY:
specifies the relationship between parent rows and child rows of the
hierarchy. condition can be any condition as described in
"Conditions". However, some part of the condition must use the
PRIOR operator to refer to the parent row. The part of the condition
containing the PRIOR operator must have one of the following
forms:
PRIOR expr comparison_operator expr
expr comparison_operator PRIOR expr
There is nothing that says
comparison_operator has to be
merely the equals sign. Let's start again with my mom's father but
CONNECT BY more than one column:
-- follow both
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives
connect by prior relative_id in (mother, father)
start with relative_id = 1;
FULL_NAME
-------------------------
Nick Gittes
Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Instead of arbitrarily starting with Grandpa Nick, let's ask Oracle to
show us all the trees that start with a person whose parents are
unknown:
select lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name
from family_relatives
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
where mother is null
and father is null);
FULL_NAME
-------------------------
Nick Gittes
Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Cecile Kaplan
Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Shirley Greenspun
Nathaniel Greenspun
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Jack Greenspun
Nathaniel Greenspun
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
22 rows selected.
PL/SQL instead of JOIN
The preceding report is interesting but confusing because it is hard to
tell where the trees meet in marriage. As noted above, you can't do a
JOIN with a CONNECT BY. We demonstrated the workaround of burying the
CONNECT BY in a view. A more general workaround is using PL/SQL:
create or replace function family_spouse_name
(v_relative_id family_relatives.relative_id%TYPE)
return varchar
is
v_spouse_id integer;
spouse_name varchar(500);
BEGIN
select spouse into v_spouse_id
from family_relatives
where relative_id = v_relative_id;
if v_spouse_id is null then
return null;
else
select (first_names || ' ' || last_name) into spouse_name
from family_relatives
where relative_id = v_spouse_id;
return spouse_name;
end if;
END family_spouse_name;
/
show errors
column spouse format a20
select
lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
family_spouse_name(relative_id) as spouse
from family_relatives
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
where mother is null
and father is null);
FULL_NAME SPOUSE
------------------------- --------------------
Nick Gittes Cecile Kaplan
Regina Gittes Nathaniel Greenspun
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Cecile Kaplan Nick Gittes
Regina Gittes Nathaniel Greenspun
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Marjorie Gittes
Shirley Greenspun Jack Greenspun
Nathaniel Greenspun Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
Jack Greenspun Shirley Greenspun
Nathaniel Greenspun Regina Gittes
Suzanne Greenspun
Philip Greenspun
Harry Greenspun
PL/SQL instead of JOIN and GROUP BY
Suppose that in addition to displaying the family tree in a Web page, we
also want to show a flag when a story about a family member is
available. First we need a way to represent stories:
create table family_stories (
family_story_id integer primary key,
story clob not null,
item_date date,
item_year integer,
access_control varchar(20)
check (access_control in ('public', 'family', 'designated')),
check (item_date is not null or item_year is not null)
);
-- a story might be about more than one person
create table family_story_relative_map (
family_story_id references family_stories,
relative_id references family_relatives,
primary key (relative_id, family_story_id)
);
-- put in a test story
insert into family_stories
(family_story_id, story, item_year, access_control)
values
(1, 'After we were born, our parents stuck the Wedgwood in a cabinet
and bought indestructible china. Philip and his father were sitting at
the breakfast table one morning. Suzanne came downstairs and, without
saying a word, took a cereal bowl from the cupboard, walked over to
Philip and broke the bowl over his head. Their father immediately
started laughing hysterically.', 1971, 'public');
insert into family_story_relative_map
(family_story_id, relative_id)
values
(1, 8);
insert into family_story_relative_map
(family_story_id, relative_id)
values
(1, 9);
insert into family_story_relative_map
(family_story_id, relative_id)
values
(1, 7);
To show the number of stories alongside a family member's listing, we
would typically do an OUTER JOIN and then GROUP BY the columns other
than the
count(family_story_id). In order not to disturb
the CONNECT BY, however, we create another PL/SQL function:
create or replace function family_n_stories (v_relative_id family_relatives.relative_id%TYPE)
return integer
is
n_stories integer;
BEGIN
select count(*) into n_stories
from family_story_relative_map
where relative_id = v_relative_id;
return n_stories;
END family_n_stories;
/
show errors
select
lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
family_n_stories(relative_id) as n_stories
from family_relatives
connect by prior relative_id in (mother, father)
start with relative_id in (select relative_id from family_relatives
where mother is null
and father is null);
FULL_NAME N_STORIES
------------------------- ----------
Nick Gittes 0
...
Shirley Greenspun 0
Nathaniel Greenspun 1
Suzanne Greenspun 1
Philip Greenspun 1
Harry Greenspun 0
...
Working Backwards
What does it look like to start at the youngest generation and work back?
select
lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
family_spouse_name(relative_id) as spouse
from family_relatives
connect by relative_id in (prior mother, prior father)
start with relative_id = 9;
FULL_NAME SPOUSE
------------------------- --------------------
Philip Greenspun
Regina Gittes Nathaniel Greenspun
Nick Gittes Cecile Kaplan
Cecile Kaplan Nick Gittes
Nathaniel Greenspun Regina Gittes
Shirley Greenspun Jack Greenspun
Jack Greenspun Shirley Greenspun
We ought to be able to view all the trees starting from all the leaves
but Oracle seems to be exhibiting strange behavior:
select
lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
family_spouse_name(relative_id) as spouse
from family_relatives
connect by relative_id in (prior mother, prior father)
start with relative_id not in (select mother from family_relatives
union
select father from family_relatives);
no rows selected
What's wrong? If we try the subquery by itself, we get a reasonable
result. Here are all the
relative_ids that appear in the
mother or
father column at least once.
select mother from family_relatives
union
select father from family_relatives
MOTHER
----------
1
2
3
5
6
7
7 rows selected.
The answer lies in that extra blank line at the bottom. There is a NULL
in this result set. Experimentation reveals that Oracle behaves
asymmetrically with NULLs and IN and NOT IN:
SQL> select * from dual where 1 in (1,2,3,NULL);
D
-
X
SQL> select * from dual where 1 not in (2,3,NULL);
no rows selected
The answer is buried in the Oracle documentation of NOT IN: "Evaluates
to FALSE if any member of the set is NULL." The correct query in this
case?
select
lpad(' ', (level - 1) * 2) || first_names || ' ' || last_name as full_name,
family_spouse_name(relative_id) as spouse
from family_relatives
connect by relative_id in (prior mother, prior father)
start with relative_id not in (select mother
from family_relatives
where mother is not null
union
select father
from family_relatives
where father is not null);
FULL_NAME SPOUSE
------------------------- --------------------
Marjorie Gittes
Nick Gittes Cecile Kaplan
Cecile Kaplan Nick Gittes
Suzanne Greenspun
Regina Gittes Nathaniel Greenspun
Nick Gittes Cecile Kaplan
Cecile Kaplan Nick Gittes
Nathaniel Greenspun Regina Gittes
Shirley Greenspun Jack Greenspun
Jack Greenspun Shirley Greenspun
Philip Greenspun
Regina Gittes Nathaniel Greenspun
Nick Gittes Cecile Kaplan
Cecile Kaplan Nick Gittes
Nathaniel Greenspun Regina Gittes
Shirley Greenspun Jack Greenspun
Jack Greenspun Shirley Greenspun
Harry Greenspun
Regina Gittes Nathaniel Greenspun
Nick Gittes Cecile Kaplan
Cecile Kaplan Nick Gittes
Nathaniel Greenspun Regina Gittes
Shirley Greenspun Jack Greenspun
Jack Greenspun Shirley Greenspun
24 rows selected.
Performance and Tuning
Oracle is not getting any help from the Tree Fairy in producing results
from a CONNECT BY. If you don't want tree queries to take O(N^2) time,
you need to build indices that let Oracle very quickly answer questions
of the form "What are all the children of Parent X?"
For the corporate slaves table, you'd want two concatenated indices:
create index corporate_slaves_idx1
on corporate_slaves (slave_id, supervisor_id);
create index corporate_slaves_idx2
on corporate_slaves (supervisor_id, slave_id);
Reference
---
based on SQL for Web Nerds