Sunday, October 26, 2008

Crew vs. DC, playoff thoughts

Today's match against D.C. United couldn't have been scripted any better. The same northwest goal post that denied two United attempts happily tucked a Brad Evans long-range shot into the back of the next for the only goal of the match. The massive supporters groups were able to enjoy the Supporter's Shield trophy and the Columbus Crew sent a hated rival with a long history of knocking us out of the playoffs, out of the playoffs. How sweet it is. Now it is on to the playoffs.

The first round of the MLS semi-finals consists of a home and away aggregate goal system. The lowest seed hosts the first game and the highest seed hosts the second game. In my opinion, this does not give the advantage to the team that did the best during the regular season. At best the two teams are equal or a slight advantage is given to the lower seeded team. Here is why.

The lower seeded team can come out hard; early and usually get away with it. New England did this to us in 2004 by delivering some hard fouls early in the match to our playmakers and midfield anchors (notable was a Shalrie Joseph tackle on Simon Elliot, worthy of ejection but went without even a caution and Simon wasn't the same player after that tackle; Joseph did the same thing to Jaime Moreno when they played DC, btw).

Using home field advantage for the first game, the lower seeded team has more control over the flow of the game as the visiting (higher seeded) team must usually sit back and feel out. The home and away series somewhat turns into a two-half match (one half home and away). So if the home team can squeeze out a victory by a goal (or more) they still control the flow of the game because the second leg forces the higher seeded team to score that many goals plus one (to avoid penalty kicks). The higher seeded team will need to press for goals opening them up for counter attack goals. I think this format benefits the "boring", defensive teams that rely upon the two or three highly skilled players up top typically running a quick counter attack three on three or four.

The first round will feature Columbus vs. Kansas City and Chicago vs. New England in the east, Houston vs. New York and Chivas USA vs. Salk Lake in the west. If I were New York I'd come out full throttle for the full 90 minutes and rack up as many goals as possible on that crap Astroturf surface. I still think Houston will take that one in the long run otherwise. I would like to see Real Salt Lake get by Chivas USA just to give Jason Kreis a pat on the back for his years of service to MLS, give the fans a pat on the back for being patient and getting a new stadium, and to keep California out of the playoffs. In the east I think New England is too volatile and weakened to go up against a surging Chicago team. I also think the Crew will prevail against Kansas City but we might have to do so on home turf.

Wednesday, October 08, 2008

Why SQL generators suck

Assume the following table:

CREATE TABLE craps
( craps_id NUMBER
, crap_size NUMBER
, crap_color VARCHAR2(16)
, crap_when_taken DATE DEFAULT SYSDATE
, crap_when_flushed DATE
, crap_rating NUMBER
, CONSTRAINT craps_pk PRIMARY KEY ( craps_id )
) storage_clauses, logging_clauses, santa_clauses ....

To get a list of craps and their size and color for a time interval by time it was flushed or if that was null (i.e. it was a proud poop that someone wanted everyone to see) the time it hit the bowl; the efficient SQL would be:

SELECT crap_size
, crap_color
, NVL(crap_when_flushed,crap_when_taken) crap_flushed_taken
FROM craps
WHERE NVL(crap_when_flushed,crap_when_taken) BETWEEN :1 and :2
ORDER BY crap_flushed_taken ASC NULLS FIRST;

Oracle would do a full table scan, efficient blocked I/O, and then poop out the results. Oh, wait. We're using a canned, record-based, RDBMS independant system that generates SQL that selects a list of primary keys and then reads each record by primary key. So we have something more like this:

SELECT craps_id
FROM craps
WHERE NVL(crap_when_flushed,crap_when_taken) BETWEEN :1 and :2
ORDER BY NVL(crap_when_flushed,crap_when_taken);

and for each selected primary key value:

SELECT crap_size
, crap_color
, NVL(crap_when_flushed,crap_when_taken) crap_flushed_taken
FROM craps
WHERE craps_id = :1;

Oh, wait. That NVL on those two fields is business logic that needs separated from the data layer; so we create a function:

CREATE OR REPLACE FUNCTION 
crap_flushed_taken (i_crap_when_flushed DATE, i_crap_when_taken DATE)
RETURN DATE AS
BEGIN
RETURN NVL(i_crap_when_flushed,i_crap_when_taken);
END;

No, no, no, stupid. That layer shouldn't need to know to pass two dates in it should work off the record/object primary key identifier:

CREATE OR REPLACE FUNCTION 
crap_flushed_taken (i_craps_id NUMBER)
RETURN DATE
AS
v_crap_when_flushed DATE;
v_crap_when_taken DATE;
BEGIN
SELECT crap_when_flushed, crap_when_taken
INTO v_crap_when_flushed, v_crap_when_taken
FROM craps
WHERE craps_id = i_craps_id;
RETURN NVL(v_crap_when_flushed,v_crap_when_taken);
END;

Actually, this is how they did it

CREATE OR REPLACE FUNCTION crap_flushed_taken (i_craps_id NUMBER)
RETURN DATE
AS as language java name 'com.crap.name.name1.CrapFlushedTaken(java.lang.Integer) return java.lang.Date';

That's a Java class that basically does the exact same thing that little slice of PL/SQL was doing. The final, generated SQL statement went something along the lines of this:

SELECT craps_id
FROM (
SELECT q1.craps_id
FROM craps q1
WHERE NVL(crap_flushed_taken(q1.craps_id),TO_DATE(1,'YYYY')) >= TO_DATE('????????','YYYYMMDD')
AND NVL(crap_flushed_taken(q1.craps_id),TO_DATE(1,'YYYY')) <= TO_DATE('????????','YYYYMMDD')
) t0
LEFT JOIN craps t1 ON t0.craps_id = t1.craps_id
ORDER BY WHERE crap_flushed_taken(t1.craps_id) ASC NULLS FIRST, t1.craps_id ASC NULLS FIRST;

Then the list program read each one of the selected primary key values and did this:

SELECT crap_size
, crap_color
, crap_when_flushed
, crap_when_taken
FROM craps
WHERE craps_id = :1;

Holy slang for feces! Oracle does an index fast full scan on the ID, which it would see as the most efficient way of going about this because in the optimizer's sane mind it would never think anyone would do a single read of the same table inside of a function. So for each primary key in that table (because we don't know the result of the function so no way to limit the working set) we are doing at least three (select, order by, list program), random (record location likely won't match the maintained order of the index), entire table reads using an index (read index block, then lookup corresponding block). Poop on me and call me a sundae...

I tried to simulate the above with inline select statements but the 10g optimizer was smart enough to figure it all out and come up with a plan five times the cost of the optimized plan for proper, hand-tuned SQL. I could try running the thing to get real statistics but unfortunately this thing runs for nine hours. My efficient SQL above runs in 30 seconds.

There are options available. First would be to rewrite the canned code and do the selection process with the optimized hand-written SQL. Another option would be to create an index on the function (haven't tried it yet) so theoretically Oracle will see the SELECT crap matches an index and use the index instead of reading the whole table one row at a time through the primary key index.

I know this is probably an extreme example (or one of many) but I think the same thoughts can be applied to code generators and to the object oriented purists that believe no business logic whatsoever belongs in the data tier.