[SQL-Fu] group expansion, does such a feature exist?

Rolf N · Aug 21, 2010

Suppose I have a DB table with three colums: transaction id, customer id, and date (the actual implementation has 30+ columns, but the further details aren't relevant).

I'm displaying this data in a group ordering. A groups contains any transactions with the same customer id, ordered by date.
But I'm also sorting the groups. For the date sort, I take the min/max of the date column. This is easily accomplished with SQL GROUP BY clause and aggregates. So far so good, I thought.

However, a SELECT MAX(date),transaction_id,(...) FROM table GROUP BY customer_id collapses all the rows within a group unto one. Multiple transactions per customer are no longer displayed. Actually the one that is returned is picked at random from within the group. Of course this won't do!

My first stab was to essentially run the same query twice in one SELECT statement, once as a subquery with a GROUP BY clause, and again as the outer query without GROUP BY. Then I LEFT JOIN the query with itself ... which sounds insane, until you realize that this step again duplicates all the rows that collapse due to the GROUP BY clause in the inner query.
As a first "optimization" I hoisted some of the per-group constants (anything in the GROUP BY clause basically) out of the outer query, into the subquery, because the subquery is the one of the two joining tables that changes at lower frequency.

Something like this:

PHP:

$qc="SELECT ang.cust_id,ang.id,ang.datum,pd.name,pd.vorname,ang.type_code,sq.gtstamp".
	" FROM".
		" ((angebote AS ang LEFT JOIN personendaten AS pd ON pd.ID=ang.cust_id)".
		" LEFT JOIN".
			" (SELECT UNIX_TIMESTAMP(MAX(datum)) AS gtstamp,cust_id,type_code".
			" FROM angebote".
			" GROUP BY cust_id,type_code AS sq".
		" ON sq.cust_id=ang.cust_id AND sq.type_code=ang.type_code".
	" WHERE ".implode(" AND ",array_prepend($filter_crit,'ang.')).
	" ORDER BY ".implode(",",$order_crit);

(code may be syntactially unsound, because I've tried to simplify/change identifiers to protect the identity of the database holder)

You see that the inner query SELECTs cust_id and type_code, which are also SELECTed by the outer query, both times from the same database table, only to be able to JOIN inner and outer using these two fields. Again, the only thing this subquery and join accomplish is smearing the single aggregate MAX across all rows that the outer query would already produce by itself.
I find this approach messy, and inefficient.

To restate the goal again, I'm looking for a way to insert, without reducing the number of rows, aggregate functions into each row of a database result.

The database runs on MySQL 5.1.somethingsomething. MySQL-specific extensions are fair game.

Anyone?

Arwin · Aug 21, 2010

It's to complex for me to grasp immediately without playing with the data, but what I generally do in these instances is take the problem apart and solve each smaller problem in a separate view first. Then I perform queries on these views to get closer to my goal. That usually works out quickly enough.

Rolf N · Aug 21, 2010

Arwin said:
It's to complex for me to grasp immediately without playing with the data, but what I generally do in these instances is take the problem apart and solve each smaller problem in a separate view first. Then I perform queries on these views to get closer to my goal. That usually works out quickly enough.

Basically, from this database table:

Code:

cust_id      date        other_stuff
   1      Aug-10-2010      XY
   1      Aug-20-2010      ABC
   1      Aug-05-2010     (unique per row)
   2      Sep-10-2010      ...
   3      Jul-28-2010      ...
   3      Jul-19-2010      ...

... I want to fetch a result set that looks like this:

Code:

   \     cust_id      date       MAX(date)     other stuff
1           1      Aug-10-2010  Aug-20-2010      XY
2           1      Aug-20-2010  Aug-20-2010      ABC
3           1      Aug-05-2010  Aug-20-2010     (unique per row)
4           2      Sep-10-2010  Sep-10-2010      ...
5           3      Jul-28-2010  Jul-28-2010      ...
6           3      Jul-19-2010  Jul-28-2010      ...

The MAX(date) column should be generated on the fly by the SQL server.

A straightforward "SELECT *,MAX(date) FROM table GROUP_BY cust_id" would yield rows 1,4 and 5 in the right format, but omit the others.
GROUP BY is necessary for the MAX function to work and can't be omitted.

Davros · Aug 21, 2010

what does the max function do ?

Rolf N · Aug 21, 2010

Davros said:
what does the max function do ?

It picks the maximum ... from a group of rows. What constitutes groups is controlled by the group clause. Look at post #3: in that theoretical result table, everything with the same cust_id is supposed to be a group (rows 1-3 are one group, row 4 is its own group, rows 5&6 are the third group). MAX(date) selects the highest (latest) date from those groups.

N00b · Aug 21, 2010

I'm not really familiar with MySQL SQL syntax as my last encounter was some six years ago, but by the look of the result you posted it looks to me like you need the grouping clause only to get the max offer date. Can't MySQL do subqueries like this:

Code:

SELECT
ang.cust_id,
and.datum,
(SELECT TOP 1 datum FROM angebote WHERE cust_id = ang.cust_id ORDER BY datum desc)
FROM
angebote ang,
personendaten pd
WHERE
pd.id = ang.cust_id
ORDER BY ...

Rolf N · Aug 21, 2010

MySQL doesn't understand the top keyword, but I also didn't know you could refer to outer-query result columns in the inner-query where clause. That's ... interesting.
So the trick is to limit the inner query to exactly one row each, and avoid the join.

Then again, this essentially starts a new tiny sub-query for each row of the outer query, whereas with the join variant (inner query independent of outer query, many rows in inner query result) consists of two bigger queries fused into one. I'll have to profile this to see what's easier for the server to do.

Thanks

N00b · Aug 21, 2010

Glad I could help a bit. Of course it won't be blazingly fast, but it depends amount of data that you need to display and the frequency of the queries. Apart from that it's usually a good idea, I think, to make it work first and then make it fast.

Arwin · Aug 21, 2010

So in this case I would first have made a view that selects the max date per customer id, and then select/join it to the first table. This will also be more efficient as basically what N00b does in the subquery (which is a fine way of doing this incidentally and probably what I would first have done had I tried to put it into one query)

Of course, depending on how you process this information you can optimise further: if you order the results from your original table by customerid, date desc, then the first date you process for each customer is also your max date ...

Rolf N · Aug 22, 2010

Turns out that the straight single-row subquery (w/o join) is much, much slower than what I had been using so far. The query takes around 200 times longer on the production dataset (a few 1000 rows), at which point the page takes more than a minute to load. It seems the frequency of elemental queries is the overriding factor here -- the MySQL optimizer doesn't know how to detect and fold this case.

Arwin,
exploring what can be done here with views seems to be a great idea. As I understand it, those are like proper state machines where updates are resolved only at the time of a request with dirty source data. I.e. at much lower frequency than what I have now, where I esentially nest a work-a-like into my query every time. The view would give me the same data, but in a persistent, reusable way. Never used one before, but now I will certainly try

Thanks for the pointer.

edit: oh, and for the sorting by date, I can't rely on that because the page allows the user to sort by any number of criteria. I can optimize the [strike]one case where date is the primary sort[/strike] cases where date is the lowest-order sort key, but I'd rather not clutter the thing with edge cases while it's not even working properly. Premature optimization and all that.
Consider the case where one customer has activities in January and April, and the other in February and March. If I sort by straight "date,cust_id", the second customer's rows will pop out in between the first one's, disrupting the group order. Only if I sort by "cust_id,[maybe other keys,]date" can I make this work.

Rolf N · Aug 22, 2010

Conclusion: setting up a view that's fully equivalent to the original sub-query turns out slower, which seems counter-intuitive to me. Given the potential for caching and reusing data from the view as long as the underlying source table does not change, should have made it faster IMO.

I've found much more promising results tuning the database indices. The one existing naive single-column index doesn't help with any query in the entire project as far as I can see. It's never used for joins, nor for ORDER BYs.

Here's the times I took:

Code:

            |                  |
            |  naive indexing  | tuned indexing
            |                  |
------------+------------------+----------------
            |                  |
  joined    |      0.89s       |     0.59s
 subquery   |                  |
            |                  |
------------+------------------+----------------
            |                  |
  joined    |      1.04s       |     0.82s
   view     |                  |
            |                  |
------------+------------------+----------------
            |                  |
   inline   |    minutes ...   |
  subquery  |                  |
            |                  |

Edit:
Well duh! I've replicated the WHERE clause from the outer query into the joined subquery, and it's now down to 0.0375 seconds with the tuned index setup ... a nice 15x speedup. That's good enough now IMO.

Davros · Aug 22, 2010

does this page help you
http://stackoverflow.com/questions/189213/sql-selecting-rows-by-most-recent-date

Arwin · Aug 22, 2010

Oh yes, indexes are always the first thing to look at for serious speed increases. Just a little while ago I noticed a colleague of mine had split up a problem into several tables to keep it fast, where when faced with the same problem at another customer adding just a single index fixed the issue in 30 seconds, no code rewriting or reconfiguration necessary.

But views work as you say, and can be a big help both in keeping things transparent and efficient. I like them, because they are great for testing and development, as you can view their results, query them and combine them with other views or queries very easily.

Rolf N said:
Conclusion: setting up a view that's fully equivalent to the original sub-query turns out slower, which seems counter-intuitive to me. Given the potential for caching and reusing data from the view as long as the underlying source table does not change, should have made it faster IMO.

I've found much more promising results tuning the database indices. The one existing naive single-column index doesn't help with any query in the entire project as far as I can see. It's never used for joins, nor for ORDER BYs.

Here's the times I took:

Code:

| | | naive indexing | tuned indexing | | ------------+------------------+---------------- | | joined | 0.89s | 0.59s subquery | | | | ------------+------------------+---------------- | | joined | 1.04s | 0.82s view | | | | ------------+------------------+---------------- | | inline | minutes ... | subquery | | | |

Edit:
Well duh! I've replicated the WHERE clause from the outer query into the joined subquery, and it's now down to 0.0375 seconds with the tuned index setup ... a nice 15x speedup. That's good enough now IMO.

Rolf N · Aug 23, 2010

Davros said:
does this page help you
http://stackoverflow.com/questions/189213/sql-selecting-rows-by-most-recent-date

I admire your dedication, but no, I'm well past this particular solution

What you've posted is one part of my original approach, often referred to in this thread as the subquery, or inner query.

The issue being that, again, this type of query reduces the number of rows in the result. You only get "the one" row with the most recent date from each group, while all the other rows, holding interesting data as well, are omitted.

The question was whether or not there's a way to not omit these rows, instead of the roundabout way of reconstructing them by external means. I'm now convinced that the answer is a no, these omissions are automatic and unavoidable. Reconstructing the remaining data as efficiently as possible is the best one can achieve in this class of problem.

Davros · Aug 23, 2010

can you get all the transactions per customer_id and then order them by date negating the use of the max function ?

totally random guess
SELECT transaction_id,(...) FROM table GROUP BY customer_id, MAX(date)

edit:
ah, brain wave your only getting 1 transaction per customer because you are saying only display the last transaction from each customer and order them by customer

( MAX(date) ) will only return the last transaction not all transactions up untill the max date

can you do something like :

group by customer _id, sort by date

or if you want all transactions that happend on the max date can you do something like
get max date
if date = max date then display the record

[SQL-Fu] group expansion, does such a feature exist?

Rolf N

Recurring Membmare

Arwin

Now Officially a Top 10 Poster

Rolf N

Recurring Membmare

Davros

Rolf N

Recurring Membmare

N00b

Rolf N

Recurring Membmare

N00b

Arwin

Now Officially a Top 10 Poster

Rolf N

Recurring Membmare

Rolf N

Recurring Membmare

Davros

Arwin

Now Officially a Top 10 Poster

Rolf N

Recurring Membmare

Davros