[Previous] damn title field | Home | [Next] neat cs trick

Comments (10)

Keyset pagination vs offset pagination in SQL

http://use-the-index-luke.com/no-offset

This article describes some performance problems associated with using the OFFSET SQL keyword for pagination. It also describes a more efficient, but more complicated, approach which the author calls "keyset pagination". Keyset pagination involves selecting only the items that are come after the last item shown on the previous page (in whatever sort order you are using). For example, instead of *OFFSET 30*, where 29 is the item number of the last result on the previous page, you would write *where (sale.date, sale.id) > (?, ?)*, in which case the placeholders represent the date and id of the last sale on the previous page.

Some things to be aware of:

- The SQL for keyset pagination is more complicated than the SQL for offset pagination.

- Most frameworks support only offset pagination, not keyset pagination.

- A SQL query for keyset pagination must be deterministic; that is, it must return the same rows no matter what execution plan is used. This can be achieved by including some unique identifier (e.g. sale.id) in the comparison.

- To simplify the comparison for keyset pagination, you can use a neat SQL feature called "row values" (supported in Postgres and SQLite, among other databases). A row value lets you combine multiple values into a single thing that can be compared like any other value. For instance you can write: *where (sale.date, sale.id) <= (?, ?)*. This combines sale.date and sale.id into a single tuple that can be compared to the given variables. (5, 7) <= (5, 8), but (5, 13) > (5,8).

- To paginate backward you need to change the order of the comparison operator(s).

- You can't easily fetch arbitrary pages with keyset pagination the way you can with offset pagination.

Josh Jordan at 4:26 PM on November 28, 2016 | #7753
why would you WHERE on *both* date and id?

Anonymous at 7:07 PM on November 28, 2016 | #7754

Removing watermarks from PDF documents under OS X

I wrote a script to remove watermarks from PDF documents. The exact code for the removal depends on the way the watermark is done, but the general outline may be helpful.

http://pastebin.com/YngtwwXK

Alisa at 2:36 AM on November 29, 2016 | #7755

Keyset pagination vs offset pagination in SQL

> why would you WHERE on *both* date and id?

In my example, id is included because the pagination boundary isn't guaranteed to exactly correspond to changes in the date.

I was assuming that you're paginating through sales in date order, and that the date field was just a date (no time). If you have more fine-grained timestamps that are unique (i.e. no two sales have the exact same timestamp), then you could just use that instead.

To flesh out my example a bit, say your first sale was on 2016-10-01. Suppose you had 15 sales that day, and you're showing 10 sales per page. The SQL query for the first page would have no filter (i.e. no "where" clause), it'd just be:

select * from sale order by date, id limit 10

Say the ID of the 10th sale (the last one) returned from that query is 12345. That ID gets fed back in to the query for page 2, so that page 2 includes the last 5 sales on 2016-10-01 as well as the next 5 sales made after that. The query for the second page looks like

select * from sale where (date,id) > ('2016-10-01', 12345) order by date, id limit 10.

Josh Jordan at 2:45 AM on November 29, 2016 | #7756
ok i get why not to do date only if it's just a date not a datetime. but if they are in the correct order by ID, why not just WHERE on ID without date?

Anonymous at 11:57 AM on November 29, 2016 | #7758

Keyset pagination vs offset pagination in SQL

> ok i get why not to do date only if it's just a date not a datetime. but if they are in the correct order by ID, why not just WHERE on ID without date?

If "they are in the correct order by ID" means that id1 >= id2 just when date1 >= date2, then yes, your idea would work. However, that wasn't assumed in my example and it might not be the case. For example suppose that ...

- you are paginating sales in date order, with 10 items per page

- sale.ID is an auto-incrementing integer

- you insert 10 sales for the date 2016-10-02, and then subsequently you insert 10 sales for the date 2016-10-01

In this case, the 2016-10-01 sales would all have higher IDs than the 2016-10-02 sales, so just filtering with WHERE ID > ? would incorrectly exclude the 2016-10-02 sales from showing up on page 2.

Josh Jordan at 4:15 PM on November 29, 2016 | #7760
i think the case you're talking about (IDs out of order, but never out of order across day boundaries) is rare and shouldn't be assumed when trying to present an approach for general use.

to take this blog as a typical example, IDs are occasionally out of chronological order and when they are it can be by multiple days. and even if the IDs are only out of order by a few minutes, those minutes can go over a day boundary.

Anonymous at 4:49 PM on November 29, 2016 | #7761
> i think the case you're talking about (IDs out of order, but never out of order across day boundaries) is rare and shouldn't be assumed when trying to present an approach for general use.

I agree. Maybe date wasn't a good example to show why ID is necessary. How about this: people want to paginate on all sorts of things that won't correspond to ID. What if you view your sales in order by amount? The amount won't correspond to the ID, and yet you'll still need to include ID in the where clause to break ties. Another example would be when you want to order by multiple fields.

The general principle is that for keyset pagination, your query needs to be deterministic. That is, it needs to return the same results under all possible execution plans. One way to make the query deterministic is to add a column that makes the results unique and include that column in the sort order and where clause.

Anonymous at 5:28 PM on November 29, 2016 | #7762
i'd rather break ties on the purchase amounts by datetime, not ID. but yes i get the concept of using ID as basically a random but unchanging (never rerolled) tiebreaker. the original example involving chronology confuses this.

Anonymous at 5:31 PM on November 29, 2016 | #7763

Keyset pagination vs offset pagination in SQL

> The general principle is that for keyset pagination, your query needs to be deterministic.

Actually, this is true for OFFSET pagination as well.

Josh Jordan at 11:04 AM on November 30, 2016 | #7765

What do you think?

(This is a free speech zone!)