用Twitter的cursor方式進(jìn)行Web數(shù)據(jù)分頁

用 Twitter的cursor方式進(jìn)行Web數(shù)據(jù)分頁

Tuesday, Jan 19th, 2010 by Tim | Tags: mysql, performance, twitter

本文討論Web應(yīng)用中實(shí)現(xiàn)數(shù)據(jù)分頁功能，不同的技術(shù)實(shí)現(xiàn)方式的性能方區(qū)別。

上圖功能的技術(shù)實(shí)現(xiàn)方法拿MySQL來舉例就是

select * from msgs where thread_id = ? limit page * count, count

不過在看Twitter API的時(shí)候，我們卻發(fā)現(xiàn)不少接口使用cursor的方法，而不用page, count這樣直觀的形式，如 followers ids 接口

URL:

http://twitter.com/followers/ids.format

Returns an array of numeric IDs for every user following the specified user.

Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body’s next_cursor and previous_cursor attributes to page back and forth in the list.
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903

http://twitter.com/followers/ids.format

從上面描述可以看到，http://twitter.com/followers/ids.xml 這個(gè)調(diào)用需要傳cursor參數(shù)來進(jìn)行分頁，而不是傳統(tǒng)的 url?page=n&count=n的形式。這樣做有什么優(yōu)點(diǎn)呢？是否讓每個(gè)cursor保持一個(gè)當(dāng)時(shí)數(shù)據(jù)集的鏡像？防止由于結(jié)果集實(shí)時(shí)改變而產(chǎn)生查詢結(jié)果有重復(fù)內(nèi)容？
在Google Groups這篇Cursor Expiration討論中Twitter的架構(gòu)師John Kalucki提到

A cursor is an opaque deletion-tolerant index into a Btree keyed by source
userid and modification time. It brings you to a point in time in the
reverse chron sorted list. So, since you can’t change the past, other than
erasing it, it’s effectively stable. (Modifications bubble to the top.) But
you have to deal with additions at the list head and also block shrinkage
due to deletions, so your blocks begin to overlap quite a bit as the data
ages. (If you cache cursors and read much later, you’ll see the first few
rows of cursor[n+1]’s block as duplicates of the last rows of cursor[n]’s
block. The intersection cardinality is equal to the number of deletions in
cursor[n]’s block). Still, there may be value in caching these cursors and
then heuristically rebalancing them when the overlap proportion crosses some
threshold.

在另外一篇new cursor-based pagination not multithread-friendly中John又提到

The page based approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.

Working with row-counts forces the data store to recount rows in an O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O(n/
block_size) which, yes, is O(n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.

Proportionally, very few users require multiple page fetches with a
page size of 5,000.

Also, scraping the social graph repeatedly at high speed is could
often be considered a low-value, borderline abusive use of the social
graph API.

通過這兩段文字我們已經(jīng)很清楚了，對于大結(jié)果集的數(shù)據(jù)，使用cursor方式的目的主要是為了極大地提高性能。還是拿MySQL為例說明，比如翻頁到100,000條時(shí)，不用cursor，對應(yīng)的SQL為

select * from msgs limit 100000, 100

在一個(gè)百萬記錄的表上，第一次執(zhí)行這條SQL需要5秒以上。
假定我們使用表的主鍵的值作為cursor_id, 使用cursor分頁方式對應(yīng)的SQL可以優(yōu)化為

select * from msgs where id > cursor_id limit 100;

同樣的表中，通常只需要100ms以下, 效率會(huì)提高幾十倍。MySQL limit性能差別也可參看我3年前寫的一篇不成熟的文章 MySQL LIMIT 的性能問題。

結(jié)論

建議Web應(yīng)用中大數(shù)據(jù)集翻頁可以采用這種cursor方式，不過此方法缺點(diǎn)是翻頁時(shí)必須連續(xù)，不能跳頁。

pi1ot says:

Jan 19th 2010 at 23:22

實(shí)際應(yīng)用中問題一般是出在where和limit之間的status = pass或者其他篩選條件上，數(shù)據(jù)不連續(xù)cursor也就不那么靈光了

fff says:

Jan 20th 2010 at 13:26

ls應(yīng)該指order by吧，當(dāng)不是以id為序時(shí)
跳頁可以加一次運(yùn)算，取出合適cursor就可以了吧，相對可能還是簡單了

超群.com says:

Jan 20th 2010 at 16:58

可以看一下我的一篇博客http://www./2009/04/efficient- pagination-using-mysql/

既可用到cursor，亦可隨意翻頁。

gen says:

Jan 22nd 2010 at 09:06

請問如果不是以主鍵id為排序，應(yīng)該怎么做呢？

乡下人产国偷v产偷v自拍,国产午夜片在线观看,婷婷成人亚洲综合国产麻豆,久久综合给合久久狠狠狠9

用Twitter的cursor方式進(jìn)行Web數(shù)據(jù)分頁 – Tim[后端技術(shù)]

用 Twitter的cursor方式進(jìn)行Web數(shù)據(jù)分頁

結(jié)論

4 Comments »