[Repost] Selecting Friends

Found another Aigoo post from here. Repost it as my documentation >>>

***

If you are building a Yet Another Great Social Network Service to beat MySpace, you’ll certainly need to keep a list of friends there, so that Alice may communicate in private with Bob, and they both can show pictures toChris, and Eve cannot eavesdrop on them and the rest of them can do all these kinds of things these people are supposed to do.

On most networks, friendship is an irreflexive symmetric binary relation:

  • Symmetric means that if Alice is a friend of Bob, then Bob is a friend of Alice too.
  • Irreflexive means that Alice is never a friend to herself.

As it’s a many-to-many relation, we sure need a separate table for it.

But how will we keep it? Should we keep the relation in the table as is (i. e. two separate rows for Alice/Boband Bob/Alice), or keep just one row and reconstruct the relation using the set operators?

Let’s check.

We’ll create 50,000 users and randomly assign 20 friends to each of these users. Then we’ll try to select list of friends of each of these users in a loop.

Table creation details

In the first table, t_fullfriends, we will keep two rows for each pair of friends. This is a violation of DKNFand occupies twice as much space, but the queries themselves are more simple.

In the second table, t_semifriends, we will keep only one row for each pair: least user id first, greatestuser id second. This is more compact form and it doesn’t violate DKNF, but we’ll need more complex queries to fetch the relation out of the table.

In a loop, we will try to select a list of friends for each user.

In the first case, we will use:

SELECT ffr_friend
FROM t_user
JOIN t_fullfriend
ON ffr_user = usr_id
WHERE usr_id = ?

, which is explained as:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_user const PRIMARY PRIMARY 4 const 1 100.00 Using index
1 SIMPLE t_fullfriend ref PRIMARY PRIMARY 4 const 41 100.00 Using index

It’s quite straightforward: a NESTED LOOP using a PRIMARY INDEX access for both tables.

The second query is a bit more complex:

SELECT IF(usr_id = sfr_least, sfr_greatest, sfr_least)
FROM t_user
JOIN t_semifriend
ON sfr_least = usr_id OR sfr_greatest = usr_id
WHERE usr_id = ?

, and explained as

id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t_user const PRIMARY PRIMARY 4 const 1 100.00 Using index
1 SIMPLE t_semifriend index_merge PRIMARY,ix_semifriend_greatest PRIMARY,ix_semifriend_greatest 4,4 34 100.00 Using union(PRIMARY,ix_semifriend_greatest); Using where

Depending on which join condition had fired, we select either sfr_least or sfr_greatest as a friend id. The join condition uses an INDEX_MERGE access path: it’s two ref‘s on the corresponding indices, combined together as with UNION operator.

The following code:

<?php
        $argc < 4 and die ("Usage: getfriends.php query_file host username [password]\n");
        $db = new mysqli($argv[2], $argv[3], isset($argv[4]) ? $argv[4] : NULL, "20090307_friends") or die ("Cannot connect: " . mysqli_connect_error() . "\n");
        $query = file_get_contents($argv[1]);
        $stmt = $db->stmt_init() or die("Cannot initialize statement " . $db->error . "\n");
        $stmt->prepare($query) or die ("Cannot prepare statement: " . $stmt->error . "\n");
        $s = 0;
        $res = 0;
        $stmt->bind_result($s);
        for ($i = 1; $i <= 50000; $i++) {
                switch ($stmt->param_count) {
                case 1:
                        $stmt->bind_param('i', $i);
                        break;
                case 2:
                        $stmt->bind_param('ii', $i, $i);
                        break;
                default:
                        die ("Wrong parameter count\n");
                }
                $stmt->execute();
                while($stmt->fetch())
                        $res += $s;
        }
        print "$res\n";
?>

proceduces the following results:

[20090307_friends]$ time php getfriends.php fullfriends.sql hostname username
50000739300

real    0m26.419s
user    0m1.571s
sys     0m0.976s

[20090307_friends]$ time php getfriends.php semifriends.sql hostname username
50000739300

real    0m56.727s
user    0m1.767s
sys     0m0.972s

That means that the denormalized t_fullfriend is twice as fast as the normalized t_semifriend.

But the query plan for t_semifriend doesn’t look so terrible, and the table itself contains half as many records as t_fullfriend.

Why such a performance degradation?

Two reasons come to mind:

  1. INDEX_MERGE is less performant than plain UNION ALL
  2. When joining by sft_least, we can take all values right from the index. But when joining onsft_greatest, we need table access to t_semifriend to get sft_least

Let’s check these out.

If we change the query againt t_semifriends to:

SELECT  sfr_greatest
FROM    t_user
JOIN    t_semifriend
ON      sfr_least = usr_id
WHERE   usr_id = ?
UNION ALL
SELECT  sfr_least
FROM    t_user
JOIN    t_semifriend
ON      sfr_greatest = usr_id
WHERE   usr_id = ?

, we get:

[20090307_friends]$ time php getfriends.php semifriends_union.sql
50000739300

real    0m34.233s
user    0m1.560s
sys     0m0.999s

This is almost as fast as t_fullfriends. Almost. But still not as fast.

Let’s check another assumption:

CREATE UNIQUE INDEX ux_semifriends_greatest_least
ON t_semifriend(sfr_greatest, sfr_least)

[20090307_friends]$ time php getfriends.php semifriends.sql hostname username
50000739300

real 0m53.436s
user 0m1.676s
sys 0m1.032s

[20090307_friends]$ time php getfriends.php semifriends_union.sql hostname username
50000739300

real 0m33.697s
user 0m1.658s
sys 0m1.031s

Nope.

So, from the experiments above we can conclude that:

  1. For the tables described above, INDEX_MERGE is less efficient than a plain UNION ALL
  2. Keeping two records for each pair instead of one increases performance

The best way to keep the friends relationship in a MySQL database is keeping two separate rows for each pair of friends.

This is both most efficient and most portable. If friendship is not a symmeric relation in your social model (as on LiveJournal), then you don’t have to change anything.

Tinggalkan Balasan

Isikan data di bawah atau klik salah satu ikon untuk log in:

Logo WordPress.com

You are commenting using your WordPress.com account. Logout / Ubah )

Gambar Twitter

You are commenting using your Twitter account. Logout / Ubah )

Foto Facebook

You are commenting using your Facebook account. Logout / Ubah )

Foto Google+

You are commenting using your Google+ account. Logout / Ubah )

Connecting to %s