一般的協同過濾演算法,首先是收集使用者對項目(產品)的評分情況,一種直接對某本書,或者某個歌曲評分,另種是隱性的評分,比如商務系統中,購買了表示打2分,流覽了打1分,其他的0分。我比較看好隱性評分,因為直接評分需要使用者的參與程度比較高,很多網站都在內容頁中留一個評分的按鈕,從1~5選一個,我可能喜歡這篇文章,可我哪裡知道我喜歡的程度是幾分啊,還要我去思考,而網站設計中一條很重要的原則是:Do not let me think!,於是我就胡打一個分數或者不打,而隱性的評分則不同,只有你喜歡的圖書你才會購買,只有你喜歡的歌曲才會聽多次。

收集好用戶的評分之後,通過最近鄰搜索到和某個項目或者某個人特徵或者興趣相近的其他項目或者人,最近鄰搜索演算法一般是皮爾森相關係數(Person Correlation Coefficient)、余弦相似性(Cosine-based Similarity)以及調整余弦相似性(Adjusted Cosine Similarity)。關於余弦定理在資料採擷中的應用,google黑白報有過介紹,可以參考數學之美系列 12 – 余弦定理和新聞的分類

剩下的工作就是根據最近鄰集進行推薦了。

最近鄰集的運算相對來說成本比較高,尤其是大量資料的時候,今天和大家分享的是一種簡單高效的協同過濾演算法:Slope one

基本原理

用戶         

對項目A評分

項目B

A-B
(差異)

3

4

-1

2

4

-2

4

A+1.5

 

用戶對項目B的評分可能是多少呢?
Slope one演算法認為:平均值也可以代替某兩個未知個體之間的分差異。項目A對項目B的平均差是:((3 – 4) + (2 – 4)) / 2 = -1.5,也就是說人們對B的分一般比A的分要高1.5;於是Slope one演算法就預測對B的評分是:4 + 1.5 = 5.5

加權演算法

有n個人對項目A和項目B評分了,R(A->B)表示這n個人對A和對B評分的平均差(A-B),有m個人對項目B和項目C評分了,R(C->B)表示這m個人對C和對B評分的平均差(C-B),注意都是平均差而不是平方差,現在某個用戶對A的評分是ra,對C的評分是rc,那麼A對B的評分可能是:

rb = (n * (ra – R(A->B)) + m * (rc - R(C->B)))/(m+n)

Slope One 算法是由 Daniel Lemire 教授在 2005 年提出,詳閱簡介(並有論文原文下載);

開源的Slope one的套裝程式

  • Python

http://www.serpentine.com/blog/2006/12/12/collaborative-filtering-made-easy/

  • Java

http://taste.sourceforge.net/

http://www.daniel-lemire.com/fr/documents/publications/SlopeOne.java

http://www.nongnu.org/cofi/

  • PHP

http://sourceforge.net/projects/vogoo

http://www.drupal.org/project/cre

http://www.daniel-lemire.com/fr/documents/publications/webpaper.txt Slope one演算法作者寫的,簡單明瞭,強烈推薦。

還有一些其他語言的版本,請參考http://zh.wikipedia.org/wiki/Slope_onehttp://code.google.com/p/openslopeone/

-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;

    namespace SlopeOne
    {
        public class Rating
        {
            public float Value { get; set; }
            public int Freq { get; set; }

            public float AverageValue
            {
                get { return Value / Freq; }
            }
        }

        public class RatingDifferenceCollection : Dictionary<string, Rating>
        {
            private string GetKey(int Item1Id, int Item2Id)
            {
                return (Item1Id < Item2Id) ? Item1Id + "/" + Item2Id : Item2Id + "/" + Item1Id ;
            }

            public bool Contains(int Item1Id, int Item2Id)
            {
                return this.Keys.Contains<string>(GetKey(Item1Id, Item2Id));
            }

            public Rating this[int Item1Id, int Item2Id]
            {
                get {
                        return this[this.GetKey(Item1Id, Item2Id)];
                }
                set { this[this.GetKey(Item1Id, Item2Id)] = value; }
            }
        }

         public class SlopeOne
        {        
            public RatingDifferenceCollection _DiffMarix = new RatingDifferenceCollection();  // The dictionary to keep the diff matrix
            public HashSet<int> _Items = new HashSet<int>();  // Tracking how many items totally

            public void AddUserRatings(IDictionary<int, float> userRatings)
            {
                foreach (var item1 in userRatings)
                {
                    int item1Id = item1.Key;
                    float item1Rating = item1.Value;
                    _Items.Add(item1.Key);

                    foreach (var item2 in userRatings)
                    {
                        if (item2.Key <= item1Id) continue; // Eliminate redundancy
                        int item2Id = item2.Key;
                        float item2Rating = item2.Value;

                        Rating ratingDiff;
                        if (_DiffMarix.Contains(item1Id, item2Id))
                        {
                            ratingDiff = _DiffMarix[item1Id, item2Id];
                        }
                        else
                        {
                            ratingDiff = new Rating();
                            _DiffMarix[item1Id, item2Id] = ratingDiff;
                        }

                        ratingDiff.Value += item1Rating - item2Rating;
                        ratingDiff.Freq += 1;
                    }
                }
            }

            // Input ratings of all users
            public void AddUerRatings(IList<IDictionary<int, float>> Ratings)
            {
                foreach(var userRatings in Ratings)
                {
                    AddUserRatings(userRatings);
                }
            }

            public IDictionary<int, float> Predict(IDictionary<int, float> userRatings)
            {
                Dictionary<int, float> Predictions = new Dictionary<int, float>();
                foreach (var itemId in this._Items)
                {
                    if (userRatings.Keys.Contains(itemId))    continue; // User has rated this item, just skip it

                    Rating itemRating = new Rating();

                    foreach (var userRating in userRatings)
                    {
                        if (userRating.Key == itemId) continue;
                        int inputItemId = userRating.Key;
                        if (_DiffMarix.Contains(itemId, inputItemId))
                        {
                            Rating diff = _DiffMarix[itemId, inputItemId];
                            itemRating.Value += diff.Freq * (userRating.Value + diff.AverageValue * ((itemId < inputItemId) ? 1 : -1));
                            itemRating.Freq += diff.Freq;
                        }
                    }
                    Predictions.Add(itemId, itemRating.AverageValue);                
                }
                return Predictions;
            }

            public static void Test()
            {
                SlopeOne test = new SlopeOne();

                Dictionary<int, float> userRating = new Dictionary<int, float>();
                userRating.Add(1, 5);
                userRating.Add(2, 4);
                userRating.Add(3, 4);
                test.AddUserRatings(userRating);

                userRating = new Dictionary<int, float>();
                userRating.Add(1, 4);
                userRating.Add(2, 5);
                userRating.Add(3, 3);
                userRating.Add(4, 5);
                test.AddUserRatings(userRating);

                userRating = new Dictionary<int, float>();
                userRating.Add(1, 4);
                userRating.Add(2, 4);
                userRating.Add(4, 5);
                test.AddUserRatings(userRating);

                userRating = new Dictionary<int, float>();
                userRating.Add(1, 5);
                userRating.Add(3, 4);

                IDictionary<int, float> Predictions = test.Predict(userRating);
                foreach (var rating in Predictions)
                {
                    Console.WriteLine("Item " + rating.Key + " Rating: " + rating.Value);
                }
            }
        }
    }
    --------------------------------------------------------------

    介绍了Slope One算法的概念, 这次介绍C#实现
    使用基于Slope One算法的推荐需要以下数据: 
    1. 有一组用户
    2. 有一组Items(文章, 商品等)
    3. 用户会对其中某些项目評分(Rating)表达他们的喜好
    Slope One算法要解决的问题是, 对某个用户, 已知道他对其中一些Item的Rating了, 向他推荐一些他还没有Rating的Items, 以增加销售机会. :-)

    一个推荐系统的实现包括以下三步:
    1. 计算出任意两个Item之间Rating的差值
    2. 输入某个用户的Rating记录, 推算出对其它Items的可能Rating值
    3. 根据Rating的值排序, 给出Top Items;

    第一步:例如我们有三个用户和4个Items, 用户評分的情况如下表.

    Ratings User1 User2 User3
    Item1 5 4 4
    Item2 4 5 4
    Item3 4 3 N/A
    Item4 N/A 5 5


    在第一步中我们的工作就是计算出Item之间两两的評分之差, 也就是使说计算出以下矩阵:

      Item1 Item2 Item3 Item4
    Item1 N/A 0/3 2/2 -2/2
    Item2 0/3 N/A 2/2 -1/2
    Item3 -2/2 -2/2 N/A -2/1
    Item4 2/2 1/2 2/1 N/A

    考虑到加权算法, 还要记录有多少人对这两项打了分(Freq), 我们先定义一个结构来保存Rating:
        public class Rating
        {
            public float Value { get; set; }
            public int Freq { get; set; }

            public float AverageValue
            {
                get {return Value / Freq;}
            }
        }
    我决定用一个Dictionary来保存这个结果矩阵:
        public class RatingDifferenceCollection : Dictionary<string, Rating>
        {
            private string GetKey(int Item1Id, int Item2Id)
            {
                return Item1Id + "/" + Item2Id;
            }

            public bool Contains(int Item1Id, int Item2Id)
            {
                return this.Keys.Contains<string>(GetKey(Item1Id, Item2Id));
            }

            public Rating this[int Item1Id, int Item2Id]
            {
                get {
                        return this[this.GetKey(Item1Id, Item2Id)];
                }
                set { this[this.GetKey(Item1Id, Item2Id)] = value; }
            }
        }

    接下来我们来实现SlopeOne类. 首先创建一个RatingDifferenceCollection来保存矩阵, 还要创建HashSet来保持系统中总共有哪些Items:
        public class SlopeOne
        {        
            public RatingDifferenceCollection _DiffMarix = new RatingDifferenceCollection();  // The dictionary to keep the diff matrix
            public HashSet<int> _Items = new HashSet<int>();  // Tracking how many items totally

    方法AddUserRatings接收一个用户的評分记录(Item-Rating): public void AddUserRatings(IDictionary<int, float> userRatings)
    AddUserRatings中有两重循环, 外层循环遍历输入中的所有Item, 内层循环再遍历一次, 计算出一对Item之间Rating的差存入_DiffMarix, 记得Freq加1, 以记录我们又碰到这一对Items一次:
        Rating ratingDiff = _DiffMarix[item1Id, item2Id];
        ratingDiff.Value += item1Rating - item2Rating;
        ratingDiff.Freq += 1;

    对每个用户调用AddUserRatings后, 建立起矩阵. 但我们的矩阵是以表的形式保存:

      Rating Dif Freq
    Item1-2 0 3
    Item1-3 1 2
    Item2-1 0 3
    Item2-3 1 2
    Item3-1 -1 2
    Item3-2 -1 2
    Item1-4 -1 2
    Item2-4 -0.5 2
    Item3-4 -2 1
    Item4-1 1 2
    Item4-2 0.5 2
    Item4-3 2 1


    第二步:输入某个用户的Rating记录, 推算出对其它Items的可能Rating值:
    public IDictionary<int, float> Predict(IDictionary<int, float> userRatings)
    也是两重循环, 外层循环遍历_Items中所有的Items; 内层遍历userRatings, 用此用户的ratings结合第一步得到的矩阵, 推算此用户对系统中每个项目的Rating:
        Rating itemRating = new Rating(); // Prediction of this user's rating 
        ...
        Rating diff = _DiffMarix[itemId, inputItemId]:
        itemRating.Value += diff.Freq * (diff.AverageValue + userRating.Value);
        itemRating.Freq += diff.Freq;

    第三步:得到用户的Rating预测后,就可以按rating排序, 向用户推荐了. 测试一下:
        Dictionary<int, float> userRating userRating = new Dictionary<int, float>();
        userRating.Add(1, 5);
        userRating.Add(3, 4);
        IDictionary<int, float> Predictions = test.Predict(userRating);
        foreach (var rating in Predictions)
        {
            Console.WriteLine("Item " + rating.Key + " Rating: " + rating.Value);
        }    
    输出:
    Item 2 Rating: 5
    Item 4 Rating: 6

    改进:
    观察之前产生的矩阵可以发现, 其中有很多浪费的空间; 例如: 对角线上永远是不会有值的. 因为我们是用线性表保存矩阵值, 已经避免了这个问题;
    对角线下方的值和对角线上方的值非常对称,下方的值等于上方的值乘以-1; 在数据量很大的时候是很大的浪费. 我们可以通过修改RatingDifferenceCollection来完善. 可以修改GetKey方法, 用Item Pair来作为Key:
        private string GetKey(int Item1Id, int Item2Id) {
            return (Item1Id < Item2Id) ? Item1Id + "/" + Item2Id : Item2Id + "/" + Item1Id ;;
        }
    完整代码在这里,在.net 3.5上调试通过;

    ----------------------------------------------------------------------------

    Implementing Slope-One in T-SQL

    Brief process summary

    Define the fact table as user data.
    Calculating intermediate matrix(FreqDiff). The information about users is eliminated and frequency/ score differences data between items is produced.
    Predicting from user input score with the intermediate data.
    Data schema

    The UserData is fact table of business transactions. I use an view to wrap it for switching between testing data and working data.
    The Freq&Diff matrix is square and sparse. Only non-zero values is meaningful and stored. And a half-matrix triangle holds full information about the matrix.
    There created two indices to avoid heavily bookmark-lookup.

    create table UserData (                  -- fact table    userid   varchar(50) not null,    itemid   varchar(50) not null,    rating   float not null default 0,    updtime  datetime default getdate(),    primary key (userid, itemid))GOcreate table FreqDiff (                  -- Freqs and Diffs    itemid1  varchar(50),    itemid2  varchar(50),    freq     float not null default 0,    diff     float not null default 0,    updtime  datetime default getdate(),    primary key (itemid1, itemid2))GOcreate index idx_freqdiff_itemid1 on FreqDiff(itemid1, freq, diff, itemid2)create index idx_freqdiff_itemid2 on FreqDiff(itemid2, freq, diff, itemid1)/* * The matrix FreqDiff is *almost* symmetric,  * so only half of the data need to be stored. * There would be huge of space (50%) saved for large dataset. */alter view vw_freqdiff asselect itemid1 as itemid1, itemid2 as itemid2, freq,     diff from FreqDiff fdunion allselect itemid2 as itemid1, itemid1 as itemid2, freq, -1* diff from FreqDiff fdGO/* * Wrap for userdata,  * switch from one model to another easily. */alter view vw_userdata as select * from userdataGO

    Testing data

    Same as Bryan’s but names changed for easily debugging print.

    -- init userdata, Bryan O'Sullivan's sample data is used

    -- init userdata, Bryan O'Sullivan's sample data is usedinsert into UserData values ( 'u1', 'i1',  1, getdate() )insert into UserData values ( 'u1', 'i2', .5, getdate() )insert into UserData values ( 'u1', 'i3', .2, getdate() )insert into UserData values ( 'u2', 'i1',  1, getdate() )insert into UserData values ( 'u2', 'i3', .5, getdate() )insert into UserData values ( 'u2', 'i4', .2, getdate() )insert into UserData values ( 'u3', 'i1', .2, getdate() )insert into UserData values ( 'u3', 'i2', .4, getdate() )insert into UserData values ( 'u3', 'i3',  1, getdate() )insert into UserData values ( 'u3', 'i4', .4, getdate() )insert into UserData values ( 'u4', 'i2', .9, getdate() )insert into UserData values ( 'u4', 'i3', .4, getdate() )insert into UserData values ( 'u4', 'i4', .5, getdate() )GO


    Processing the intermediate table

    -- update processdelete FreqDiffinsert into FreqDiffselect     ud1.itemid, ud2.itemid, count(*), (sum(ud1.rating - ud2.rating))/count(*), getdate()from     vw_userdata ud1     join vw_userdata ud2 on             ud1.userid = ud2.userid        and ud1.itemid > ud2.itemidgroup by ud1.itemid, ud2.itemid

    Predicting

    -- predict processdeclare @pref table(itemid varchar(50), rating float)insert into @pref values('i1', 0.4)select -- distinct top 10    itemid1,    sum(freq)                               as freq,    sum(freq*(diff + rating))            as pref,    sum(freq*(diff + rating)) /sum(freq) as ratingfrom     vw_freqdiff fd    join @pref p on fd.itemid2 = p.itemidwhere itemid1 not in( select itemid from @pref )group by itemid1

     

    【出處】http://www.cnblogs.com/kuber/archive/2008/06/18/1224725.html

     

    【出處】http://www.fuchaoqun.com/2008/09/slope_one/

    【出處】http://blog.charliezhu.com/2008/07/21/implementing-slope-one-in-t-sql/

    【出處】http://my.oschina.net/liangtee/blog?catalog=246307

    創作者介紹
    創作者 The Dance of Disorder (Fluctuations of Entropy) 的頭像
    Jason

    The Dance of Disorder (Fluctuations of Entropy)

    Jason 發表在 痞客邦 留言(0) 人氣( 0 )