Collaborative filtering(協同過濾)

Collaborative filtering (CF) makes predictions of the user's unknown preference rely on a given users' preferences.
CF methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities, or preferences -- and predicting what users will like based on their similarity to other users(user-based model). A key advantage of the CF approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items (such as movies) without requiring an "understanding" of the item itself. Though CF tasks have their main challenges, such as data sparsity, scalability.

Recommendation systems seek to predict the 'rating' or 'preference' that a user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered. Some techniques work well with explicit ratings e.g. movies, music while others do well with implicit ratings e.g. page views, clicks, etc.

First, there was 'Taste', a simple collaborative filtering engine that could predict what a user would like next -- be it a movie, book, or a product.
Later, Taste was incorporated into a learning. For example,
Apache Mahout, Apache Software Foundation旗下的一個開放原始碼的專案，其專案的目的，在於提供具規模可擴充性的機器學習)As Apache Mahout of the original item-item, user-user and 【slope-one】 recommender as well as the distributed versions of item-item CF.

The following images demonstrate how Mahout's recommendation engine work:

S is the similarity matrix between items, U is the user’s preferences for items
R is the predicted recommendations

In contrast to CF algorithms, a very successful recommendation, called latent factor models, using Matrix Factorization to build a model by recognizing patterns in the training data. After the learning phase, they use the model to predict ratings of given queries. Many recent achievement of precise prediction used this approach.

Specifically, these methods factorize the rating matrix into two low-rank matrices: user profile and item profile, i.e. model the users and items as points in a k-dimensional feature space. An unknown rating can than simply be estimated by taking the dot product between the corresponding user and item feature vectors. They tend to take longer time compared to CF, but claimed to achieve better accuracy.
Mathematically spoken, decompose recommendation(prediction) A into two other matrices U and M whose combination is a good approximation of A.
The matrix A can be very large and the challenge is to find the decomposition and various methods exist to calculate it.
Few published techniques in matrix factorization include:

Regularized SVD - Regularized SVD (Singular Value Decomposition) minimizes squared error between actual ratings and predicted estimations for all available votes. In order to control overfitting issue, it adds regularization terms both for user and item profiles. For minimization process, it uses gradient descent. With a proper choice of parameters, this algorithm is known to achieve good accuracy.
Non-negative Matrix Factorization (NMF) - NMF also factorizes the rating matrix into user and item profiles, but it has one more restriction: both low-rank profile matrices should have only positive values in them. This method uses multiplicative update rules for minimizing Euclidean distance or Kullback-Leibler divergence between the actual ratings and estimation.
Probabilistic Matrix Factorization (PMF) - PMF adopts a probabilistic linear model with Gaussian observation noise for representing latent features both for users and items.
Bayesian Probabilistic Matrix Factorization (BPMF) - This algorithm applies a fully Bayesian treatment of PMF model in which model capacity is controlled automatically by integrating over all model parameters and hyperparameters.
Non-linear Probabilistic Matrix Factorization (NLPMF) - This algorithm develops a non-linear probabilistic matrix factorization using Gaussian process latent variable models. The model is optimized using stochastic gradient descent method, allowing to apply Gaussian processes to data sets with millions of observations without approximate methods.

Based on feedback from Netflix participants, a straightforward matrix factorization with stochastic gradient descent training worked well for predicting ratings. For item recommendation, weighted regularized matrix factorization (WR-MF), which is called weighted-alternating least squares and BPR-MF (Bayesian personalized ranking), which optimizes for a ranking loss (AUC), does well.
Mahout uses Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition, which is an iterative algorithm and currently show poor performance on Hadoop caused by the enormous overhead of scheduling and check-pointing each single iteration.
The next wave of innovative techniques were brought out by KDD 2011 competition, similar in nature to the Netflix competition but offered a fraction of the prize money! The winners used a combination of techniques (ensemble) including ALS, KNN, SGD, SVD++. More on it in next blog article.

協同過濾（Collaborative Filtering）是利用與特定使用者擁有類似經驗之群體的所顯示的偏好，來預測使用者未呈現的偏好資訊。如：網路書店的電子商務推薦系統，在顧客選擇一本書籍後，在下方也會顯示「Customer Who Bought This Item Also Bought」，就是CF經典的範例。

CF的優點：

不需進行內容分析。
能夠對複雜的、難以表述的概念（如資訊品質、個人品味）進行推薦
有推薦新資訊的能力，可以發現使用者潛在的的偏好
能做個人化推薦
自動化程度高

CF的缺點：

Cold start 問題。新使用者及新項目剛出現時，CF系統的推薦品質較差
稀疏性問題（Sparsity）
系統延伸性問題（Scalability）

CF主要的3種方法：

(1) User-Based CF

用相似統計的方法得到具有相似愛好或者興趣的相鄰使用者，所以稱之為以使用者為基礎（User-based）的協同過濾或基於鄰居的協同過濾(Neighbor-based Collaborative Filtering)。

步驟如下：

1.收集使用者資訊

收集可以代表使用者興趣的資訊。一般的網站系統使用評分的方式或是給予評價，這種方式被稱為「主動評分」。另外一種是「被動評分」，是根據使用者的行為模式由系統代替使用者完成評價，不需要使用者直接打分或輸入評價資料。電子商務網站在被動評分的資料獲取上有其優勢，使用者購買的商品記錄是相當有用的資料。

2.最近鄰搜索(Nearest neighbor search, NNS)

以使用者為基礎（User-based）的協同過濾的出發點是與使用者興趣愛好相同的另一組使用者，就是計算兩個使用者的相似度。例如：尋找n個和A有相似興趣使用者，把他們對M的評分作為A對M的評分預測。一般會根據資料的不同選擇不同的演算法，目前較多使用的相似度演算法有Person Correlation Coefficient、Cosine-based Similarity、Adjusted Cosine Similarity。

3.產生推薦結果

有了最近鄰集合，就可以對目標使用者的興趣進行預測，產生推薦結果。依據推薦目的的不同進行不同形式的推薦，較常見的推薦結果有Top-N 推薦和關聯推薦。Top-N 推薦是針對個體使用者產生，對每個人產生不一樣的結果，例如：透過對A使用者的最近鄰使用者進行統計，選擇出現頻率高且在A使用者的評分項目中不存在的，作為推薦結果。關聯推薦是對最近鄰使用者的記錄進行關聯規則(association rules)挖掘。

【註】推薦系統的另一種做法是：利用【關聯法則】 ARM

Another approach is to use association rule mining (market basket analysis) to compute interesting recommendations. But this technique does not generate personalized recommendations. Some researchers have combined ARM and CF to provide personalized recommendations.關聯法則 ARM無法提供個人化的推薦，而有結合ARM與CF來做個人化的推薦。

(2) Item-Based CF

以使用者為基礎的協同推薦演算法隨著使用者數量的增多，計算的時間就會變長；所以在2001年Sarwar提出了基於項目的協同過濾推薦演算法(Item-based Collaborative Filtering Algorithms)。以項目為基礎的協同過濾方法有一個基本的【假設】：「能夠引起使用者興趣的項目，必定與其之前評分高的項目相似」，透過計算項目之間的相似性來代替使用者之間的相似性。方法步驟：

1.收集使用者資訊

同以使用者為基礎（User-based）的協同過濾。

2.針對項目的最近鄰搜索

先計算己評項目與待測項目間的相似度，並以相似度為權重，加權各已評價項目的分數，得到待測項目的預測值。例如：要對項目 A 和項目 B 進行相似性計算，要先找出同時對 A 和 B 打過分的組合，對這些組合進行相似度計算，常用的演算法同以使用者為基礎（User-based）的協同過濾。

3.產生推薦結果

以項目為基礎的協同過濾不用考慮使用者間的差別，所以精度比較差。但是卻不需要使用者的歷史資料，或是進行使用者識別。對於項目來講，它們之間的相似性要穩定很多，因此可以離線完成工作量最大的相似性計算步驟，從而降低了線上計算量，提高推薦效率，尤其是在使用者多於項目的情形下尤為顯著。

(3) Content-Based Recommendation(CBF) (定義來說，此非 CF 法)

從 Robin Burke 的 Hybrid Recommender Systems: Survey and Experiments 一文中有定義如下：

A content-based recommender learns a profile of the user’s interests based on the features present in objects the user has rated

所以當使用者 profile 是經由分析文章所建立的時候，而 item 或 user 相似度是經由 cosine sim 算的時候，都算是用到了 content filtering 的概念

Collaborative Filtering , Content-based filtering 優缺點比較( Burke, 2002)

content-based 必須從同類別先去 training，所以其他類別的 item 就完全不可能被推薦，但是 CF 法的話，不同類別的 item 還是可能被推薦的
完全從內容著手，不需要該領域的專業知識
當資料越來越多時，推薦也會越準確
可以自行定義隱藏的變數，例如如果 user 停留在某個頁面很久，可以定義為它很喜歡 (rating ↑)
新的 user 紀錄太少 (此user 對所有 item 的 rating)，造成推薦不準確的問題
新的 item 紀錄太少 (所有user 對此 item 的 rating) ，造成推薦不準確的問題
特立獨行者，沒有明顯喜好者，很難準確推薦
歷史資料要夠大才可以分析，所以新系統似乎不適用
指的是當分析完使用者的偏好後，如果使用者品味突然改變，很難再去預測，必須使用新的歷史資料重新分析，推薦很僵化很麻煩

CF Case 案例說明：

(1) Case 背景

下面這張圖是CF的典型範例；一個有 48 萬名 user的網站，收錄了 18000 部電影，每個 user 對部份電影有給予評分。

第一列的 user A 對於 movie 2 的評價是 1 分，對於 movie 3 的評價也是 1 分，但是 user A 還沒有對 movie1 和 movie 4 做評價。要如何用CF預測 user A 對 1,4 的評價呢?

(2) Case 概念說明

基礎共分為兩種概念解決，分述如下

(a) User-Based CF

第一種是 user-based CF ，圖示如下 (JUN WANG,2008)

如果要預測 user A 對 4 的評價，因為 user B 對 4 的評價是 5 分， user F 對於 4 的評價是 1 ，則

user A → item 4的user-based CF分數 = 4 * similarities (user A, user B) + 5 * similarities (user A, user F)，再除以總相似度

【概念：取所有對item4的評價，以user A與其評價者間的相似度做加權平均】

即：取所有對item 4 做評價的user ，將其對item 4 的評價，乘上user A與這些user的相似度，加總，除以總相似度。

要預測 user u 對於 item i 的分數，需考量所有行為。像是 u 的其他 user v ( ex. 共同 rating 過 item i)，再除上所有相似度的加總，意義是將算出來的結果 mapping 到 user u 平常的 rating 區間

(b) Item-Based CF

第二種是 item-based CF ，則正好相反，圖示如下 (JUN WANG,2008)

例如：要預測 user A 對 4 的評價，因為 user A 對 2 和 3 的評價都是 1 ，則

user A = 1 * similarities (item 4, item 2) 加上 1 * similarities (item 4, item 3)

即 user A 可以從相似 item 2 和 item 3 去看他們與 item 4 的相似度，再算回去 user A 的評價

要預測 user u 對於 item i 的分數的話

要考量與 i 相似的其他 item j ，且 u 也對 j 有 rating ，並乘上 i 與 j 的相似度

再除上所有相似度的加總，意義是將算出來的結果 mapping 到 user u 平常的 rating 區間

(the sum of the similarity terms to make sure the prediction is within the predefined range)

(Badrul Sarwar, 2001)

CF 之中相似度算法：

CF 用到的相似度算法也有很多種，分述如下：

(1) Jaccard similarites (需要分析 content 建立成 tag set)

i,j 是 item 名稱，要算 item i 與 item j 的相似度，先把 i,j 分別建立成一個 tag set。例如 ( keyword1 , keyword 2 , keyword 3 , keyword 4 ….etc)

再用以下公式，分母為 item A 與 item B 所有有出現過的 tag 數，分子為 item A 與 item B 共同出現過的 tag 數

(2) Cosine similarities (需要分析 content 建立成 vector)

i,j 是 item 名稱，要算 item i 與 item j 的相似度，先把 i,j 建立成一個 vector (可以看成有權重的 tag set)。例如 ( keyword1 : 0.67 , keyword 2 : 0.88 …. keyword n : 0.32 )。如此一來就可以計算兩兩的相似度，如果 keyword 只有 2 個，則可以看成二維平面，example 如下：

item A 有兩個 tag x1 與 y1 ，權重為 x1 : 0.22 , y1 : 0.33

item B 有兩個 tag x2 與 y2 ，權重要 x2 : 0.44 , y2 : 0.55

sim (item A , item B) 的圖示如下：

sim (item A , item B) = cos ( x , y )

= ( 0.22 * 0.44 + 0.33 * 0.55 ) / ( 0.22 ^ 2 + 0.44 ^ 2) ^ 0.5 + ( 0.33 ^ 2 + 0.55 ^ 2) ^ 0.5

(3) Pearson correlation-based similarity (需要有 user 對於某些 item 的 rating)

ps. 這公式是 item-based

R(u,i) 指 user i 對於 item i 的 rating 值，

U 指有對 item i co-rating 的 user

R(u) bar 是該 user 對於所有文章的平均值

針對 item i , 共同有 co-rating 的 U 們的 rating 減掉 item i 平均得到的 rating 數，不考慮 user 同時 rating 不同 item 的因素，只考慮 item 本身屬性，可以判斷這些 item 彼此是不是像的。

(3) Adjusted cosine similarities (需要有 user 對於某些 item 的 rating)

ps. 這公式是 item-based

Cosine similarities 可以針對內容相似度去做計算，但如果

R(u,i) 指 user i 對於 item i 的 rating 值，

U 指有對 item i co-rating 的 user

R(u) bar 是該 user 對於所有文章的平均值

針對同個 item，不同 user 所給予的不同分數，但是 adjusted cosine 為了考量每個 user 的 scale 不同的問題 (ex. user A 平均都給 2 分 , user B 給 5 分 , 就不公平) 因此把每筆 rating 都扣掉一個該 user 給分的平均值，來避免此種情況發生。

ICF has three obvious advantages over UCF.
First, ICF is more explanatory, the recommendation results of UCF based on the
favorite items of users that have the same preference with u. But the ICF recommendation results are based on your past purchase preferences, which is very convincing.
Second, the user embedding model of ICF encodes all the items which users have interacted with in the past, so it can capture dynamic user preferences and achieve more accurate recommendation.
Third, since ICF estimates the similarity between items offline and then recommends items with high similarity online, Therefore, when users interact with new items, they only need to find items with high similarity to new items and add them to the list of recommended candidates, which makes online real-time recommendation more efficient. [Ref-1]