갈루아의 반서재

source https://www.pexels.com/photo/person-doing-thumbs-up-193821/


파이썬과 서프라이즈 라이브러리, 그리고 협업 필터링 (Collaborative Filtering) 을 활용한 추천 엔진 구축하기


추천시스템에 접근하는 2가지 방법은 collaborative filtering 과 content-based recommendations 이다. 이 포스팅에서는 collaborative filtering 접근방법에 초점을 맞춰 진행한다. 간단히 말해 사용자간의 유사성에 기반하여 등급을 예측하는 방법이다.

추천 시스템 알고리즘 개발을 위한 북크로싱 데이터와 Nicolas Hug에 의해 개발된 Surprise 라이브러리를 가지고 진행한다. 먼저 필요한 라이브러리를 임포트한다.

import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

Surprise 가 설치되어 있지 않은 경우 아래와 같은 오류가 발생한다. Surprise는 추천 시스템을 위한 파이썬 라이브러리이다. 


ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-bc06588669df> in <module>
      1 import pandas as pd
----> 2 from surprise import Reader
      3 from surprise import Dataset
      4 from surprise.model_selection import cross_validate
      5 from surprise import NormalPredictor

ModuleNotFoundError: No module named 'surprise'


Anaconda 환경에서는 다음과 같이 설치한다.

$ conda install -c conda-forge scikit-surprise


The Data

북크로싱 데이터는 2개의 테이터프레임으로 구성된다. 사용자 테이블과 등급이 그것이다. 실습에 필요한 데이터는 아래 링크에서 다운로드 받을 수 있다.

http://www2.informatik.uni-freiburg.de/~cziegler/BX/

user = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['userID', 'Location', 'Age']
rating = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['userID', 'ISBN', 'bookRating']

각 데이터프레임의 구성을 간략이 살펴보자. 먼저 사용자 데이터이다.

user.head()

userIDLocationAge
01nyc, new york, usaNaN
12stockton, california, usa18.0
23moscow, yukon territory, russiaNaN
34porto, v.n.gaia, portugal17.0
45farnborough, hants, united kingdomNaN


다음으로 평점 데이터이다.

rating.head()

userIDISBNbookRating
0276725034545104X0
127672601550612245
227672704465208020
3276729052165615X3
427672905217950286


이상 2개의 데이터프레임을 합친다.

df = pd.merge(user, rating, on='userID', how='inner')
df.drop(['Location', 'Age'], axis=1, inplace=True)

df.head()


userIDISBNbookRating
0201951534480
170345422520
2800020050185
3800609731290
4803741570650

이 데이터프레임의 주요정보를 살펴보면 다음과 같다.

df.shape

(1149780, 3)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 35.1+ MB

print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::200000, :])

Dataset shape: (1149780, 3)
-Dataset examples-
         userID        ISBN  bookRating
0             2  0195153448           0
200000    48494  0871233428           0
400000    98391  0670032549          10
600000   147513  0470832525           5
800000   196502  0590431862           0
1000000  242157  0732275865           0


EDA

Ratings Distribution

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

data = df['bookRating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} book-ratings'.format(df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-20-b5ed23cad8a8> in <module>
----> 1 from plotly.offline import init_notebook_mode, plot, iplot
      2 import plotly.graph_objs as go
      3 init_notebook_mode(connected=True)
      4 
      5 data = df['bookRating'].value_counts().sort_index(ascending=False)

ModuleNotFoundError: No module named 'plotly'
(tfKeras) founder@hilbert:~/tfKeras$ conda install -c plotly plotly

Downloading and Extracting Packages
ca-certificates-2019 | 126 KB    | ################################################################################## | 100%
retrying-1.3.3       | 15 KB     | ################################################################################## | 100%
plotly-3.6.1         | 28.0 MB   | ################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

114,9780개의 도서에 대한 평가 분포를 살펴보면, 62% 이상이 0, 그리고 1, 2, 3점 같은 낮은 수의 평점이 대부분이다.


Ratings Distribution By Book
# Number of ratings per book
data = df.groupby('ISBN')['bookRating'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Book (Clipped at 100)',
                   xaxis = dict(title = 'Number of Ratings Per Book'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

df.groupby('ISBN')['bookRating'].count().reset_index().sort_values('bookRating', ascending=False)[:10]
ISBNbookRating
24740809718801072502
4737103166663431295
833590385504209883
96370060928336732
410070312195516723
101670044023722X647
1667050679781587639
281530142001740615
166434067976402X614
1536200671027360586

5개 이하의 평가를 받은 책들이 대부분이다. 가장 많은 평가를 기록한 책의 경우 그 수는 2,502개였다.


Ratings Distribution By User
# Number of ratings per user
data = df.groupby('userID')['bookRating'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 50)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

df.groupby('userID')['bookRating'].count().reset_index().sort_values('bookRating', ascending=False)[:10]
userIDbookRating
42131167613602
748151987117550
581131536626109
37356983915891
13576358595850
801852128984785
1051112784184533
28884763523367
420371109733100
885842351053067

사용자별로 살펴봐도 대부분의 사용자가 5개 이하의 평가를 기록했다. 그리고 가장 활발한 활동을 보여준 사용자의 경우 13,602 개의 평가를 기록했다. 2개의 분포 모두 지수적으로 소멸한다. 데이터셋의 차원을 줄여, 메모리 에러를 피하기 위해, 저조한 평가를 기록한 도서 및 사용자를 제외하자.

min_book_ratings = 50
filter_books = df['ISBN'].value_counts() > min_book_ratings
filter_books = filter_books[filter_books].index.tolist()

min_user_ratings = 50
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['ISBN'].isin(filter_books)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(1149780, 3)
The new data frame shape:	(140516, 3)


Surprise

앞선 panda 데이터프레임으로부터 데이터셋을 로딩하기 위해서는,  load_from_df() 메소드를 사용할 것이다. Reader object 가 필요하고, rating_scale 파라메터가 특정되어 있어야 한다. 그리고 데이터 프레임은 사용자 id, 아이템 id, 평가에 대응하는 3개의 컬럼을 가지고 있어야 한다.

reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(df_new[['userID', 'ISBN', 'bookRating']], reader)

Surprise 라이브러리를 가지고 다음 알고리즘에 대해 벤치마킹해보자. 각 알고리즘에 대해 자세한 설명은 다음 링크에서 확인가능하다. 

prediction_algorithms package


Basic algorithms

NormalPredictor - NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

BaselineOnly - BaselineOnly algorithm predicts the baseline estimate for given user and item.


k-NN algorithms

KNNBasic - KNNBasic is a basic collaborative filtering algorithm.

KNNWithMeans - KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

KNNWithZScore - KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

KNNBaseline - KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.


Matrix Factorization-based algorithms

SVD - SVD algorithm is equivalent to Probabilistic Matrix Factorization

SVDpp - The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

NMF - NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

Slope One - SlopeOne is a straightforward implementation of the SlopeOne algorithm.

Co-clustering - Coclustering is a collaborative filtering algorithm based on co-clustering.


여기서는 “rmse” 를 예측을 위한 정확도 척도로 사용한다.

benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

test_rmsefit_timetest_time
Algorithm
BaselineOnly3.3784590.5312550.483405
CoClustering3.4665002.8041500.507137
SlopeOne3.4761481.1451894.673109
KNNWithMeans3.4805891.2233625.777882
KNNBaseline3.4959152.1790708.162395
KNNWithZScore3.5041821.3477036.161966
SVD3.5428795.8573780.844189
KNNBasic3.7219861.5001398.031263
SVDpp3.791743138.6122376.063440
NMF3.8330766.8829460.533782
NormalPredictor4.6640790.1580030.483223

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

surprise_results

test_rmsefit_timetest_time
Algorithm
BaselineOnly3.3784590.5312550.483405
CoClustering3.4665002.8041500.507137
SlopeOne3.4761481.1451894.673109
KNNWithMeans3.4805891.2233625.777882
KNNBaseline3.4959152.1790708.162395
KNNWithZScore3.5041821.3477036.161966
SVD3.5428795.8573780.844189
KNNBasic3.7219861.5001398.031263
SVDpp3.791743138.6122376.063440
NMF3.8330766.8829460.533782
NormalPredictor4.6640790.1580030.483223



Train and Predict

BaselineOnly 알고리즘이 가장 좋은 rmse 결과를 보였다. 따라서 BaselineOnly 를 사용하여 훈련 및 예측을 진행하고 교대최소제곱(Alternating Least Squares, ALS)을 사용할 것이다.

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

{'test_rmse': array([3.37841046, 3.36613712, 3.37813444]),
 'fit_time': (0.22569656372070312, 0.26633763313293457, 0.27385568618774414),
 'test_time': (0.465076208114624, 0.4199976921081543, 0.435945987701416)}

rmse 정확도 훈련셋과 검증셋을 샘플링하기 위해 train_test_split() 을 사용할 것이고, rmse 정확도 척도를 사용한다. fit() 메소드를 통해 훈련셋의 알고리즘을 훈련시키고, test() 메소드를 통해 검증셋으로부터 생성된 예측을 반환할 것이다. 

trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 3.3708
3.370803268319106

# dump.dump('./dump_file', predictions, algo)
# predictions, algo = dump.load('./dump_file')

trainset = algo.trainset
print(algo.__class__.__name__)

BaselineOnly


예측을 정확히 살펴보기 위해, 모든 예측에 대한 데이터프레임을 생성해보자. 다음 코드는 이 노트북에서 대부분 가져왔다.

def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)

df.head()

uidiidruiestdetailsIuUierr
0125878014029847910.02.108557{'was_impossible': False}171017.891443
124240901560273210.04.008031{'was_impossible': False}91884.008031
28094503129632970.02.955063{'was_impossible': False}36302.955063
326499606848743500.03.999109{'was_impossible': False}91113.999109
412869605532799128.01.778143{'was_impossible': False}701606.221857
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

best_predictions

uidiidruiestdetailsIuUierr
3318422744705532824760.00.0{'was_impossible': False}316610.0
31995106225051511264X0.00.0{'was_impossible': False}204290.0
341218774606718670910.00.0{'was_impossible': False}175660.0
998119871104251142360.00.0{'was_impossible': False}353330.0
2211817973304402368510.00.0{'was_impossible': False}91420.0
298023462303757027090.00.0{'was_impossible': False}250780.0
3490714545105532857850.00.0{'was_impossible': False}159460.0
297522581004463546780.00.0{'was_impossible': False}217320.0
3336112742904465257310.00.0{'was_impossible': False}121240.0
2990321079204251473630.00.0{'was_impossible': False}49670.0

이상은 최상의 예측에 대한 것이다. 

worst_predictions

uidiidruiestdetailsIuUierr
30262245827045118366510.00.168424{'was_impossible': False}127809.831576
29129241548044023702510.00.124767{'was_impossible': False}67299.875233
233573394034538765110.00.062455{'was_impossible': False}2341149.937545
15725172030042512546710.00.000000{'was_impossible': False}1022910.000000
12325115490081297106X10.00.000000{'was_impossible': False}1615010.000000
34400238781034544328410.00.000000{'was_impossible': False}19214610.000000
952424921044023666510.00.000000{'was_impossible': False}932810.000000
5722263460044023685110.00.000000{'was_impossible': False}584210.000000
2993326544051512860010.00.000000{'was_impossible': False}1963210.000000
24357227447055356773X10.00.000000{'was_impossible': False}3164410.000000

최악의 예측들은 그 결과가 꽤 놀랍다. 마지막인 ISBN "055356773X"의 경우 44명에 의해 평가가 이루어졌으며,  "227447" 는 10점을 줬다. 하지만 BaselineOnly 알고리즘은 0 으로 예측했다.

df_new.loc[df_new['ISBN'] == '055358264X']['bookRating'].describe()
count    60.000000
mean      1.283333
std       2.969287
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max      10.000000
Name: bookRating, dtype: float64

import matplotlib.pyplot as plt
%matplotlib notebook
df_new.loc[df_new['ISBN'] == '055356773X']['bookRating'].hist()
plt.xlabel('rating')
plt.ylabel('Number of ratings')
plt.title('Number of ratings book ISBN 055356773X has received')
plt.show();



위의 ISBN 055358264X 책의 경우 대부분의 평점은 0점 이었다. 사용자 대부분이 0점을 줬다는 이야기이다. 오직 소수의 사용자만이 10점 등을 줬다. "worst prediction" 리스트의 다른 예측치와 일맥상통한다.


[원문보기] https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b