서프라이즈 라이브러리를 활용한 추천시스템 구축 및 검증 Building and Testing Recommender Systems With Surprise, Step-By-Step

728x90

source https://www.pexels.com/photo/person-doing-thumbs-up-193821/

파이썬과 서프라이즈 라이브러리, 그리고 협업 필터링 (Collaborative Filtering) 을 활용한 추천 엔진 구축하기

추천시스템에 접근하는 2가지 방법은 collaborative filtering 과 content-based recommendations 이다. 이 포스팅에서는 collaborative filtering 접근방법에 초점을 맞춰 진행한다. 간단히 말해 사용자간의 유사성에 기반하여 등급을 예측하는 방법이다.

추천 시스템 알고리즘 개발을 위한 북크로싱 데이터와 Nicolas Hug에 의해 개발된 Surprise 라이브러리를 가지고 진행한다. 먼저 필요한 라이브러리를 임포트한다.

import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

Surprise 가 설치되어 있지 않은 경우 아래와 같은 오류가 발생한다. Surprise는 추천 시스템을 위한 파이썬 라이브러리이다.

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-bc06588669df> in <module>
      1 import pandas as pd
----> 2 from surprise import Reader
      3 from surprise import Dataset
      4 from surprise.model_selection import cross_validate
      5 from surprise import NormalPredictor

ModuleNotFoundError: No module named 'surprise'

Anaconda 환경에서는 다음과 같이 설치한다.

$ conda install -c conda-forge scikit-surprise

The Data

북크로싱 데이터는 2개의 테이터프레임으로 구성된다. 사용자 테이블과 등급이 그것이다. 실습에 필요한 데이터는 아래 링크에서 다운로드 받을 수 있다.

http://www2.informatik.uni-freiburg.de/~cziegler/BX/

user = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['userID', 'Location', 'Age']
rating = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['userID', 'ISBN', 'bookRating']

각 데이터프레임의 구성을 간략이 살펴보자. 먼저 사용자 데이터이다.

user.head()

	userID	Location	Age
0	1	nyc, new york, usa	NaN
1	2	stockton, california, usa	18.0
2	3	moscow, yukon territory, russia	NaN
3	4	porto, v.n.gaia, portugal	17.0
4	5	farnborough, hants, united kingdom	NaN

다음으로 평점 데이터이다.

rating.head()

	userID	ISBN	bookRating
0	276725	034545104X	0
1	276726	0155061224	5
2	276727	0446520802	0
3	276729	052165615X	3
4	276729	0521795028	6

이상 2개의 데이터프레임을 합친다.

df = pd.merge(user, rating, on='userID', how='inner')
df.drop(['Location', 'Age'], axis=1, inplace=True)

df.head()

	userID	ISBN	bookRating
0	2	0195153448	0
1	7	034542252	0
2	8	0002005018	5
3	8	0060973129	0
4	8	0374157065	0

이 데이터프레임의 주요정보를 살펴보면 다음과 같다.

df.shape

(1149780, 3)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 35.1+ MB

print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::200000, :])

Dataset shape: (1149780, 3)
-Dataset examples-
         userID        ISBN  bookRating
0             2  0195153448           0
200000    48494  0871233428           0
400000    98391  0670032549          10
600000   147513  0470832525           5
800000   196502  0590431862           0
1000000  242157  0732275865           0

EDA

Ratings Distribution

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

data = df['bookRating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} book-ratings'.format(df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-20-b5ed23cad8a8> in <module>
----> 1 from plotly.offline import init_notebook_mode, plot, iplot
      2 import plotly.graph_objs as go
      3 init_notebook_mode(connected=True)
      4 
      5 data = df['bookRating'].value_counts().sort_index(ascending=False)

ModuleNotFoundError: No module named 'plotly'

(tfKeras) founder@hilbert:~/tfKeras$ conda install -c plotly plotly

Downloading and Extracting Packages
ca-certificates-2019 | 126 KB    | ################################################################################## | 100%
retrying-1.3.3       | 15 KB     | ################################################################################## | 100%
plotly-3.6.1         | 28.0 MB   | ################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

114,9780개의 도서에 대한 평가 분포를 살펴보면, 62% 이상이 0, 그리고 1, 2, 3점 같은 낮은 수의 평점이 대부분이다.

Ratings Distribution By Book

# Number of ratings per book
data = df.groupby('ISBN')['bookRating'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Book (Clipped at 100)',
                   xaxis = dict(title = 'Number of Ratings Per Book'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

df.groupby('ISBN')['bookRating'].count().reset_index().sort_values('bookRating', ascending=False)[:10]

	ISBN	bookRating
247408	0971880107	2502
47371	0316666343	1295
83359	0385504209	883
9637	0060928336	732
41007	0312195516	723
101670	044023722X	647
166705	0679781587	639
28153	0142001740	615
166434	067976402X	614
153620	0671027360	586

5개 이하의 평가를 받은 책들이 대부분이다. 가장 많은 평가를 기록한 책의 경우 그 수는 2,502개였다.

Ratings Distribution By User

# Number of ratings per user
data = df.groupby('userID')['bookRating'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 50)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

df.groupby('userID')['bookRating'].count().reset_index().sort_values('bookRating', ascending=False)[:10]

	userID	bookRating
4213	11676	13602
74815	198711	7550
58113	153662	6109
37356	98391	5891
13576	35859	5850
80185	212898	4785
105111	278418	4533
28884	76352	3367
42037	110973	3100
88584	235105	3067

사용자별로 살펴봐도 대부분의 사용자가 5개 이하의 평가를 기록했다. 그리고 가장 활발한 활동을 보여준 사용자의 경우 13,602 개의 평가를 기록했다. 2개의 분포 모두 지수적으로 소멸한다. 데이터셋의 차원을 줄여, 메모리 에러를 피하기 위해, 저조한 평가를 기록한 도서 및 사용자를 제외하자.

min_book_ratings = 50
filter_books = df['ISBN'].value_counts() > min_book_ratings
filter_books = filter_books[filter_books].index.tolist()

min_user_ratings = 50
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['ISBN'].isin(filter_books)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(1149780, 3)
The new data frame shape:	(140516, 3)

Surprise

앞선 panda 데이터프레임으로부터 데이터셋을 로딩하기 위해서는, load_from_df() 메소드를 사용할 것이다. Reader object 가 필요하고, rating_scale 파라메터가 특정되어 있어야 한다. 그리고 데이터 프레임은 사용자 id, 아이템 id, 평가에 대응하는 3개의 컬럼을 가지고 있어야 한다.

reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(df_new[['userID', 'ISBN', 'bookRating']], reader)

Surprise 라이브러리를 가지고 다음 알고리즘에 대해 벤치마킹해보자. 각 알고리즘에 대해 자세한 설명은 다음 링크에서 확인가능하다.

prediction_algorithms package

Basic algorithms

NormalPredictor - NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

BaselineOnly - BaselineOnly algorithm predicts the baseline estimate for given user and item.

k-NN algorithms

KNNBasic - KNNBasic is a basic collaborative filtering algorithm.

KNNWithMeans - KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

KNNWithZScore - KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

KNNBaseline - KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

Matrix Factorization-based algorithms

SVD - SVD algorithm is equivalent to Probabilistic Matrix Factorization

SVDpp - The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

NMF - NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

Slope One - SlopeOne is a straightforward implementation of the SlopeOne algorithm.

Co-clustering - Coclustering is a collaborative filtering algorithm based on co-clustering.

여기서는 “rmse” 를 예측을 위한 정확도 척도로 사용한다.

benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

	test_rmse	fit_time	test_time
Algorithm
BaselineOnly	3.378459	0.531255	0.483405
CoClustering	3.466500	2.804150	0.507137
SlopeOne	3.476148	1.145189	4.673109
KNNWithMeans	3.480589	1.223362	5.777882
KNNBaseline	3.495915	2.179070	8.162395
KNNWithZScore	3.504182	1.347703	6.161966
SVD	3.542879	5.857378	0.844189
KNNBasic	3.721986	1.500139	8.031263
SVDpp	3.791743	138.612237	6.063440
NMF	3.833076	6.882946	0.533782
NormalPredictor	4.664079	0.158003	0.483223

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

surprise_results

	test_rmse	fit_time	test_time
Algorithm
BaselineOnly	3.378459	0.531255	0.483405
CoClustering	3.466500	2.804150	0.507137
SlopeOne	3.476148	1.145189	4.673109
KNNWithMeans	3.480589	1.223362	5.777882
KNNBaseline	3.495915	2.179070	8.162395
KNNWithZScore	3.504182	1.347703	6.161966
SVD	3.542879	5.857378	0.844189
KNNBasic	3.721986	1.500139	8.031263
SVDpp	3.791743	138.612237	6.063440
NMF	3.833076	6.882946	0.533782
NormalPredictor	4.664079	0.158003	0.483223

Train and Predict

BaselineOnly 알고리즘이 가장 좋은 rmse 결과를 보였다. 따라서 BaselineOnly 를 사용하여 훈련 및 예측을 진행하고 교대최소제곱(Alternating Least Squares, ALS)을 사용할 것이다.

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

{'test_rmse': array([3.37841046, 3.36613712, 3.37813444]),
 'fit_time': (0.22569656372070312, 0.26633763313293457, 0.27385568618774414),
 'test_time': (0.465076208114624, 0.4199976921081543, 0.435945987701416)}

rmse 정확도 훈련셋과 검증셋을 샘플링하기 위해 train_test_split() 을 사용할 것이고, rmse 정확도 척도를 사용한다. fit() 메소드를 통해 훈련셋의 알고리즘을 훈련시키고, test() 메소드를 통해 검증셋으로부터 생성된 예측을 반환할 것이다.

trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 3.3708

3.370803268319106

# dump.dump('./dump_file', predictions, algo)
# predictions, algo = dump.load('./dump_file')

trainset = algo.trainset
print(algo.__class__.__name__)

BaselineOnly

예측을 정확히 살펴보기 위해, 모든 예측에 대한 데이터프레임을 생성해보자. 다음 코드는 이 노트북에서 대부분 가져왔다.

def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)

df.head()

	uid	iid	rui	est	details	Iu	Ui	err
0	125878	0140298479	10.0	2.108557	{'was_impossible': False}	17	101	7.891443
1	242409	0156027321	0.0	4.008031	{'was_impossible': False}	9	188	4.008031
2	80945	0312963297	0.0	2.955063	{'was_impossible': False}	36	30	2.955063
3	264996	0684874350	0.0	3.999109	{'was_impossible': False}	9	111	3.999109
4	128696	0553279912	8.0	1.778143	{'was_impossible': False}	70	160	6.221857

best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

best_predictions

	uid	iid	details	Iu	Ui
33184	227447	0553282476	{'was_impossible': False}	316	61
31995	106225	051511264X	{'was_impossible': False}	204	29
34121	87746	0671867091	{'was_impossible': False}	175	66
9981	198711	0425114236	{'was_impossible': False}	353	33
22118	179733	0440236851	{'was_impossible': False}	91	42
2980	234623	0375702709	{'was_impossible': False}	250	78
34907	145451	0553285785	{'was_impossible': False}	159	46
2975	225810	0446354678	{'was_impossible': False}	217	32
33361	127429	0446525731	{'was_impossible': False}	121	24
29903	210792	0425147363	{'was_impossible': False}	49	67

이상은 최상의 예측에 대한 것이다.

worst_predictions

	uid	iid	rui	est	details	Iu	Ui	err
30262	245827	0451183665	10.0	0.168424	{'was_impossible': False}	127	80	9.831576
29129	241548	0440237025	10.0	0.124767	{'was_impossible': False}	67	29	9.875233
2335	73394	0345387651	10.0	0.062455	{'was_impossible': False}	234	114	9.937545
15725	172030	0425125467	10.0	0.000000	{'was_impossible': False}	102	29	10.000000
12325	115490	081297106X	10.0	0.000000	{'was_impossible': False}	161	50	10.000000
34400	238781	0345443284	10.0	0.000000	{'was_impossible': False}	192	146	10.000000
9524	24921	0440236665	10.0	0.000000	{'was_impossible': False}	93	28	10.000000
5722	263460	0440236851	10.0	0.000000	{'was_impossible': False}	58	42	10.000000
29933	26544	0515128600	10.0	0.000000	{'was_impossible': False}	196	32	10.000000
24357	227447	055356773X	10.0	0.000000	{'was_impossible': False}	316	44	10.000000

최악의 예측들은 그 결과가 꽤 놀랍다. 마지막인 ISBN "055356773X"의 경우 44명에 의해 평가가 이루어졌으며, "227447" 는 10점을 줬다. 하지만 BaselineOnly 알고리즘은 0 으로 예측했다.

df_new.loc[df_new['ISBN'] == '055358264X']['bookRating'].describe()

count    60.000000
mean      1.283333
std       2.969287
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max      10.000000
Name: bookRating, dtype: float64

import matplotlib.pyplot as plt
%matplotlib notebook
df_new.loc[df_new['ISBN'] == '055356773X']['bookRating'].hist()
plt.xlabel('rating')
plt.ylabel('Number of ratings')
plt.title('Number of ratings book ISBN 055356773X has received')
plt.show();

위의 ISBN 055358264X 책의 경우 대부분의 평점은 0점 이었다. 사용자 대부분이 0점을 줬다는 이야기이다. 오직 소수의 사용자만이 10점 등을 줬다. "worst prediction" 리스트의 다른 예측치와 일맥상통한다.

[원문보기] https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

728x90

저작자표시 비영리 변경금지 (새창열림)

'프로그래밍 Programming' 카테고리의 다른 글

ImportError: cannot import name '_validate_lengths' from 'numpy.lib.arraypad' (0)	2019.03.16
아나콘다 패키지 삭제 conda remove (0)	2019.03.15
주피터랩 살펴보기 Jupyter Lab: Evolution of the Jupyter Notebook (0)	2019.02.25
Rasa Stack 과 파이썬을 활용한 슬랙 챗봇 만들기 (2) A guide to creating a chatbot with Rasa stack and Python (0)	2019.02.22
Rasa Stack 과 파이썬을 활용한 슬랙 챗봇 만들기 (1) A guide to creating a chatbot with Rasa stack and Python (0)	2019.02.22

갈루아의 반서재

서프라이즈 라이브러리를 활용한 추천시스템 구축 및 검증 Building and Testing Recommender Systems With Surprise, Step-By-Step

'프로그래밍 Programming' 카테고리의 다른 글

티스토리툴바