Abstract¶

Implicit Matrix Factorization¶

The Implicit Matrix Factorization (IMF) algorithm as presented by Hu, Koren and Volinksy is a widely popular, effective method for recommending items to users. This approach was born from necessity: while explicit feedback as to our users' tastes - a questionnaire completed at user signup, for example - makes building a recommendation engine straightforward, we often have nothing more than implicit feedback data - view count, click count, time spent on page, for example - which serve instead as a proxy for these preferences. Crucially, the latter feedback is asymmetric: while a high view count might indicate positive preference for a given item, a low view count cannot be said to do the opposite. Perhaps, the user simply didn't know the item was there.

IMF begins with a ratings matrix $R$, where $R_{u, i}$ gives the implicit feedback value observed for user $u$ and item $i$. Next, it constructs two other matrices defined as follows:

$$ p_{u, i} = \begin{cases} 1 & r_{u, i} \gt 0\\ 0 & r_{u, i} = 0 \end{cases}\\ c_{u, i} = 1 + \alpha r_{u, i} $$

$P$ gives a binary matrix indicating our belief in the existence of each user's preference for each item. $C$ gives our confidence in the existence of each user's preference for each item where, trivially, larger values of $r_{u, i}$ give us higher confidence that user $u$ indeed likes item $i$.

Next, IMF outlines its goal: let's embed each user and item into $\mathbb{R}^f$ such that their dot product approximates the former's true preference for the latter. Finally, and naming user vectors $x_u \in \mathbb{R}^f$ and item vectors $y_i \in \mathbb{R}^f$, we compute the argmin of the following objective:

$$ \underset{x_{*}, y_{*}}{\arg\min}\sum\limits_{u, i}c_{u, i}\big(p_{u, i} - x_u^Ty_i\big)^2 + \lambda\bigg(\sum\limits_u\|x_u\|^2 + \sum\limits_i\|y_u\|^2\bigg) $$

Once sufficiently minimized, we can compute expected preferences $\hat{p}_{u, i} = x_u^Ty_i$ for unobserved $\text{(user, item)}$ pairs; recommendation then becomes:

For a given user $u$, compute predicted preferences $\hat{p}_{u, i} = x_u^Ty_i$ for all items $i$.
Sort the list in descending order.
Returning the top $n$ items.

Shallow neural networks¶

IMF effectively gives a function $f: u, i \rightarrow \hat{p}_{u, i}$. As before, our goal is to minimize the function above, which we can now rewrite as:

$$ \underset{x_{*}, y_{*}}{\arg\min}\sum\limits_{u, i}c_{u, i}\big(p_{u, i} - f(u, i)\big)^2 + \lambda\bigg(\sum\limits_u\|x_u\|^2 + \sum\limits_i\|y_u\|^2\bigg) $$

To approximate this function, I turn to our favorite universal function approximator: neural networks optimized with stochastic gradient descent.

This work is built around a toy web application I authored long ago: dotify. At present, dotify:

Pulls data nightly from Spotify Charts. These data contain the number of streams for that day's top 200 songs for each of 55 countries.
Computes an implicit matrix factorization nightly, giving vectors for both countries and songs.
Allows the user to input a "country-arithmetic" expression, i.e. "I want music like Colombia x Turkey - Germany." It then performs this arithmetic with the chosen vectors and recommends songs to the composite.

In this work, I first fit and cross-validate an implicit matrix factorization model, establishing the three requisite parameters: $f$, the dimensionality of the latent vectors; $\alpha$, the scalar multiple used in computing $C$; $\lambda$ the regularization strength used in on our loss function.

Network architectures¶

Next, I explore five different shallow neural network architectures in attempt to improve upon the observed results. These architectures are as follows:

A trivially "Siamese" network which first embeds each country and song index into $\mathbb{R}^f$ in parallel then computes a dot-product of the embeddings. This is roughly equivalent to what is being done by IMF.
Same as previous, but with a bias embedding for each set, in $\mathbb{R}$, added to the dot-product.
Same as #1, but concatenate the vectors instead. Then, stack 3 fully-connected layers with ReLU activations, batch normalization after each, and dropout after the first. Finally, add a 1-unit dense layer on the end, and add bias embeddings to the result. (NB: I wanted to add the bias embeddings to the respective $\mathbb{R}^f$ embeddings at the outset, but couldn't figure out how to do this in Keras.)
Same as #2, except feed in the song title text as well. This text is first tokenized, then padded to a maximum sequence length, then embedded into a fixed-length vector by an LSTM, then reduced to a single value by a dense layer with a ReLU activation. Finally, this scalar is concatenated to the scalar output that #2 would produce, and the result is fed into a final dense layer with a linear activation - i.e. a linear combination of the two.
Same as #4, except feed in the song artist index as well. This index is first embedded into a vector, then reduced to a scalar by a dense layer with a ReLU activation. Finally, this scalar is concatenated with the two scalars produced in the second-to-last layer of #4, then fed into a final dense layer with a linear activation. Like the previous, this is a linear combination of the three inputs.

Results¶

While the various network architectures are intruiging for both the ease with which we can incorporate multiple, disparate inputs into our function, and the ability to then train this function end-to-end with respect to our main minimization objective, we find no reason to prefer them over the time-worn Implicit Matrix Factorization algorithm.

Data preparation¶

To construct our ratings matrix, I take the sum of total streams for each $\text{(country, song)}$ pair. The data are limited to a given number of "top songs," defined as a song that appeared on Spotify Charts on a given date.

Because the values exist on wildly different orders of magnitude, I scale the results as follows:

$$\tilde{r}_{u, i} = \log{\bigg(\frac{1 + r_{u, i}}{\epsilon}\bigg)}$$

To start, let's build a ratings matrix for a small sample of the data.

In [143]:

ratings_matrix = RatingsMatrix(n_top_songs=10000, eps=1e3)
ratings_matrix.R_ui.head()

Out[143]:

song_id	5552	5553	5557	5558	5560	5562	5565	5582	5583	5585	...	33062	33064	33065	33066	33067	33068	33069	33070	33071	33072
country_id
1	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	3.643437	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	2.546472	0.0	0.0	3.597833	0.0	0.0	0.0	0.000000	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.000000	0.0	0.0	3.175760	0.0	0.0	0.0	0.000000	0.0	0.0	...	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6	0.000000	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.0	...	0.0	1.138153	0.961264	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 2110 columns

Next, let's get an idea of the sparsity of our data and how many songs each country has streamed.

In [144]:

sparsity = (ratings_matrix.R_ui == 0).mean().mean()
non_zero_entries_percent = 100*np.round(1 - sparsity, 4)

print('Our ratings matrix has {}% non-zero entries.'.format(non_zero_entries_percent))

Our ratings matrix has 7.3999999999999995% non-zero entries.

In [145]:

print('Our ratings matrix contains {} countries and {} unique songs.'.format(*ratings_matrix.R_ui.shape))

Our ratings matrix contains 55 countries and 2110 unique songs.

In [146]:

plt.figure(figsize=(10, 6))

songs_rated_by_country = (ratings_matrix.R_ui > 0).sum(axis=1)
plt.hist(songs_rated_by_country)
plt.title('Distribution of # of Unique Songs Streamed per Country', fontsize=15)
plt.xlabel('Unique Songs Streamed', fontsize=12)
_ = plt.ylabel('# of Countries', fontsize=12)

Construct training, validation sets¶

In reviewing literature on recommender engine evaluation, it seems common to create training and validation sets as follows:

Filter your ratings matrix for users (countries) that meet some criterion, i.e. they've streamed above a certain threshold of songs.
Select a random $x\%$ of their items (songs).
"Move" these items into a validation matrix; set them to 0 in the training matrix.

To construct this split, I first compute a sensible threshold for Step #1, then "move" a random 20% of songs from the training matrix to the validation matrix.

In [147]:

print('The 15th percentile of songs rated by country is {}.'.format(
    np.percentile(songs_rated_by_country, 15)
))

The 15th percentile of songs rated by country is 54.5.

Let's choose 50 songs as our cutoff, and move 20% of the songs in qualifying rows to a validation set.

NB: The actual ratings matrix is located at RatingsMatrix.R_ui. This reflects an API choice I made when first starting this project.

In [12]:

def split_ratings_matrix_into_training_and_validation(ratings_matrix, eligibility_mask, fraction_to_drop=.2):
    training_matrix = ratings_matrix
    validation_matrix = deepcopy(training_matrix)
    validation_matrix.R_ui = pd.DataFrame(0., index=training_matrix.R_ui.index, columns=training_matrix.R_ui.columns)

    for country_id, ratings in training_matrix.R_ui[eligibility_mask].iterrows():
        
        rated_songs_mask = training_matrix.R_ui.ix[country_id] > 0
        rated_songs = training_matrix.R_ui.ix[country_id][rated_songs_mask].index.tolist()
        n_songs_to_drop = int( len(rated_songs)*fraction_to_drop )
        songs_to_drop = set( random.sample(rated_songs, n_songs_to_drop) )
        
        validation_matrix.R_ui.ix[country_id][songs_to_drop] = training_matrix.R_ui.ix[country_id][songs_to_drop]
        training_matrix.R_ui.ix[country_id][songs_to_drop] = 0.

    return training_matrix, validation_matrix

In [152]:

more_than_50_ratings_mask = songs_rated_by_country > 50

training_matrix, validation_matrix = split_ratings_matrix_into_training_and_validation(
    ratings_matrix=ratings_matrix, 
    eligibility_mask=more_than_50_ratings_mask
)

Evaluation¶

Evaluating recommender systems is an inexact science because there is no "right" answer. In production, this evaluation is often done via the A/B test of a proxy metric important to the business - revenue, for example. In training, the process is less clear. To this end, the authors of the IMF paper offer the following:

Evaluation of implicit-feedback recommender requires appropriate measures. In the traditional setting where a user is specifying a numeric score, there are clear metrics such as mean squared error to measure success in prediction. However with implicit models we have to take into account availability of the item, competition for the item with other items, and repeat feedback. For example, if we gather data on television viewing, it is unclear how to evaluate a show that has been watched more than once, or how to compare two shows that are on at the same time, and hence cannot both be watched by the user.

Additionally, they state:

It is important to realize that we do not have a reliable feedback regarding which programs are unloved, as not watching a program can stem from multiple different reasons. In addition, we are currently unable to track user reactions to our recommendations. Thus, precision based metrics are not very appropriate, as they require knowing which programs are undesired to a user. However, watching a program is an indication of liking it, making recall-oriented measures applicable.

In solution, they propose evaluating the "expected percentile ranking" defined as follows:

$$ \overline{\text{rank}} = \frac{\sum_{u, i}\tilde{r}_{u, i}^t\text{rank}_{u, i}}{\sum_{u, i}\tilde{r}_{u, i}^t} $$

Here, $\text{rank}_{u, i}$ gives the percentile-ranking of the predicted preference, i.e. if $\hat{p}_{u = 17, i = 34}$ is the largest of all predicted preferences, then $\text{rank}_{u = 17, i = 34} = 0\%$. Similarly, the smallest of the predicted preferences, i.e. the last on the list, equals $100\%$.

The following class accepts a training matrix, validation matrix and a matrix of predicted preferences. It then exposes the mean expected percentile ranking for both training and validation sets as properties on the instance.

In [6]:

class ExpectedPercentileRankingsEvaluator:
    
    def __init__(self, training_matrix, validation_matrix, eligibility_mask, predicted_preferences):
        self.training_matrix = training_matrix
        self.validation_matrix = validation_matrix
        self.eligibility_mask = eligibility_mask
        self.predicted_preferences = predicted_preferences
        self._expected_percentile_rankings_train = []
        self._expected_percentile_rankings_validation = []
        
    def run(self):
        self._evaluate_train()
        self._evaluate_validation()

    def _evaluate_train(self):
        self._expected_percentile_rankings_train = self._evaluate(matrix=self.training_matrix)
        
    def _evaluate_validation(self):
        self._expected_percentile_rankings_validation = self._evaluate(matrix=self.validation_matrix)
        
    def _evaluate(self, matrix):
        expected_percentile_rankings = []
        for country_id, preferences in self.predicted_preferences[self.eligibility_mask].iterrows():
            predictions = pd.DataFrame({
                'predicted_preference': preferences.sort_values(ascending=False),
                'rank': np.arange( len(preferences) ),
                'percentile_rank': np.arange( len(preferences) ) / len(preferences)
            })
            ground_truth = matrix.R_ui.ix[country_id][ matrix.R_ui.ix[country_id] > 0 ]
            numerator = (ground_truth * predictions['percentile_rank'][ground_truth.index]).sum()
            denominator = ground_truth.sum()
            expected_percentile_rankings.append( numerator / denominator )

        return expected_percentile_rankings

    @property
    def mean_expected_percentile_rankings_train(self):
        return np.mean(self._expected_percentile_rankings_train)
    
    @property
    def mean_expected_percentile_rankings_validation(self):
        return np.mean(self._expected_percentile_rankings_validation)

Grid search¶

Next, we perform a basic grid search to find reasonable values for $\alpha$ and $\lambda$.

In [ ]:

F = 30
grid_search_results = {}
result = namedtuple('Result', 'alpha lmbda')
alpha_values = [1e-1, 1e0, 1e1, 1e2]
lmbda_values = [1e-1, 1e0, 1e1, 1e2]

for alpha in alpha_values:
    for lmbda in lmbda_values:
        implicit_mf = ImplicitMF(ratings_matrix=training_matrix, f=F, alpha=alpha, lmbda=lmbda)
        implicit_mf.run()
        predicted_preferences = implicit_mf.country_vectors.vectors.dot( implicit_mf.song_vectors.vectors.T )
        
        evaluator = ExpectedPercentileRankingsEvaluator(
            training_matrix=training_matrix,
            validation_matrix=validation_matrix,
            eligibility_mask=more_than_50_ratings_mask,
            predicted_preferences=predicted_preferences
        )
        evaluator.run()
        grid_search_results[result(alpha=alpha, lmbda=lmbda)] = {
            'train': evaluator.mean_expected_percentile_rankings_train,
            'validation': evaluator.mean_expected_percentile_rankings_validation
        }
        print(grid_search_results)

Let's visualize the results for clarity. I plot the opposite of the validation score such that the parameters corresponding to the darkest square are most favorable.

In [159]:

grid_search_items = [(params, results) for params, results in grid_search_results.items()]
grid_search_items.sort(key=lambda tup: tup[0].alpha)

params, results = zip(*grid_search_items)
validation_results = np.array([result['validation'] for result in results]).reshape(4, 4)

plt.figure(figsize=(15, 10))
plt.xticks([])
plt.yticks([])
plt.pcolormesh(-validation_results, cmap='Reds')
plt.colorbar()
for alpha_index, alpha in enumerate(alpha_values):
    for lbmda_index, lmbda in enumerate(lmbda_values):
        plt.text(
            x = lbmda_index + 0.5,
            y = alpha_index + 0.5,
            s = r'$\alpha$ = {}'.format(alpha) + '\n' + r'$\lambda$ = {}'.format(lmbda), 
            ha='center', 
            va='center', 
            size=15, 
            color='w'
        )
plt.title('Grid Search Results', fontsize=15)
plt.xlabel(r'$\lambda$ Values', fontsize=14)
_ = plt.ylabel(r'$\alpha$ Values', fontsize=14)

In [58]:

best_params = min(grid_search_results, key=lambda key: grid_search_results.get(key)['validation'])
print('The best parameters were found to be: {}.'.format(best_params))

The best parameters were found to be: Result(alpha=0.1, lmbda=10.0).

Train a final model¶

Using these parameters, let's train a final implicit matrix factorization model using 1,000,000 top songs. Additionally, we'll choose $\epsilon = 1,000$ which seemed to work best in simple experimentation. This experimentation is not shown here.

In [7]:

ratings_matrix = RatingsMatrix(n_top_songs=1000000, eps=1e3)

In [163]:

print('Our ratings matrix contains {} countries and {} unique songs.'.format(*ratings_matrix.R_ui.shape))

Our ratings matrix contains 55 countries and 8663 unique songs.

In [ ]:

implicit_mf = ImplicitMF(ratings_matrix=ratings_matrix, f=F, alpha=best_params.alpha, lmbda=best_params.lmbda)
implicit_mf.run()

Visualize¶

Let's plot the cosine similarities between all pairs of countries to confirm that things make sense. Intuitively, I'd think that countries in the following groups should be similar:

United States, United Kingdom, Canada, Australia, New Zealand
Latin American countries
Sweden, Finland, Norway, Denmark

First let's replace the index of our vectors such that it contains the country names.

In [165]:

country_vectors_df = implicit_mf.country_vectors.vectors.copy()

country_id_to_name = {countries_lookup[name]['id']: name for name in countries_lookup}
country_ids = country_vectors_df.index
country_names = pd.Index([country_id_to_name[c_id] for c_id in country_ids], name='country_name')
country_vectors_df.index = country_names

country_vectors_df.head()

Out[165]:

	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5	dim_6	dim_7	dim_8	dim_9	...	dim_20	dim_21	dim_22	dim_23	dim_24	dim_25	dim_26	dim_27	dim_28	dim_29
country_name
United States	-1.276051	-1.058583	-0.622779	0.430231	0.306830	0.966786	0.190530	-0.549356	0.538788	-0.722162	...	0.057478	0.500288	0.799253	0.383243	0.575051	0.247066	0.548018	-0.238643	0.114481	0.731294
United Kingdom	0.106521	0.346207	-0.276669	-0.003249	-1.294024	-0.090318	0.666301	-0.533962	0.281620	-0.268332	...	-0.803658	-0.498773	-0.041302	-0.705815	0.624482	-0.007293	1.335080	-0.717667	0.019507	-1.700016
Argentina	-0.871966	0.678408	0.343354	-0.968614	-1.099471	0.014175	0.031833	-0.370821	-0.114270	-0.982766	...	0.435942	-0.170878	-0.172491	-0.158596	0.445267	-0.949628	0.542433	-0.157643	0.765119	0.587947
Australia	-1.129085	0.212391	0.136416	-0.665215	0.452084	-0.418566	-0.187979	0.807976	-0.693282	-0.723989	...	-0.961785	-0.064047	-0.161033	0.211444	0.180285	0.597487	-0.143220	0.336241	-0.251201	-1.057264
Austria	-0.094485	0.870111	-0.328650	-0.594735	-0.429165	-0.955718	0.151173	0.250305	-0.302996	0.718965	...	-0.409435	-0.381884	-0.144373	0.340587	0.006711	0.870717	0.067101	-0.361675	0.742727	1.341254

5 rows × 30 columns

Then, we'll plot the cosine similarities.

In [166]:

sns.set(style="white")

def plot_cosine_similarities(country_vectors_df):

    similarities_df = pd.DataFrame(
        data=cosine_similarity(country_vectors_df),
        index=country_vectors_df.index,
        columns=country_vectors_df.index
    )
    
    lower_triangle_mask = np.zeros_like(similarities_df, dtype=np.bool)
    lower_triangle_mask[np.triu_indices_from(lower_triangle_mask)] = True

    f, ax = plt.subplots(figsize=(21, 21))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    sns.heatmap(
        similarities_df, 
        mask=lower_triangle_mask, 
        cmap=cmap, 
        vmax=.5,
        square=True, 
        xticklabels=True, 
        yticklabels=True,
        linewidths=1, 
        cbar_kws={"shrink": .5}, 
        ax=ax,
    )
    ax.set_title('Cosine Similarity Matrix', fontsize=20)

In [167]:

plot_cosine_similarities(country_vectors_df)

I first turn to my favorite country - Colombia - to inspect its similarity with other Latin American countries. The numbers are high for: Peru, Paraguay, Panama, Nicaragua, Gautemala, Ecuador, Dominican Republic and Costa Rica.

Next, I turn to the United Kingdom: it is most similar with Switzerland, New Zealand, Ireland, Hungary, Belgium and Australia.

Finally, we see that Sweden is similar enough to Norway, yet not Finland (who, incidentally, seems to have nothing in common with anyone at all).

Overall, this looks very good to me.

Visualize with TSNE¶

To further inspect similarities between countries, let's explore the 2-dimensional TSNE space.

In [205]:

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, random_state=12345)

country_embeddings = pd.DataFrame(
    data=tsne.fit_transform(country_vectors_df),
    index=country_vectors_df.index,
    columns=['dim_1', 'dim_2']
)

In [206]:

def plot_tsne_embeddings(country_embeddings):

    plt.figure(figsize=(15,15))

    for country_name, country_embedding in country_embeddings.iterrows():
        dim_1, dim_2 = country_embedding
        plt.scatter(dim_1, dim_2)
        plt.annotate(country_name, xy=(dim_1, dim_2), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
        plt.title('Two-Dimensional TSNE Embeddings of Latent Country Vectors', fontsize=16)
        plt.xlabel('Dimension 1', fontsize=12)
        plt.ylabel('Dimension 2', fontsize=12)
    plt.show()

In [207]:

plot_tsne_embeddings(country_embeddings)

This is not particularly useful. Furthermore, and perhaps most importantly, the plot varies immensely when changing the random seed. This is a worthwhile step nonetheless.

Inspect predicted preferences¶

Crucially, we'll now inspect predicted preferences for select countries and confirm that they make sense. I first normalize each of the country and song vectors, then compute predicted preferences via the dot product.

In [ ]:

SONG_METADATA_QUERY = """
    SELECT 
        songs.title as song_title, 
        songs.artist as song_artist,
        songs.id as song_id
    FROM songs
"""

song_metadata_df = pd.read_sql(SONG_METADATA_QUERY, ENGINE, index_col=['song_id'])

song_vectors_df = song_metadata_df.join(implicit_mf.song_vectors.vectors, how='inner')\
    .set_index(['song_title', 'song_artist'])

In [209]:

song_vectors_df_norm = song_vectors_df.apply(lambda vec: vec / np.linalg.norm(vec), axis=1)
country_vectors_df_norm = country_vectors_df.apply(lambda vec: vec / np.linalg.norm(vec), axis=1)

assert song_vectors_df_norm.apply(np.linalg.norm, axis=1).mean() == 1.
assert country_vectors_df_norm.apply(np.linalg.norm, axis=1).mean() == 1.

United States¶

In [210]:

pd.set_option('display.max_colwidth', 100)

In [211]:

country_vec = country_vectors_df_norm.ix['United States']
song_vectors_df_norm.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[211]:

	song_title	song_artist	0
0	Dab of Ranch - Recorded at Spotify Studios NYC	Migos	0.899299
1	Massage In My Room	Future	0.899299
2	Father Stretch My Hands Pt. 1	Kanye West	0.899299
3	Keep On	Kehlani	0.899299
4	First Day Out	Tee Grizzley	0.899299
5	When I Was Broke	Future	0.899299
6	Star Of The Show	Thomas Rhett	0.899299
7	Rose Golden	Kid Cudi	0.899299
8	oui	Jeremih	0.899299
9	Everyday We Lit	YFN Lucci	0.899299

Colombia¶

In [212]:

country_vec = country_vectors_df_norm.ix['Colombia']
song_vectors_df_norm.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[212]:

	song_title	song_artist	0
0	Ya No Me Duele Más	Silvestre Dangond	0.941204
1	Bajo el agua	Manuel Medrano	0.919997
2	Si No Te Quiere	Ozuna	0.916971
3	El Ganador	Nicky Jam	0.898030
4	Oiga, Mira, Vea	Guayacán Orquesta	0.897944
5	Mil Lágrimas	Nicky Jam	0.897863
6	Déjame Te Explico	Reykon	0.897396
7	Quédate	Manuel Medrano	0.896307
8	Afuera del planeta	Manuel Medrano	0.895686
9	Materialista	Silvestre Dangond	0.895501

Turkey¶

In [213]:

country_vec = country_vectors_df_norm.ix['Turkey']
song_vectors_df_norm.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[213]:

	song_title	song_artist	0
0	Dudak	Edis	0.943574
1	Benim Karanlık Yanım	Sezen Aksu	0.943574
2	Affection	Cigarettes After Sex	0.943574
3	Ateş Böceği	Mithat Can Özer	0.943574
4	Beni Öyle Bilme	Ceylan Ertem	0.943574
5	Keyfim Kaçık Acık	Can Bonomo	0.943574
6	Seni Unutmaya Ömrüm Yeter mi	Ümit Besen	0.943574
7	Vazgeçtim	Sezen Aksu	0.943574
8	Yanımda Kal	Harun Kolçak	0.943574
9	Stanga	Sagi Abitbul	0.943574

Germany¶

In [214]:

country_vec = country_vectors_df_norm.ix['Germany']
song_vectors_df_norm.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[214]:

	song_title	song_artist	0
0	Skandale	Bonez MC	0.908499
1	Ahnma	Beginner	0.906591
2	Am schönsten	SDP	0.906250
3	Tage wie diese	Die Toten Hosen	0.906250
4	Kartell (feat. Azet & Nash)	Zuna	0.906250
5	Chill mal dein Leben (feat. Moe Phoenix)	Kianush	0.906250
6	Neuanfang	Clueso	0.906250
7	Neben der Spur	Maxwell	0.906250
8	Hochspannung	Maxwell	0.906250
9	Intro	Capital Bra	0.906250

Taiwan¶

In [215]:

country_vec = country_vectors_df_norm.ix['Taiwan']
song_vectors_df_norm.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[215]:

	song_title	song_artist	0
0	如果的事	王藍茵	0.954455
1	小小的	思衛	0.954455
2	Heathens (feat. MUTEMATH)	Twenty One Pilots	0.954455
3	其實都沒有	楊宗緯	0.954455
4	對摺 - 國語	任賢齊	0.954455
5	最難的是相遇	Valen Hsu	0.954455
6	안녕 못해 I'm Not Okay (From “MISSING 9”)	CHEN	0.954455
7	未接來電	楊乃文	0.954455
8	飛機 - feat. 林俊峰	林俊傑	0.954455
9	All On Me	Devin Dawson	0.954455

These look great.

One interesting thing to note is the significance of the length of the vectors. To this point, Schakel and Wilson offer the following:

A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts.

Not only the direction, but also the length of word vectors carries important information.

Word vector length furnishes, in combination with term frequency, a useful measure of word significance.

In our case, my guess is that length indicates something like popularity - at least for songs. Let's see which songs and countries have the largest L2 norms.

In [216]:

song_vectors_df.apply(np.linalg.norm, axis=1).sort_values(ascending=False).reset_index().head(10)

Out[216]:

	song_title	song_artist	0
0	Shape of You	Ed Sheeran	0.687497
1	Starboy	The Weeknd	0.684529
2	I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty Shades Darker (Original Motion Pi...	ZAYN	0.682996
3	Despacito (Featuring Daddy Yankee)	Luis Fonsi	0.682941
4	Rockabye (feat. Sean Paul & Anne-Marie)	Clean Bandit	0.682918
5	Paris	The Chainsmokers	0.682878
6	Call On Me - Ryan Riback Extended Remix	Starley	0.682799
7	Castle on the Hill	Ed Sheeran	0.682442
8	I Feel It Coming	The Weeknd	0.682225
9	Closer	The Chainsmokers	0.682175

In [217]:

country_vectors_df.apply(np.linalg.norm, axis=1).sort_values(ascending=False).reset_index().head(10)

Out[217]:

	country_name	0
0	Taiwan	4.842656
1	Sweden	4.602653
2	Finland	4.415459
3	France	4.399233
4	United Kingdom	4.374870
5	Netherlands	4.330190
6	Denmark	4.232457
7	Italy	4.218088
8	Norway	4.084380
9	Germany	4.076692

To reference Taiwan and Colombia, let's inspect predictions using unnormalized vectors.

In [218]:

country_vec = country_vectors_df.ix['Taiwan']
song_vectors_df.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[218]:

	song_title	song_artist	0
0	Bad Things - With Camila Cabello	Machine Gun Kelly	0.958864
1	Just Hold On	Steve Aoki	0.958692
2	Shape of You	Ed Sheeran	0.958325
3	Side To Side	Ariana Grande	0.957497
4	Closer	The Chainsmokers	0.957290
5	Starboy	The Weeknd	0.957090
6	Paris	The Chainsmokers	0.957001
7	I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty Shades Darker (Original Motion Pi...	ZAYN	0.956933
8	Something Just Like This	The Chainsmokers	0.956803
9	The Greatest	Sia	0.956746

In [219]:

country_vec = country_vectors_df.ix['Colombia']
song_vectors_df.dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[219]:

	song_title	song_artist	0
0	I Got You	Bebe Rexha	1.019838
1	Party Monster	The Weeknd	0.993647
2	Hello	Adele	0.993421
3	I Would Like	Zara Larsson	0.987099
4	Scars To Your Beautiful	Alessia Cara	0.987036
5	Weak	AJR	0.986642
6	Too Good	Drake	0.982179
7	Nancy Mulligan	Ed Sheeran	0.979373
8	Supermarket Flowers	Ed Sheeran	0.978706
9	Hearts Don't Break Around Here	Ed Sheeran	0.978494

Respectively, there is not a song with Chinese nor Spanish letters in each set of recommendations. Additionally, it seems like the world is fond of Ed Sheeran. It is clear that the global popularity of songs dominates our recommendations when computing the dot product with unnormalized vectors.

When we do normalize, our predicted preferences can be defined thus:

$$ \hat{x}_u = \frac{x_u}{\|x_u\|} \quad \hat{y}_i = \frac{y_i}{\|y_i\|}\\ \hat{p}_{u, i} = \hat{x}_u \cdot \hat{y}_i $$

One idea might be to scale the song vectors by their respective lengths, leaving our predicted preferences as:

$$ \begin{align*} \hat{p}_{u, i} &= \hat{x}_u \cdot \big(\hat{y}_i \cdot c\|y_i\|\big)\\ &= \hat{x}_u \cdot \Bigg(\hat{y}_i \cdot c\frac{y_i}{\hat{y_i}}\Bigg)\\ &= \hat{x}_u \cdot c y_i \end{align*} $$

Let's briefly see what this looks like.

In [220]:

C = .2

country_vec = country_vectors_df_norm.ix['Taiwan']
(C * song_vectors_df).dot(country_vec).sort_values(ascending=False).reset_index().head(10)

Out[220]:

	song_title	song_artist	0
0	Bad Things - With Camila Cabello	Machine Gun Kelly	0.039601
1	Just Hold On	Steve Aoki	0.039594
2	Shape of You	Ed Sheeran	0.039578
3	Side To Side	Ariana Grande	0.039544
4	Closer	The Chainsmokers	0.039536
5	Starboy	The Weeknd	0.039528
6	Paris	The Chainsmokers	0.039524
7	I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty Shades Darker (Original Motion Pi...	ZAYN	0.039521
8	Something Just Like This	The Chainsmokers	0.039516
9	The Greatest	Sia	0.039513

About the same. Nevertheless, the length of our vectors can likely offer some value when balancing local recommendations with globally popular songs. I leave this to the reader to explore further for now.

Approximating IMF with neural networks¶

Again, IMF gives a function $f: u, i \rightarrow \hat{p}_{u, i}$ and the following objective function to minimize:

$$ \underset{x_{*}, y_{*}}{\arg\min}\sum\limits_{u, i}c_{u, i}\big(p_{u, i} - f(u, i)\big)^2 + \lambda\bigg(\sum\limits_u\|x_u\|^2 + \sum\limits_i\|y_u\|^2\bigg) $$

Let's try to do this with neural networks. I'll use the $\alpha$ and $\lambda$ parameters found above, and keep the same dimensionality, $F = 30$, for our latent vectors. This number was chosen in separate experimentation not shown here.

In [32]:

from itertools import product

from keras import backend as K
from keras.layers import Input, Embedding, Flatten, Lambda, Dense, BatchNormalization, Dropout, LSTM
from keras.layers.merge import add, dot, concatenate
from keras.models import Model
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.regularizers import l2
from keras.utils import plot_model

In [9]:

songs_rated_by_country = (ratings_matrix.R_ui > 0).sum(axis=1)

plt.hist(songs_rated_by_country)
plt.title('Distribution of # of Unique Songs Streamed per Country', fontsize=15)
plt.xlabel('Unique Songs Streamed', fontsize=12)
_ = plt.ylabel('# of Countries', fontsize=12)

In [10]:

print('The 15th percentile of songs rated by country is {}.'.format(
    np.percentile(songs_rated_by_country, 15)
))

The 15th percentile of songs rated by country is 218.20000000000002.

Let's set our eligibility cutoff at 200 songs and create training and validation matrices.

In [13]:

more_than_200_ratings_mask = songs_rated_by_country > 20

training_matrix, validation_matrix = split_ratings_matrix_into_training_and_validation(
    ratings_matrix=ratings_matrix,
    eligibility_mask=more_than_200_ratings_mask
)

Arrange our data¶

To interact with Keras, we'll need to change our data from wide format to long. Additionally, so as to create our embeddings, we'll need to give each country and song a unique, contiguous index. Finally, I'll do the same with the song artist, as well as tokenize the song title, for later use.

In [16]:

song_metadata_df = song_metadata_df.ix[ratings_matrix.R_ui.columns]
song_metadata_df.head()

Out[16]:

	song_title	song_artist
song_id
5552	El Perdón	Nicky Jam
5553	Lean On (feat. MØ & DJ Snake)	Major Lazer
5555	Love Me Like You Do - From "Fifty Shades Of Grey"	Ellie Goulding
5556	Uptown Funk	Mark Ronson
5557	Loquita	Marama

In [17]:

country_id_to_index = {c_id: index for index, c_id in enumerate(ratings_matrix.R_ui.index)}
song_id_to_index = {s_id: index for index, s_id in enumerate(ratings_matrix.R_ui.columns)}
song_artist_to_index = {artist: index for index, artist in enumerate(song_metadata_df['song_artist'].unique())}

song_metadata_df['song_index'] = song_metadata_df.index.map(song_id_to_index.get)
song_metadata_df['song_artist_index'] = song_metadata_df['song_artist'].map(song_artist_to_index.get).astype(int)
song_metadata_df.tail()

Out[17]:

	song_title	song_artist	song_index	song_artist_index
song_id
33068	Affection	Cigarettes After Sex	8658	3699
33069	Istemem Soz Sevmeni	Ferman Akgül	8659	3106
33070	Skammekroken 2017	TIX	8660	1677
33071	Wiosna	Organek	8661	3716
33072	Fuckt.	Strapo	8662	3717

Tokenize song title¶

In [18]:

NUM_WORDS = 5000

tokenizer = Tokenizer(num_words=NUM_WORDS, lower=True, split=" ")
tokenizer.fit_on_texts(song_metadata_df['song_title'])
song_metadata_df['title_sequence'] = tokenizer.texts_to_sequences(song_metadata_df['song_title'])

Convert ratings to long format¶

In [19]:

input_tuples = list( product(ratings_matrix.R_ui.index, ratings_matrix.R_ui.columns) )
country_ids, song_ids = zip(*input_tuples)

ratings_df = pd.DataFrame({
    'country_id': country_ids,
    'song_id': song_ids,
    'train_rating': training_matrix.R_ui.values.flatten(),
    'validation_rating': validation_matrix.R_ui.values.flatten(),
})

ratings_df = ratings_df.join(song_metadata_df, on='song_id', how='inner').sort_index()
ratings_df['country_index'] = ratings_df['country_id'].map(country_id_to_index.get)

Define network parameters¶

In [22]:

n_countries = ratings_df['country_index'].nunique()
n_songs = ratings_df['song_index'].nunique()
n_artists = ratings_df['song_artist_index'].nunique()

alpha = best_params.alpha
lmbda = best_params.lmbda
F = 30
LEARNING_RATE = .001

Define a loss function¶

This is simply the IMF objective we'd like to minimize.

In [23]:

def implicit_cf_loss(y_true, y_pred, alpha=alpha):
    P = y_true > 0
    C = 1 + alpha*y_true
    return K.mean( C*(P - y_pred)**2 )

Define functions to generate, evaluate and visualize predictions¶

In [24]:

def generate_predictions(model, inputs, ratings_df=ratings_df):
    predictions = model.predict(x=inputs)
    return pd.DataFrame({
        'prediction': predictions.flatten(),
        'country_id': ratings_df['country_id'],
        'song_id': ratings_df['song_id']
    }).pivot('country_id', 'song_id', 'prediction')

In [25]:

def evaluate_predictions(predictions, eligibility_mask=more_than_200_ratings_mask, 
                         training_matrix=training_matrix, validation_matrix=validation_matrix):
    evaluator = ExpectedPercentileRankingsEvaluator(
        training_matrix=training_matrix,
        validation_matrix=validation_matrix,
        eligibility_mask=eligibility_mask,
        predicted_preferences=predictions
    )
    evaluator.run()
    print('Train: {}'.format(evaluator.mean_expected_percentile_rankings_train))
    print('Validation: {}'.format(evaluator.mean_expected_percentile_rankings_validation))

In [74]:

def visualize_predictions(predictions):
    plt.figure(figsize=(10, 6))

    plt.hist(predictions.values.flatten(), bins=25)
    plt.title('Predicted Preferences Histogram', fontsize=15)
    plt.xlabel('Predicted Prefence', fontsize=12)
    _ = plt.ylabel('Count', fontsize=12)

Network #1¶

A trivially "Siamese" network which first embeds each country and song index into $\mathbb{R}^f$ in parallel then computes a dot-product of the embeddings. This is roughly equivalent to what is being done by IMF.

In [162]:

country_input = Input(shape=(1,), dtype='int64', name='country')
song_input = Input(shape=(1,), dtype='int64', name='song')

country_embedding = Embedding(input_dim=n_countries, output_dim=F, embeddings_regularizer=l2(lmbda))(country_input)
song_embedding = Embedding(input_dim=n_songs, output_dim=F, embeddings_regularizer=l2(lmbda))(song_input)

predicted_preference = dot(inputs=[country_embedding, song_embedding], axes=2)
predicted_preference = Flatten()(predicted_preference)

In [163]:

model = Model(inputs=[country_input, song_input], outputs=predicted_preference)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

plot_model(model, to_file='figures/network_1.png')

network 1

In [164]:

inputs = [ratings_df['country_index'], ratings_df['song_index']]

network_1 = model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 2s - loss: 64.9000 - val_loss: 0.0185
Epoch 2/5
476465/476465 [==============================] - 8s - loss: 0.0744 - val_loss: 0.0185
Epoch 3/5
476465/476465 [==============================] - 3s - loss: 0.0744 - val_loss: 0.0185
Epoch 4/5
476465/476465 [==============================] - 10s - loss: 0.0745 - val_loss: 0.0185
Epoch 5/5
476465/476465 [==============================] - 8s - loss: 0.0745 - val_loss: 0.0186

In [165]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.5053388101677021
Validation: 0.49071882278215817

For reference, let's recall the results we computed when validation our model with best_params.

In [166]:

grid_search_results[best_params]

Out[166]:

{'train': 0.04479720392806505, 'validation': 0.18543168148626762}

Additionally, a random model should return an expected percentile ranking of ~50%, as noted in Hu, Koren and Volinksy. So, this is really bad. Let's visualize the distribution of predictions before moving on to another model.

In [167]:

visualize_predictions(predictions)

Network #2¶

Same as previous, but with a bias embedding for each set, in $\mathbb{R}$, added to the dot-product.

In [168]:

country_bias = Embedding(input_dim=n_countries, output_dim=1, name='country_bias', input_length=1)(country_input)
song_bias = Embedding(input_dim=n_songs, output_dim=1, name='song_bias', input_length=1)(song_input)

predicted_preference = add(inputs=[predicted_preference, song_bias, country_bias])
predicted_preference = Flatten()(predicted_preference)

In [169]:

model = Model(inputs=[country_input, song_input], outputs=predicted_preference)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

plot_model(model, to_file='figures/network_2.png')

network 2

In [170]:

inputs = [ratings_df['country_index'], ratings_df['song_index']]

network_2 = model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 2s - loss: 0.0734 - val_loss: 0.0204
Epoch 2/5
476465/476465 [==============================] - 2s - loss: 0.0563 - val_loss: 0.0216
Epoch 3/5
476465/476465 [==============================] - 2s - loss: 0.0534 - val_loss: 0.0236
Epoch 4/5
476465/476465 [==============================] - 2s - loss: 0.0515 - val_loss: 0.0252
Epoch 5/5
476465/476465 [==============================] - 2s - loss: 0.0502 - val_loss: 0.0269

In [171]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.10900494586591782
Validation: 0.1906236846450195

This looks a lot better. Now let's visualize our ground-truth and predictions side-by-side.

In [172]:

_ = pd.DataFrame({
    'predictions': predictions.values.flatten(),
    'ratings': ratings_df['train_rating'].values.flatten()
}).hist(figsize=(16, 6))

This model approximates the training distribution much better than the others. Key to this task is delineating between items with 0 ratings, and those with high ratings. Remember, we first down-scaled our ratings with the following transformation to shrink the domain:

$$\tilde{r}_{u, i} = \log{\bigg(\frac{1 + r_{u, i}}{\epsilon}\bigg)}$$

Furthermore, the model scores worse on the training distribution, yet similarly on the validation set. The bias term was clearly helpful.

Let's try a similar model, but concatenate the country and song embeddings, then stack fully connected layers, then compute the final linear combination as before.

Network #3¶

Same as #1, but concatenate the latent vectors instead. Then, stack 3 fully-connected layers with ReLU activations, batch normalization after each, and dropout after the first. Finally, add a 1-unit dense layer on the end, and add bias embeddings to the result. (NB: I wanted to add the bias embeddings to the respective $\mathbb{R}^f$ embeddings at the outset, but couldn't figure out how to do this in Keras.)

In [185]:

country_input = Input(shape=(1,), dtype='int64', name='country')
song_input = Input(shape=(1,), dtype='int64', name='song')

country_embedding = Embedding(input_dim=n_countries, output_dim=F, embeddings_regularizer=l2(lmbda))(country_input)
song_embedding = Embedding(input_dim=n_songs, output_dim=F, embeddings_regularizer=l2(lmbda))(song_input)

country_bias = Embedding(input_dim=n_countries, output_dim=1, name='country_bias', input_length=1)(country_input)
song_bias = Embedding(input_dim=n_songs, output_dim=1, name='song_bias', input_length=1)(song_input)

concatenation = concatenate([country_embedding, song_embedding])

dense_layer = Dense(activation='relu', units=10)(concatenation)
dropout = Dropout(.5)(dense_layer)
batch_norm = BatchNormalization()(dropout)
dense_layer = Dense(activation='relu', units=10)(batch_norm)
batch_norm = BatchNormalization()(dense_layer)
dense_layer = Dense(activation='relu', units=10)(batch_norm)
batch_norm = BatchNormalization()(dense_layer)

predicted_preference = Dense(units=1, name='predicted_preference')(batch_norm)
predicted_preference = Flatten()(predicted_preference)

predicted_preference = add(inputs=[predicted_preference, country_bias, song_bias])
predicted_preference = Flatten()(predicted_preference)

In [190]:

model = Model(inputs=[country_input, song_input], outputs=predicted_preference)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

plot_model(model, to_file='figures/network_3.png')

network 3

In [191]:

inputs = [ratings_df['country_index'], ratings_df['song_index']]

network_3 = model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 5s - loss: 0.0702 - val_loss: 0.0830
Epoch 2/5
476465/476465 [==============================] - 5s - loss: 0.0472 - val_loss: 0.0601
Epoch 3/5
476465/476465 [==============================] - 5s - loss: 0.0477 - val_loss: 0.0667
Epoch 4/5
476465/476465 [==============================] - 5s - loss: 0.0483 - val_loss: 0.0508
Epoch 5/5
476465/476465 [==============================] - 5s - loss: 0.0491 - val_loss: 0.0513

In [192]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.11102242413459021
Validation: 0.1899202531334214

These results are roughly identical to the previous. Choosing between the models, we'd certainly go with the simpler of the two.

Now, let's try some architectures that benefit from other data sources altogether: the song title, and the song artist.

Network #4¶

Same as #2, except feed in the song title text as well. This text is first tokenized, then padded to a maximum sequence length, then embedded into a fixed-length vector by an LSTM, then reduced to a single value by a dense layer with a ReLU activation. Finally, this scalar is concatenated to the scalar output that #2 would produce, and the result is fed into a final dense layer with a linear activation - i.e. a linear combination of the two.

In [194]:

MAX_SEQUENCE_LENGTH = ratings_df['title_sequence'].map(len).max()

padded_title_sequences = pad_sequences(sequences=ratings_df['title_sequence'], maxlen=MAX_SEQUENCE_LENGTH)

In [205]:

country_input = Input(shape=(1,), dtype='int64', name='country')
song_input = Input(shape=(1,), dtype='int64', name='song')
title_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='song_title_sequence')

country_embedding = Embedding(input_dim=n_countries, output_dim=F, embeddings_regularizer=l2(lmbda))(country_input)
song_embedding = Embedding(input_dim=n_songs, output_dim=F, embeddings_regularizer=l2(lmbda))(song_input)
title_embedding = Embedding(output_dim=F, input_dim=NUM_WORDS, input_length=MAX_SEQUENCE_LENGTH)(title_input)

country_bias = Embedding(input_dim=n_countries, output_dim=1, name='country_bias', input_length=1)(country_input)
song_bias = Embedding(input_dim=n_songs, output_dim=1, name='song_bias', input_length=1)(song_input)

predicted_preference = dot(inputs=[country_embedding, song_embedding], axes=2)
predicted_preference = add(inputs=[predicted_preference, country_bias, song_bias])
predicted_preference = Flatten()(predicted_preference)

title_lstm = LSTM(F)(title_embedding)
dense_title_lstm = Dense(units=1, activation='relu')(title_lstm)

predicted_preference_merge = concatenate(inputs=[predicted_preference, dense_title_lstm])
final_output = Dense(activation='linear', units=1)(predicted_preference_merge)

In [206]:

model = Model(inputs=[country_input, song_input, title_input], outputs=final_output)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

plot_model(model, to_file='figures/network_4.png')

network 4

In [207]:

inputs = [ratings_df['country_index'], ratings_df['song_index'], padded_title_sequences]

network_4 = model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 124s - loss: 65.4419 - val_loss: 0.0215
Epoch 2/5
476465/476465 [==============================] - 126s - loss: 0.0506 - val_loss: 0.0256
Epoch 3/5
476465/476465 [==============================] - 120s - loss: 0.0476 - val_loss: 0.0282
Epoch 4/5
476465/476465 [==============================] - 119s - loss: 0.0472 - val_loss: 0.0288
Epoch 5/5
476465/476465 [==============================] - 119s - loss: 0.0471 - val_loss: 0.0291

In [208]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.11471700578472085
Validation: 0.18885209182480478

We seem to get a pinch of lift from the song title, though we're still underfitting our training set. Perhaps we could make our model bigger, or run it for more epochs.

Let's plot our ground-truth vs. prediction distributions side-by-side just to make sure we're still on the right track.

In [210]:

_ = pd.DataFrame({
    'predictions': predictions.values.flatten(),
    'ratings': ratings_df['train_rating'].values.flatten()
}).hist(figsize=(16, 6))

Finally, let's add the artist.

Network #5¶

Same as #4, except feed in the song artist index as well. This index is first embedded into a vector, then reduced to a scalar by a dense layer with a ReLU activation. Finally, this scalar is concatenated with the two scalars produced in the second-to-last layer of #4, then fed into a final dense layer with a linear activation. Like the previous, this is a linear combination of the three inputs.

In [217]:

artist_input = Input(shape=(1,), dtype='int64', name='artist')
artist_embedding = Embedding(input_dim=n_artists, output_dim=F, embeddings_regularizer=l2(lmbda))(artist_input)
dense_artist_embedding = Dense(units=1, activation='relu')(artist_embedding)
dense_artist_embedding = Flatten()(dense_artist_embedding)

predicted_preference_merge = concatenate(inputs=[predicted_preference, dense_title_lstm, dense_artist_embedding])
final_output = Dense(activation='linear', units=1)(predicted_preference_merge)

In [218]:

model = Model(inputs=[country_input, song_input, title_input, artist_input], outputs=final_output)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

plot_model(model, to_file='figures/network_5.png')

network 5

In [219]:

inputs = [
    ratings_df['country_index'], 
    ratings_df['song_index'], 
    padded_title_sequences, 
    ratings_df['song_artist_index']
]

model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 122s - loss: 28.0315 - val_loss: 0.0224
Epoch 2/5
476465/476465 [==============================] - 129s - loss: 0.0484 - val_loss: 0.0269
Epoch 3/5
476465/476465 [==============================] - 121s - loss: 0.0473 - val_loss: 0.0310
Epoch 4/5
476465/476465 [==============================] - 134s - loss: 0.0489 - val_loss: 0.0333
Epoch 5/5
476465/476465 [==============================] - 131s - loss: 0.0485 - val_loss: 0.0300

Out[219]:

<keras.callbacks.History at 0x12ee912b0>

In [220]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.11092262775206822
Validation: 0.19037930919362725

This seems a bit worse than the last, though similar all the same. I'm a bit surprised we didn't get more lift from artists. Nevertheless, we're still underfitting. Bigger model, more epochs, etc.

What went wrong?¶

Our best neural network in terms of performance and complexity is network #2. Let's refit this model and inspect a few predictions.

In [409]:

country_input = Input(shape=(1,), dtype='int64', name='country')
song_input = Input(shape=(1,), dtype='int64', name='song')

country_embedding = Embedding(input_dim=n_countries, output_dim=F, embeddings_regularizer=l2(lmbda))(country_input)
song_embedding = Embedding(input_dim=n_songs, output_dim=F, embeddings_regularizer=l2(lmbda))(song_input)

country_bias = Embedding(input_dim=n_countries, output_dim=1, name='country_bias', input_length=1)(country_input)
song_bias = Embedding(input_dim=n_songs, output_dim=1, name='song_bias', input_length=1)(song_input)

predicted_preference = dot(inputs=[country_embedding, song_embedding], axes=2)
predicted_preference = Flatten()(predicted_preference)

predicted_preference = add(inputs=[predicted_preference, song_bias, country_bias])
predicted_preference = Flatten()(predicted_preference)

In [410]:

model = Model(inputs=[country_input, song_input], outputs=predicted_preference)
model.compile(loss=implicit_cf_loss, optimizer=Adam(lr=LEARNING_RATE))

In [411]:

inputs = [ratings_df['country_index'], ratings_df['song_index']]

network_2 = model.fit(
    x=inputs,
    y=ratings_df['train_rating'],
    batch_size=1024, 
    epochs=5,
    validation_data=(inputs, ratings_df['validation_rating'])
)

Train on 476465 samples, validate on 476465 samples
Epoch 1/5
476465/476465 [==============================] - 3s - loss: 65.2042 - val_loss: 0.0205
Epoch 2/5
476465/476465 [==============================] - 8s - loss: 0.0559 - val_loss: 0.0212
Epoch 3/5
476465/476465 [==============================] - 3s - loss: 0.0519 - val_loss: 0.0225
Epoch 4/5
476465/476465 [==============================] - 2s - loss: 0.0496 - val_loss: 0.0235
Epoch 5/5
476465/476465 [==============================] - 2s - loss: 0.0482 - val_loss: 0.0248

In [412]:

predictions = generate_predictions(model=model, inputs=inputs)
evaluate_predictions(predictions)

Train: 0.10905118689288228
Validation: 0.19050085125816704

In [441]:

predictions_unstack = predictions.unstack()

predictions_unstack = pd.DataFrame({
    'song_title': song_metadata_df.ix[predictions_unstack.index.get_level_values('song_id')]['song_title'],
    'song_artist': song_metadata_df.ix[predictions_unstack.index.get_level_values('song_id')]['song_artist'],
    'country': unstack.index.get_level_values('country_id').map(country_id_to_name.get),
    'rating': predictions_unstack.values
})

Colombia¶

In [443]:

predictions_unstack[predictions_unstack['country'] == 'Colombia'].sort_values(by='rating', ascending=False).head(10)

Out[443]:

	country	rating	song_artist	song_title
song_id
30519	Colombia	0.715299	Katy Perry	Chained To The Rhythm
31060	Colombia	0.704788	Ed Sheeran	Perfect
30137	Colombia	0.683443	Ed Sheeran	Castle on the Hill
31043	Colombia	0.681572	Zedd	Stay (with Alessia Cara)
31041	Colombia	0.680419	The Chainsmokers	Something Just Like This
18894	Colombia	0.674962	Calvin Harris	My Way
30487	Colombia	0.674799	Major Lazer	Run Up (feat. PARTYNEXTDOOR & Nicki Minaj)
20131	Colombia	0.672363	The Weeknd	I Feel It Coming
20222	Colombia	0.671484	Bruno Mars	That's What I Like
31045	Colombia	0.669374	Clean Bandit	Symphony (feat. Zara Larsson)

United States¶

In [444]:

predictions_unstack[predictions_unstack['country'] == 'United States'].sort_values(by='rating', ascending=False).head(10)

Out[444]:

	country	rating	song_artist	song_title
song_id
30519	United States	0.752865	Katy Perry	Chained To The Rhythm
31060	United States	0.742354	Ed Sheeran	Perfect
30137	United States	0.721009	Ed Sheeran	Castle on the Hill
31043	United States	0.719138	Zedd	Stay (with Alessia Cara)
31041	United States	0.717985	The Chainsmokers	Something Just Like This
18894	United States	0.712528	Calvin Harris	My Way
30487	United States	0.712365	Major Lazer	Run Up (feat. PARTYNEXTDOOR & Nicki Minaj)
20131	United States	0.709929	The Weeknd	I Feel It Coming
20222	United States	0.709051	Bruno Mars	That's What I Like
31045	United States	0.706940	Clean Bandit	Symphony (feat. Zara Larsson)

Turkey¶

In [445]:

predictions_unstack[predictions_unstack['country'] == 'Turkey'].sort_values(by='rating', ascending=False).head(10)

Out[445]:

	country	rating	song_artist	song_title
song_id
30519	Turkey	0.728861	Katy Perry	Chained To The Rhythm
31060	Turkey	0.718350	Ed Sheeran	Perfect
30137	Turkey	0.697005	Ed Sheeran	Castle on the Hill
31043	Turkey	0.695134	Zedd	Stay (with Alessia Cara)
31041	Turkey	0.693981	The Chainsmokers	Something Just Like This
18894	Turkey	0.688524	Calvin Harris	My Way
30487	Turkey	0.688361	Major Lazer	Run Up (feat. PARTYNEXTDOOR & Nicki Minaj)
20131	Turkey	0.685925	The Weeknd	I Feel It Coming
20222	Turkey	0.685046	Bruno Mars	That's What I Like
31045	Turkey	0.682936	Clean Bandit	Symphony (feat. Zara Larsson)

Taiwan¶

In [446]:

predictions_unstack[predictions_unstack['country'] == 'Taiwan'].sort_values(by='rating', ascending=False).head(10)

Out[446]:

	country	rating	song_artist	song_title
song_id
30519	Taiwan	0.789601	Katy Perry	Chained To The Rhythm
31060	Taiwan	0.779090	Ed Sheeran	Perfect
30137	Taiwan	0.757745	Ed Sheeran	Castle on the Hill
31043	Taiwan	0.755874	Zedd	Stay (with Alessia Cara)
31041	Taiwan	0.754721	The Chainsmokers	Something Just Like This
18894	Taiwan	0.749264	Calvin Harris	My Way
30487	Taiwan	0.749101	Major Lazer	Run Up (feat. PARTYNEXTDOOR & Nicki Minaj)
20131	Taiwan	0.746664	The Weeknd	I Feel It Coming
20222	Taiwan	0.745786	Bruno Mars	That's What I Like
31045	Taiwan	0.743676	Clean Bandit	Symphony (feat. Zara Larsson)

As is abundantly clear, the model is simply converging to a popularity baseline. I tried extracting the vectors outright, normalizing as before then computing dot products, but the results made little sense.

Future work¶

At this stage, Implicit Matrix Factorization is the clear winner. The algorithm is popular for several good reasons - it's simple, relatively scalable, intuitive (especially given the significance of the interplay between $C$ and $P$, which I did not expressedly discuss) and effective. Herein, I did not produce a neural network approximating this algorithm with superior results.

My intuition tells me I'm missing something simple; why shouldn't a non-linear function approximator be able to approximate $f(u, i)$ better than a linear matrix decomposition? At this stage, I'm not sure. My best guess is that we should seek to optimize a different loss function. Here, we've mirrored that of IMF; why not optimize a ranking loss function instead, since that's what's ultimately important? The IMF objective makes the actual optimization process easy for ratings matrices with billions of items; in our case, with far fewer items, we're simply using stochastic gradient descent. To this effect, I did experiment with some composite loss functions; on one side, implicit_mf_loss, and on the other, a cutoff_loss, which seeks to enforce that songs which received a rating $\tilde{r}_{u, i}$ above a certain cutoff should be indeed predicted at above that cutoff. It's a crude, binary absolute error. This was unsuccessful.

Finally, if there is in fact some value in using neural networks for this type of problem, I'm extremely intrigued by the ease and intuitiveness with which we can incorporate multiple inputs into our function. Encoding our song title with an LSTM, and training that LSTM end-to-end with our main minimization objective, is one example of this. Furthermore, Keras makes the construction of these networks straightforward.

Thanks for reading this work. Your feedback is welcome, and I do hope this serves as a point of inspiration, trial and error should this question interest you.

Approximating Implicit Matrix Factorization with Shallow Neural Networks

Will Wolf

Abstract¶

Implicit Matrix Factorization¶

Shallow neural networks¶

Network architectures¶

Results¶

Data preparation¶

Construct training, validation sets¶

Evaluation¶

Grid search¶

Train a final model¶

Visualize¶

Visualize with TSNE¶

Inspect predicted preferences¶

United States¶

Colombia¶

Turkey¶

Germany¶

Taiwan¶

Approximating IMF with neural networks¶

Arrange our data¶

Tokenize song title¶

Convert ratings to long format¶

Define network parameters¶

Define a loss function¶

Define functions to generate, evaluate and visualize predictions¶

Network #1¶

Network #2¶

Network #3¶

Network #4¶

Network #5¶

What went wrong?¶

Colombia¶

United States¶

Turkey¶

Taiwan¶

Future work¶

Comments