I am a software/firmware developer, working in Dresden, Germany. In my free time, I spent some time on developing apps, presenting some interesting findings here in my blog.
Intelligent Playlists - Part 2
OK, so after the more general stuff in part 1, I now want to write about the theoretical side of the task. Besides “theoretical” might sound boring - it is not. Especially as this topic is a pretty interesting one: How to determine similarity and - more complex to say - how to calculate it given some rather strange chunk of data?
Comparing - but on what?
This is the first thing to think about. Given a set of “things”, how to compare them? There are easy examples: If you have a list of numbers or single characters, determining at last equality is easy. But here also determining similarity is another thing. Similarity of numbers? Well, as strange at this might sound the first moment, imagine that: You have a case base with data about patients who got care in e.g. a hospital. Now you have a new patient and - to decide what medicine to use for his case - you want to find similar cases. Beneath other data - as gender, body size and of course data about his or her disease, you usually store the patients’ age. Now, when searching for a similar case of a - let’s say - 26 year old, you certainly don’t want to find just patients of the age of 26! Here, a “fuzzy match” where e.g. patients aged from 20 to 34 years are returned is clearly preferred.
The Music Library
So in my case, the case base will be a library holding several music files. Now, what I search is a function, that…
- takes a song, for which I want to find similar ones and…
- another song from the library …
- which returns a (normalized) value between 0 and 1 indicating how similar both are (where 0 means not similar and 1 very similar/identical).
Populating a play list could then look like this:
INPUT track;
INPUT threshold;
FOR entry IN collection DO
IF sim(track, entry) > threshold THEN
playlist.add(entry);
END;
END;
Apart from that this algorithm neither is intelligent (it would run through the complete collection each time it is called and maybe we don’t want this much songs to be added at once) nor efficient (depending on the size of the collection, a complete run through might be a matter not just of a few seconds): The most interesting thing still has not been revealed - the sim function.
Comparing tracks
Now we finally arrived, where the life is: How to compare two given music files? As already mentioned the last time, we basically have two options here: We could use meta information which can be taken from the tags assigned to each song or we can analyze the music file itself. Let’s start with the former one:
Comparing using Meta Information
A single track can be seen as a structure holding meta information (in this scenario we just forget that it is “music”):
DEFINE STRUCTURE Track
title : String;
artist : String;
composer : String;
album : String;
year : Integer;
length : Integer;
playCount : Integer;
rating : Integer;
END;
This structure would be OK when we want to compare two tracks. Please note, that not all of these information come directly from the music file (playCount and rating are not stored inside the file but in a database which the music management application usually cares about). Now, the sim function could look like this:
FUNCTION sim(track1 : Track; track2 : Track)
Double result = 0.0;
// compareString() and compareInteger() return values between 0.0 and 1.0
result = compareString(track1.title, track2.title) +
compareString(track1.artist, track2.artist) +
compareString(track1.composer, track2.composer) +
compareString(track1.album, track2.album) +
compareInteger(track1.year, track2.year) +
compareInteger(track1.length, track2.length) +
compareInteger(track1.playCount, track2.playCount) +
compareInteger(track1.rating, track2.rating);
return result / 8.0;
END;
Looks reasonable, doesn’t it? However, when you think a bit about it, two questions should come up:
- How does compareString() internally work, i.e. is it OK to return how similar two strings are or wouldn’t it be better to use only exact matching here (where the result is an element from {0.0, 1.0} and not [0.0 .. 1.0])?
- Doesn’t the compareInteger() function need further parameters (e.g. when track A has been released 10 years after track B, they might be still similar (compared e.g. to a song composed in the middle age), but when track A has a user rating of 0 (lowest) and B of 10 (highest possible value) then this cannot be considered similar)?
Exact vs. Fuzzy matching
First, let’s have a look at the compareString() function. As already noted, we basically have two choices here to implement it: One (the easier to implement) way would be to make only exact matches, e.i. the function would probably look like this:
FUNCTION compareString(s1 : String, s2 : String)
IF s1 = s2 THEN
RETURN 1.0;
ELSE
RETURN 0.0;
END;
END;
Well, you might say that this has not to do much with similarity, but consider, that this function is called in a context (the sim() function) and together in a bunch of such comparisons it still would serve our needs. The second way would be, to let the function internally make a real string similarity calculation. As the algorithms are a bit more complex, I just want to redirect you to Wikipedia here, where you can find more about specific algorithms. What I want to do here is rather to discuss the pros and cons of both methods when used in the search: The fuzzy method makes one assumption: That similarity in a name also means similarity of the music itself. Despite you might recognize some common patterns when looking through a magazine advertising Metal, Rock or Jazz, this method isn’t 100% reliable. What really would be needed when analyzing names is a semantic match, and not a test if the strings themselves are to some degree equal. Even as this might be possible: I’m afraid this would not fit my timetable for this semester, so I discard this for now. However, in some situations the fuzzy matching can also be really useful: For example, if you look at the genres. Usually, you can group music roughly up into big bunches (Rock, Metal, Jazz, Classic, …) However, you also can use more specialized genre names. As a matter of fact, these often contain the “meta” genre in their name (e.g. Black Metal, Heavy Metal, Powermetal and so forth). Here, fuzzy matching would work pretty fine, as songs that are exactly in the same sub-genre get pretty high similarity, those in the same meta genre are still somehow similar which tracks that are in completely different genres (e.g. Metal and Jazz) are not considered similar. Beneath, the fuzzy matching is also great if you take the human laziness into account: A lot of collections I have seen in the past contain wrongly tagged tracks, often with typos in the artist/album names, often use of capital letters is not consistent and so on. Here, fuzzy matching could also help to find similar songs. Well, after so much pros for the fuzzy algorithm, what about exact matching? It could be preferred when the collection is well tagged and meta information is consistent. Here, a fuzzy match could - under certain circumstances - even cause strange decisions. Another good point for exact matching: It is faster. While string similarity functions usually use matrices internally, an exact match can be done by simply comparing the two strings characterwise (which - if I’m not mistaken, can be done with a time complexity of O(min(strlen(string1), strlen(string2))) which already is worst case). Now, what method to use? When thinking about pros and cons of both methods, the one and only reasonable decision could be: Make it user configurable! Of course, the defaults chosen shall be reasonable (e.g. fuzzy matching will not hurt too much in most cases), but if the user wants he shall get the possibility to change the methods used.
Comparing numbers
This is another important point. The sim() function presented above uses a pretty easy variant for comparing numbers: It just delegates two numbers to the compareInteger() function and expects it to know what to do with them. Now, this cannot be successful. The reason here: While strings usually represents names - and names are either similar or not - numbers usually must be interpreted; they bear meaning. So, when inspecting the year when a track has been written/released you have to use other parameters as when comparing the length of a track or the rating. The other thing is: Do we really want/need to compare numbers assigned to tracks? Well, e.g. the year seems reasonable (songs written in the time might somehow be similar) but what about the number of times a song already has been played or the user rating? Now, this point is interesting: Depending on whether we take these into account or not changes the behavior of the algorithm. If we ignore those used generated input, the algorithm behaves neutral compared to what the user likes. That means: It will generally spit out songs regardless the user likes them or not or has heart them recently or not. This would in the end result in an algorithm that suggests the user to hear songs he maybe have not heart for a long time now (“rediscover your music” ;) Got it?) However, when taking these things into account, the algorithm would behave like the user himself: He would preferably choose songs which he likes (which is indicated by both good rating and high play counts). On the other side, once the used drops songs with not (yet) good ratings or low play counts to the play list, the algorithm would also choose those songs. So, both methods would make sense, right? Another way is the golden path between both, but for this, read on this article (I will cover the topic soon).
Comparing the Music Itself
Now, beneath the possibility to compare two tracks by using their meta information, we also can compare by analyzing the music itself. Now, if you read my last blog entry about this topic you might remember I wrote that I probably won’t take this into account. Now, things changed. While indeed I would not compare the music files themselves - I simply don’t have enough time for this now - I still can reuse work that already has been done. So, indeed there is a neat little tool that analyzes music files: The moodbar program. For further and more detailed information, please consult the paper written by Gavin Wood and Simon O’Keefe. Just to give a short overview: Moodbar analyzes the music data and produces graphical output for it. Currently, this is used for in-track navigation, as the output visualizes where the given track is interesting or special. I assume, the moodbars can also be used for calculating similarity, so the sim() function will be extended by another call to a function that can be called e.g. compareMoodbar().
Improving the Algorithm - Learning by Doing
So, the basics are covered, but… is there a way to improve the presented algorithm? Well, there is. One great feature of intelligent algorithms often is, that they can - in some way - learn. And our sim() function already is a good place where basics for learning can be placed. Have a look at this - slightly changed - variant:
FUNCTION sim(track1 : Track; track2 : Track)
Double result = 0.0;
result = c1 * compareString(track1.title, track2.title) +
c2 * compareString(track1.artist, track2.artist) +
c3 * compareString(track1.composer, track2.composer) +
c4 * compareString(track1.album, track2.album) +
c5 * compareInteger(track1.year, track2.year) +
c6 * compareInteger(track1.length, track2.length) +
c7 * compareInteger(track1.playCount, track2.playCount) +
c8 * compareInteger(track1.rating, track2.rating) +
c9 * compareMoodbar(track1.moodbar, track2.moodbar);
return result / (c1 + c2 + c3 + c4 + c5 + c6 + c7 + c8 + c9);
END;
Well, beneath the addition of the moodbars, you shall see that we now use coefficients. Indeed, when all c’s (I will refer to them as vector c now) are set to 1.0, the algorithm would behave exactly as the first version! Now, we can also change the values of the components of c, which would result in different importance of single attributes. For example, if we set c = (0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.5, 0.5, 1.0) the algorithm would return only tracks with similar artist/album, as well as similar rating (where the rating is not as important as the other components) and tracks that “sound” similar. Indeed, c could be any vector from R+9 as long as the sum of the components is greater than zero. Now, we can use this extended version of sim() to “learn” from user input! For example, we could the user provide with a little widget where we present him songs that we currently assume to be equal to the currently playing one and ask him, whether he really things that they are equal. If he agrees, we would increase the value of those components of c, where the similarity of the attributes are high, and decrease those, where the similarity is rather low. So, over time, the user would get an personalized algorithm - or agent - that selects tracks as he likes. Another - less intriguing - way for getting user feedback would be to watch what the user does with the playlist: If he tends to skip files selected by the agent, we could assume this as “bad selection” and accordingly rate down our vector c. On the other side, when the user hears the song until the very end or even turns the volume louder and rates the track up, we can assume it as a good choice. (However, volume interpretation requires the collection to be normalized, as otherwise the act of turning louder could just be a matter of a track with lower volume degree).
Another Way of Determining Similarity: Partitioning
Yeah, this is mathematics in it’s pure form: Given a set and a relation, one can partition the set into disjoint subsets; this is called an equivalence relation. An example would be the set of natural numbers and a relation that assumes two numbers a and b to be equal, when a mod 5 = b mod 5. (Easy spoken, this partitions the natural numbers into 5 disjoint subsets where all numbers in a given subset have the same remained when divided by 5). Equivalence relations have 3 interesting characteristics:
- They are reflexive (i.e. an element is always equal to itself).
- They are symmetric (i.e. if a is equal to b, b is equal to a).
- Last but not least: They are transitive (i.e. if a equals to b and b equals to c, then a also equals to c).
Now, what has this to do with our problem? Imagine, we would provide the user with the possibility to explicitly mark two songs as similar. Now, given a track, we could simply search for a track the user explicitly marked. This is - hypothetically - possible. However, imagine a collection of 1000 tracks and more (and this would still mean a small collection!): Would you like to explicitly declare songs as equal? Certainly not! However, this is where we use math magic: The user does not necessarily need to mark all tracks (which requires a set where all possible pairs of tracks from the collection would be stored) but only some. If you take some moments to think about it: This “song A is similar to song B” relation is an equivalence relation! Why? Of course, a song is always similar (or better: equal) to itself, and when the user says, that song A is equal to B, we can safely assume that B also is equal to A. And what about transitivity? Well, if A is similar to B and B similar to C, there is a (slight) chance that the user would find A being also similar to C. When we recall the widget idea from above, we could change it to:
- present the user with songs we assume (calculated) to be similar
- let him rate the choice (aka “Yes, similar” and “No, what do you thing?!”)
This way, we could combine the possibilities for getting user feedback and letting the user rate similarity.
Conclusion
Well, you see that there are in deed a lot of things one could consider to calculate similarity. The next step now is to implement the similarity search. As hinted in part 1: I don’t know whether I will implement everything presented here. A lot of the presented ideas are certainly cool and can be done - but they require time and time is currently a bit rare :( Nevertheless, let’s see what I can finish until the end of this semester ;)