True song deduplication #3353

troycarpenter · 2022-09-14T19:14:03Z

troycarpenter
Sep 14, 2022

I was thinking of this while exercising today and a duplicate song came up in my playlist. Let me start out by saying that I don't know all the implications of this, and I doubt it would ever be done in Ampache, but here's the idea. This would be true song deduplication for an Ampache-only music distribution model and involves utilizing the Musicbrainz MBID for a recording.

Since Amapche is database driven for the most part, just about all metadata is gathered from files either by filenames or tags, or via an external source. In this way, all objects are assigned and ID for their particular object type (song, artist, album and so on) with database relations between those objects.

It occurs to me that for some artists, particularly of the older variety, there are really just a set number of songs by that artist. In fact, for any given artist I may have three exact copies of the same song, with the exception of the embedded metadata, in different directories (divided by albums): the original version from a studio album, and any number of copies on compilation albums. As I understand the MBID for a recording, those should all have the same recording ID, no matter what album the recording appears on (see https://musicbrainz.org/doc/Recording and the referenced examples on that page).

Ampache is already tracking the mbid for tracks, but the song database also has other album specific data. In fact, for Amapche, it appears that the song is the basic unit of data. That means in the current schema, an object of 'song' cannot appear on more than one album. It would require another database layer/table, a concept of a 'track' , which would be the glue between a 'recording' (or song) and a track on an album. An album would then be comprised of tracks, and each track could then point to a song record that has information about that song, like it's location on the filesystem and many of the other fields that already exist in the song table. The track entry would have other data migrated from the song database like which album it's on, and importantly which catalog it belongs to for proper filtering.

Ampache could then glean all of the information during a file scan to add new content as it does today. A new song management screen could then present songs with the same MBID to the admin, and the admin could chose to eliminate duplicate files and adjust the track database so that there is only one copy of a recording on the filesystem. Or the admin could choose not to do anything and keep duplicates.

What remains to be seen is whether or not the effort for such a feature and/or code rework is worth it. With storage space relatively inexpensive, it may not be a big problem to keep multiple copies of the same song, and other non-Ampache systems that build their databases based on a file hierarchy would (obviously used in conjunction with Ampache) would seem to have files missing.

You can check your own database for duplicate MBIDs with this:

select artist.name,title,song.mbid from song left join artist on artist.id=song.artist group by song.mbid having count(song.mbid) > 1;

For me that shows 590 records, each representing songs that have at least two copies on my file system. However, if that's true and assuming only one duplicate, using an average file size of 5MB per song adds up to almost 1.5 GB of space (590 * 5 / 2).

Like I said, with people having TBs of disk space, and the amount of work that would be required, I think this is an interesting feature idea that in the end doesn't pass the bang for buck test unless a massive rewrite is in the works. However, if there is interest, the database changes to separate out "track" data from "song" data could be made and the de-duplication work added on later.

mitchray · 2022-09-15T07:55:07Z

mitchray
Sep 15, 2022
Collaborator

Interestingly when I run that query I get 438 results, but only 164 (effectively 82 results as they are pairs) from a Possible Duplicates search

1 reply

troycarpenter Sep 15, 2022
Author

The query was quickly thrown together (with some internet help), I added the artist name to the query to make a visual check easier to see if it made sense, but follow-up queries showed the list to be pretty accurate. However, it does depend on the MBID for the song to be correct, which means as far as Muscibrainz is concerted, they are the same recording. What doesn't get caught is any song that is NOT properly tagged with MBID, and therefore for those without MBID would need a secondary matching algorithm (like artist and song title that match) with manual verification.

Like I said, lots of work for seemingly minimal benefit. I'm also not sure I would want the system removing songs from the filesystem just in the name to clean up storage space. I haven't given any thought to disaster recovery where a file scan would be necessary to rebuild the library databases. All the MBIDs for the various components (album, artist, artist_id, track and recording) are in the tags of the song. If only one copy of the song physically exists, a true rebuild of the database from tags wouldn't be possible without Ampache querying MB to find album and track IDs.

troycarpenter · 2022-09-15T13:05:32Z

troycarpenter
Sep 15, 2022
Author

Wow, a thought just occurred that purely with that metadata, Ampache could actually replicate compilations (or specifically "best of") albums for an artist only from the metadata for those compilations and if the actual recordings exist that the compilations are made from. Easily done if someone is a fan and has all that artist's studio albums where "best of" albums are usually just combinations of songs from those studio albums.

0 replies

kuzi-moto · 2022-09-15T14:44:23Z

kuzi-moto
Sep 15, 2022
Collaborator

I have thought about this in the past. If you have a FLAC library, then your potential space savings could increase about 3x.

I think that an application like beets would be better suited since it's primary function is to manage music. Plus Ampache has beets catalog functionality, though for some reason I've never been able to get it working.

Ampache is great, but it's strengths are definitely more playing music, and less managing it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

True song deduplication #3353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

True song deduplication #3353

troycarpenter Sep 14, 2022

Replies: 3 comments · 1 reply

mitchray Sep 15, 2022 Collaborator

troycarpenter Sep 15, 2022 Author

troycarpenter Sep 15, 2022 Author

kuzi-moto Sep 15, 2022 Collaborator

troycarpenter
Sep 14, 2022

Replies: 3 comments 1 reply

mitchray
Sep 15, 2022
Collaborator

troycarpenter Sep 15, 2022
Author

troycarpenter
Sep 15, 2022
Author

kuzi-moto
Sep 15, 2022
Collaborator