True song deduplication #3353
Replies: 3 comments 1 reply
-
Interestingly when I run that query I get 438 results, but only 164 (effectively 82 results as they are pairs) from a Possible Duplicates search |
Beta Was this translation helpful? Give feedback.
-
Wow, a thought just occurred that purely with that metadata, Ampache could actually replicate compilations (or specifically "best of") albums for an artist only from the metadata for those compilations and if the actual recordings exist that the compilations are made from. Easily done if someone is a fan and has all that artist's studio albums where "best of" albums are usually just combinations of songs from those studio albums. |
Beta Was this translation helpful? Give feedback.
-
I have thought about this in the past. If you have a FLAC library, then your potential space savings could increase about 3x. I think that an application like beets would be better suited since it's primary function is to manage music. Plus Ampache has beets catalog functionality, though for some reason I've never been able to get it working. Ampache is great, but it's strengths are definitely more playing music, and less managing it. |
Beta Was this translation helpful? Give feedback.
-
I was thinking of this while exercising today and a duplicate song came up in my playlist. Let me start out by saying that I don't know all the implications of this, and I doubt it would ever be done in Ampache, but here's the idea. This would be true song deduplication for an Ampache-only music distribution model and involves utilizing the Musicbrainz MBID for a recording.
Since Amapche is database driven for the most part, just about all metadata is gathered from files either by filenames or tags, or via an external source. In this way, all objects are assigned and ID for their particular object type (song, artist, album and so on) with database relations between those objects.
It occurs to me that for some artists, particularly of the older variety, there are really just a set number of songs by that artist. In fact, for any given artist I may have three exact copies of the same song, with the exception of the embedded metadata, in different directories (divided by albums): the original version from a studio album, and any number of copies on compilation albums. As I understand the MBID for a recording, those should all have the same recording ID, no matter what album the recording appears on (see https://musicbrainz.org/doc/Recording and the referenced examples on that page).
Ampache is already tracking the mbid for tracks, but the song database also has other album specific data. In fact, for Amapche, it appears that the song is the basic unit of data. That means in the current schema, an object of 'song' cannot appear on more than one album. It would require another database layer/table, a concept of a 'track' , which would be the glue between a 'recording' (or song) and a track on an album. An album would then be comprised of tracks, and each track could then point to a song record that has information about that song, like it's location on the filesystem and many of the other fields that already exist in the song table. The track entry would have other data migrated from the song database like which album it's on, and importantly which catalog it belongs to for proper filtering.
Ampache could then glean all of the information during a file scan to add new content as it does today. A new song management screen could then present songs with the same MBID to the admin, and the admin could chose to eliminate duplicate files and adjust the track database so that there is only one copy of a recording on the filesystem. Or the admin could choose not to do anything and keep duplicates.
What remains to be seen is whether or not the effort for such a feature and/or code rework is worth it. With storage space relatively inexpensive, it may not be a big problem to keep multiple copies of the same song, and other non-Ampache systems that build their databases based on a file hierarchy would (obviously used in conjunction with Ampache) would seem to have files missing.
You can check your own database for duplicate MBIDs with this:
For me that shows 590 records, each representing songs that have at least two copies on my file system. However, if that's true and assuming only one duplicate, using an average file size of 5MB per song adds up to almost 1.5 GB of space (590 * 5 / 2).
Like I said, with people having TBs of disk space, and the amount of work that would be required, I think this is an interesting feature idea that in the end doesn't pass the bang for buck test unless a massive rewrite is in the works. However, if there is interest, the database changes to separate out "track" data from "song" data could be made and the de-duplication work added on later.
Beta Was this translation helpful? Give feedback.
All reactions