You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This could possibly be considered a sub-issue of #5809, but it's not directly discussed there. I'd like to work on this but could use some guidance on the right approach before submitting any PRs.
In exploring ways to re-validate chds in cd-media software lists to address the CHD hash issue - #2517 I've found that the main challenge is source references. Nearly half of the CD Media software lists don't contain source references, and those that have them often don't do it consistently. Many sources are also outdated/bad dumps or from defunct groups, in some cases with questionable ripping methods, and could be updated to redump (redump seems to be the preference from what I can see in other conversations, but tosec/no-intro can also be used where needed, or for unique discs tracked in those projects).
The most common way to handle the source reference is using an xml comment at the beginning of the hash node. This can usually be parsed easily enough for single disk entries (but is challenging when a submitter mixes freeform comments with the rom sources, especially if that references alternative files). It causes more issues with multi-disk entries - it becomes problematic to programmatically and reliably determine which source comment lines belong with which softlist entry part.
In this specific case this can be parsed easily enough by using the cue as the delimiter, but .cue/toc can't be reliably used, sometimes it's the first entry, sometimes last, in the middle (rarely), or completely excluded from the source reference (also rare). There is also a risk that the comment itself may have the discs referenced out of order.
Option 1
The Amiga CDTV softlist uses one possible solution, which is to move the source comments into the disk parts, so the above could look like this:
This approach is workable as it would solve the main issue of properly attributing source lines to a soft entry part. That said, copying/pasting the entire DAT entry isn't particularly valuable from a data perspective - the TOC files can and do change over time as source groups change their standards or filenames, as do the individual track filenames. The only reliable bits of 'data' from the source comments is the binary track data itself - the hashes, file order, file size, and total number of files. These only change if a bad dump is replaced.
Option 2
In trying to determine how to best use source comments to match a source in a DAT, the best solution I've found is to use the comments to create a sort of 'fingerprint' using the source track data - concatenating the binary track hashes and then hashing that concatenated string creates a unique key that can be used to search a lookup table built from source DATs using the same method. (as noted above, ignore cue/gdi TOC files as they can't be trusted to remain static)
Bash example: Get a set of hashes for each disc (concatenate sha1 of tracks 1 and 2 of the above entry):
Disk 1:
% echo -n 927b6d0525010265059ab1ee4d23c6a0598b7fb8ea3b156235b3554b5d522054f83c811a3fc3188e | shasum
869f4fbc8f0aae745d455bd58a3b11f4e5adc2f5 -
Disk 2:
% echo -n f6a8c90a517285d7763c72b881a5d2ace99d4403ea3b156235b3554b5d522054f83c811a3fc3188e | shasum
c702c39706fd779dad82d5189931d95697976d71 -
Using this value as the source reference the same softlist entry could then look like something this:
This is enough info to find a source entry in a redump dat file, and it's recorded in a way that addresses some of #5809, in terms of having standardized way to report specifically on source stats.
Note that I've only included the hash as the only reference, in practice I haven't seen any issues with collisions. This hash inherently stores the track order, but additional fields such as source_filecount and source_total_binarysize could be added to help further reduce the possibility of collisions if that is a concern.
In one of the other conversations it was mentioned whether the redump URL itself would be enough in lieu of the more detailed source reference, the challenge with that is it requires scraping the redump website to get the actual hashes, and it doesn't seem to be a good idea to rely on that being online. This 'fingerprint' idea can work fully offline and survive dat group / webserver changes as well.
I only see two downsides with this approach
it's a bit more cumbersome for submitters working by hand - copy/paste from the source dat is easier than manually generating a hash, as is manually searching for individual hashes from the original sources (but I don't believe this last bit is a common use case)
Finding the replacement for a bad dump would be harder - retaining each individual hash would allow some automation as typically only one track hash changes when a dump is updated - note this could be largely mitigated if a goal was to explicitly include every redump entry in the soft lists as it would just be a matter of finding the mismatches periodically.
Let me know if either of these approaches (or some mix) would be acceptable, or if there are other suggestions to help address this issue. I'd like to start updating source info for some of the software lists below (adding sources where they don't exist today, shifting existing entries to the latest redump when available there, etc)
Source Coverage Reference Info
This data is based on the script I'm using to parse the soft lists, I've got confidence in the DAT matches it's finding, but it's not able to parse 100% of the source references because of the challenges above. Getting these sorts of stats in a reliable and consistent way would be one benefit of implementing some of these ideas. I only mapped these against redump, tosec, and no-intro, as I'm unable to find dats for any of the defunct groups that are sometimes used, e.g. trurip
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This could possibly be considered a sub-issue of #5809, but it's not directly discussed there. I'd like to work on this but could use some guidance on the right approach before submitting any PRs.
In exploring ways to re-validate chds in cd-media software lists to address the CHD hash issue - #2517 I've found that the main challenge is source references. Nearly half of the CD Media software lists don't contain source references, and those that have them often don't do it consistently. Many sources are also outdated/bad dumps or from defunct groups, in some cases with questionable ripping methods, and could be updated to redump (redump seems to be the preference from what I can see in other conversations, but tosec/no-intro can also be used where needed, or for unique discs tracked in those projects).
The most common way to handle the source reference is using an xml comment at the beginning of the hash node. This can usually be parsed easily enough for single disk entries (but is challenging when a submitter mixes freeform comments with the rom sources, especially if that references alternative files). It causes more issues with multi-disk entries - it becomes problematic to programmatically and reliably determine which source comment lines belong with which softlist entry part.
Example entry from segacd:
In this specific case this can be parsed easily enough by using the cue as the delimiter, but .cue/toc can't be reliably used, sometimes it's the first entry, sometimes last, in the middle (rarely), or completely excluded from the source reference (also rare). There is also a risk that the comment itself may have the discs referenced out of order.
Option 1
The Amiga CDTV softlist uses one possible solution, which is to move the source comments into the disk parts, so the above could look like this:
This approach is workable as it would solve the main issue of properly attributing source lines to a soft entry part. That said, copying/pasting the entire DAT entry isn't particularly valuable from a data perspective - the TOC files can and do change over time as source groups change their standards or filenames, as do the individual track filenames. The only reliable bits of 'data' from the source comments is the binary track data itself - the hashes, file order, file size, and total number of files. These only change if a bad dump is replaced.
Option 2
In trying to determine how to best use source comments to match a source in a DAT, the best solution I've found is to use the comments to create a sort of 'fingerprint' using the source track data - concatenating the binary track hashes and then hashing that concatenated string creates a unique key that can be used to search a lookup table built from source DATs using the same method. (as noted above, ignore cue/gdi TOC files as they can't be trusted to remain static)
Bash example: Get a set of hashes for each disc (concatenate sha1 of tracks 1 and 2 of the above entry):
Using this value as the source reference the same softlist entry could then look like something this:
This is enough info to find a source entry in a redump dat file, and it's recorded in a way that addresses some of #5809, in terms of having standardized way to report specifically on source stats.
Note that I've only included the hash as the only reference, in practice I haven't seen any issues with collisions. This hash inherently stores the track order, but additional fields such as source_filecount and source_total_binarysize could be added to help further reduce the possibility of collisions if that is a concern.
In one of the other conversations it was mentioned whether the redump URL itself would be enough in lieu of the more detailed source reference, the challenge with that is it requires scraping the redump website to get the actual hashes, and it doesn't seem to be a good idea to rely on that being online. This 'fingerprint' idea can work fully offline and survive dat group / webserver changes as well.
I only see two downsides with this approach
Let me know if either of these approaches (or some mix) would be acceptable, or if there are other suggestions to help address this issue. I'd like to start updating source info for some of the software lists below (adding sources where they don't exist today, shifting existing entries to the latest redump when available there, etc)
Source Coverage Reference Info
This data is based on the script I'm using to parse the soft lists, I've got confidence in the DAT matches it's finding, but it's not able to parse 100% of the source references because of the challenges above. Getting these sorts of stats in a reliable and consistent way would be one benefit of implementing some of these ideas. I only mapped these against redump, tosec, and no-intro, as I'm unable to find dats for any of the defunct groups that are sometimes used, e.g. trurip
Softlists with source references:
cdi.xml
cdtv.xml - partially documented as per option 1 above (rom info in disc part)
psx.xml
dc.xml - (would like to standardize more of this on redump based on 3410)
fm_towns_cd.xml
pc98_cd.xml -
megacd.xml - 75% in current redump/tosec
megacdj.xml
segacd.xml
Softlists without source references:
Beta Was this translation helpful? Give feedback.
All reactions