Software Lists - Standardize Source Reference method (CD Media) #49

ldolse · 2023-06-15T00:37:10Z

ldolse
Jun 15, 2023

This could possibly be considered a sub-issue of #5809, but it's not directly discussed there. I'd like to work on this but could use some guidance on the right approach before submitting any PRs.

In exploring ways to re-validate chds in cd-media software lists to address the CHD hash issue - #2517 I've found that the main challenge is source references. Nearly half of the CD Media software lists don't contain source references, and those that have them often don't do it consistently. Many sources are also outdated/bad dumps or from defunct groups, in some cases with questionable ripping methods, and could be updated to redump (redump seems to be the preference from what I can see in other conversations, but tosec/no-intro can also be used where needed, or for unique discs tracked in those projects).

The most common way to handle the source reference is using an xml comment at the beginning of the hash node. This can usually be parsed easily enough for single disk entries (but is challenging when a submitter mixes freeform comments with the rom sources, especially if that references alternative files). It causes more issues with multi-disk entries - it becomes problematic to programmatically and reliably determine which source comment lines belong with which softlist entry part.

Example entry from segacd:

<software name="dracunl" supported="no">
		<!-- Source: redump.org - http://redump.org/disc/21040/ http://redump.org/disc/21041/
		<rom name="Dracula Unleashed (USA) (Disc 1).cue" size="257" crc="93ade26e" md5="fa725eb12ee7eb683144986261c33bf1" sha1="fdb6ad28c7fc21f99e35684502bcd32dfc1abb1b"/>
		<rom name="Dracula Unleashed (USA) (Disc 1) (Track 1).bin" size="544372752" crc="0e728a6c" md5="282a9be8157a6167a389c7fff475ed3a" sha1="927b6d0525010265059ab1ee4d23c6a0598b7fb8"/>
		<rom name="Dracula Unleashed (USA) (Disc 1) (Track 2).bin" size="10936800" crc="d786ed68" md5="67e2198ef6dc4c6c234ef791ee13bc8a" sha1="ea3b156235b3554b5d522054f83c811a3fc3188e"/>
		<rom name="Dracula Unleashed (USA) (Disc 2).cue" size="234" crc="c464d858" md5="dc484097b4c2f2f0536f8f22f3a54c2b" sha1="b9d2dbf101195dcaacacd3049e3aebe397cdd5be"/>
		<rom name="Dracula Unleashed (USA) (Disc 2) (Track 1).bin" size="599428368" crc="2fc70577" md5="b5e16ae813c649f8dbaa4e5a97d293e4" sha1="f6a8c90a517285d7763c72b881a5d2ace99d4403"/>
		<rom name="Dracula Unleashed (USA) (Disc 2) (Track 2).bin" size="10936800" crc="d786ed68" md5="67e2198ef6dc4c6c234ef791ee13bc8a" sha1="ea3b156235b3554b5d522054f83c811a3fc3188e"/>
		-->
		<description>Dracula Unleashed (USA)</description>
		<part name="cdrom1" interface="scd_cdrom">
			<feature name="part_id" value="Disc 1" />
			<diskarea name="cdrom">
				<disk name="dracula unleashed (usa) (disc 1)" sha1="307c0779dace5d95f3e7a0ccee18c00356ca2362"/>
			</diskarea>
		</part>
		<part name="cdrom2" interface="scd_cdrom">
			<feature name="part_id" value="Disc 2" />
			<diskarea name="cdrom">
				<disk name="dracula unleashed (usa) (disc 2)" sha1="e3adeebe60f32a3487e7d5afc86675b73057fba6"/>
			</diskarea>
		</part>
	</software>

In this specific case this can be parsed easily enough by using the cue as the delimiter, but .cue/toc can't be reliably used, sometimes it's the first entry, sometimes last, in the middle (rarely), or completely excluded from the source reference (also rare). There is also a risk that the comment itself may have the discs referenced out of order.

Option 1

The Amiga CDTV softlist uses one possible solution, which is to move the source comments into the disk parts, so the above could look like this:

<software name="dracunl" supported="no">
    <description>Dracula Unleashed (USA)</description>
    <part name="cdrom1" interface="scd_cdrom">
        <!-- Source: redump.org - http://redump.org/disc/21040/-->
        <!--<rom name="Dracula Unleashed (USA) (Disc 1).cue" size="257" crc="93ade26e" md5="fa725eb12ee7eb683144986261c33bf1" sha1="fdb6ad28c7fc21f99e35684502bcd32dfc1abb1b"/>
        <rom name="Dracula Unleashed (USA) (Disc 1) (Track 1).bin" size="544372752" crc="0e728a6c" md5="282a9be8157a6167a389c7fff475ed3a" sha1="927b6d0525010265059ab1ee4d23c6a0598b7fb8"/>
        <rom name="Dracula Unleashed (USA) (Disc 1) (Track 2).bin" size="10936800" crc="d786ed68" md5="67e2198ef6dc4c6c234ef791ee13bc8a" sha1="ea3b156235b3554b5d522054f83c811a3fc3188e"/>-->
        <feature name="part_id" value="Disc 1" />
        <diskarea name="cdrom">
            <disk name="dracula unleashed (usa) (disc 1)" sha1="307c0779dace5d95f3e7a0ccee18c00356ca2362"/>
        </diskarea>
        </part>
    <part name="cdrom2" interface="scd_cdrom">
        <!-- Source: redump.org - http://redump.org/disc/21041/-->
        <!--<rom name="Dracula Unleashed (USA) (Disc 2).cue" size="234" crc="c464d858" md5="dc484097b4c2f2f0536f8f22f3a54c2b" sha1="b9d2dbf101195dcaacacd3049e3aebe397cdd5be"/>
        <rom name="Dracula Unleashed (USA) (Disc 2) (Track 1).bin" size="599428368" crc="2fc70577" md5="b5e16ae813c649f8dbaa4e5a97d293e4" sha1="f6a8c90a517285d7763c72b881a5d2ace99d4403"/>
        <rom name="Dracula Unleashed (USA) (Disc 2) (Track 2).bin" size="10936800" crc="d786ed68" md5="67e2198ef6dc4c6c234ef791ee13bc8a" sha1="ea3b156235b3554b5d522054f83c811a3fc3188e"/>-->
        <feature name="part_id" value="Disc 2" />
        <diskarea name="cdrom">
            <disk name="dracula unleashed (usa) (disc 2)" sha1="e3adeebe60f32a3487e7d5afc86675b73057fba6"/>
        </diskarea>
    </part>
</software>

This approach is workable as it would solve the main issue of properly attributing source lines to a soft entry part. That said, copying/pasting the entire DAT entry isn't particularly valuable from a data perspective - the TOC files can and do change over time as source groups change their standards or filenames, as do the individual track filenames. The only reliable bits of 'data' from the source comments is the binary track data itself - the hashes, file order, file size, and total number of files. These only change if a bad dump is replaced.

Option 2

In trying to determine how to best use source comments to match a source in a DAT, the best solution I've found is to use the comments to create a sort of 'fingerprint' using the source track data - concatenating the binary track hashes and then hashing that concatenated string creates a unique key that can be used to search a lookup table built from source DATs using the same method. (as noted above, ignore cue/gdi TOC files as they can't be trusted to remain static)

Bash example: Get a set of hashes for each disc (concatenate sha1 of tracks 1 and 2 of the above entry):

Disk 1:
% echo -n 927b6d0525010265059ab1ee4d23c6a0598b7fb8ea3b156235b3554b5d522054f83c811a3fc3188e | shasum
869f4fbc8f0aae745d455bd58a3b11f4e5adc2f5  -
Disk 2:
% echo -n f6a8c90a517285d7763c72b881a5d2ace99d4403ea3b156235b3554b5d522054f83c811a3fc3188e | shasum
c702c39706fd779dad82d5189931d95697976d71  -

Using this value as the source reference the same softlist entry could then look like something this:

<software name="dracunl" supported="no">
    <description>Dracula Unleashed (USA)</description>
    <part name="cdrom1" interface="scd_cdrom">
        <info name="source_group" value="redump"/>
        <info name="source_url" value="http://redump.org/disc/21040/"/>
        <info name="source_id" value="869f4fbc8f0aae745d455bd58a3b11f4e5adc2f5"/>
        <feature name="part_id" value="Disc 1" />
        <diskarea name="cdrom">
            <disk name="dracula unleashed (usa) (disc 1)" sha1="307c0779dace5d95f3e7a0ccee18c00356ca2362"/>
        </diskarea>
    </part>
    <part name="cdrom2" interface="scd_cdrom">
        <info name="source_group" value="redump"/>
        <info name="source_url" value="http://redump.org/disc/21041/"/>
        <info name="source_id" value="c702c39706fd779dad82d5189931d95697976d71"/>
        <feature name="part_id" value="Disc 2" />
        <diskarea name="cdrom">
            <disk name="dracula unleashed (usa) (disc 2)" sha1="e3adeebe60f32a3487e7d5afc86675b73057fba6"/>
        </diskarea>
    </part>
</software>

This is enough info to find a source entry in a redump dat file, and it's recorded in a way that addresses some of #5809, in terms of having standardized way to report specifically on source stats.

Note that I've only included the hash as the only reference, in practice I haven't seen any issues with collisions. This hash inherently stores the track order, but additional fields such as source_filecount and source_total_binarysize could be added to help further reduce the possibility of collisions if that is a concern.

In one of the other conversations it was mentioned whether the redump URL itself would be enough in lieu of the more detailed source reference, the challenge with that is it requires scraping the redump website to get the actual hashes, and it doesn't seem to be a good idea to rely on that being online. This 'fingerprint' idea can work fully offline and survive dat group / webserver changes as well.

I only see two downsides with this approach

it's a bit more cumbersome for submitters working by hand - copy/paste from the source dat is easier than manually generating a hash, as is manually searching for individual hashes from the original sources (but I don't believe this last bit is a common use case)
Finding the replacement for a bad dump would be harder - retaining each individual hash would allow some automation as typically only one track hash changes when a dump is updated - note this could be largely mitigated if a goal was to explicitly include every redump entry in the soft lists as it would just be a matter of finding the mismatches periodically.

Let me know if either of these approaches (or some mix) would be acceptable, or if there are other suggestions to help address this issue. I'd like to start updating source info for some of the software lists below (adding sources where they don't exist today, shifting existing entries to the latest redump when available there, etc)

Source Coverage Reference Info

This data is based on the script I'm using to parse the soft lists, I've got confidence in the DAT matches it's finding, but it's not able to parse 100% of the source references because of the challenges above. Getting these sorts of stats in a reliable and consistent way would be one benefit of implementing some of these ideas. I only mapped these against redump, tosec, and no-intro, as I'm unable to find dats for any of the defunct groups that are sometimes used, e.g. trurip

Softlists with source references:

cdi.xml
- 682 / 690 (98.8%) individual discs contain source references
- 676 individual discs can be matched to dat sources
  - redump: 79.3%, TOSEC: 19.4%, no-intro: 0.4%, unknown: 0.9%
cdtv.xml - partially documented as per option 1 above (rom info in disc part)
- 51 / 145 (35%) individual discs contain source references
- 51 individual discs can be matched to dat sources - 100.0% redump
psx.xml
- 2916 / 3015 individual discs contain source references
- 2132 individual discs can be matched to dat sources
  - redump: 72.2% (many not explicitly flagged as redump and currently marked as bad dump)
  - no-intro: 0.7%, TOSEC: 0.2%, unknown: 26.9%
dc.xml - (would like to standardize more of this on redump based on 3410)
- 743 / 752 (98.8%) individual discs contain source references
- 671 individual discs can be matched to dat sources
- redump: 18.7%, no-intro: 0.9%, TOSEC: 70.7%, unknown: 9.7%
fm_towns_cd.xml
- 939 / 946 individual discs contain source references
- 907 individual discs can be matched to dat sources
- redump: 96.6%, unknown: 3.4%
pc98_cd.xml -
- 275 / 289 individual discs contain source references
- 95 individual discs can be matched to dat sources
- redump: 34.5%, unknown: 65.5%
megacd.xml - 75% in current redump/tosec
- 157 / 158 individual discs contain source references
- 147 individual discs can be matched to dat sources
- redump: 29.3%, TOSEC: 64.3%, unknown: 6.4%
megacdj.xml
- 132 / 134 individual discs contain source references
- 132 entries can be matched to dat sources
- redump: 51.5%, TOSEC: 48.5%
segacd.xml
- 215 / 229 individual discs contain source references
- 193 individual discs can be matched to dat sources
- redump: 55.3%, TOSEC: 34.4%, unknown: 34.4%

Softlists without source references:

cd32.xml
ibm5170_cdrom.xml - 30-50% of the entries do have references, but outside of the hash xml node itself, not parsing this atm
pcecd.xml
pippin.xml
pcfx.xml
saturn.xml
neocd.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAMEdev

Software Lists - Standardize Source Reference method (CD Media) #49

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

MAMEdev

Software Lists - Standardize Source Reference method (CD Media) #49

ldolse Jun 15, 2023

Option 1

Option 2

Source Coverage Reference Info

Replies: 0 comments

ldolse
Jun 15, 2023