Skip to content

A LiveCode function to return the offsets of a character in a string. Uses items. Is much faster than the built-in offset function with chars-to-skip.

License

Notifications You must be signed in to change notification settings

gcanyon/alloffsets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Description AllOffsets is a LiveCode function that returns a comma-delimited list of all the offsets of a character in a string. For large unicode strings with many hits, the obvious solution, to use the offset function with the characters-to-skip parameter, or codepointOffset with codepoints-to-skip, scales poorly. There are two versions included: allOffsetsItems, and allOffsetsUTF32.

allOffsetsItems Works by setting the item delimiter to the search stringToFind and parsing the items of the stringToSearch. Case-sensitivity is handled by setting caseSensitive, and overlapping results are handled by specific code within the repeat-for-each-item loop. allOffsetsItems seems to be immune, or at least resistant to the issues listed below. allOffsetsItems is substantially to extremely (hundreds of times) faster for strings that include multibyte characters -- in testing on a 50,000-character string, it was 300x faster than simply using offset.

allOffsetsUTF32 Converts the stringToFind and stringToSearch to UTF-32, i.e. 4-byte binary values. This allows byteOffset to be used, which is efficient because bytes are fixed-length. Case-INsensitivity is handled by using toUpper on the strings before conversion. Overlapping results are a natural result of the algorithm, with non-overlapping results generated by a special case in the loop. allOffsetsUTF32 is designed to correctly handle the 4-byte boundary, meaning that it should never return incorrect results caused by a match on four bytes running from one character to another. allOffsetsUTF32 postfixes both strings with "せ" before converting to UTF-32, and then strips the last 4 bytes, because this seems to address some/all of the issues listed below. allOffsetsUTF32 scales similarly to allOffsetsItems, but appears to be about 3x faster in general.

Issues Unicode is funky. As I understand it, í (for example) can be represented two ways: as the single character í, or as a conglomeration of i with ´, meaning two "characters" joined into one. This can (does?) have an impact on search algorithms, and a search for "í" might not find (i with ´). Or an incorrect hit might be returned when searching for "i" and the stringToSearch contains (i with ´). At this point both allOffsetsItems and allOffsetsUTF32 as implemented seem to return correct results, but in both cases there is dependence on underlying aspects of the LC engine that can/might change to break both algorithms. I think allOffsetsItems is a bit safer/unlikely to break, but it is a more involved implementation, and a little slower, albeit that both functions scale well.

About

A LiveCode function to return the offsets of a character in a string. Uses items. Is much faster than the built-in offset function with chars-to-skip.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published