Clustering dialect varieties based on historical sound correspondences

Can we (meaningfully) cluster dialects based on sound correspondences? Research such as Wieling and Nerbonne (2011) uses phonetically transcribed doculect data that has been aligned with data from a reference doculect to investigate clustering based on the presence/absence of sound segment alignments. The results are doculect clusters as well as analyses of correlations between segment alignments and clusters.

Using data from Heggarty (2018), I follow a similar approach, but use Proto-Germanic data as reference doculect for a set of (continental) West Germanic doculects to explore how historical sound shifts are associated with the resulting clusters.

Details on this are in my Bachelor's thesis, which can be found here. A summary and a presentation are also available.

Abstract

While information on historical sound shifts plays an important role for examining the relationships between related language varieties, it has rarely been used for computational dialectology. This thesis explores the performance of two algorithms for clustering language varieties based on sound correspondences between Proto-Germanic and modern continental West Germanic dialects. Our experiments suggest that the results of agglomerative clustering match common dialect groupings more closely than the results of (divisive) bipartite spectral graph co-clustering. We also observe that adding phonetic context information to the sound correspondences yields clusters that are more frequently associated with representative and distinctive sound correspondences).

Errata

The last sentence of section 4.3.2 Bipartite Spectral Graph Co-clustering (p. 13) should read "The results from this method are hereafter referred to as BSGC-context and BSGC-nocontext."

Running the software

To run the scripts on Windows, start run.bat which sets the scripts' IO encodings and a hash seed for python (to get consistent results across runs) and runs cluster.py (which calls the other python scripts as needed).

On UNIX, run:

set pythonhashseed=123
python3 cluster.py

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
data		data
doc		doc
helper scripts		helper scripts
output		output
README.md		README.md
align.py		align.py
bsgc.py		bsgc.py
cluster.py		cluster.py
print_output.py		print_output.py
read_data.py		read_data.py
run.bat		run.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

doc

doc

helper scripts

helper scripts

output

output

README.md

README.md

align.py

align.py

bsgc.py

bsgc.py

cluster.py

cluster.py

print_output.py

print_output.py

read_data.py

read_data.py

run.bat

run.bat

Repository files navigation

Clustering dialect varieties based on historical sound correspondences

Abstract

Errata

Running the software

About

Releases 1

Languages

verenablaschke/dialect-clustering

Folders and files

Latest commit

History

Repository files navigation

Clustering dialect varieties based on historical sound correspondences

Abstract

Errata

Running the software

About

Topics

Resources

Stars

Watchers

Forks

Languages