Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_wikipedia.py: long running time #121

Open
chschroeder opened this issue Feb 13, 2024 · 3 comments
Open

make_wikipedia.py: long running time #121

chschroeder opened this issue Feb 13, 2024 · 3 comments

Comments

@chschroeder
Copy link

Hi, Thank you for sharing this outstanding repository!

I have been trying to use scripts/make_wikipedia_py to process a German wikipedia dump:

python scripts/make_wikipedia.py --output wikipedia --lang de  --date 20240201 --processes 16

Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:

[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]

At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.

This is likely a problem of the underlying wikiextractor library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?

@soldni
Copy link
Member

soldni commented Feb 21, 2024

mmmh I have not tried to process DE Wikipedia in a while, but when I did it last year I was not having the same issue. I've heard good things about the MediaWiki Parser from Hell, which is luckily still in active development.

@chschroeder
Copy link
Author

Thank you for the tip! In the meanwhile I have discovered a pre-parsed dataset on huggingface hub: wikimedia/wikipedia. They also seem to use this parser, so I will try using this for now.

@huangwei2913
Copy link

It took a very long time to process the dataset, my computer has 16 virtual processors as follows:
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 151
model name : 12th Gen Intel(R) Core(TM) i5-12600KF
stepping : 2
microcode : 0x32
cpu MHz : 3700.000
cache size : 20480 KB
physical id : 0
siblings : 16
core id : 31
cpu cores : 10
apicid : 62
initial apicid : 62
fpu : yes
fpu_exception : yes
cpuid level : 32
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb
bogomips : 7372.80
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
How can we accelerate the extration speed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants