Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File export crashed for non utf-8 characters #647

Open
florian-huber opened this issue May 15, 2024 · 2 comments
Open

File export crashed for non utf-8 characters #647

florian-huber opened this issue May 15, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@florian-huber
Copy link
Collaborator

I imported recent MoNA data and wanted to store this as .mgf.
This gave me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2070' in position 102: character maps to <undefined>

I believe this came from the msp imported spectrum below:

Name: methyl 8-hydroxy-4,5,7,10,14,14-hexamethyl-6,17-dioxo-16-oxapentacyclo[13.2.2.0¹,¹³.0²,¹⁰.0⁵,⁹]nonadeca-3,7-diene-9-carboxylate
Synon: $:00in-source
DB#: VF-NPL-QTOF008906
InChIKey: LISLGICORSJAKO-UHFFFAOYSA-N
Precursor_type: [M+H]+
Spectrum_type: MS2
PrecursorMZ: 443.2423
Instrument_type: LC-ESI-QTOF
Instrument: Agilent 6530 Q-TOF
Ion_mode: P
Collision_energy: 40 V
Formula: C26H34O6
MW: 442
ExactMass: 442.235538808
Comments: "computed SMILES=O=C(OC)C12C(O)=C(C(=O)C2(C(=CC3C45C(=O)OC(CC4)C(C)(C)C5CCC31C)C)C)C" "computed InChI=InChI=1S/C26H34O6/c1-13-12-16-23(5,26(21(30)31-7)19(28)14(2)18(27)24(13,26)6)10-8-15-22(3,4)17-9-11-25(15,16)20(29)32-17/h12,15-17,28H,8-11H2,1-7H3" "computed [2M+H-H2O]+=867.463707616" "computed [2M+K]+=923.569377616" "computed [M+H]+=443.243508808" "computed [2M+Na]+=907.460847616" "computed [2M+Cl]-=919.924077616" "computed [2M+HAc-H]-=943.515107616" "computed [M-H20-H]-=423.217148808" "computed [M-H]-=441.228262808" "computed [2M-H]-=883.463801616" "computed [M+Na]+=465.225308808" "computed [M+H-H2O]+=425.228168808" "computed [2M+NH4]+=902.509657616" "computed [M+Cl]-=477.688538808" "computed [M+NH4]+=460.274118808" "computed [M+K]+=481.333838808" "computed [2M+H]+=885.479047616" "computed [M+HAc-H]-=501.279568808" "author=Arpana Vaniya and Alberto Valdes Tabernero" "compound id=MX_UC_501" "retention time=5.482" "computed spectral entropy=4.0086388293454975" "computed normalized entropy=0.882319774844684" "computed mass accuracy=2.7271945840936818" "computed mass error=-0.001208808000001227" "SPLASH=splash10-066r-0930000000-aa2842f96cdb495cc487" "submitter=submitter = Arpana Vaniya (University of California, Davis)" "MoNA Rating=4.615384615384616"
Num Peaks: 100
105.0696 17.901902
107.0839 18.174174
109.1003 2.618619
117.0648 2.319319
119.0845 100.000000
121.0612 3.602603
121.0993 10.780781
123.0426 2.351351
123.1107 1.989990
125.0635 2.232232
129.0696 3.359359
131.0837 21.209209
133.1011 8.746747
135.0548 6.297297
137.0587 2.276276
137.0921 1.988989
141.0667 2.016016
143.0842 4.709710
145.0996 37.784785
147.1172 3.839840
149.0587 4.772773
149.0930 2.857858
151.0416 7.238238
151.0695 5.746747
153.0520 2.747748
156.0939 2.912913
157.1000 15.731732
159.1159 20.935936
161.0904 2.071071
161.1326 4.305305
163.0773 10.517518
163.1370 3.946947
165.0899 6.880881
169.1031 3.968969
171.1169 16.634635
173.1296 24.457457
175.1438 3.849850
177.0884 5.374374
179.0698 12.842843
181.1073 2.746747
183.1128 7.426426
185.1288 7.831832
187.1285 8.412412
189.0906 9.244244
189.1566 2.430430
191.0893 3.555556
193.0851 7.036036
195.1125 2.015015
196.0878 2.140140
197.1324 3.412412
198.1362 2.446446
199.1027 3.406406
199.1383 3.501502
201.1246 3.592593
207.1137 2.091091
211.1441 7.317317
213.1612 30.044044
215.1215 4.165165
215.1766 7.812813
217.1161 1.994995
218.0908 2.883884
219.0990 2.572573
221.1376 2.867868
223.1408 4.390390
225.1290 6.897898
227.1372 4.367367
229.1244 2.933934
231.1654 2.025025
237.1515 2.891892
239.1456 6.199199
241.1410 3.310310
243.1398 4.755756
245.1092 3.384384
247.1448 2.824825
249.1384 2.855856
253.1222 5.973974
255.1364 4.081081
259.1476 4.540541
263.1476 2.872873
267.1350 6.793794
271.1293 11.623624
275.1654 2.109109
281.1890 5.641642
289.1529 1.986987
291.1944 2.365365
293.1572 2.438438
295.1599 6.584585
309.1850 3.061061
319.2018 2.540541
323.1708 2.405405
327.1907 2.597598
329.1914 2.297297
337.2133 8.057057
347.1978 4.383383
355.1880 2.322322
365.2033 8.666667
367.2289 2.190190
383.2175 2.862863
393.2009 5.143143
411.2106 7.419419
@florian-huber florian-huber added the bug Something isn't working label May 15, 2024
@florian-huber
Copy link
Collaborator Author

Actually, not so sure anymore where this came from. Might also have happened during cleaning. Maybe pubchem lookup?

I will check where this occurred to better reproduce this. Then see if we have to introduce something like x.encode('utf-8','ignore').decode("utf-8") anywhere.

@niekdejonge
Copy link
Collaborator

A similar issue was mentioned before in #608

For loading MSP files we could change:
with open(filename, 'r', encoding='utf-8') as f:
in line 58 to
with open(filename, 'r', encoding='utf-8', errors="ignore") as f:
to just remove these characters.

However MGF file loading happens in another package I believe, so I am not sure how we could fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants