Encountered error of preprocess data #127

yingdehuijin · 2022-06-30T16:08:52Z

Hi,Uri
Hi, I am using code2seq to run on EMSE-DeepCom https://github.com/xing-hu/EMSE-DeepCom newest datasets. I followed your suggestiones to run scripts preprocess.sh,but i have encountered errors on test/val/train datasets.The error_log.txt and stdout show the following information:
b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 2, column 407.\n\nWas expecting one of:\n\n
And examples are decreased: 20000 test methods hava decreased to 17060 , 20000 valid methods decreased to 17043 and 480000 methods decreased to 380001. Are there something wrong with the datasets?
Looking forward your reply!
Wcc

urialon · 2022-07-03T02:48:33Z

Hi @yingdehuijin ,
Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best,
Uri

yingdehuijin · 2022-07-03T02:59:52Z

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

Thank you for your reply
A single example from the dataset is like this:
code:
public static DecomposableMatchBuilder1 < Float , Float > caseFloat ( MatchesAny f ) { List < Matcher < Object > > matchers = new ArrayList < > ( ) ; matchers . add ( any ( ) ) ; return new DecomposableMatchBuilder1 < > ( matchers , NUM_ , new PrimitiveFieldExtractor < > ( Float . class ) ) ; }
nl:
matches a float .

urialon · 2022-07-14T02:14:51Z

The "nl: matches a float" are part of the same file?
Our JavaExtractor expects pure java files, and extracts the method names as the labels.
You can replace the existing method name (DecomposableMatchBuilder1) with a unique ID, remove the "nl: matches a float", and later, replace the unique ID in the processed files with the natural language sequence that you wish to generate.

See also: #45

Best,
Uri

lidiancracy · 2023-09-17T13:31:24Z

Hello, I encountered the same issue while preprocessing the files. Does the original JAR package handle exceptions, such as skipping files that do not meet the format requirements without preprocessing them? I'm using it to process my own dataset, but it's throwing errors. I'm not sure if it will keep getting stuck there.

urialon · 2023-09-17T15:46:46Z

Hi @lidiancracy ,
Thank you for your interest in our work.

The truth is that I don't remember, this code was written about 5 years ago. If you wish to debug it go ahead, the entire java code is available in this repo.

But I recommend using newer models such as PolyCoder:
https://github.com/VHellendoorn/Code-LMs
https://arxiv.org/pdf/2202.13169.pdf

Best,
Uri

lidiancracy · 2023-09-18T01:38:04Z

@urialon Thank you for your timely reply. My .sh file now terminates normally and has produced 4 files with the .c2s extension. I think the logic in the JAR package is probably fine. By the way, can I continue to train a new dataset on a model that has been trained well, similar to transfer learning and incremental training? I did not find any relevant information in the readme, did I miss something?Thank you in advance.

lidiancracy · 2023-09-19T05:11:29Z

Sorry to bother you.I trained the model using default parameters, but now only the dictionary remains as shown in the picture. Is this normal?

urialon · 2023-09-19T11:12:04Z

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountered error of preprocess data #127

Encountered error of preprocess data #127

yingdehuijin commented Jun 30, 2022

urialon commented Jul 3, 2022

yingdehuijin commented Jul 3, 2022

urialon commented Jul 14, 2022

lidiancracy commented Sep 17, 2023

urialon commented Sep 17, 2023

lidiancracy commented Sep 18, 2023 •

edited

lidiancracy commented Sep 19, 2023

urialon commented Sep 19, 2023

Encountered error of preprocess data #127

Encountered error of preprocess data #127

Comments

yingdehuijin commented Jun 30, 2022

urialon commented Jul 3, 2022

yingdehuijin commented Jul 3, 2022

urialon commented Jul 14, 2022

lidiancracy commented Sep 17, 2023

urialon commented Sep 17, 2023

lidiancracy commented Sep 18, 2023 • edited

lidiancracy commented Sep 19, 2023

urialon commented Sep 19, 2023

lidiancracy commented Sep 18, 2023 •

edited