Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve horizontal space in code blocks #553

Open
mittsommer opened this issue Apr 9, 2024 · 3 comments
Open

Preserve horizontal space in code blocks #553

mittsommer opened this issue Apr 9, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@mittsommer
Copy link

Hello,
thanks for yours continous work on trafilatura
recent when we using trafilatura working on code-text content extraction, wo noticed that the santize func remove all white space \ table even in code block when using txt outpput formating
we think the problem is here preserve_space=False in default
https://github.com/adbar/trafilatura/blob/2c9f20296c1c5ce9a23715a07df5b623f3016b65/trafilatura/xml.py#L315C5-L315C51

@adbar adbar added the question Further information is requested label Apr 9, 2024
@adbar
Copy link
Owner

adbar commented Apr 9, 2024

Do you mean space before the code or space in general? Could you provide a concrete example of code block?

@mittsommer
Copy link
Author

Guten Tag,
thank you for your replay
we are working on output article with code inside in markdown formating, here is an example

这样在当前目录下就能够生成demo的api服务了。
下图为生成的项目目录结构:
在logic下面的demologic.go编写逻辑

func (l *DemoLogic) Demo(req *types.Request) (resp *types.Response, err error) {
// todo: add your logic here and delete this line
return &types.Response{
Message: "hello world",
}, nil
}

in this case, all white space before the code line in the code block were removed, which is unexpected and not friendly for LLM training

btw. here is another bug (maybe) when extracting inline code block, a redundant '\n' was added after a inline code block
now result

1.2、实现WebMvcConfigurer

接口,注册拦截器
which is supposed to be

1.2、实现WebMvcConfigurer接口,注册拦截器

thank you

@adbar adbar added enhancement New feature or request and removed question Further information is requested labels Apr 19, 2024
@adbar
Copy link
Owner

adbar commented Apr 19, 2024

Yes, spacing is not necessarily preserved in code blocks, this can be improved.

@adbar adbar changed the title santize func remove all white space \ table even in code block when using txt outpput formating Preserve horizontal space in code blocks Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants