Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to achieve the effect of BeautifulSoup get_text? #443

Open
chushuai opened this issue Apr 13, 2023 · 5 comments
Open

How to achieve the effect of BeautifulSoup get_text? #443

chushuai opened this issue Apr 13, 2023 · 5 comments

Comments

@chushuai
Copy link

How to achieve the effect of BeautifulSoup get_text?

def extract_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    text = soup.get_text(separator=" ", strip=True)
    return text
@mna
Copy link
Member

mna commented Apr 15, 2023

Hello,

I'm not familiar with BeautifulSoup, what does this achieve? It seems like it would be something like https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Text, but with some text handling applied, space normalization or something?

Martin

@chushuai
Copy link
Author

chushuai commented Apr 17, 2023

@mna Thank you for your reply, I want to do data extraction, using goquery can not achieve the effect similar to BeautifulSoup, I will give the comparison between the two below
Example address: https://host7.bienvenidohosting.com:2096/

This is the result of goquery output, which contains a lot of space and js code
image

This is the result of BeautifulSoup output, very simple and clean
image

@chushuai
Copy link
Author

BeautifulSoup get_text definition

def get_text(self, separator="", strip=False,
                 types=default):
        """Get all child strings of this PageElement, concatenated using the
        given separator.

        :param separator: Strings will be concatenated using this separator.

        :param strip: If True, strings will be stripped before being
            concatenated.

        :param types: A tuple of NavigableString subclasses. Any
            strings of a subclass not found in this list will be
            ignored. Although there are exceptions, the default
            behavior in most cases is to consider only NavigableString
            and CData objects. That means no comments, processing
            instructions, etc.

        :return: A string.
        """

@mna
Copy link
Member

mna commented Apr 17, 2023

It's hard to tell from those screenshots but it looks like (and the function documentation seems to confirm this) it optionally trims each text node and concatenates them using the provided separator, and it ignores comments and some other nodes ("processing instructions", not sure what that means in this context).

Based on your screenshots, it looks like doing this would indeed get you something similar.

This is not supported in goquery out of the box, but it should be doable relatively easily using Contents() , Map(), strings.TrimSpace and strings.Join.

I wouldn't be opposed to add a top-level function (i.e. not a Selection method, as those are reserved for jquery API compatibility) that would do something similar to BeautifulSoup if anyone was interested in providing a PR. It should be general enough (i.e. support similar args to trim, join, and maybe a filter function to decide if a node's text is included or not). Probably more finer details to figure out.

But yeah, to answer your initial question, there's nothing equivalent but it should be possible using the methods I linked above.

Hope this helps,
Martin

@chushuai
Copy link
Author

@mna Thank you for your reply, This is the code I wrote, help me to see why many nested nodes do not parse out child nodes such as div, form

// TextAll returns the trimmed text contents of all the nodes in the selection, joined using the provided separator.
// It ignores comments and some other nodes ("processing instructions").
func TextAll(doc *goquery.Document, sep string) string {
	var texts []string
	// Slightly optimized vs calling Each: no single selection object created
	var f func(*html.Node)
	f = func(n *html.Node) {
		// Ignore script and style nodes
		if n.Type == html.ElementNode {
			switch n.Data {
			case "style", "script":
				return
			}
		}
		if n.Type == html.TextNode {
			if n.FirstChild == nil {
				text := strings.TrimSpace(n.Data)
				if len(text) > 0 {
					texts = append(texts, text)
					// Debugging information
					fmt.Println("+n.Type:", n.Type)
					fmt.Println("+n.FirstChild != nil:", n.FirstChild != nil)
					fmt.Println("+n.text:", text)
				}
			}
		} else {
			// Debugging information
			fmt.Println("-n.Type:", n.Type)
			fmt.Println("-n.text:", n.Data)
			fmt.Println("-n.FirstChild != nil:", n.FirstChild != nil)
		}
		// Recursively process child nodes
		if n.FirstChild != nil {
			for c := n.FirstChild; c != nil; c = c.NextSibling {
				f(c)
			}
		}
	}

	// Iterate over all nodes in the selection
	for _, n := range doc.Nodes {
		f(n)
	}
	// Join the texts slice using the provided separator
	return strings.Join(texts, sep)
}

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants