How to achieve the effect of BeautifulSoup get_text？ #443

chushuai · 2023-04-13T18:02:58Z

How to achieve the effect of BeautifulSoup get_text？

def extract_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    text = soup.get_text(separator=" ", strip=True)
    return text

The text was updated successfully, but these errors were encountered:

mna · 2023-04-15T21:47:14Z

Hello,

I'm not familiar with BeautifulSoup, what does this achieve? It seems like it would be something like https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Text, but with some text handling applied, space normalization or something?

Martin

chushuai · 2023-04-17T12:33:08Z

@mna Thank you for your reply, I want to do data extraction, using goquery can not achieve the effect similar to BeautifulSoup, I will give the comparison between the two below
Example address: https://host7.bienvenidohosting.com:2096/

This is the result of goquery output, which contains a lot of space and js code

This is the result of BeautifulSoup output， very simple and clean

chushuai · 2023-04-17T12:44:59Z

BeautifulSoup get_text definition

def get_text(self, separator="", strip=False,
                 types=default):
        """Get all child strings of this PageElement, concatenated using the
        given separator.

        :param separator: Strings will be concatenated using this separator.

        :param strip: If True, strings will be stripped before being
            concatenated.

        :param types: A tuple of NavigableString subclasses. Any
            strings of a subclass not found in this list will be
            ignored. Although there are exceptions, the default
            behavior in most cases is to consider only NavigableString
            and CData objects. That means no comments, processing
            instructions, etc.

        :return: A string.
        """

mna · 2023-04-17T22:32:17Z

It's hard to tell from those screenshots but it looks like (and the function documentation seems to confirm this) it optionally trims each text node and concatenates them using the provided separator, and it ignores comments and some other nodes ("processing instructions", not sure what that means in this context).

Based on your screenshots, it looks like doing this would indeed get you something similar.

This is not supported in goquery out of the box, but it should be doable relatively easily using Contents() , Map(), strings.TrimSpace and strings.Join.

I wouldn't be opposed to add a top-level function (i.e. not a Selection method, as those are reserved for jquery API compatibility) that would do something similar to BeautifulSoup if anyone was interested in providing a PR. It should be general enough (i.e. support similar args to trim, join, and maybe a filter function to decide if a node's text is included or not). Probably more finer details to figure out.

But yeah, to answer your initial question, there's nothing equivalent but it should be possible using the methods I linked above.

Hope this helps,
Martin

chushuai · 2023-04-18T04:03:51Z

@mna Thank you for your reply, This is the code I wrote, help me to see why many nested nodes do not parse out child nodes such as div, form

// TextAll returns the trimmed text contents of all the nodes in the selection, joined using the provided separator.
// It ignores comments and some other nodes ("processing instructions").
func TextAll(doc *goquery.Document, sep string) string {
	var texts []string
	// Slightly optimized vs calling Each: no single selection object created
	var f func(*html.Node)
	f = func(n *html.Node) {
		// Ignore script and style nodes
		if n.Type == html.ElementNode {
			switch n.Data {
			case "style", "script":
				return
			}
		}
		if n.Type == html.TextNode {
			if n.FirstChild == nil {
				text := strings.TrimSpace(n.Data)
				if len(text) > 0 {
					texts = append(texts, text)
					// Debugging information
					fmt.Println("+n.Type:", n.Type)
					fmt.Println("+n.FirstChild != nil:", n.FirstChild != nil)
					fmt.Println("+n.text:", text)
				}
			}
		} else {
			// Debugging information
			fmt.Println("-n.Type:", n.Type)
			fmt.Println("-n.text:", n.Data)
			fmt.Println("-n.FirstChild != nil:", n.FirstChild != nil)
		}
		// Recursively process child nodes
		if n.FirstChild != nil {
			for c := n.FirstChild; c != nil; c = c.NextSibling {
				f(c)
			}
		}
	}

	// Iterate over all nodes in the selection
	for _, n := range doc.Nodes {
		f(n)
	}
	// Join the texts slice using the provided separator
	return strings.Join(texts, sep)
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to achieve the effect of BeautifulSoup get_text？ #443

How to achieve the effect of BeautifulSoup get_text？ #443

chushuai commented Apr 13, 2023

mna commented Apr 15, 2023

chushuai commented Apr 17, 2023 •

edited

chushuai commented Apr 17, 2023

mna commented Apr 17, 2023

chushuai commented Apr 18, 2023

How to achieve the effect of BeautifulSoup get_text？ #443

How to achieve the effect of BeautifulSoup get_text？ #443

Comments

chushuai commented Apr 13, 2023

mna commented Apr 15, 2023

chushuai commented Apr 17, 2023 • edited

chushuai commented Apr 17, 2023

mna commented Apr 17, 2023

chushuai commented Apr 18, 2023

chushuai commented Apr 17, 2023 •

edited