Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML encoding is not autodetected properly #777

Open
Dinver opened this issue Aug 23, 2023 · 6 comments
Open

HTML encoding is not autodetected properly #777

Dinver opened this issue Aug 23, 2023 · 6 comments
Labels

Comments

@Dinver
Copy link

Dinver commented Aug 23, 2023

Hi! When I try to recognize the encoding on sites with windows-1251, I get:
2023/08/23 21:45:10 ÑÄÎ «Ïðîìåòåé» | ÎÎÎ «Âèðòóàëüíûå òåõíîëîãèè â îáðàçîâàíèè»
2023/08/23 21:45:10 Ýëåêòðîííûå êóðñû
2023/08/23 21:45:10 Ïðîäóêòû

Example:

package main

import (
	"log"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector(
		colly.DetectCharset(),
		colly.Async(true),
	)
	c.OnHTML("title", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		title := e.Text
		log.Println(title)
	})

	c.OnHTML("img", func(e *colly.HTMLElement) {
		title := e.Attr("alt")
		log.Println(title)
	})

	c.Visit("https://prometeus.ru/")
	c.Wait()
}

colly.DetectCharset() / c.DetectCharset = true - does not working.

@blagoySimandov
Copy link

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

@Dinver
Copy link
Author

Dinver commented Aug 28, 2023

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.

@WGH-
Copy link
Collaborator

WGH- commented Aug 28, 2023

Yeah, I can reproduce it with colly/v2, too

@WGH- WGH- added the bug label Aug 28, 2023
@Dinver
Copy link
Author

Dinver commented Sep 1, 2023

Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.

response.go:

package colly

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"mime"
	"net/http"
	"strings"

	"github.com/PuerkitoBio/goquery"
	"github.com/saintfish/chardet"
	"golang.org/x/net/html/charset"
)

// Response is the representation of a HTTP response made by a Collector
type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
	// Trace contains the HTTPTrace for the request. Will only be set by the
	// collector if Collector.TraceHTTP is set to true.
	Trace *HTTPTrace
}

// Save writes response body to disk
func (r *Response) Save(fileName string) error {
	return ioutil.WriteFile(fileName, r.Body, 0644)
}

// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
	_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
	if fName, ok := params["filename"]; ok && err == nil {
		return SanitizeFileName(fName)
	}
	if r.Request.URL.RawQuery != "" {
		return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
	}
	return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
}

func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
	if len(r.Body) == 0 {
		return nil
	}
	if defaultEncoding != "" {
		tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
		if err != nil {
			return err
		}
		r.Body = tmpBody
		return nil
	}
	contentType := strings.ToLower(r.Headers.Get("Content-Type"))

	if strings.Contains(contentType, "image/") ||
		strings.Contains(contentType, "video/") ||
		strings.Contains(contentType, "audio/") ||
		strings.Contains(contentType, "font/") {
		// These MIME types should not have textual data.

		return nil
	}

	if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
		if !detectCharset {
			return nil
		}
		contentTypeBody := checkContentTypeInBody(string(r.Body))
		if contentTypeBody != "" {
			contentType = contentTypeBody
		}
	}

	if !strings.Contains(contentType, "charset") {
		if !detectCharset {
			return nil
		}
		d := chardet.NewTextDetector()
		r, err := d.DetectBest(r.Body)
		if err != nil {
			return err
		}
		contentType = "text/plain; charset=" + r.Charset
	}
	if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
		return nil
	}
	tmpBody, err := encodeBytes(r.Body, contentType)
	if err != nil {
		return err
	}
	r.Body = tmpBody
	return nil
}

func encodeBytes(b []byte, contentType string) ([]byte, error) {
	r, err := charset.NewReader(bytes.NewReader(b), contentType)
	if err != nil {
		return nil, err
	}
	return ioutil.ReadAll(r)
}

func checkContentTypeInBody(b string) string {
	reader := strings.NewReader(b)
	doc, err := goquery.NewDocumentFromReader(reader)
	if err != nil {
		fmt.Println(err)
	}
	metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
	if exists {
		return metaContent
	} else {
		return ""
	}
}

@WGH-
Copy link
Collaborator

WGH- commented Sep 3, 2023

There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the <meta tags.

It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding

There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348

We really should incorporate it into Colly.

@WGH- WGH- changed the title Bug encoding cyrillic windows-1251 HTML encoding is not autodetected properly Sep 3, 2023
@blagoySimandov
Copy link

Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants