Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Metascraper for e-commerce #412

Open
Kikobeats opened this issue May 24, 2021 · 4 comments
Open

[RFC] Metascraper for e-commerce #412

Kikobeats opened this issue May 24, 2021 · 4 comments
Labels

Comments

@Kikobeats
Copy link
Member

Kikobeats commented May 24, 2021

The idea behind this issue is to determine what kind of data can be extracted and normalized across e-commerce URLs.

examples of e-commerces

(no exhausted list, we need a lot more!)

@Kikobeats Kikobeats changed the title [RFC] Metascraper for ecommerce [RFC] Metascraper for e-commerce May 24, 2021
@theetrain
Copy link

theetrain commented May 28, 2021

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

  • parse and return data from ld+json objects that use schema.org @type: 'Product'
  • Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently
  • (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata
  • (feature request) have an option to parse page elements and return their innerText so that redundant inner HTML gets excluded
  • parse and return multiple products based on offers https://schema.org/offers
  • Support RDFa parsing, though I have yet to come across a site that uses RDF so this could be a low priority

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

@adentranter
Copy link

Has this moved anywhere in the past last years? or are you using addons like https://github.com/samirrayani/metascraper-shopping?

very keen to know more about this.

@adentranter
Copy link

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

* [x]  parse and return data from ld+json objects that use schema.org `@type: 'Product'`

* [ ]  Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently

* [ ]  (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata

* [ ]  (feature request) have an option to parse page elements and return their [`innerText`](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText) so that redundant inner HTML gets excluded

* [ ]   parse and return multiple products based on offers https://schema.org/offers

* [ ]  Support [RDFa](https://www.w3.org/MarkUp/2009/rdfa-for-html-authors) parsing, though I have yet to come across a site that uses RDF so this could be a low priority

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

* https://www.walmart.com/ip/Miracle-Gro-Garden-Soil-Vegetables-and-Herbs-1-5-cu-ft/46928865?athcpid=46928865&athpgid=athenaHomepage&athcgid=dealspage-home-2524396&athznid=ItemCarouselType_BestInDeals&athieid=v1&athstid=CS020&athguid=466001f5-9a18a716-46880cef9f15260d&athancid=null&athena=true

* https://www.garnier.ca/en-ca/about-our-brands/hair-care/fructis/hair-treats/garnier-fructis-nourishing-treat-with-coconut-extract-400-ml

* https://www.kerastase.ca/en/collections/nutritive/3474636721832.html

* https://www.lorealparis.ca/en-ca/excellence-creme/excellence-creme-f-medium-brown

* https://www.staples.ca/products/2735027-en-brother-tn760-black-toner-cartridge-high-yield

* https://thelionchain.com/collections/exclusive-promotions/products/the-gold-edition-trap-set

* https://shop.3dtotal.com/anatomy-figure/3dtotal-anatomy-3-piece-set-of-animal-figures

* https://hellostella.myshopify.com/collections/rustic-stella/products/highland-fingering-posy

* https://www.toysrus.ca/en/Hot-Wheels-Sky-Crash-Tower-Track-Set/242C6973.html

* https://www.homedepot.com/p/RYOBI-18-Volt-ONE-Cordless-AirStrike-18-Gauge-Brad-Nailer-Tool-Only-with-Sample-Nails-P320/203810823?MERCH=REC-_-pnf-_-312306957-_-203810823-_-N&

* https://thewhiteelephantdesigns.com/collections/the-baby-shop/products/chicken-dress

https://github.com/zbicin/metascraper-shopping might have some of the goods that you are looking for.

@Kikobeats
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants