Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic HTML Price/Value Scraper #3968

Open
jat255 opened this issue Apr 29, 2024 · 2 comments
Open

Generic HTML Price/Value Scraper #3968

jat255 opened this issue Apr 29, 2024 · 2 comments

Comments

@jat255
Copy link

jat255 commented Apr 29, 2024

Is your feature request related to a problem? Please describe.

Some assets have information about their value available online, but oftentimes it is not in a nicely structured JSON or HTML table, like is currently supported by PP, or there is no public API available. An example of this could be the current value of a piece of real estate (Zillow, Redfin, etc.), arbitrary funds that are not exchange traded, or perhaps some other physical asset (collectables, etc.).

Describe the solution you'd like

A useful addition to PP would be if there was an "HTML parser/scraper" quote feed that required a URL to fetch and a "selector" string (similar to how the JSON parser works). Possible query languages could be XPath, CSS selectors, or maybe something else. Since this would be primarily useful for ongoing quote fetching, I would expect that the date would be set to the current date when the quote price is fetched (as opposed to the HTML table tool, which requires that the date is explicitly stated).

Additional context

I am not an experienced Java programmer, but after a quick look, it appears there are some libraries that might provide this "HTML parsing" functionality:

This functionality is available in Ghostfolio, which uses the cheerio javascript library to accomplish this.

@Morpheus1w3
Copy link
Contributor

Morpheus1w3 commented Apr 30, 2024

I like the idea to select the data like "table:eq(2) > tr > td:eq(1)") if the third table and second column is requested. And, JSOUP is already in use @ Portfolio Performance.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class TableCellSelector {
    public static void main(String[] args) {
        // Example URL containing a table
        String url = "http://example.com/table.html";

        // Jsoup to parse the HTML
        try {
            Document doc = Jsoup.connect(url).get();
            
            // Select all cells in the second column of the third table
            Elements cells = selectTableCellsInColumn(doc, "table:eq(2) > tr > td:eq(1)");
            for (Element cell : cells) {
                System.out.println(cell.text());
            }
            
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    public static Elements selectTableCellsInColumn(Document doc, String selector) {
        return doc.select(selector);
    }
}

@jat255
Copy link
Author

jat255 commented Apr 30, 2024

@Morpheus1w3 that's good to hear there's already a library in use that could do this. I worked on setting up a development environment to see if I could hack something together, but this might be beyond my current skill set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants