Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List element inside a table is lost #531

Open
mikhainin opened this issue Mar 29, 2024 · 5 comments
Open

List element inside a table is lost #531

mikhainin opened this issue Mar 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@mikhainin
Copy link
Contributor

source.html ```html
    <div class="biz-details-page-container-outer__09f24__pZBzx css-1qn0b6x">
        <div class="biz-details-page-container-inner__09f24__L9S07 css-1qn0b6x">
            <section class="css-1wrui0y"></section>
            <section class="css-1wrui0y"></section>
            <section class="css-afumm9"></section>
            <div class="css-s97lou">
                <div class="css-kmgt1v" data-testid="main-content">
                    <main class=" css-1xykegj" id="main-content">

                        <div class=" css-1qn0b6x" id="location-and-hours">
                            <section class=" css-ufd2i" aria-label="Location &amp; Hours">

                                <div class="arrange__09f24__LDfbs gutter-4__09f24__dajdg css-1qn0b6x">

                                    <div class="arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x">
                                        <div class=" css-1qn0b6x">
                                            <div class=" css-1qn0b6x">
                                                <table class="hours-table__09f24__KR8wh css-n604h6">
                                                    <tbody class="">
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-ux5mu6"
                                                                data-font-weight="bold">Mon</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">10:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"><span
                                                                class="open-status__09f24__YH9PK no-wrap__09f24__c3plq css-syqjjh"
                                                                data-font-weight="semibold">Closed now</span></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Tue</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">Closed</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Wed</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">10:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Thu</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">10:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Fri</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">10:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Sat</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">10:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    <tr class="hours-table-row-space__09f24__chJx9 css-29kerx"></tr>
                                                    <tr class=" css-29kerx">
                                                        <th class="" scope="col"><p
                                                                class="day-of-the-week__09f24__JJea_ css-1p9ibgf"
                                                                data-font-weight="semibold">Sun</p></th>
                                                        <td class="css-1hgawz4">
                                                            <ul class=" list__09f24__ynIEd">
                                                                <li class=" css-1qn0b6x"><p
                                                                        class="no-wrap__09f24__c3plq css-1p9ibgf"
                                                                        data-font-weight="semibold">11:00 AM - 8:00
                                                                    PM</p></li>
                                                            </ul>
                                                        </td>
                                                        <td class="css-1hgawz4"></td>
                                                    </tr>
                                                    </tbody>
                                                </table>
                                            </div>
                                        </div>
                                    </div>
                                </div>
                            </section>
                        </div>


                    </main>
                </div>

            </div>
        </div>
    </div>
```

This code returns only days (first column) but an important information (timestamps) are missing.

 python3 -c "import trafilatura; print(trafilatura.extract(open('source.html', 'r').read()))"           
|
Mon
|
|Closed now
|
Tue
|
|
Wed
|
|
Thu
|
|
Fri
|
|
Sat
|
|
Sun
|

However, if I supply include_tables=False, I can see the timestamps:

python3 -c "import trafilatura; print(trafilatura.extract(open('source.html', 'r').read(), include_tables=False))"
Mon
10:00 AM - 8:00 PM
Closed now
Tue
Closed
Wed
10:00 AM - 8:00 PM
Thu
10:00 AM - 8:00 PM
Fri
10:00 AM - 8:00 PM
Sat
10:00 AM - 8:00 PM
Sun
11:00 AM - 8:00 PM
@mikhainin
Copy link
Contributor Author

I was able to fix it this way:

diff --git a/trafilatura/core.py b/trafilatura/core.py
index 63699a4..1970c25 100644
--- a/trafilatura/core.py
+++ b/trafilatura/core.py
@@ -397,7 +397,7 @@ def handle_table(table_elem, potential_tags, options):
                     # add child element to processed_element
                     if processed_subchild is not None:
                         subchildelem = SubElement(newchildelem, processed_subchild.tag)
-                        subchildelem.text, subchildelem.tail = processed_subchild.text, processed_subchild.tail
+                        subchildelem.text, subchildelem.tail = ''.join(processed_subchild.itertext()), processed_subchild.tail
                     child.tag = 'done'
             # add to tree
             if newchildelem.text or len(newchildelem) > 0:

But not sure if this is the correct solution

@adbar adbar added the bug Something isn't working label Apr 2, 2024
@adbar
Copy link
Owner

adbar commented Apr 2, 2024

@mikhainin Thank you for reporting the bug and the solution, could you please draft a PR with your solution? If the tests pass I would integrate it.

@mikhainin
Copy link
Contributor Author

mikhainin commented Apr 2, 2024

Sure, I just filed #534

@adbar
Copy link
Owner

adbar commented Apr 19, 2024

Note: the issue is now fixed if recall option is on.

@alroythalus
Copy link

alroythalus commented Apr 20, 2024

Try it for spotify https://www.spotify.com/in-en/legal/privacy-policy/
The lists in the tables arnt being captured yet
@adbar @mikhainin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants