Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.extract() is unable to get data properly from sparse tables #192

Open
shubham-MLwiz opened this issue May 28, 2020 · 5 comments
Open

.extract() is unable to get data properly from sparse tables #192

shubham-MLwiz opened this issue May 28, 2020 · 5 comments

Comments

@shubham-MLwiz
Copy link

I created a manual table to reproduce the bug which I am facing

<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
   <thead>
      <tr>
        <th class="">Mar 2008</th>
        <th class="">Mar 2009</th>
        <th class="">Mar 2010</th>
      </tr>
   </thead>
   <tbody>
      <tr>
        <td class="">8,626</td>
        <td class="">8,427</td>
        <td class="">11,525</td>
      </tr>
      <tr>
        <td class="">16,408</td>
        <td class="">19,582</td>
        <td class=""></td>
      </tr>
      <tr>        
        <td class=""></td>
        <td class="">22,574</td>
        <td class="">21,755</td> 
      </tr>
   </tbody>
</table>

Now when I try to run the below code on the above html. This is the output I get

>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']

As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.

Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.

>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3

Both .getall() and .extract() give the same issue.

@elacuesta elacuesta transferred this issue from scrapy/scrapy May 28, 2020
@elacuesta
Copy link
Member

AFAICT, this is expected. "td::text" does not exist if there is no text, that's why it's not included in the results and why len(rows[2].css("td")) != len(rows[2].css("td::text")).

Were you expecting some other value, None for instance?

PS: to reproduce in parsel:

In [1]: html = """<!DOCTYPE html> 
   ...: <html lang="en"> 
   ...: <table class="manual_table"> 
   ...:    <thead> 
   ...:       <tr> 
   ...:         <th class="">Mar 2008</th> 
   ...:         <th class="">Mar 2009</th> 
   ...:         <th class="">Mar 2010</th> 
   ...:       </tr> 
   ...:    </thead> 
   ...:    <tbody> 
   ...:       <tr> 
   ...:         <td class="">8,626</td> 
   ...:         <td class="">8,427</td> 
   ...:         <td class="">11,525</td> 
   ...:       </tr> 
   ...:       <tr> 
   ...:         <td class="">16,408</td> 
   ...:         <td class="">19,582</td> 
   ...:         <td class=""></td> 
   ...:       </tr> 
   ...:       <tr>         
   ...:         <td class=""></td> 
   ...:         <td class="">22,574</td> 
   ...:         <td class="">21,755</td>  
   ...:       </tr> 
   ...:    </tbody> 
   ...: </table>"""

In [2]: from parsel import Selector

In [3]: s = Selector(text=html)

In [4]: rows = s.css(".manual_table tbody tr")

In [5]: rows[0].css("td::text").extract()
Out[5]: ['8,626', '8,427', '11,525']

In [6]: rows[1].css("td::text").extract()
Out[6]: ['16,408', '19,582']

In [7]: rows[2].css("td::text").extract()
Out[7]: ['22,574', '21,755']

@shubham-MLwiz
Copy link
Author

shubham-MLwiz commented May 28, 2020

Thank for the clarification.
But I still think that if I am scraping a table, I should be able to get all the td values properly with empty cells included.
Currently I am getting it by putting it in a for loop and using .get() with default argument.

rows = response.css(".manual_table tbody tr")
dt=[]
for row in rows:
    for data in row.css("td"):
         dt.append(data.css("::text").get(default=''))

Is there a better way to parse a sparse table other than the looping method?

What I suggest is that similar default argument should be there for .getall() and .extract() as well. So if some tag is available but corresponding "::text" is not there then we should be able to assign a default value to it, rather than totally ignoring it.

@shubham-MLwiz
Copy link
Author

Is anyone looking into this?

@Gallaecio
Copy link
Member

Is there a better way to parse a sparse table other than the looping method?

I believe that is the right way to do it with Parsel.

@ilyazub
Copy link

ilyazub commented Feb 16, 2022

@shubham-MLwiz xpath("normalize-space()").getall() returns None from the empty data cells unlike text().

>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

Full code

from parsel import Selector

html = """<!DOCTYPE html> 
<html lang="en"> 
<table class="manual_table"> 
  <thead> 
    <tr> 
      <th class="">Mar 2008</th> 
      <th class="">Mar 2009</th> 
      <th class="">Mar 2010</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td class="">8,626</td> 
      <td class="">8,427</td> 
      <td class="">11,525</td> 
    </tr> 
    <tr> 
      <td class="">16,408</td> 
      <td class="">19,582</td> 
      <td class=""></td> 
    </tr> 
    <tr>         
      <td class=""></td> 
      <td class="">22,574</td> 
      <td class="">21,755</td>  
    </tr> 
  </tbody> 
</table>
</html>"""

s = Selector(text=html)

rows = s.css(".manual_table tbody tr")

dt = []
for row in rows:
    for data in row.css("td"):
        dt.append(data.css("::text").get(default=''))

print("Loop:", dt)

dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()

print("One-liner:", dt2)

Output

Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

I'm commenting on this old issue because I've faced it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants