Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting data from table #125

Open
SalvatoreRa opened this issue Dec 15, 2022 · 10 comments
Open

Getting data from table #125

SalvatoreRa opened this issue Dec 15, 2022 · 10 comments

Comments

@SalvatoreRa
Copy link

Very nice package.

I am trying to write a script that for a tv series extract the content of the season episodes:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')
p.sections
p.content

In the content there is not the text (in the page is inside a table), and I also have tried
p.table_of_contents['Episodes']['Season 1 (2022)']

which returns an empty structure

Thank you very much for your help

@barrust
Copy link
Owner

barrust commented Dec 15, 2022

I am glad that you find the package useful! I haven't been able to find an API to pull information from tables directly from the wiki api, but you could use beautifulsoup to parse the html directly.

Something like:

from bs4 import BeautifulSoup
from mediawiki import MediaWiki

wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')

soup = BeautifulSoup(p.html, "html.parser")
episodes = soup.find("table", {"class": "wikiepisodetable"})

# Do something to parse the table as per the documentation on bs4

I hope this is helpful!

@SalvatoreRa
Copy link
Author

Thank you for your reply,

I have used beautifulsoup:

def text_recovery(url):
    # Make a  request to the URL
    response = requests.get(str(url))

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table that contains the episode summary

    table = soup.find('table', {'class': 'wikiepisodetable'})
    text = []

    # Iterate over the rows in the table
    for row in table.find_all('tr'):
        # Find the cells in each row

        cells = row.find_all('td')

        # If the row contains episode data
        if len(cells) == 1:
            # Extract the episode number, title, and summary
            episode_summary = cells[0].text

            # Print the episode data
            text.append(episode_summary)
    return text

This work for the Andor page, however I have realized not all the page are the same, and I was wondering if there a way to extra the same information in agnostic way, something that taking a series x it provide you the text of episodes' description.

@barrust
Copy link
Owner

barrust commented Dec 16, 2022

Sadly, not that I know of as I haven't been able to find an MediaWiki API that can help with that.

I will have to look at the contents or wikitext output that could help.

@SalvatoreRa
Copy link
Author

There is a maybe a way to interact with the database? like information is a sort SQL or GrapSQL db of wiki?

@barrust
Copy link
Owner

barrust commented Dec 16, 2022

Not though this python package as it is just a wrapper for the API and doesn't have access to the back-end system, just what is provided through the API.

@SalvatoreRa
Copy link
Author

I understand, thank you for your help

@barrust
Copy link
Owner

barrust commented Dec 16, 2022

The p.wikitext property might be helpful as it has this type of information:

===Season 1 (2022)===
{{Episode table |background=#804A41 |overall= |title= |director= |writer= |airdate= |released=y |episodes=
{{Episode list
 |EpisodeNumber   = 1
 |Title           = Kassa 
 |DirectedBy      = [[Toby Haynes]]
 |WrittenBy       = [[Tony Gilroy]]
 |OriginalAirDate = {{Start date|2022|9|21}}
 |ShortSummary    = Five years before the Battle of Yavin, Cassian Andor looks for his missing sister in the industrial planet of Morlana One. While investigating, Cassian is antagonized by two officers. An altercation ensues, leading to Cassian accidentally killing one officer and murdering the other. He flees to the planet Ferrix and attempts to hide his involvement by convincing his adopted mother Maarva's droid, B2EMO, and his friend, Brasso, to cover for him. Having a Starpath Unit (a valuable piece of Imperial navigation technology), Cassian asks his friend Bix to connect him with a black market buyer. Bix agrees and contacts the buyer. Meanwhile, Bix's boyfriend, Timm, is suspicious of Andor. To improve his report to the Imperial authorities, Morlana One's chief inspector of security elects to cover up the murders. However, his deputy, the dutiful Syril Karn, is determined to solve the case. He identifies Cassian's ship, traces it to Ferrix and learns that the fugitive is from the planet Kenari. In a flashback, a younger Cassian, known as Kassa, and his tribe on Kenari decide to investigate a crashed ship. Kassa rebuffs his younger sister's efforts to join them, leaving her behind to guard their encampment. 
 |LineColor       = 804A41
}}
...
}}

Which means that could also be used to parse the text; I still haven't seen an API to pull tables directly from the API.

@SalvatoreRa
Copy link
Author

I would try with p.wikitest!

However, I still have to find a way when the episodes (and the table) is in another page. The problem with the wiki pages is that the format is not uniform

@barrust
Copy link
Owner

barrust commented Dec 16, 2022

Yes, that is the one draw back is that it isn't always standardized.

Good luck!

@SalvatoreRa
Copy link
Author

yes, it is a pity, since there is so much interesting information in wiki for model training or doing apps.

thank you very much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants