Skip to content

pheonixchen/COSMIC-cell-lines-webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

COSMIC-cell-lines-webcrawler

为了使用oncopredict来预测肿瘤药物,首先我需要详细了解COSMIC数据库中的所有细胞系信息。为了实现这一目标,我采用了Python编程语言,并结合requests、lxml、pandas和BeautifulSoup4等库来进行数据爬取和处理。

在开始爬取之前,我首先分析了COSMIC数据库的网页结构,找到了包含细胞系信息的链接。随后,我使用requests库发送HTTP/GET请求,获取这些链接的网页内容。

使用lxml库,我能够从网页内容中提取所需的细胞系详情数据。将数据存储在pandas的数据框架中,使其便于进行后续处理和分析。

由于细胞系数量较大,我编写了批量爬取的代码,自动处理所有细胞系的链接。这样一来,我能够高效地获取大量数据,为后续预测工作打下了坚实的基础。

经过数据爬取和处理,最终我成功获得了1020个细胞系的详细数据(CSV),包括与疾病相关的信息,这些数据将成为使用oncopredict进行肿瘤药物预测的重要依据。

To utilize oncopredict for tumor drug prediction, I needed to acquire detailed information about all cell lines in the COSMIC database.

I employed Python, along with the requests, lxml, pandas, and BeautifulSoup4 libraries, to scrape the required links for all cell lines.

The links were crawled in bulk, using a batch processing approach, to efficiently gather the data.

In the end, I successfully obtained detailed data on 1020 cell lines, including their associated diseases and other relevant information. These data serve as a crucial foundation for conducting tumor drug prediction with oncopredict.

Releases

No releases published

Packages

No packages published

Languages