您现在的位置：万盛学电脑网 >> 程序编程 >> 网络编程 >> 编程语言综合 >> 正文

Python实现简单HTML表格解析的方法

作者：佚名责任编辑：admin 更新时间：2022-06-22

　　本文实例讲述了Python实现简单HTML表格解析的方法。分享给大家供大家参考。具体分析如下：

　　这里依赖libxml2dom，确保首先安装!导入到你的脚步并调用parse_tables() 函数。

　　1. source = a string containing the source code you can pass in just the table or the entire page code

　　2. headers = a list of ints OR a list of strings

　　If the headers are ints this is for tables with no header, just list the 0 based index of the rows in which you want to extract data.

　　If the headers are strings this is for tables with header columns (with the tags) it will pull the information from the specified columns

　　3. The 0 based index of the table in the source code. If there are multiple tables and the table you want to parse is the third table in the code then pass in the number 2 here

　　It will return a list of lists. each inner list will contain the parsed information.

　　具体代码如下：

　　100

　　101

　　102

　　103

　　104

　　105

　　106

　　107

　　108

　　109

　　110

　　111

　　112

　　113

　　114

　　115

　　116

　　117

　　118#The goal of table parser is to get specific information from specific

　　#columns in a table.

　　#Input: source code from a typical website

　　#Arguments: a list of headers the user wants to return

　　#Output: A list of lists of the data in each row

　　import libxml2dom

　　def parse_tables(source, headers, table_index):

　　"""parse_tables(string source, list headers, table_index)

　　headers may be a list of strings if the table has headers defined or

　　headers may be a list of ints if no headers defined this will get data

　　from the rows index.

　　This method returns a list of lists

　　"""

　　#Determine if the headers list is strings or ints and make sure they

　　#are all the same type

　　j = 0

　　print 'Printing headers: ',headers

　　#route to the correct function

　　#if the header type is int

　　if type(headers[0]) == type(1):

　　#run no_header function

　　return no_header(source, headers, table_index)

　　#if the header type is string

　　elif type(headers[0]) == type('a'):

　　#run the header_given function

　　return header_given(source, headers, table_index)

　　else:

　　#return none if the headers aren't correct

　　return None

　　#This function takes in the source code of the whole page a string list of

　　#headers and the index number of the table on the page. It returns a list of

　　#lists with the scraped information

　　def header_given(source, headers, table_index):

　　#initiate a list to hole the return list

　　return_list = []

　　#initiate a list to hold the index numbers of the data in the rows

　　header_index = []

　　#get a document object out of the source code

　　doc = libxml2dom.parseString(source,html=1)

　　#get the tables from the document

上一个程序编程： Python判断Abundant Number的方法
下一个程序编程： IDisposable接口

电脑店

您现在的位置：万盛学电脑网 >> 程序编程 >> 网络编程 >> 编程语言综合 >> 正文

Python实现简单HTML表格解析的方法

作者：佚名责任编辑：admin 更新时间：2022-06-22

编程语言综合排行

程序编程推荐

热门文章

相关文章

图片文章

iphone5s移动4g型号是多少？iphone5s移动4g…

iOS9新功能汇总

Wizard向导图形化详解

淘宝宝贝抠图如何用PS把头发也抠出来

万盛电脑知识网 | 设为首页 | 加入收藏 | 关于我们

您现在的位置： 万盛学电脑网 >> 程序编程 >> 网络编程 >> 编程语言综合 >> 正文

Python实现简单HTML表格解析的方法

作者：佚名 责任编辑：admin 更新时间：2022-06-22

编程语言综合排行

程序编程推荐

热门文章

相关文章

图片文章

iphone5s移动4g型号是多少？iphone5s移动4g…

iOS9新功能汇总

Wizard向导图形化详解

淘宝宝贝抠图如何用PS把头发也抠出来

万盛电脑知识网 | 设为首页 | 加入收藏 | 关于我们

您现在的位置：万盛学电脑网 >> 程序编程 >> 网络编程 >> 编程语言综合 >> 正文

作者：佚名责任编辑：admin 更新时间：2022-06-22