Remove escape sequences from parsed HTML
I'm using the Python mechanize module to submit a simple query to a
website, then break down the returned elements to get the data I need. But
I can't seem to correctly handle the escape sequences that get passed
back. here's my code:
def stripEscape(string): #credit goes to sarnold
delete = ""
i=1
while (i<0x20):
delete += chr(i)
i += 1
t = string.translate(None, delete)
return t
def getHTML(metID):
br = mechanize.Browser()
response = br.open("http://urlgoeshere.com")
br.form = list(br.forms())[0]
br["PROMPT12"] = metID
response = br.submit()
htmlText = response.read()
parseHTML(htmlText)
def parseHTML(htmlText):
htmlText.index('table')
arr = re.split(r'(</?\w{2}>)',htmlText) # everything after
background tag
logFile = open('Log.txt','wb')
for ele in arr:
ele = stripEscape(ele)
if ele == '':
arr.remove(ele)
for ele in arr:
logFile.write("ele: "+ele+'\n')
if re.match('/table', ele):
logFile.write("END OF TABLE FOUND")
logFile.write("\nele: "+ele+'\n')
break
# other element filters
The stripEscape function works just fine when I pass it arguments via
interactive shell, but one of the array elements from the website is
\r\n</table>\r\n, and this "escapes" my filters. It gets written to my log
file like so:
ele: normal
ele: stuff
ele:
</table>
ele: more
ele: normal
The closing table tag bypassing the filter causes all my other filters to
go haywire. Is there a better way of handling escape sequences?
No comments:
Post a Comment