lundi 29 juin 2015

Python extract info from a local html file [duplicate]

This question already has an answer here:

I have this local website and I want to extract each line after

<font color='000000'> <u>PATTERN:</font>

Here is the page source, it's an output from the program ApproxMAP on google code:

<! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
<HTML><font size=5 face='Helvetica-Narrow'><b>
<font color='000000'> Cluster Support= [Pattern=</font>
<font color='000000'> 50</font>
<font color='000000'> % : Variation=</font>
<font color='000000'> 20</font>
<font color='000000'> %]; Database Support= [Min= </font>
<font color='000000'> 1</font>
<font color='000000'>  seq: Max=</font>
<font color='000000'> 50</font>
<font color='000000'> %]</font>
<BR>
<font color='a9a9a9'> cluster=0 size=3</font>
<font color='000000'>   =<100:</font>
<font color='434343'> 85:</font>
<font color='767676'> 70:</font>
<font color='a9a9a9'> 50:</font>
<font color='c8c8c8'> 35:</font>
<font color='e1e1e1'> 20></font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {1,} {2,3,} {4,5,} 
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='000000'> 1</font>
<font color='cbcbcb'> 12</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 24</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='7f7f7f'> 2</font>
<font color='7f7f7f'> 3</font>
<font color='cbcbcb'> 25</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 1</font>
<font color='7f7f7f'> 4</font>
<font color='7f7f7f'> 5</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 26</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {9,10,} {11,} {12,13,} 
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='717171'> 9</font>
<font color='989898'> 10</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='d3d3d3'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 12</font>
<font color='989898'> 13</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> TOTAL LEN=</font>
<font color='000000'> 10</font>
<BR>
<BR>
</b></font></html>

In this case, I want to extract the following:

{1,} {2,3,} {4,5,} 
{9,10,} {11,} {12,13,} 

Here are some code I tried but none of them worked:

# First try
soup = BeautifulSoup('file:///H:/Approx_google_code/tiny20.html')
soup.findall('PATTERN:')

# Second try
re.search( "PATTERN:", 'file:///H:/Approx_google_code/tiny20.html')

# Third try
soup.body.findAll(text='PATTERN:')

# Forth try
soup.body.findAll(text=re.compile('PATTERN:'))

I've been stuck on this easy problem for so long that I started to wonder whether BeautifulSoup is the right direction. I'm totally new to HTML so any easy explanations / suggestions are welcomed, thanks.

Aucun commentaire:

Enregistrer un commentaire