This question already has an answer here:
I have this local website and I want to extract each line after
<font color='000000'> <u>PATTERN:</font>
Here is the page source, it's an output from the program ApproxMAP on google code:
<! Created by program ApproxMAP by Hye-Chung(Monica) Kum>
<HTML><font size=5 face='Helvetica-Narrow'><b>
<font color='000000'> Cluster Support= [Pattern=</font>
<font color='000000'> 50</font>
<font color='000000'> % : Variation=</font>
<font color='000000'> 20</font>
<font color='000000'> %]; Database Support= [Min= </font>
<font color='000000'> 1</font>
<font color='000000'> seq: Max=</font>
<font color='000000'> 50</font>
<font color='000000'> %]</font>
<BR>
<font color='a9a9a9'> cluster=0 size=3</font>
<font color='000000'> =<100:</font>
<font color='434343'> 85:</font>
<font color='767676'> 70:</font>
<font color='a9a9a9'> 50:</font>
<font color='c8c8c8'> 35:</font>
<font color='e1e1e1'> 20></font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {1,} {2,3,} {4,5,}
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='000000'> 1</font>
<font color='cbcbcb'> 12</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 24</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='7f7f7f'> 2</font>
<font color='7f7f7f'> 3</font>
<font color='cbcbcb'> 25</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 1</font>
<font color='7f7f7f'> 4</font>
<font color='7f7f7f'> 5</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='cbcbcb'> 26</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> <u>PATTERN:</font>
<font color='000000'> {9,10,} {11,} {12,13,}
</font>
<font color='000000'> =</font>
<font color='000000'> 5</font>
<font color='000000'> </u></font>
<BR>
<font color='000000'> {</font>
<font color='717171'> 9</font>
<font color='989898'> 10</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='d3d3d3'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 11</font>
<font color='000000'> }</font>
<font color='000000'> {</font>
<font color='404040'> 12</font>
<font color='989898'> 13</font>
<font color='000000'> }</font>
<BR>
<font color='000000'> TOTAL LEN=</font>
<font color='000000'> 10</font>
<BR>
<BR>
</b></font></html>
In this case, I want to extract the following:
{1,} {2,3,} {4,5,}
{9,10,} {11,} {12,13,}
Here are some code I tried but none of them worked:
# First try
soup = BeautifulSoup('file:///H:/Approx_google_code/tiny20.html')
soup.findall('PATTERN:')
# Second try
re.search( "PATTERN:", 'file:///H:/Approx_google_code/tiny20.html')
# Third try
soup.body.findAll(text='PATTERN:')
# Forth try
soup.body.findAll(text=re.compile('PATTERN:'))
I've been stuck on this easy problem for so long that I started to wonder whether BeautifulSoup is the right direction. I'm totally new to HTML so any easy explanations / suggestions are welcomed, thanks.
Aucun commentaire:
Enregistrer un commentaire