Ñò ÁOyHc*@sRdZdZddddddgZdd kZeid dUjo d d Undd kZdd kZdd kZdd kZdd k Z dd k Z dVdWgZ dZ dZ dZdZd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd dXd!„ƒYZd"dYd#„ƒYZd$d%d&d'd(d)d*d+d,d-d.d/d0d1d2d3d4d5d6d7d8d9d:d;d<d=d>d?d@dAdBdCdDdEg"ZedF„eƒZdG„ZdH„ZdI„Z d dJdK„Z"dL„Z#dM„Z$dN„Z%dO„Z&ddZdP„ƒYZ'dQ„Z(dR„Z)dS„Z*e+dTjo e*ƒnd S([s› Manipulate HTML or XHTML documents. Version 1.0.7. This source code has been placed in the public domain by Connelly Barnes. Features: - Translate HTML back and forth to data structures. This allows you to read and write HTML documents programmably, with much flexibility. - Extract and modify URLs in an HTML document. - Compatible with Python 2.0 - 2.4. See the L{examples} for a quick start. s1.0.7texamplest tagextractttagjoint urlextractturljointURLMatchiÿÿÿÿNiiisTrue = 1; False = 0tscripts/scripttstyles/styless cCstt|ƒ}xatt|ƒƒD]M}t||tƒo||i||>> tagextract('hifoo

') [('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo', ('br', {}), ('br/', {}), ('/body', {})] Text between C{'') [('script', {'type': 'a'}), 'var x; ', ('/script', {})] Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed: >>> tagextract('') [('!-- blah --', {})] >>> tagextract('') [('?xml version="1.0" encoding="utf-8" ?', {})] >>> tagextract('') [('!DOCTYPE html PUBLIC etc...', {})] Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block: >>> tagextract('') [('!-- <><><><>>..> --', {})] >>> tagextract('<>><>]<> ]]>') [('!CDATA[[><>><>]<> ]]', {})] Note that if one modifies these tags, it is important to retain the C{"--"} (for comments) or C{"]]"} (for C{CDATA}) at the end of the tag name, so that output from L{tagjoin} will be correct HTML/XHTML. (t_full_tag_extracttrangetlent isinstancet_TextTagttexttnametattrs(tdoctLti((smodules/htmldata.pyR8s/ "c Cs—t|tiƒptdƒ‚ng}x^|D]V}t|tiƒo|i|ƒq0|ddjo|idƒq0|ddjo|idƒq0|\}}|ddjod }|d }nd }g}|iƒ}|iƒxK|D]C\}} | djo|i|d | d ƒqø|i|ƒqøWd i |ƒ}|d jod |}n|id|||dƒq0Wd i |ƒS(sæ Convert data structure back to HTML. This reverses the L{tagextract} function. More precisely, if an HTML string is turned into a data structure, then back into HTML, the resulting string will be functionally equivalent to the original HTML. >>> tagjoin(tagextract(s)) (string that is functionally equivalent to s) Three changes are made to the HTML by L{tagjoin}: tags are lowercased, C{key=value} pairs are sorted, and values are placed in double-quotes. sexpected list argumentis--s-->s!--s' + ... ' end') [' blah', '', '', ' ', '', 'end'] iRRi( R,R t_BEGIN_COMMENTt startswithtfindt _END_COMMENTRt _BEGIN_CDATAt _END_CDATAR1R+(R-ts_lowerRRtcti2torig_ittagi((smodules/htmldata.pyt _html_split¹sV   ! !       cCs½g}d}xª|t|ƒjo–||}|tijoyx6t|t|ƒƒD]}||tijoPqRqRW||tijo|d7}n|i|||!ƒ|}qtidƒ}|i||ƒ}|o-|i|||iƒ!ƒ|iƒ}qntidƒ}|i||ƒ}|o-|i|||iƒ!ƒ|iƒ}qntidƒ}|i||ƒ}|o-|i|||iƒ!ƒ|iƒ}qqqW|S(s@ Like C{shlex.split}, but reversible, and for HTML. Splits a string into a list C{L} of strings. List elements contain either an HTML tag C{name=value} pair, an HTML name singleton (eg C{"checked"}), or whitespace. The identity C{''.join(L) == s} is always satisfied. >>> _shlex_split('a=5 b="15" name="Georgette A"') ['a=5', ' ', 'b="15"', ' ', 'name="Georgette A"'] >>> _shlex_split('a = a5 b=#b19 name="foo bar" q="hi"') ['a = a5', ' ', 'b=#b19', ' ', 'name="foo bar"', ' ', 'q="hi"'] >>> _shlex_split('a="9"b="15"') ['a="9"', 'b="15"'] iis[^ \t\n\r\f\v"]+\s*\=\s*"[^"]*"s\S+\s*\=\s*\S*s\S+( R tstringt whitespaceR Rtretcompiletmatchtend(R-R"RR9R:tm((smodules/htmldata.pyt _shlex_splitsB       cCsQtdƒgjpt‚tdƒdgjpt‚tdƒdddddgjpt‚tdƒddd d d d d ddg jpt‚tddƒddddddddddd dddddddddd dgjpt‚tdƒdddddgjpt‚tdƒddddddd gjpt‚d!S("s$ Unit test for L{_shlex_split}. RRsa=5 b="15" name="Georgette A"sa=5sb="15"sname="Georgette A"s"a=cvn b=32vsd c= 234jk e d ="hi"sa=cvnsb=32vsds sc= 234jks tesd ="hi"s$ a b c d=e f g h i="jk" l mno = p s qr = "st"R/R0R9sd=etftgthsi="jk"tlsmno = psa=5 b="9"c="15 dfkdfkj "d="25"sb="9"sc="15 dfkdfkj "sd="25"s"a=5 b="9"c="15 dfkdfkj "d="25" e=4se=4N(REtAssertionError(((smodules/htmldata.pyt_test_shlex_splitQs"   !   cCsct|ƒ}h}h}h}d}x/|D]'}|t|ƒ}|idƒ}|djo¢|||} } ||d|t|ƒ} } x0| | jo"|| tijo| d7} qWx4| | jo&|| dtijo| d8} qÃWx0| | jo"|| tijo| d7} qúWx4| | jo&|| dtijo| d8} q-W| | djo>|| djo-|| ddjo| d7} | d8} n|| | !iƒ|| | !} }||| <| | f|| <| | f|| >> _tag_dict('bgcolor=#ffffff text="#000000" blink') ({'bgcolor':'#ffffff', 'text':'#000000', 'blink': None}, {'bgcolor':(0,7), 'text':(16,20), 'blink':(31,36)}, {'bgcolor':(8,15), 'text':(22,29), 'blink':(36,36)}) Returns a 3-tuple. First element is a dict of C{(key, value)} pairs from the HTML tag. Second element is a dict mapping keys to C{(start, end)} indices of the key in the text. Third element maps keys to C{(start, end)} indices of the value in the text. Names are lowercased. Raises C{ValueError} for unmatched quotes and other errors. it=iRN(RER R4R>R?R,tsplitR (R-R$Rtkey_post value_poststartR#RCtequalstk1tk2tv1tv2R'R(((smodules/htmldata.pyt _tag_dicthsN  !%!%7 !    cCs}tdƒhhhfjpt‚tdƒhhhfjpt‚tdƒhdd6dd6dd6hdd6dd6dd6hdd6d d6d!d6fjpt‚d}t|ƒ\}}}|hdd6dd6dd6dd6jpt‚xƒ|iƒD]u}|||d ||d!|jpt‚||djo3|||d ||d!||jpt‚qqWdS("s! Unit test for L{_tag_dict}. Rs s$bgcolor=#ffffff text="#000000" blinks#fffffftbgcolors#000000R tblinkiiiiii$iiiis, bg = val text = "hi you" name e="5" shi youtvaltbgt5RFRiN(ii(ii(ii$(ii(ii(i$i$(RWRKR tkeys(R-R/R0R9R'((smodules/htmldata.pyt_test_tag_dict¥s## )0 +cCs:t|ƒ}dgt|ƒ}xAtdt|ƒƒD]*}||dt||dƒ||>> tag = _full_tag_extract('')[0] >>> tag.attrs {'href': 'd.com'} Surrounding quotes are stripped from the values. @ivar key_pos: Key position dict. Maps the name of a tag attribute to C{(start, end)} indices for the key string in the C{"key=value"} HTML pair. Indices are absolute, where 0 is the start of the HTML document. Example: >>> tag = _full_tag_extract('')[0] >>> tag.key_pos['href'] (3, 7) >>> ''[3:7] 'href' @ivar value_pos: Value position dict. Maps the name of a tag attribute to C{(start, end)} indices for the value in the HTML document string. Surrounding quotes are excluded from this range. Indices are absolute, where 0 is the start of the HTML document. Example: >>> tag = _full_tag_extract('')[0] >>> tag.value_pos['href'] (9, 14) >>> ''[9:14] 'd.com' cCs1||_||_||_||_||_dS(s$ Create an _HTMLTag object. N(RmRRRORP(tselfRmRRRORP((smodules/htmldata.pyt__init__Ms     (R`Rat__doc__Rt(((smodules/htmldata.pyRgs/R cBseZdZd„ZRS(s™ Text extracted from an HTML document by L{_full_tag_extract}. @ivar text: Extracted text. @ivar pos: C{(start, end)} indices of the text. cCs||_||_dS(s# Create a _TextTag object. N(RmR (RsRmR ((smodules/htmldata.pyRt_s (R`RaRuRt(((smodules/htmldata.pyR Wssa hrefsapplet archives applet codesapplet codebases area hrefs base hrefsblockquote citesbody backgroundsdel cites form actionsframe longdescs frame srcs head profiles iframe srcsiframe longdescsimg srcs img ismaps img longdescs img usemaps input srcsins cites link hrefsobject archivesobject codebases object datas object usemaps script srcstable backgroundstbody backgrounds td backgroundstfoot backgrounds th backgroundsthead backgrounds tr backgroundcCst|iƒƒS((ttupleRN(R-((smodules/htmldata.pytvscCs¡ti|ƒ}g}d}xtow|i||ƒ}|o|i|ƒn|S|i|iƒ}|i|iƒ}||jo |}q|d7}qWdS(så Like C{re.finditer}, provided for compatibility with Python < 2.3. Returns a list instead of an iterator. Otherwise the return format is identical to C{re.finditer} (except possibly in the details of empty matches). iiN(R@RARftsearchRRQt lastindexRC(tpatternR>tcompiledR"RQRDtm_starttm_end((smodules/htmldata.pyt _finditerys  cCsËg}d}x¯to§|id|ƒ}|djo|||g7}Pn||||!g7}|id|dƒ}|djot|ƒd}n|d||dg7}|d}qWdi|ƒS(sF Replaces commented out characters with spaces in a CSS document. is/*s*/iiRR(RfR4R R!(RR"RR:ti3((smodules/htmldata.pyt_remove_comments“s  cCsÄdd}tt|ƒƒt|ƒjpt‚dddd}tt|ƒƒt|ƒjpt‚dddd}tt|ƒƒt|ƒjpt‚d }t|ƒd jpt‚d S( s( Unit test for L{_remove_comments}. s$/*d s kjlsdf */*//*/*//**/**/*//**/ai2s/**/s/*5845*/*/*//*/**/dfds/*//**//sa/**/s/**//**/////***/****/*//**//*/is+hi /* foo */ hello /* bar!!!!! */ there!s+hi hello there!N(R R€RK(R-((smodules/htmldata.pyt_test_remove_comments§s &&& s text/htmlcCs‹|iƒ}|djoÛt|ƒ}tdddddd|dƒ}g}|D]+}||i|iƒ|i|iƒfqW~}g}x“|D]T\}}t|t|ƒƒ}||jo&|it ||||t t ƒƒq˜q˜Wn4g}t |ƒ}d } xtt|ƒƒD]} | } || } t| tƒo˜t| tƒo„| id jott| i|dƒ} xNtt| ƒƒD]:} | | i| id 7_| | i| id 7_qW|| 7}q#q| iid ƒo€t| id |dƒ} xVtt| ƒƒD]B} | | i| id d 7_| | i| id d 7_qW|| 7}nx°tD]¨\}}| ii|ƒo‰|| iiƒjos| i|}| i|\}}|}|}| i}| }t ||||t t ||||ƒ }|i|ƒqwqwWqWh}g}xQ|D]I} |i| i| ifƒp'd || i| if<|i| ƒq:q:W|S( s¥ Extract URLs from HTML or stylesheet. Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document. Returns a list of L{URLMatch} objects. >>> L = urlextract('') >>> L[0].url 'a.gif' >>> L[1].url 'www.google.com' If C{siteurl} is specified, all URLs are made into absolute URLs by assuming that C{doc} is located at the URL C{siteurl}. >>> doc = '' >>> L = urlextract(doc, 'http://www.python.org/~guido/') >>> L[0].url 'http://www.python.org/~guido/a.gif' >>> L[1].url 'http://www.python.org/b.html' If C{mimetype} is C{"text/css"}, the document will be parsed as a stylesheet. If a stylesheet is embedded inside an HTML document, then C{urlextract} will extract the URLs from both the HTML and the stylesheet. stext/csssurl\s*\(([^\r\n\("']*?)\)|surl\s*\(\s*"([^\r\n]*?)"\s*\)|surl\s*\(\s*'([^\r\n]*?)'\s*\)|s5@import\s+([^ \t\r\n"';@\(\)]+)[^\r\n;@\(\)]*[\r\n;]|s7@import\s+'([^ \t\r\n"';@\(\)]+)'[^\r\n;@\(\)]*[\r\n;]|s6@import\s+"([^ \t\r\n"';\(\)']+)"[^\r\n;@\(\)]*[\r\n;]s; RiN(R,R€R~RQRyRCtminR RRReRfRR R R R RgRRR RmRthas_keyRPt _URL_TAGSR3R](RtsiteurltmimetypeRt_[1]txR"R-RFR#Rt prev_itemttempR.R/R0turlRQRCttag_namettag_attrt tag_attrst tag_indexttagt start_end_mapt filtered_ans((smodules/htmldata.pyRµst!    ?  .   " )  c CsÑg}|}|iƒt|ƒt|ƒjotdƒ‚nxÞtt|ƒdƒD]Æ}||d||ddjotdƒ‚n||d||djotdƒ‚nt||d||dƒdjp,t||d||dƒt|ƒjotdƒ‚qWqWWd}d}xƒtt|ƒƒD]o}||d||d}t||ƒ}|i||||d|!ƒ|i||ƒ||d}q@W|i||ƒdi|ƒS(s# Replace slices of a string with new substrings. Given a list of slice tuples in C{Lindices}, replace each slice in C{s} with the corresponding replacement substring from C{Lreplace}. Example: >>> _tuple_replace('0123456789',[(4,5),(6,9)],['abc', 'def']) '0123abc5def9' slists differ in lengthiistuples overlaps invalid tuples bad indexR(RR RR R‚tmaxRR!( R-tLindicestLreplaceR"RR.toffsettlen1tlen2((smodules/htmldata.pyt_tuple_replace%s2  !&, cCs¢tdggƒdjpt‚tdggƒdjpt‚tdddgddgƒd jpt‚td dddgdddgƒdjpt‚dS(s& Unit test for L{_tuple_replace}. Rt 0123456789iiii tabctdeft 0123abc5def9t01234567890123456789ii iiitabcdtefgthijkt0abcd9012efg45hijk89N(ii(ii (ii (i i(ii(R™RK(((smodules/htmldata.pyt_test_tuple_replaceNs  c CsUt|g}|D]}||i|ifq~g}|D]}||iq;~ƒS(s0 Write back document with modified URLs (reverses L{urlextract}). Given a list C{L} of L{URLMatch} objects obtained from L{urlextract}, substitutes changed URLs into the original document C{s}, and returns the modified document. One should only modify the C{.url} attribute of the L{URLMatch} objects. The ordering of the URLs in the list is not important. >>> doc = '' >>> L = urlextract(doc) >>> L[0].url = 'foo' >>> L[1].url = 'bar' >>> urljoin(doc, L) '' (R™RQRCR‹(R-RR‡Rˆt_[2]((smodules/htmldata.pyRZs0cCs tiGHdS(s£ Examples of the C{htmldata} module. Example 1: Print all absolutized URLs from Google. Here we use L{urlextract} to obtain all URLs in the document. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... print u.url ... http://www.google.com/images/logo.gif http://www.google.com/search (More output) Note that the second argument to L{urlextract} causes the URLs to be made absolute with respect to that base URL. Example 2: Print all image URLs from Google in relative form. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents): ... if u.tag_name == 'img': ... print u.url ... /images/logo.gif Equivalently, one can use L{tagextract}, and look for occurrences of C{} tags. The L{urlextract} function is mostly a convenience function for when one wants to extract and/or modify all URLs in a document. Example 3: Replace all C{} links on Google with the Microsoft web page. Here we use L{tagextract} to turn the HTML into a data structure, and then loop over the in-order list of tags (items which are not tuples are plain text, which is ignored). >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> L = htmldata.tagextract(contents) >>> for item in L: ... if isinstance(item, tuple) and item[0] == 'a': ... # It's an HTML tag! Give it an href=. ... item[1]['href'] = 'http://www.microsoft.com/' ... >>> htmldata.tagjoin(L) (Microsoftized version of Google) Example 4: Make all URLs on an HTML document be absolute. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.urljoin(htmldata.urlextract(contents, url)) (Google HTML page with absolute URLs) Example 5: Properly quote all HTML tag values for pedants. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.tagjoin(htmldata.tagextract(contents)) (Properly quoted version of the original HTML) Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script C{http://mysite.com/proxy.cgi}. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> proxy_url = 'http://mysite.com/proxy.cgi?url=' >>> L = htmldata.urlextract(contents) >>> for u in L: ... u.url = proxy_url + u.url ... >>> htmldata.urljoin(L) (Document with all URLs wrapped in our proxy script) Example 7: Download all images from a website. >>> import urllib, htmldata, time >>> url = 'http://www.google.com/' >>> contents = urllib.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... if u.tag_name == 'img': ... filename = urllib.quote_plus(u.url) ... urllib.urlretrieve(u.url, filename) ... time.sleep(0.5) ... (Images are downloaded to the current directory) Many sites will protect against bandwidth-draining robots by checking the HTTP C{Referer} [sic] and C{User-Agent} fields. To circumvent this, one can create a C{urllib2.Request} object with a legitimate C{Referer} and a C{User-Agent} such as C{"Mozilla/4.0 (compatible; MSIE 5.5)"}. Then use C{urllib2.urlopen} to download the content. Be warned that some website operators will respond to rapid robot requests by banning the offending IP address. N(RRu(((smodules/htmldata.pyRqsscBs#eZdZddddd„ZRS(s< A matched URL inside an HTML document or stylesheet. A list of C{URLMatch} objects is returned by L{urlextract}. @ivar url: URL extracted. @ivar start: Starting character index. @ivar end: End character index. @ivar in_html: C{True} if URL occurs within an HTML tag. @ivar in_css: C{True} if URL occurs within a stylesheet. @ivar tag_attr: Specific tag attribute in which URL occurs. Example: C{'href'}. C{None} if the URL does not occur within an HTML tag. @ivar tag_attrs: Dictionary of all tag attributes and values. Example: C{{'src':'http://X','alt':'Img'}}. C{None} if the URL does not occur within an HTML tag. @ivar tag_index: Index of the tag in the list that would be generated by a call to L{tagextract}. @ivar tag_name: HTML tag name in which URL occurs. Example: C{'img'}. C{None} if the URL does not occur within an HTML tag. c CsŽ||_||_||_|||!|_||_||_|djoti||iƒ|_n||_ ||_ | |_ | |_ dS(s# Create a URLMatch object. N( RRQRCR‹tin_htmltin_cssR turlparseRRRŽRRŒ( RsRRQRCR…R¥R¦RRŽRRŒ((smodules/htmldata.pyRts         N(R`RaRuR Rt(((smodules/htmldata.pyRæs cCs‰ddd}dddd}dd }d d d }d }|}|dit|ƒƒjpt‚t|ƒdddddddddddddddgjpt‚|}|dit|ƒƒjpt‚dd}|dit|ƒƒjpt‚t|ƒd d!dd"dgjpt‚d#d$}|dit|ƒƒjpt‚t|ƒd%d&d'd(d)d*d+d,d-d.d/g jpt‚d0d1d2}|dit|ƒƒjpt‚t|ƒd3d4d5d6d!d)d7d+d!d8d!d9d:d+d2gjpt‚|}|dit|ƒƒjpt‚t|ƒd;d<d=d>d?d@dAdBgjpt‚|}tdƒgjpt‚t|ƒddChfdDhdEdF6fddGhfddHhfdIhfdJhfdKhdLdM6fdNhdŸdO6fdPhdQdR6fdShfdThfdgjpt‚dUdVd}tt|ƒƒ|jpt‚|}dddd}||jpt‚|}t|ƒdWdChfdXhfdYhfdZhfdRhd[d\6d]dN6fd^hd_d`6fdahfdbhfdcddhfdZhd]de6fdfdghfd!dhhdidj6fdkhfdlhfgjpt‚tt|ƒƒdmdndodjpt‚|}t|ƒd/hdpdq6drds6dtdu6fd/hdvdu6dwdx6fgjpt‚tt|ƒƒdyjpt‚xõ|||gD]ä}t|ƒ}xÏt|ƒD]Á\} } t| t ƒo¥x¢| i i ƒD]} || i | dz| i | d{!i ƒ| jpt‚| i | dŸjo<|| i| dz| i| d{!| i | jpt‚qPqPWq$q$WqWd|} d}| }t|ƒ}t|ƒ| jpt‚xNt| ƒD]@} || d~hd]d6d€d6d‚d/6dƒd„6fjpt‚q0Wd…d†d‡dˆd‰dŠd‹}dŒddŽdd} d/d‘hfd„d’d‡d“hfd”dChd„d/6fd•hfd–d—hfd˜d™hfdšg }td›ƒdœhfgjpt‚tdƒdžhfgjpt‚t|ƒ|jpt‚tt|ƒƒ| jpt‚dŸS( s2 Unit tests for L{tagextract} and L{tagjoin}. s/ Hi

Ho


s)
s' Bye! s1 s.s8end s8s, < html >< tag> s1s/s3 ] ][]]>s
Rs sstHis

tHos

s
s
ss ssss Bye! s4

Headers

s)RtHeaders: blah ok whatass blah ok ss whatssR/s;! -ss Rbss sst-s ssss' ] ][]]>s ssHi

Ho


s)
s s!-- Comment
--thiyatfoot6tcontentR\tisRetbrokentyayRs<><>>!>!_-s/scripts7 end s foobar/ =threft10tbaset15Rˆt9t20tts6iiiès7stag/Rs6afdjherknc4 cdk jR(t7t8R0sasbcs#szrxs%tts9abczrxtts?xml version="1.0"?s !DOCTYPE htmls9"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"R9s!-- Comment <><> hi! --tzs![CDATA[ some content ]]trxs!![C[DATA[ more and weirder ] ][]]ttts's%?xml version="1.0" encoding="utf-8" ?ss!DOCTYPE html PUBLIC etc...N(R!R=RKRR RRR*R RgRR]ROR,RPR R (tdoc1tdoc2tdoc3tdoc4tdoc5R-ts2tdoc2oldRRR#R'tntdoc1jointans1((smodules/htmldata.pyt_test_tagextractsè #   ## #   #   #   ! &   %   (  & ,   #     c Csdd}dddddd}d d d d d }ddd}dd}|}t|ddƒ}g}|D]}||iqu~} | ddddgjpt‚g} |D]#}| ||i|i!|ijq¹~ itƒdjpt‚|}t|ddƒ}g} |D]}| |iq~ } | ddddddd d!d"d#d$g jpt‚g} |D]#}| ||i|i!|ijqx~ itƒdjpt‚|}t|ddƒ}g} |D]}| |iqÞ~ } | dddd%dgjpt‚g}|D]#}|||i|i!|ijq%~itƒdjpt‚|}t|ƒ}g}|D]}||iq…~} d&d'd(d)d*d+d&d,g}| |jpt‚xJtt|ƒƒD]6}|||i||i!||ijpt‚qãWd-}||}t|ƒ}g}|D]}||iqD~}|| |jpt‚xJtt|ƒƒD]6}|||i||i!||ijpt‚qˆWd.}t||ƒ}g}|D]}||iqâ~} | g}|D]}|t i ||ƒq ~jpt‚t |t|ddƒƒ|jpt‚t |t|ƒƒ|jpt‚|}t|ƒ}d/|d0_d1|d2_d3|d4_t ||ƒd5d6d7d8d9jpt‚|}t|ƒ}g}|D]}||iq~} | dd&d:d'gjpt‚g}|D]#}|||i|i!|ijqI~itƒdjpt‚d;S(<s2 Unit tests for L{urlextract} and L{urljoin}. s-urlblah, url ( blah2, url( blah3) url(blah4) s:url("blah5") hum("blah6") url)"blah7"( url ( " blah8 " );;s5bs9http://www.ignore.us/s* http://www.nowhere.com cs-@import foo; @import bar @import url('foo2');s/@import url('http://bar2') @import url("foo!");s,@import 'foo3' @import "bar3"; @importfails;s.@import;@import ;url('howdy!') @import foo5 ;s!@import 'foo6' @import "foo7";s-@import foo handheld; @import 'bar' handheld s3@import url('foo2') handheld; @import url(bar2) ha s@import url("foo3") handheld s2bs4R†stext/csss blah3tblah4tblah5s blah8 iRµR»tfoo2s http://bar2sfoo!tfoo3tbar3showdy!tfoo5tfoo6tfoo7tbar2sa.gifsb.htmls./c.pngshttp://www.abc.edu/d.tgash.gifshttp://www.testdomain.com/s/i.pngidshttp://www.python.org/~guido/tFOOitBARisF00!is9bs5s<http://www.ignore.us/ http://www.nowhere.com s6s)csbar.cssN( RR‹RKRQRCtcountReR R R§R(RÊRËRÌRÍRÎR-RR‡RˆtL2R¤t_[3]t_[4]t_[5]t_[6]t_[7]R"RRÑRÏtL3t_[8]tL4RÀt_[9]t_[10]t_[11]t_[12]((smodules/htmldata.pyt_test_urlextractÕs~$ N$#N$#N $ 4  $4$;)#      $ cCs[dGHtƒdGHtƒdGHtƒdGHtƒdGHtƒdGHdGHtƒdGHd GHd S( s Unit test main routine. s Unit tests:s _remove_comments: OKs _shlex_split: OKs _tag_dict: OKs _tuple_replace: OKs tagextract: OKs tagjoin: OKs urlextract: OKs urljoin: OKN(RRLR^R£RÔRî(((smodules/htmldata.pyt_test:st__main__(iii(sscripts/script(sstyles/style((((,Rut __version__t__all__tsyst version_infoR@tshlexR>turllibR§RR+R2R5R6R7RRR*R1R=RERLRWR^RRgR R„tmapR~R€RR RR™R£RRRRÔRîRïR`(((smodules/htmldata.pytsp            9 4  \ <  =  e;              p )  u8 · e