All Packages Class Hierarchy This Package Previous Next Index
Class Webcrawler.Crawler.HTMLParser
java.lang.Object
|
+----Webcrawler.Crawler.HTMLParser
- public class HTMLParser
- extends Object
A Parser-object scans through a local file and extracts HTML-specific information like
the tags (and their values) and the text between the tags. If you are only interested in
the tags just call ReadTag() sequentially (it overreads the text between the tags). ReadInfo()
works the same way, it overreads the tags. If you need the entire HTML-file you can call
comesTag() and comesInfo() and the according ReadTag()/ReadInfo().
One tag can consist of several attributes (+values); To scan through the attributes of the
currently read tag, use getCurrAttributes() and convert the elements of the enumeration to
HTMLAttribute. The first thing in a tag is the "element"-name (e.g: IMG). To get this simply
call getCurrElement() after you called readTag(). If readTag realizes that the next thing to be
read is a comment, it reads the entire comment and stores it in the currComment field. currElement
then is null, otherwise currComment is null.
-
currAttributes
- Stores the currently read attributes (=list of HTMLAttributes)
-
currChar
-
-
currComment
- Stores the currently read Comment !--......--
-
currElement
- Stores the currently read Element (e.g: "BODY")
-
currInfo
- Stores the currently read info (=text between tags)
-
in
-
-
parsedCharsNum
-
-
success
-
-
HTMLParser(String)
- Creates a new Parser on the local file htmlFile
Use ReadTag and ReadValue for reading from the file, then use getCurrTag
and getCurrInfo for finding out what was read.
-
afterBlanks()
- overreads blanks, carriage-returns, newlines and tabs
-
close()
-
-
comesInfo()
-
-
comesTag()
-
-
finalize()
-
-
getCurrAttributes()
-
-
getCurrComment()
-
-
getCurrElement()
-
-
getCurrInfo()
-
-
getParsedCharsNum()
-
-
isWhiteSpace(char)
-
-
read()
-
-
readAttribute()
-
-
readComment(String)
-
-
readElement()
-
-
readInfo()
- Reads the next info between tags.
-
readTag()
- Reads the next tag.
-
readValue()
-
-
success()
-
in
protected FileInputStream in
success
protected boolean success
currChar
protected char currChar
parsedCharsNum
protected int parsedCharsNum
currElement
protected String currElement
- Stores the currently read Element (e.g: "BODY")
currComment
protected String currComment
- Stores the currently read Comment !--......--
currAttributes
protected Vector currAttributes
- Stores the currently read attributes (=list of HTMLAttributes)
- See Also:
- HTMLAttribute
currInfo
protected String currInfo
- Stores the currently read info (=text between tags)
HTMLParser
public HTMLParser(String htmlFile)
- Creates a new Parser on the local file htmlFile
Use ReadTag and ReadValue for reading from the file, then use getCurrTag
and getCurrInfo for finding out what was read.
finalize
protected void finalize() throws Throwable
- Overrides:
- finalize in class Object
close
public void close()
read
private char read()
isWhiteSpace
private boolean isWhiteSpace(char ch)
afterBlanks
private char afterBlanks()
- overreads blanks, carriage-returns, newlines and tabs
readTag
public boolean readTag()
- Reads the next tag.
If the next thing to be read is an info, it is overread.
Afterwards currChar contains the closing > of the tag.
Get the element-name of the tag using getCurrElement().
Get the tag using getCurrTag() which gives you an Enumeration over HTMLAttribute-objects
The attribute's values contain the quotation-marks if there are any
- Returns:
- success (false if error or eof)
readElement
private String readElement()
readComment
private String readComment(String element)
readAttribute
private String readAttribute()
readValue
private String readValue()
readInfo
public boolean readInfo()
- Reads the next info between tags.
If the next thing to be read is a tag it is overread.
Afterwards currChar contains the < of the next tag
get the read info using getCurrInfo()
- Returns:
- success (false if error or eof)
comesTag
public boolean comesTag()
- Returns:
- Is the next thing to be read a tag?
comesInfo
public boolean comesInfo()
- Returns:
- Is the next thing to be read an info?
success
public boolean success()
- Returns:
- Successful (like readTag() and readInfo())
getParsedCharsNum
public int getParsedCharsNum()
getCurrElement
public String getCurrElement()
- Returns:
- the Element-name (e.g: BODY) of the currently read tag
getCurrComment
public String getCurrComment()
- Returns:
- the read Comment, if getCurrElement()==null then getCurrComment()
getCurrAttributes
public Enumeration getCurrAttributes()
- Returns:
- Enumeration over the attributes of the last read HTML-tag
- See Also:
- HTMLAttribute
getCurrInfo
public String getCurrInfo()
- Returns:
- The info/text between HTML-tags (=the "real" text of the page)
All Packages Class Hierarchy This Package Previous Next Index