WebDataKit Help Documentation © LOTONtech Limited 2000-2002

GUI Assistant  SQL Syntax  WDBC API  Searching API  Custom HTTP Clients  HTTPS

GUI Assistant

The GUI Assistant can be run by by double-clicking the jwebdata.jar file or by typing at the command line java com.lotontech.jwebdata.gui.Assistant.

The Assistant window has four panes: a URL Submission Pane (top-left), a Submission Results Pane (bottom-left), a SQL Statement Pane (top-right) and a SQL Results Pane (bottom-right).

Into the URL Submission Pane you may enter a URL request and any parameters to be sent using get or post. You can also specify the resulting view to be as a tree or a list, and you submit the request by pressing the Submit button.

The Submission Results Pane (above) shows the structure of the HTML or XML elements comprising the source page, each having a unique key and being colored according to type. You can double-click these elements in order to transfer their identifying keys into the SQL Statement Pane (below). When you have completed the SQL statement by specifying wildcards * and ? along with the column aliases you can press the Run button to see the results in the SQL Results pane:

Tips

SQL Syntax

The WebDataKit supports a version of SQL that is designed for extracting HTML and XML data from live web sources. Here is an example SQL statement  showing most of the supported syntax, with keywords shown in bold, table aliases in green and column aliases in red:

SELECT prices.table[0].tr[*].td[0].text[0] AS prices.exchange,
prices.table[0].tr[*].td[2].table[0].tr[*].td[0].text[0] AS share,
prices.table[0].tr[*].td[2].table[0].tr[*].td[1].text[0] AS price,
locations.table[0].tr[*].td[1].text[0] AS location,
locations.table[0].tr[*].td[0].text[0] AS locations.exchange
FROM file:data/html/ShareData.html AS prices, file:data/html/ShareData.html AS locations
WHERE price > 100 AND price < 400 AND prices.exchange=locations.exchange
ORDER BY locations.exchange, price DESC

The result of that statement when run against the source data at ../data/html/ShareData.html is:

Taking each part in turn....

SELECT prices.table[0].tr[*].td[0].text[0] AS prices.exchange,

means "SELECT the 0th text element (text[0]) of the 0th table data element (td[0]) of any/each row (tr[*]) of the 0th table (table[0]) of the HTML page whose table alias is prices. Label the results in this column the alias prices.exchange".

FROM file:data/html/ShareData.html AS prices, file:data/html/ShareData.html AS locations

means "take data FROM web page file:data/html/ShareData.html and refer to this data with table alias prices. Also take data from another web page (but in this case the same one) and refer to this data with table alias locations."

WHERE price > 100 AND price <400 AND prices.exchange=locations.exchange

means "include rows WHERE the value in the price column is >100 and <400 and where the value in the prices.exchange column is the same as the value in the locations.exchange column." This latter WHERE clause joins the two tables using an inner join, which in this case means a self-join.

ORDER BY locations.exchange, price DESC

means "ORDER (sort) the results according to values in the locations.exchange column, within which order the results by descending values in the price column."

SELECT Specifications

SELECT specifications are written in terms of hierarchical HTML elements, which you can view using the WDBC Assistant. You can include wildcards so as to select multiple elements.

Wildcard '*'

Where wildcard *s are used, these provide the context by which hierarchical elements are arranged rows and columns. Given the SELECT Specifications...

prices.table[0].tr[*].td[0].text[0] AS prices.exchange,
prices.table[0].tr[*].td[2].table[0].tr[*].td[0].text[0] AS share,

these elements will be put into the same row (because the * values match)...

row 1
prices.table[0].tr[2].td[0].text[0] AS prices.exchange,
prices.table[0].tr[2].td[2].table[0].tr[1].td[0].text[0] AS share,

row 2
prices.table[0].tr[2].td[0].text[0] AS prices.exchange, (duplicated)
prices.table[0].tr[2].td[2].table[0].tr[2].td[0].text[0] AS share,

but these won't (because the * values don't match)...

prices.table[0].tr[2].td[0].text[0] AS prices.exchange,
prices.table[0].tr[3].td[2].table[0].tr[1].td[0].text[0] AS share,

Wildcard '?'

You can use a wildcard ? to vary some elements while fixing others, without the strict context matching. For example, this statement...

SELECT prices.table[0].tr[*].td[2].table[0].tr[?].td[0].text[0] AS share
,prices.table[0].tr[*].td[2].table[0].tr[0].td[1].text[0] AS price
FROM file:data/html/ShareData.html AS prices

will vary the share name row (.tr[?]) while fixing the share price at row 0 (.tr[0]), all within the same context for exchange (.tr[*] in both cases) to give:

GHI Financial|247|
DEF Technology|247|
ABC Industries|247|
JKL Publishing|66|
MNO Financial|66|
YZ Computers|444|
VWX Technology|444|
STU Bio|444|
PQR SuperTech|444|

META-DATA

You can select meta-data from a data source to find out about the structure and content of a page that is not known in advance:

SELECT ._key[*] AS key, ._content[*] AS content
FROM file:data/html/ShareData.html

This statement selects items that are not in a table within a table, and which are not HTML attributes:

SELECT ._key[*] AS key, ._content[*] AS content
FROM file:data/html/ShareData.html
WHERE NOT key MATCHES '*.table[*]*.table[*]'
AND NOT key MATCHES '*@*'

to give:

WITHGET and WITHPOST

You can simulate form submissions using GET and POST methods, thus making the full syntax for a FROM clause...

FROM http://www.asharesite.com WITHPOST symbol=BTA WITHPOST symbol=MKS WITHPOST symbol=TWT

The parameters that you specify in WITHGET or WITHPOST clauses correspond with the parameters that follow the question mark (?) in the URLs of sites that handle form submissions. For example, the statement above is the same as pointing your browser at these four URLs in turn:

http://www.asharesite.com?symbol=BTA
http://www.asharesite.com?symbol=MKS
http://www.asharesite.com?symbol=TWT

Use table aliases to distinguish the form submissions as separate tables and thus allow joins between the result sets. For example...

SELECT SGC.text[000] AS sgcPrice, MKS.text[000] AS mksPrice, TWT.text[000] AS twtPrice
FROM http://www.asharesite.com WITHPOST symbol=SGC AS SGC WITHPOST symbol=MKS AS MKS WITHPOST symbol=TWT AS TWT

WHERE Specifications

WHERE Specifications are always phrased in terms of column aliases. You can specify a column alias, an operator ( <, >, <=, = >, =, MATCHES ) and a literal value or another column alias to compare against. For example:

WHERE price > 100
or
WHERE name MATCHES '*Technology*'
or
WHERE price = yearLow

Literal values can be enclosed in single quotes (essential if they contain spaces or clash with column alias names), and you can combine WHERE clauses with AND / OR, but currently the order of evaluation is strictly sequential.

You can reverse the sense by using WHERE NOT, AND NOT or OR NOT like this: WHERE price > 100 AND NOT name MATCHES *Technology*

"UPPER" Function

You can use the UPPER function for case-insensitive matches, e.g. WHERE UPPER(name) MATCHES '*JAVA*'

WDBC / SQL API

(for Searching APIclickhere or scroll down)

SQL statements may be executed from within a Java application, applet or servlet via the Web DataBase Connectivity (WDBC) API. If you're familiar with JDBC you will be comfortable with WDBC.

Here is an example WDBC Client application that is provided in the examples directory:

package examples;

import com.lotontech.jwebdata.wdbc.*;

public class WDBCClient
{
  public static void main(String args[])
  {
    String sqlStatement
    ="SELECT .table[0].tr[0].td[1].table[0].tr[*].td[*].a[0].text[0] AS linkText"
    +", .table[0].tr[0].td[1].table[0].tr[*].td[*].a[0].@href[0] AS link"
    +" FROM http://www.btinternet.com/~lotontech/nav.html";

    try
    {
      ResultSet results=SqlEngine.executeQuery(sqlStatement);

      System.out.println("\nSQL STATEMENT:\n");
      System.out.println(sqlStatement);

      System.out.println("\nRESULTS:\n");

      while (results.next())
      {
        int columnCount=results.getColumnCount();
        for (int c=0; c<columnCount; c++)
        {
          System.out.println("COLUMN "+(c+1)+"="+results.getString(c+1));
        }
        System.out.println();
      }

    }
    catch (Exception ex)
    {
      ex.printStackTrace();
    }
  }
}

Results are returned in the form of a JDBC-like ResultSet having the following methods:

public boolean next() - move the cursor to the next row of the ResultSet.

public boolean previous() - move the cursor to the previous row of the ResultSet.

public boolean absolute(int row) - move the cursor to an absolute position of the ResultSet.

public int getRowCount() - find out how many rows the ResultSet has.

public int getColumnCount() - find out how many columns the ResultSet has.

public void beforeFirst() - move to the beginning of the ResultSet so that you can start a while (rs.next()) loop.

public void afterLast() - move to the end of the ResultSet so that you can start a while (rs.previous()) loop.

public java.lang.String getString(int col) - get the String value contained in column col at the current cursor position.

public java.lang.String getString(java.lang.String name) - get the String value contained in column name at the current cursor position.

public java.lang.String findColumnName(int col) - get the name of the column with number col.

Searching API

You can make use of the core Search Engine my implementing the SearchHandler interface. The example - also in the examples directory - searches news groups for the phrase "web mining":

package examples;

import com.lotontech.jwebdata.search.*;
import com.lotontech.jwebdata.parsing.*;

public class NewsFinder implements SearchHandler
{
  public String lookFor="";

  /**
   Start a search within the main method.
   */
  public static void main(String args[])
  {
    // Set up this class as the handler for discovered data.

    NewsFinder thisHandler=new NewsFinder();
    String startPoint="http://groups.google.com/groups?site=groups&q=web+mining";

    // Start the Search Engine.

    SearchEngine thisEngine=new SearchEngine();
    thisEngine.startSearch(startPoint,thisHandler);
    while (thisEngine.searchInProgress());
  }

  /**
   Customize the search behavior and the action to take on each hit.
   */
  public boolean handleElement(String url, String key, String content)
  {
    boolean goodHit=false;

    int foundPos=content.toLowerCase().indexOf("web mining");
    if ( (foundPos>=0) && !content.startsWith("http://"))
    {
      int sampleStart=0;
      int sampleEnd=content.length();
      if ((foundPos-30)>0) sampleStart=foundPos-30;
      if ((foundPos+30)<content.length()) sampleEnd=foundPos+30;

      System.out.println("* FOUND ["+content.substring(sampleStart,sampleEnd)+"]");
      System.out.println("* AT ["+url+"]");
      System.out.println();

      // The engine will only follow outgoing links from this page if you
      // tell it that there was at least one hit here.
      goodHit=true;
    }

    return goodHit;
  }
}

Custom HTTP Clients

You may use your own custom HTTP client class to fetch data for parsing and querying by our engine. An example would be your own class to fetch secure data via HTTPS / SSL.

HttpClient Interface

Your custom HTTP client must implement interface com.lotontech.jwebdata.parsing.HttpClient, which has the following declaration:

package com.lotontech.jwebdata.parsing;

public interface HttpClient
{
    public HttpChannel getHttpChannel(String baseURL, String params, String method) throws Exception;
}

Whenever data is required by the SqlEngine, your implementation of the getHttpChannel(...) method will be invoked with these parameters:

baseURL: The URL of the data, beginning http:// or file://

params: An optional list of parameter values to be submitted by your implementation  using GET or POST, in the form company=ABC-Industries&amount=100 (note there will be no leading question mark)

method: The method by which parameter data should be submitted, which will be WITHGET or WITHPOST.

HttpChannel Class

Upon establishing a HTTP connection, your implementation must return an instance of com.lotontech.jwebdata.parsing.HttpChannel containing the content type (e.g. text/html) and an InputStream obtained from the connection. Like this...

return new com.lotontech.jwebdata.parsing.HttpChannel(contentType, inputStream);

Using the Custom HTTP Client

You can invoke the SqlEngine with an instance of your custom HTTP client class like this:

com.lotontech.jwebdata.parsing.HttpClient yourHttpClient=new YourHttpClient(yourParameters);
ResultSet results=SqlEngine.executeQuery(sqlStatement, yourHttpClient);

You can also run the GUI Assistant with your HTTP client class like this:

java com.lotontech.jwebdata.gui.Assistant yourpackage.YourHttpClientClass

In this case the class that you supply will be instantiated via a parameterless constructor, so any configuration information must be hard-coded within your implementation of the parameterless constructor. We suggest that you create a special version of your HTTP client class for use with the GUI Assistant at design time.

HTTPS

The J2SE SDK v1.4 allows you to access secure sites (those beginning https://) by specifying a Java property on the command line. For example, to run the GUI Assistant with HTTPS support you would type:

java -Djava.protocol.hanler.pkgs=com.sun.net.ssl.internal.www.protocol com.lotontech.jwebdata.gui.Assistant

You can also enable HTTPS support from within your programs like this:

String PROTOCOL_HANDLERS = "com.sun.net.ssl.internal.www.protocol";

System.setProperty("java.protocol.handler.pkgs", PROTOCOL_HANDLERS);