I know that most people shudder when they hear the words ‘Screen Scraping’ combined with the word ‘Database’. In MOST screen scraping environments, performance is the ULTIMATE driver of all decisions. I don’t see that to be true.

If you know how to use SQL Server, take a look at these stored procedures. I’m going to link the actual code rather than just pointing to sourcforge, because I don’t LIKE going to SourceForge, and I frequently have a bad experience there.

About SourceForge, even though they CLAIM that their downloads are only delayed by FIVE seconds, I frequently have problems when it takes more like 90 seconds for the download to start. When a website makes one promise, and then their results are 20 times worse, I don’t want to RELY on seeing them in the market indefinitely. Further questions:

Why aren’t downloads available on a Mobile Device? More importantly, I HATE the ‘wait 5 seconds before download thing’. I don’t think that webpages are here to HARRASS you, I think that providing a simple download link is MUCH safer, and more reliable. In MOST situations, I disable JavaScript. I don’t think that most pages need it, and I’ll selectively enable domains and apps that I trust.

OVERVIEW

SQLDOM is an easy and robust way to parse HTML directly into SQL tables, manipulate DOM nodes in a JQuery-like manner, and to render HTML from the SQL-based DOM.

SQLDOM is written entirely in native T-SQL, and uses only temporary database objects (tempdb). No changes to user databases are required.

Features

  • Screen scraping HTML pages into structured data
  • Generation of HTML in SQL from templates (i.e. merge templates with data)
  • HTML checker / LINT-like analysis
  • Easy node-based search-and-replace
  • Other HTML analysis in support of optimization

FILES

SQLDOM_core_persist_927.sql – 2013-03-21 91.6 kB

SQLDOM_core_927.sql – 2013-03-21 83.8 kB

DETAILS FROM BOTTOM OF THE FILES TAB

SQLDOM HTML parser and DOM tools for MSSQL.
https://sourceforge.net/projects/sqldom/

Parses HTML from a string or from a URL into a DOM (document object model) implemented with SQL tables. Provides routines to manipulate the DOM data and to render the DOM data back to HTML.

You may safely run this entire script: it does not make any changes to any SQL user databases. It only creates some local temporary tables and temporary stored procedures, and prints out a string with some instructions.

Requires Microsoft SQL 2005 or later.

Copyright (C) 2012 David B. Rueter (drueter@assyst.com)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Source: README.txt, updated 2012-02-19

WIKI

SQLDOM is an easy and robust way to parse HTML directly into SQL tables, manipulate DOM nodes in a JQuery-like manner, and to render HTML from the SQL-based DOM.

SQLDOM is useful in three main categories:

  1. Consumption of HTML (turn HTML into structured data)
  2. Creation and modification of HTML
  3. Analysis of HTML

EXAMPLES

--Example 1: Simple parse of string

EXEC #spactDOMLoad
  @HTML = 'Hello World.

SQLDOM ROCKS!

' EXEC #spgetDOM ~~~~~ ------- **--Example 2:** *Render HTML from DOM (that we parsed in Example 1 above)* :::sql EXEC #spgetDOMHTML @PrettyWhitespace=1, @PrintHTML = 1 ------- **--Example 3:** *Parse and re-render from a URL* :::sql DECLARE @HTML varchar(MAX) EXEC #sputilGetHTTP @URL = 'http://www.google.com', @ResponseText = @HTML OUTPUT, @SuppressResultset = 1 EXEC #spactDOMLoad @HTML=@HTML EXEC #spgetDOM EXEC #spgetDOMHTML @PrettyWhitespace=1, @PrintHTML = 1 ------- **--Example 4:** *Parse from a string, modify the DOM, render resulting HTML* :::sql EXEC #spactDOMLoad @HTML = 'Hello World. Future content goes here ' EXEC #spactDOMLoad @HTML = ' Here is some neat stuff about SQLDOM ', @Selector = '.myContent' EXEC #spgetDOM EXEC #spgetDOMHTML @PrettyWhitespace=1, @PrintHTML = 1

JUSTIFICATION

SQLDOM is useful in three main categories:

  1. Creation of HTML
  2. Consumption of HTML
  3. Analysis of HTML

Sometimes a data-centric, server-side approach to HTML creation, manipulation and consumption is appropriate.

Poplular Javascript libraries like JQuery and YUI suport “set-based” manipulation of HTML via selectors. This is done primarily in a client-side browser.

Of late, server-side javascript (SSJS) has become popular with projects like Node.js, Ringo and others. See: http://zope.cetis.ac.uk/members/scott/blogview?entry=20110124133546

While SSJS is interesting, not all implementations include a DOM. Furthermore, each implementation imposes limitations as to web server, languages, and other platform-specific considerations.

SQLDOM was originally born out of a desire to perform DOM-based creation of HTML directly within Microsoft SQL Server, with no external dependencies. While SQL Server does include a good XML parser, it is not terribly useful for HTML.

Imagine having HTML templates–either in a SQL table or passed in–and being able to directly merge those templates with live SQL data without relying on a web server scripting environment like PHP or ASP.NET. Such a capability would allow for generation of clean standards-based HTML, and would support best-practices in terms of separation of layout from content. But, such a capability would not require an added scripting layer such as PHP or ASP.NET.

This would reduce the number of moving parts in a web application, and would potentially increase security and reliability while simplifying support and potentially increasing both performance and flexibility.

USE CASES

Possible uses of SQLDOM include:

  1. Screen scraping HTML pages into structured data
  2. Generation of HTML from templates in SQL (i.e. merge templates with data)
  3. HTML checker / LINT-like analysis
  4. Easy node-based search-and-replace of HTML content
  5. HTML analysis in support of optimization and the like
  6. Structured searchable archiving of HTML

IMPLEMENTATION

SQLDOM was created for Microsoft SQL Server (version 2005 or later).

It does not make use of CLR. It is written entirely in T-SQL.

SQLDOM is implemented entirly with temporary objects. No permanent SQL objects or modifications to user databases are required. (i.e. SQLDOM is set up to run entirely within tempdb.) If desired, users could modify or incorporate SQLDOM into other databases, but there are good reasons to let SQLDOM exist in tempdb.

Performance is good, though not as fast as a parser written in C++. For example, parsing the Google home page HTML takes roughly 150ms on modest hardware. Rendering HTML is slightly faster–roughly 120ms to render the HTML of the Google home page from the DOM.

When considering performance, remember that while an external parser may be somewhat faster at the raw work of parsing and rendering HTML, real-world implementations of other solutions (ASP.NET, PHP) introduce other performance-limiting factors and introduce additional complexity. In other words, it is possible that performance of a SQLDOM-based solution will outperform other solutions because of performance gains from the streamlining of the architecture.

SQLDOM implements 4 tables to store the DOM:

#tblDOM

Has a row for each node (HTML tag). Text nodes are stored in their own row (because they are technically a separate node).

#tblDOMAttribs

Has a row for each attribute name/value pair of a node

#tblDOMStyles

Has a row for each CSS style of a node. These are concatenated together to make up the style attribute of the node.

#tblDOMDocs

If multiple HTML documents are needed simutaneously, each document has a row in this table to allow DOM elements from multiple documents to be stored in #tblDOM

Possible uses of SQLDOM include:

  1. Screen scraping HTML pages into structured data
  2. Generation of HTML from templates in SQL (i.e. merge templates with data)
  3. HTML checker / LINT-like analysis
  4. Easy node-based search-and-replace of HTML content
  5. HTML analysis in support of optimization and the like
  6. Structured searchable archiving of HTML

FORUM

There is a limited forum dedicated to these stored procedures, and it gives some details on options and capabilities. Man, I hate waiting 20 minutes for SourceForge to load today. I literally think that these discussion pages took 20 minutes to load and it’s SUCH a simple page. It blows my mind that SourceForge is still around.

Running SQLDom on the server

https://sourceforge.net/p/sqldom/discussion/general/thread/56f1383f/

Russell Shilling2014-12-31

If you wish to run SQLDOM as a job on the server, there is different behavior than running on SSMS that must be compensated for. The default setting will truncate the results from the HTTP GET (and POST, I presume) at 512 characters.

To compensate, add the following statement at the beginning of your batch:

SET TEXTSIZE 2147483647;

  • Last edit: Russell Shilling 2014-12-31
  • William Mercer Morris William Mercer Morris2017-04-08 Russel; I am having this issue where only 512 bytes are being returned when I run the SQLDOM executing a Stored Procedure on SQL Agent. I have added the set TEXTSIZE before and I still am getting 512 bytes? It does not matter which website I retrieve the data from. SET TEXTSIZE 10000 DECLARE @HTML varchar(MAX) EXEC master.[sdom].[sputilGetHTTP]
    –@URL = ‘http://quotes.freerealtime.com/dl/frt/M?IM=stats&exch=O&stat=3’,
    –@URL = ‘http://www.finviz.com/’,
    @URL = ‘http://www.yahoo.com/’,
    @ResponseText = @HTML OUTPUT,
    @SuppressResultset = 0 <html id=”atomic” lang=”en-US” class=”atomic my3columns l-out Pos-r https fp fp-v2 rc1 fp-default mini-uh-on viewer-right two-col ntk-wide ltr desktop Desktop bktCG004,fpdmcntrl,FP131,SR014″> <head> <meta http-equiv=”x-dns-prefetch-control” content=”on”><link rel=”dns-prefetch” href=”//s.yimg.com”><link rel=”preconnect” href=”//s.yimg.com”><link rel=”dns-prefetch” href=”//search.yahoo.com”><link rel=”preconnect” href=”//search.yahoo.com”> 
  • William Mercer Morris William Mercer Morris2017-04-08 The Bytes returned are always 512……Do you have any idea what I am doing wrong? When I execute the SP from MMS it returns the correct results, SQL Agent is it always 512. I am running SQL 2016

Retrieving Hidden Input Values

Tim Deagan2017-04-06

I need to run a job that finds a hidden input’s default value and stores it in a database. SQLDOM is very close to what I need, but the returned DOM in a table strips out the hidden attribute and the default value from inputs. I’m slowly workign my way through spactDOMLoad, but I’m getting lost. Any suggestions on mods I could make to include the default value?

Thanks,
–Tim

  • David Rueter David Rueter2017-04-07 Hi, Tim. Good to hear from you. The HTML attributes should be parsed out and stored in #tblDOMAttribs This should let you JOIN the tag to the attribute, something like this: SELECT d.Tag, d.ID, d.Name AS TagName, d.TextData, a.Name AS AttribName, a.Value AS AttribValue FROM #tblDOMHierarchy d –or you can use #tblDOM LEFT JOIN #tblDOMAttribs a ON d.DEID = a.DEID In other words, the default value should be in an Attribute named
    “placeholder”, and thus #tbllDOMAttribs.Value should contain the actual
    default value itself. Let me know if you have further questions. FYI, I’m in the process of moving SQLDOM to GitHub.com, and as I do so I
    will be providing a few updates. (I’ve kind of neglected the SourceForge
    repository.) Sincerely, David Rueter drueter@assyst.com From: Tim Deagan [mailto:tdeagan@users.sf.net]
    Sent: Thursday, April 06, 2017 11:05 AM
    To: [sqldom:discussion] general@discussion.sqldom.p.re.sf.net
    Subject: [sqldom:discussion] Retrieving Hidden Input Values I need to run a job that finds a hidden input’s default value and stores it
    in a database. SQLDOM is very close to what I need, but the returned DOM in
    a table strips out the hidden attribute and the default value from inputs.
    I’m slowly workign my way through spactDOMLoad, but I’m getting lost. Any
    suggestions on mods I could make to include the default value? Thanks,
    –Tim

SQLDOM Adding Spaces To Markup

Gavin Paolucci-Kleinow
Gavin Paolucci-Kleinow - 2012-09-13

I'm using SQLDOM in a script, and am getting odd results. Two spaces are added before and after every tag. Is this a known issue with SQLDOM?
 

    David Rueter
    David Rueter - 2012-09-13

    Hi Gavin. Which routine are you calling when you see this behavior?

    If you mean when you do something like this: EXEC #spgetDOMHTML @PrettyWhitespace=1, @PrintHTML = 1, try setting @PrettyWhitespace = 0.

    If that isn't what you mean / that doesn't solve your problem, provide me with some more details. I'm happy to help, and will be happy to fix a problem in the code if one exists.

Here is the source:

https://sourceforge.net/projects/sqldom/

%d bloggers like this: