Html Provider + Canopy. · Colin Bull

Html Provider + Canopy.

So recently this tweet came across my timeline.

and indeed the article is definiately worth a read. However I have recently been using both canopy and the HTML Provider together to extract auction price data from http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table and thought it might be worth sharing some of the code I have been using. Now the problem with just using the HTML Provider to scrape this page is that you actually need the javascript on the page to execute and the HTML provider doesn't do this. Maybe this is something worth adding??

However using canopy with phantomjs we can get the javascript to execute and the table generated in the resulting HTML and therefore availble to the HTML provider. So how do we do this. Well first of all we need to find out which elements we need write a function that uses canopy to execute the page,

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
let getN2EXPage phantomJsDir targetUrl units withSource = 
    phantomJSDir <- phantomJsDir
    start phantomJS
    url targetUrl
    waitForElement "#datatable"

    if not(String.IsNullOrWhiteSpace(units))
    then 
        let currencySelector = new SelectElement(element "#data-currency-select")
        currencySelector.SelectByText(units)
        let unitDisplay = (element "div .dashboard-table-unit")
        printfn "%A" unitDisplay.Text
        while not(unitDisplay.Text.Contains(units)) do
            printfn "%A" unitDisplay.Text
            sleep 0.5
        printfn "%A" unitDisplay.Text
    let source = withSource browser.PageSource
    quit()
    source

with this function we can now do a couple of things.

So with this we can now create a snapshot of the page and dump it to a file.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
let toolPath = 
    Path.GetFullPath(__SOURCE_DIRECTORY__ + "/Tools/phantomjs/bin")

let writePage path content = 
    if File.Exists(path) then File.Delete path
    File.WriteAllText(path, content)

getN2EXPage toolPath "http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table" "GBP" (writePage "code/data/n2ex_auction_prices.html")

Once we have executed the above function we have a template file that we can use in the type provider to generate our type space.

1: 
2: 
3: 
4: 
5: 
type N2EX = HtmlProvider<"data/n2ex_auction_prices.html">

let getAuctionPriceData() = 
    let page = getN2EXPage toolPath "http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table" "GBP" (fun data -> N2EX.Parse(data))
    page.Tables.Datatable.Rows

at this point we can use the HTML Provider as we normally would.

1: 
2: 
3: 
let data = 
    getAuctionPriceData() 
    |> Seq.map (fun x -> x.``UK time``, x.``30-11-2016``)

Finally, I think it is worth noting that even though the the headers will change on the page; due to the fact that it is a rolling 9 day window. At runtime this code will carry on working as expected, because the code behind this will still be accessing the 1st and 3rd columns in the table, even though the headers have changed. However at compile time the code will fail :( because the headers and therefore the types have changed. However all is not lost, when this occurs, since the underlying type is erased to a tuple. So we could just do the following

1: 
2: 
3: 
4: 
5: 
6: 
let dataAsTuple = 
    getAuctionPriceData() 
    |> Seq.map (fun x -> 
        let (ukTime, _, firstData,_,_,_,_,_,_,_) = x |> box |> unbox<string * string * string * string * string * string * string * string * string * string>
        ukTime, firstData
    )

A little verbose but, hey it's another option...

namespace System
namespace System.IO
namespace Microsoft.FSharp.Data
val not : value:bool -> bool

Full name: Microsoft.FSharp.Core.Operators.not
module String

from Microsoft.FSharp.Core
val printfn : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn
module Seq

from Microsoft.FSharp.Collections
val map : mapping:('T -> 'U) -> source:seq<'T> -> seq<'U>

Full name: Microsoft.FSharp.Collections.Seq.map
val box : value:'T -> obj

Full name: Microsoft.FSharp.Core.Operators.box
val unbox : value:obj -> 'T

Full name: Microsoft.FSharp.Core.Operators.unbox
Multiple items
val string : value:'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

--------------------
type string = System.String

Full name: Microsoft.FSharp.Core.string
tweet-share