Screen Scraping tools have been around for some time.  These enable an app to download HTML source and extract specific entities from an HTML page.  Given that an HTML table IS structured data it shouldn’t be hard to programmatically extract a table. It could be done with Linq queries to teh page text. This blog uses the HTMLAgilityPack’s HtmlDocument  in a Universal Windows app to extract an HTML table directly into a list of  a matching class.

 

A Web Page with a Table

<html>
....
....
<body>
....
<table Id="clientTable">
    <tr><th>Id</th><th>Name</th><th>Address</th><th>Category</th></tr>
     <tr><td>"1"</td>
         <td>Joe Blogs</td>
         <td>123 Main St</td>
         <td>St Kilda</td>
     </tr>
     <tr><td>"2"</td>
         <td>Sue Ficher</td>
         <td>23 Rain Ave</td>
         <td>Moone Ponds</td>
     </tr>
     <tr><td>"3"</td>
         <td>Fred Nurk</td>
         <td>999 Alexander Rd</td>
         <td>Brunswick</td>
     </tr>
</table>
....
</body>

 

The UW App Project

Create a new Universal Windows App Project (C#)
You need to add the HtmlAgilityPack to the project through NuGet

At a reference to  it::

using HtmlAgilityPack;

 

A Class and a List thereof to capture the table data.

        public class Client
        {
            public string ID { get; set; }
            public string Name { get; set; }
            public string Address { get; set; }
            public string Suburb { get; set; }
        }

	public List<Client>Clients {get; set;} = null;

 

 

The Screen Scraping Code

        private async Task button_Click(object sender, RoutedEventArgs e)
        {
            HtmlWeb web = new HtmlWeb();
            string url =  http://clients.com.au/clients.html;
            HtmlDocument htmlDoc =  await web.LoadFromWebAsync(url);
            //Get the table
            List<HtmlNode> tbl = htmlDoc.GetElementbyId("clientTable").Elements("tr").ToList();


Clients.Clear();



bool IsHeader= true; //Ignore the header
//Iterate through rows in the table
foreach (HtmlNode node in tbl) {
if (IsHeader)
{
IsHeader=false;
continue;
} //Get column data in row List<HtmlNode> s = node.Elements("td").ToList(); var vals0 = (from d in s select d.InnerText); var vals1 = from v in vals0 select v.Replace("\r\n", ""); var vals2 = from z in vals1 select z.Trim(); string vals = vals2.ToList<string>().ToArray<string>(); if (vals.Length > 3) { string id = vals[0]; string name = vals[1]; string address = vals[2]; string suburb = vals[3]; //Instantiate object for the row. Clients.Add(new Client() { ID = id, Name = name, Address = address, Suburb = suburb, }); } } }

 

Getting the table

Note that htmlDoc.GetElementbyId() is used to get the table.  In the wider domain (than Universal Windows) there are other options for GetElement:

 

ByTagName could be used if available to get the <table> Inner Text.  But only ById is available with UWP so the table must have a matching Id (clientTable) in both the HTML code and in the C# code.

 

Further

The code could be extended to make use of the header to align the class properties with the column names and could even use Reflection to generate a generic class using those column headings.