Screen Scraping tools have been around for some time. These enable an app to download HTML source and extract specific entities from an HTML page. Given that an HTML table IS structured data it shouldn’t be hard to programmatically extract a table. It could be done with Linq queries to teh page text. This blog uses the HTMLAgilityPack’s HtmlDocument in a Universal Windows app to extract an HTML table directly into a list of a matching class.
<html> .... .... <body> .... <table Id="clientTable"> <tr><th>Id</th><th>Name</th><th>Address</th><th>Category</th></tr> <tr><td>"1"</td> <td>Joe Blogs</td> <td>123 Main St</td> <td>St Kilda</td> </tr> <tr><td>"2"</td> <td>Sue Ficher</td> <td>23 Rain Ave</td> <td>Moone Ponds</td> </tr> <tr><td>"3"</td> <td>Fred Nurk</td> <td>999 Alexander Rd</td> <td>Brunswick</td> </tr> </table> .... </body>
Create a new Universal Windows App Project (C#)You need to add the HtmlAgilityPack to the project through NuGet.
At a reference to it::
using HtmlAgilityPack;
public class Client { public string ID { get; set; } public string Name { get; set; } public string Address { get; set; } public string Suburb { get; set; } } public List<Client>Clients {get; set;} = null;
private async Task button_Click(object sender, RoutedEventArgs e) { HtmlWeb web = new HtmlWeb(); string url = http://clients.com.au/clients.html; HtmlDocument htmlDoc = await web.LoadFromWebAsync(url); //Get the table List<HtmlNode> tbl = htmlDoc.GetElementbyId("clientTable").Elements("tr").ToList(); Clients.Clear(); bool IsHeader= true; //Ignore the header //Iterate through rows in the table foreach (HtmlNode node in tbl) { if (IsHeader) { IsHeader=false; continue; } //Get column data in row List<HtmlNode> s = node.Elements("td").ToList(); var vals0 = (from d in s select d.InnerText); var vals1 = from v in vals0 select v.Replace("\r\n", ""); var vals2 = from z in vals1 select z.Trim(); string vals = vals2.ToList<string>().ToArray<string>(); if (vals.Length > 3) { string id = vals[0]; string name = vals[1]; string address = vals[2]; string suburb = vals[3]; //Instantiate object for the row. Clients.Add(new Client() { ID = id, Name = name, Address = address, Suburb = suburb, }); } } }
Note that htmlDoc.GetElementbyId() is used to get the table. In the wider domain (than Universal Windows) there are other options for GetElement:
ByTagName could be used if available to get the <table> Inner Text. But only ById is available with UWP so the table must have a matching Id (clientTable) in both the HTML code and in the C# code.
The code could be extended to make use of the header to align the class properties with the column names and could even use Reflection to generate a generic class using those column headings.