Universal Windows 10: Screen Scraping an HTML Table into a List

Screen Scraping tools have been around for some time. These enable an app to download HTML source and extract specific entities from an HTML page. Given that an HTML table IS structured data it shouldn’t be hard to programmatically extract a table. It could be done with Linq queries to teh page text. This blog uses the HTMLAgilityPack’s HtmlDocument in a Universal Windows app to extract an HTML table directly into a list of a matching class.

A Web Page with a Table

<html>
....
....
<body>
....
<table Id="clientTable">
    <tr><th>Id</th><th>Name</th><th>Address</th><th>Category</th></tr>
     <tr><td>"1"</td>
         <td>Joe Blogs</td>
         <td>123 Main St</td>
         <td>St Kilda</td>
     </tr>
     <tr><td>"2"</td>
         <td>Sue Ficher</td>
         <td>23 Rain Ave</td>
         <td>Moone Ponds</td>
     </tr>
     <tr><td>"3"</td>
         <td>Fred Nurk</td>
         <td>999 Alexander Rd</td>
         <td>Brunswick</td>
     </tr>
</table>
....
</body>

The UW App Project

Create a new Universal Windows App Project (C#)
You need to add the HtmlAgilityPack to the project through NuGet.

At a reference to it::

using HtmlAgilityPack;

A Class and a List thereof to capture the table data.

        public class Client
        {
            public string ID { get; set; }
            public string Name { get; set; }
            public string Address { get; set; }
            public string Suburb { get; set; }
        }

	public List<Client>Clients {get; set;} = null;

The Screen Scraping Code

        private async Task button_Click(object sender, RoutedEventArgs e)
        {
            HtmlWeb web = new HtmlWeb();
            string url =  http://clients.com.au/clients.html;
            HtmlDocument htmlDoc =  await web.LoadFromWebAsync(url);
            //Get the table
            List<HtmlNode> tbl = htmlDoc.GetElementbyId("clientTable").Elements("tr").ToList();


            Clients.Clear();



            bool IsHeader= true; //Ignore the header
            //Iterate through rows in the table

            foreach (HtmlNode node in tbl)
            {
                if (IsHeader)
                {
                    IsHeader=false;
                    continue;
                }
                //Get column data in row                
                List<HtmlNode> s = node.Elements("td").ToList();
                var vals0 = (from d in s select d.InnerText);
                var vals1 = from v in vals0 select v.Replace("\r\n", "");
                var vals2 = from z in vals1 select z.Trim();
                string vals = vals2.ToList<string>().ToArray<string>();
                if (vals.Length > 3)
                {
                    string id = vals[0];
                    string name = vals[1];
                    string address = vals[2];
                    string suburb = vals[3];
                    //Instantiate object for the row.
                    Clients.Add(new Client()
                    {
                        ID = id,
                        Name = name,
                        Address = address,
                        Suburb = suburb,
                    });
                }
            }
        }

Getting the table

Note that htmlDoc.GetElementbyId() is used to get the table. In the wider domain (than Universal Windows) there are other options for GetElement:

GetElementById(String) Retrieves a single HtmlElement using the element's ID attribute as a search key.
GetElementFromPoint(Point) Retrieves the HTML element located at the specified client coordinates.
GetElementsByTagName(String) Retrieve a collection of elements with the specified HTML tag.

ByTagName could be used if available to get the <table> Inner Text. But only ById is available with UWP so the table must have a matching Id (clientTable) in both the HTML code and in the C# code.

Further

The code could be extended to make use of the header to align the class properties with the column names and could even use Reflection to generate a generic class using those column headings.

Recent blog entries

Universal Windows 10: Screen Scraping an HTML Table into a List

A Web Page with a Table

The UW App Project

A Class and a List thereof to capture the table data.

The Screen Scraping Code

Getting the table

Further

Possibly related posts:

Comments

David Jone's blog

Search blog

Tags

Category

Archive

Recent blog entries

Universal Windows 10: Screen Scraping an HTML Table into a List

A Web Page with a Table

The UW App Project

A Class and a List thereof to capture the table data.

The Screen Scraping Code

Getting the table

Further

Share this post

Possibly related posts:

Comments

David Jone's blog

Search blog

Tags

Category

Archive