Search
  • Reviewed by Maslows

Import & Combine multiple PDF files from WHO from DataChant

In the last tutorial of the COVID-19 dashboard series here, you learned how to import a single COVID-19 situation report directly from WHO website using Power Query PDF connector.

In today’s tutorial, you will earn how to combine multiple PDF files from WHO into a single table. If you are in a hurry, you can skip the last tutorial and start with this PBIX as your starting point. I will present this tutorial live next week in one of two webinars that will be hosted by the Microsoft Power BI team (details below). If you want to see any other techniques that I used in the COVID-19 series, please share it in the comments below.

Follow @gilra

Background

In the last tutorial, we imported from here the WHO COVID-19 Situation report – 177.

In today’s session, we will import all the reports since May 2020 and implement a solution that will combine data from new PDF files as they are been published on the WHO website.

This tutorial is divided into several phases with varying degrees of complexity and depth of explanation. If you cannot follow a certain step, you can download the final pbix file here, review it and try to work your way towards the solution.

Here are the phases we will cover in this tutorial:

  1. Optimize the query logic of a single PDF file

  2. Generate an inventory of PDF filenames

  3. Create the custom function

  4. Get the inventory of PDF filenames using web scraping

  5. Find mismatching PDF files and optimize the import query to support them

Optimize the the query logic of a single PDF File

If you followed the last article, and open Power Query Advanced Editor you can see that the query refers to few columns with very strange names like “Total confirmed#(lf)cases” and “Total confirmed#(lf)new cases“.

let
    Source = Pdf.Tables(Web.Contents("https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200715-covid-19-sitrep-177.pdf"), [Implementation="1.1"]),
    Table002 = Source{[Id="Table002"]}[Data],
    #"Promoted Headers" = Table.PromoteHeaders(Table002, [PromoteAllScalars=true]),
    #"Filtered Rows" = Table.SelectRows(#"Promoted Headers", each [#"Total confirmed#(lf)cases"] <> null and [#"Total confirmed#(lf)cases"] <> ""),
    #"Replaced Value" = Table.ReplaceValue(#"Filtered Rows"," ","",Replacer.ReplaceText,{"Total confirmed#(lf)cases", "Total confirmed#(lf)new cases", "Total deaths", "Total new deaths"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Replaced Value",{{"Total confirmed#(lf)cases", Int64.Type}, {"Total confirmed#(lf)new cases", Int64.Type}, {"Total deaths", Int64.Type}, {"Total new deaths", Int64.Type}, {"Days since last#(lf)reported case", Int64.Type}}),
    #"Removed Bottom Rows" = Table.RemoveLastN(#"Changed Type",3)
in
    #"Removed Bottom Rows"

The #(lf) expression in the column names is used to refer to column names with a line-feed character. Having such column names in your model is not a good practice because you may find out over time that later versions of the PDF files may not preserve the line-feed in the column names. It also makes your M code difficult to read.

Table.SelectRows(#”Promoted Headers”, each [#”Total confirmed#(lf)cases”] <> null and [#”Total confirmed#(lf)cases”] <> “”)

To avoid errors in reading multiple PDF files with minor variations in format, we should try to optimize the query import and re-write the transformations in a way that will not be dependent on the original column names you have in a single PDF.

To optimize the query I follow these steps:

  1. I didn’t promote the first row as headers as I did in the last tutorial. Instead, with column names like “Column1” and “Column2” you have better chances to achieve a robust query than with column names that have line-feeds.

  2. Instead of renaming “Column1”, “Column2” explicitly, I wrote a generic M code that uses List.Zip to apply the renaming. You can read more about it in Chapter 10 of my book or the article here.

  3. I trimmed problematic columns. Sometimes you will get inconsistent whitespace characters in your data. This is especially important to secure the success of the next two filter steps.

  4. To remove the header from the first row, I filtered out all rows that start with “Reporting Country” in the first column. This is a better way to omit this row than the promote-first-row-as-headers action we used in the last tutorial. You can never know how many blank rows you may have at the beginning of your table.

  5. To remove the totals and sub-totals I filtered out “Grand total”, “Other*” and “Subtotal for all regions” values form the first column. This step is more resilient to edge-cases than removing the bottom 3 rows.

If you cannot follow the recommendations above, copy the following M expression and paste it in the Advanced Editor to replace the current code of the COVID-19 query.

let
    Source = Pdf.Tables(Web.Contents("https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200715-covid-19-sitrep-177.pdf"), [Implementation="1.1"]),
    #"Filtered Rows" = Table.SelectRows(Source, each (Text.Contains([Name],"-")))[Data]{0},
    ColumnNames = Table.ColumnNames(#"Filtered Rows"),
    NewColumnNames = {"Location", "Total confirmed cases", "Total confirmed new cases","Total deaths", "Total new deaths", "Classification", "Days since last report"},

    #"Renamed Columns" = Table.RenameColumns(#"Filtered Rows", List.Zip({ColumnNames, NewColumnNames})),
    #"Trimmed Text" = Table.TransformColumns(#"Renamed Columns",{{"Location", Text.Trim, type text}}),
    #"Filtered Rows1" = Table.SelectRows(#"Trimmed Text", each (not Text.Contains([Location],"Reporting Country"))),
    #"Filtered Rows2" = Table.SelectRows(#"Filtered Rows1", each [Total confirmed cases] <> null and [Total confirmed cases] <> ""),
    #"Filtered Rows3" = Table.SelectRows(#"Filtered Rows2", each ([Location] <> "Grand total" and [Location] <> "Other*" and [Location] <> "Subtotal for all regions")),
    #"Replaced Value" = Table.ReplaceValue(#"Filtered Rows3"," ","",Replacer.ReplaceText,{"Total confirmed cases", "Total confirmed new cases", "Total deaths", "Total new deaths", "Days since last report"})
in
    #"Replaced Value"

Generate an inventory of PDF filenames

Now that we have a good and robust query, it’s time to figure out how we can bring multiple reports from the website without preparing in advance the list of URLs.

Go to the WHO website here and click on the latest situational report. At the time I wrote this tutorial, the latest report was Situational report -204 from 11 August 2020.

Click on the report and inspect the URL in your browser. After you remove the trailing parameters (anything after the question mark), you will find the following URL:

https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200811-covid-19-sitrep-204.pdf

You can see that the PDF filename includes the following format: YYYYMMDD-covid19-siterep-zzz.pdf where YYYY is the year, MM is the month (in 2 digit representation), DD is the day of the month, and zzz is the number of the report.

You can also go back to the previous tutorial to confirm that the format of the URL for July 15 was the same:

https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200715-covid-19-sitrep-177.pdf

While there are few edge-cases in WHO website and not all the PDFs will have this format, I thought it would be useful to share in this tutorial the technique to generate the list of filenames.

Our objective is to create a list of filenames starting from:

20200715-covid-19-sitrep-177.pdf

And ending with:

20200811-covid-19-sitrep-204.pdf

Note: I deliberately start with a smaller range of dates. But you can reapply this technique on any range of dates.

Here we go.

Download the last tutorial’s PBIX file as your starting point and open it with Power BI Desktop, then click Transform data. Inside Power Query Editor, you may need to click Edit Credentials, select Anonymous in Access Web content dialog box and click Connect.

In Home tab, select Manage Parameters, New Parameter. In the Manage Parameters dialog box, enter StartDate in Name, select Date/Time in Type, and enter 7/15/2020 12:00:00 AM in Current Value, then click OK. Note, in your locale, you may enter the date and time differently.

In the Queries pane, right-click on the canvas and select New Query, Blank Query. Then, in the formula bar, enter the following formula:

= List.Dates(Date.From(StartDate), Duration.TotalDays(DateTime.LocalNow() - StartDate), #duration(1,0,0,0))

This formula is extremely useful to create calendar tables in Power BI and work with time intelligence in Power Query. The formula creates a dynamic list of dates starting from your StartDate parameter and ending with today. I wish you would have a simpler syntax to generate such a list. Let’s explain it.

List.Dates function in M accepts three arguments. The starting date, the total number of items in the list, and the increment. In our scenario, we would like to start with the current DateTime value that we have as a parameter. Since the first argument requires a date value and not datetime, we use Date.From to convert the value to Date.

The second argument of List.Dates function is the total number of items we want to have on the list. To get the number of days from today to StartDate we apply several steps. First, we get the current time using DateTime.LocalNow. Next, we subtract the StartDate From the current time using the minus operator. In M, the result of subtracting two datetime values is a duration object. To convert the duration into a number of days we apply Duration.TotalDays.

The last argumen is the increment which should be a duration object. The expression #duration(1,0,0,0) returns a duration object of a single day (first argument is days, second argument is hours, then minutes, then seconds).

To start transforming our list of dates into filenames, we can now convert the list into a table by clicking To Table in the List Tools, Transform tab.

Rename the column Date and change its type to Date as shown in this screenshot.

Our next step is to convert the dates in the table into the format YYYYMMDD which is used in the filenames.

Select the Date column, then in Add Column tab, select Date, Year, Year.

As a second column, you got a new column with the values of the years. Let’s create a column for the month. Select the Date column, then in Add Column tab, select Date, Month, Month.

Let’s create a column for the day of the month. Select the Date column, then in Add Column tab, select Date, Day, Day.

Convert the type of Year, Month and Day columns to Text.

You may notice that the Month and Day columns may have a single digit value, but the filenames are formatted as two digits. For months before October (10) and days before 10, we would need to add a leading zero character. Here is how you can do it.

Using the CTRL key select the columns Month and Day. Right-click on the column header of Month or Day and select Transform, then select Length.

This previous step was used to generate for us the code that will transform these columns. In the formula bar you can see the following M expression:

= Table.TransformColumns(#"Changed Type1",{{"Month", Text.Length, Int64.Type}, {"Day", Text.Length, Int64.Type}})

We will use it as a base line, but will change the two instances of the code

Text.Length, Int64.Type

into a new code:

each if Text.Length(_) = 1 then “0” & _ else _

Instead of returning the Text.Length function on the value, we check if the value’s length is 1, and if it is we add a zero, otherwise we return the original value. The combination of each and underscore are used here as a sugar-syntax for a function. I will not elaborate about it here. You can learn more on many blogs and in Chapter 9 of my book.

Here is the complete modified expression you would need to have in the formula bar when you complete this step:

= Table.TransformColumns(#"Changed Type1",{{"Month", each if Text.Length(_) = 1 then "0" & _ else _}, {"Day", each if Text.Length(_) = 1 then "0" & _ else _}})

Now we can merge the Year, Month and Day columns. Using the CTRL key select the columns Year, Month and Day. Right-click on one of the selected column headers and select Merge Columns.

In the Merge Columns dialog box, keep separator as None and click OK.

In the Merged column, we have the prefix needed to compile the filename. It’s time to take care of the suffix which is a running number. In this section, we started with the file in July 15, you can see that its correpsonding suffix is the number 177. Therefore, we can create now a running index starting from 177.

In Add Column tab, click Index Column and select Custom in the drop-down menu. In the Add Index Column dialog box, enter 177 as Starting Index and 1 as Increment. Then click OK.

Transform Index column type to Text.

We now have all the components to build the filename. In Add Custom tab, select Custom Column. In the Custom Column dialog box, enter URL as New column name. Then, enter the following expression inside the Custom column formula and click OK.

"docs/default-source/coronaviruse/situation-reports/" &
[Merged] &
"-covid-19-sitrep-" &
[Index] &
".pdf"

The expression above concatenates the static and dynamic elements of the URL relative path. For example, in bold you can find the dynamic elements:

docs/default-source/coronaviruse/situation-reports/20200811-covid-19-sitrep-204.pdf

Remove Merged and Index columns and name the query Filenames. It’s time to move to the next phase of the tutorial and create the custom function.

Create the custom function

Now when we have the filenames with their relative path we can move to the next step and create a custom function that will receive a filename and import a specific table. Power Query allows you to convert a query into a custom function. The first step is to create a parameter and use it in the query instead of the fixed value you currently use.

In Home tab, select Manage Parameters, New Parameter. In the Manage Parameters dialog box, enter Filename as Name. Select Text in the drop-down menu of Type, and paste one of the relative path filenames into Current Value. Then, click OK.

For example, you can copy this relative path filename: docs/default-source/coronaviruse/situation-reports/20200811-covid-19-sitrep-204.pdf

Select the COVID-19 query in the Queries pane and select Source in Applied Steps.

You can see this expression in the formula bar:

= Pdf.Tables(Web.Contents("https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200715-covid-19-sitrep-177.pdf"), [Implementation="1.1"])

We will now need to change the fixed URL in the formula into our generic parameter. You can do it by a simple concatenation as follows:

= Pdf.Tables(Web.Contents("https://www.who.int/" & Filename), [Implementation="1.1"])

Notice that these change will show you the same results in the preview dialog box. Which will mean you have done the change correctly. But there is a better way to implement this step using the RelativePath field in Web.Contents. Here is the new formula:

= Pdf.Tables(Web.Contents("https://www.who.int", [RelativePath = Filename]), [Implementation="1.1"])

Why is the new formula better? By defining the relative path, you are telling Power Query to identify the data source as the hostname and ignore all the permutations in the relative path. This way, you will have a better-managed dataset when you publish this solution to Power BI service.

In Queries pane, right-click on COVID-19 and select Create Function.

In Create Function dialog box, enter FnImportTable click OK. This will be our function name.

Depending on your Privacy settings in Options, you may now notice the following data privacy notification. Click Continue.

In Privacy levels dialog box, set Public as the privacy level and click Save.

After Power Query will load the PDF files you will see Table objects in the new column. Before you expand this column, notice the Errors values as highlighted below. These will give you valuable information about inconsistencies.

You can see that two filenames were not found on the WHO website. This is because WHO published different URLs for these two PDFs. TO avoid getting these errors you can apply one of the following steps:

  1. Apply the Remove Errors step on the last column.

  2. Apply Replace Values on the problemative URLs and provide the correct URLs.

  3. Proceed to the next phase where I will share a different technique using web scraping to load the actual URLs from the web page instead of the generation of the URLs by the dates and index numbers.

Before we proceed to the next phase, you have two more steps to go. Click on the expand control in FnImportTable column, and select all fields. Uncheck Use original column name as prefix and click OK. Next, apply the correct column types and load the data to Power BI.

Extract the filenames using web scraping

From here, I will move fast and will skip the step-by-step details. If you are interested in a detail tutorial for the next two sections, please share it in the comments below.

Since the WHO website contains URLs that have mismatching formats, I wanted to tackle this problem differently and extract the hyperlinks directly from the web page. I didn’t find a way to extract the hyperlinks using the HTML built-in capabilities. Instead, I loaded the web page into text lines using the technique I published here.

Next, I filtered the lines that contain “.pdf” and split them to extract the URL from the href section.

To follow my steps create a blank query and copy the following code into Advanced Editor.

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/"), null, null, 65001)}),
    #"Filtered Rows" = Table.SelectRows(Source, each Text.Contains([Column1], ".pdf")),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Filtered Rows", {{"Column1", Splitter.SplitTextByDelimiter("href=", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Split Column by Delimiter", "Column1", Splitter.SplitTextByEachDelimiter({"?"}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Filtered Rows1" = Table.SelectRows(#"Split Column by Delimiter1", each [Column1.2] <> null and [Column1.2] <> ""),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows1",{"Column1.2"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Column1.1", "Filename"}}),
    #"Added Custom Column" = Table.AddColumn(#"Renamed Columns", "Custom", each let splitFilename = List.Reverse(Splitter.SplitTextByDelimiter("/", QuoteStyle.None)([Filename])) in Text.Start(splitFilename{0}?, 8)),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom Column",{{"Custom", Int64.Type}}),
    #"Filtered Rows2" = Table.SelectRows(#"Changed Type", each [Custom] >= 20200501),
    #"Inserted Date" = Table.AddColumn(#"Filtered Rows2", "Date", each Date.From(Text.From([Custom], "en-US")), type date),
    #"Removed Columns1" = Table.RemoveColumns(#"Inserted Date",{"Custom"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Removed Columns1", "FnImportTable", each FnImportTable([Filename])),
    #"Expanded FnImportTable" = Table.ExpandTableColumn(#"Invoked Custom Function", "FnImportTable", {"Location", "Total confirmed cases", "Total confirmed new cases", "Total deaths", "Total new deaths", "Classification", "Days since last report"}, {"Location", "Total confirmed cases", "Total confirmed new cases", "Total deaths", "Total new deaths", "Classification", "Days since last report"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Expanded FnImportTable",{{"Total confirmed cases", Int64.Type}, {"Total confirmed new cases", Int64.Type}, {"Total deaths", Int64.Type}, {"Total new deaths", Int64.Type}, {"Days since last report", Int64.Type}}),
    #"Removed Columns2" = Table.RemoveColumns(#"Changed Type1",{"Filename"}),
    #"Removed Duplicates" = Table.Distinct(#"Removed Columns2", {"Date", "Location", "Classification"})
in
    #"Removed Duplicates"

You will also notice that the query above also extracts the dates from the URLs. Having dates are crucial to preserve the context of the date in each PDF file.

Find mismatching PDF files and optimize the import query

Using the query above you can step up your game and try to import as many PDF files from the WHO website using the following approach:



  1. In Applied Steps go to Invoked Custom Function.

  2. In the Preview section, look for Error objects in the last column.

  1. Copy the filename from the row with the error.

  2. In the Queries pane, select the Filename parameter and paste the filename from the step above in Current Value. Then, click on the COVID-19 query under FnImportTable group.

  1. Detect the error and find the best way to handle it without breaking the same transformations of the other PDF files. Repeat Step 1-5 until you resolve all errors.

Following the steps above, here is one of the error you will notice.

It originates by an extra empty column in the specific PDF. You will find out additional PDF files with additional empty columns.

To fix the empty columns and other edge cases, you can copy the M expression below and paste it in the Advanced Editor of the COVID-19 query.

let
    Source = Pdf.Tables(Web.Contents("https://www.who.int", [ RelativePath = Filename ]), [Implementation="1.1"]),
    #"Filtered Rows" = Table.SelectRows(Source, each (Text.Contains([Name],"-")))[Data]{0},
    CheckColumn2 = if #"Filtered Rows"{0}[Column2] = null or #"Filtered Rows"{0}[Column2] = "*" then Table.RemoveColumns(#"Filtered Rows", {"Column2"}) else #"Filtered Rows",
    CheckColumn3 = if CheckColumn2{0}[Column3] = null or CheckColumn2{0}[Column3] = "*" then Table.RemoveColumns(CheckColumn2, {"Column3"}) else CheckColumn2,
    CheckColumn4 = if CheckColumn3{0}[Column4] = null or CheckColumn3{0}[Column4] = "*" then Table.RemoveColumns(CheckColumn3, {"Column4"}) else CheckColumn3,
    CheckColumn5 = if CheckColumn4{0}[Column5] = null or CheckColumn4{0}[Column5] = "*" then Table.RemoveColumns(CheckColumn4, {"Column5"}) else CheckColumn4,
    CheckColumn6 = if CheckColumn5{0}[Column6] = null or CheckColumn5{0}[Column6] = "*" then Table.RemoveColumns(CheckColumn5, {"Column6"}) else CheckColumn5,
    CheckColumn7 = if CheckColumn6{0}[Column7] = null or CheckColumn6{0}[Column7] = "*" then Table.RemoveColumns(CheckColumn6, {"Column7"}) else CheckColumn6,
    CheckColumn8 = if Table.HasColumns(CheckColumn7, {"Column8"}) then if  CheckColumn7{0}[Column8]? = null or CheckColumn7{0}[Column8]? = "*" then Table.RemoveColumns(CheckColumn7, {"Column8"}) else CheckColumn7 else CheckColumn7,
    
    ColumnNames = Table.ColumnNames(CheckColumn8),
    NewColumnNames = {"Location", "Total confirmed cases", "Total confirmed new cases","Total deaths", "Total new deaths", "Classification", "Days since last report"},

    #"Renamed Columns" = Table.RenameColumns(CheckColumn5, List.Zip({ColumnNames, NewColumnNames})),
    #"Trimmed Text" = Table.TransformColumns(#"Renamed Columns",{{"Location", Text.Trim, type text}}),
    #"Filtered Rows1" = Table.SelectRows(#"Trimmed Text", each (not Text.Contains([Location],"Reporting Country"))),
    #"Filtered Rows2" = Table.SelectRows(#"Filtered Rows1", each [Total confirmed cases] <> null and [Total confirmed cases] <> ""),
    #"Trimmed Text1" = Table.TransformColumns(#"Filtered Rows2",{{"Location", Text.Trim, type text}}),
    #"Filtered Rows3" = Table.SelectRows(#"Trimmed Text1", each ([Location] <> "Grand total" and [Location] <> "Other*" and [Location] <> "Subtotal for all regions")),
    #"Replaced Value" = Table.ReplaceValue(#"Filtered Rows3"," ","",Replacer.ReplaceText,{"Total confirmed cases", "Total confirmed new cases", "Total deaths", "Total new deaths", "Days since last report"}),
    #"Trimmed Text2" = Table.TransformColumns(#"Replaced Value",{{"Days since last report", Text.Trim, type text}}),
    #"Replaced Value1" = Table.ReplaceValue(#"Trimmed Text2","-",null,Replacer.ReplaceValue,{"Days since last report"})
in
    #"Replaced Value1"

You are now ready to import all the files since May 1, 2020 into a single table in Power BI. Download this pbix file for the full solution.

What’s next?

If you like my recent tutorials on COVID-19 in Power BI, I will be glad to meet you in my next two webinars that will be hosted by the Microsoft Power BI team – Soon to be announced on Microsoft Power BI blog here. The links below will take you to the Teams Live Event at the time of the webinars.

Analyze the COVID-19 pandemic in Power BIhttps://aka.ms/PBIWebinar082020208/20/2020 2PM ETAnalyze the COVID-19 pandemic in Power BI Part 2https://aka.ms/PBIWebinar092020209/01/2020 2PM ET

Follow @gilra

The post Import & Combine multiple PDF files from WHO appeared first on DataChant.

0 views