DuckDB - a couple of learnings
I’ve been updating some slides for emissions in the West of England using data from DESNZ.
R and RStudio are unwell so I resolved to use duckdb to derive the proportions of emissions by sector for the latest year and the 10 year change in emissions. Nothing dramatic. I did learn a couple of useful things.
-
Renaming all columns in a table to snake case doesn’t seem that easy in SQL compared to the wonderful
janitor::clean_names()function in R. However there is a nice workaround if you are importing from a csv. Simply use thenormalize_names = trueparameter in the read_csv() function. From the docs: Normalize column names. This removes any non-alphanumeric characters from them. Column names that are reserved SQL keywords are prefixed with an underscore character (_). -
Calculating the proportions of values compared to the totals of a column. When I wanted to work out the proportion of emissions by sector in the same table as the percentage change over 10 years I got a bit stuck as I didn’t want to do a GROUP BY. The solution is to use the OVER() clause in the column calculation like this:
SELECT *, ("2023_LEP" - "2014_LEP") * 100 / "2014_LEP" lep_pc_change, ("2023_UK" - "2014_UK") * 100 / "2014_UK" uk_pc_change, "2023_LEP" * 100 / SUM("2023_LEP") OVER() lep_pc_total, "2023_UK" * 100 / SUM("2023_UK") OVER() uk_pc_total FROM year_pivot_tbl WHERE la_ghg_sector != 'LULUCF';The OVER () clause turns the aggregation into a window function. An empty () specifies that the window for the sum is the entire result set.