Finding duplicate PKs in big CSV file
The challenge J
Some days ago at work young $PADAWAN-COLLEAGUE had the task to extract all ids out of a big CSV file (~ 1.5 mio of lines), which were used multiple times. These ids should be unique like PKs in a database, duplicates are an error, therefore the task to find them.
Suprise - despite my hint - highly valued $PADAWAN-COLLEAGUE started with Excel. These young guys… As expected, MS Office 365 Excel has also a limit in number of lines that it can handle. IIRC once(tm) it was about 64k lines, today this increased to 1 mio of lines.
$PADAWAN-COLLEAGUE thought of splitting up the CSV in <1mio chunks, but bad idea, as you would miss duplicates accross the chunks…
Howto proceed? Importing into a database and some subselect?
Time for the $JEDI (yep, that's me) to jump in.
The salvation
Instead of awk, or something like that I choose the deadly weapon J.
At first import the CSV-file and isolate the column with the PKs:
require 'tables/dsv' input=: (';';'"') readdsv_pdsv_ 'file.csv' NB. Our CSV file uses ';' as delimiters input=: >{:"1 input NB. get last column, and get rid of the boxes
And now, real J-magic: find first occurrences of the PK, negate the result, use this result to show the duplicates, show each duplicate just once:
~. (-. ~: input) # input NB. nub sieve input, negate it, copy only the duplicates, nub result
Voila - that's it :).
J - as always, have an elegant solution, with minimal effort in typing, but maximal confusion for non-enlightened (less nerdy) colleagues.
Related Links
- Jsoftware - home base of J
- Wikipedia about APL: J is an APL-dialect, read about APL.