How to remove duplicate lines from text files > 1 GB?

EDN Admin

Well-known member
Joined
Aug 7, 2010
Messages
12,794
Location
In the Machine
Hey everyone:
We have an issue here we didnt see a clear answer for in other similar thread posts.
Our system reads in lines that are comma delimited from text files that are sometimes greater than 1,347,545 kb (1.3 GB) in size.
Currently, our process is set up to do a File.ReadAllLines into a string[], convert the array into a Concurrentdictionary via parallel.for loops to remove the duplicates, then write out all the lines.
This works great until the text files with the data are HUGE, and thus both the string[] and dictionary become greater than the 2 or 3GB limit .Net imposes on objects.
So here we are looking for ideas how to tackle this problem. Each line in our data file is a comma separated list of 2-digit numbers. Each line in the file will have exactly the same number or numbers. The numbers on each line themselves are sorted in ascending
order, however the entries in the file are not.
We need a process to remove duplicate lines: there could be 1; there could be thousands; all scattered randomly around the file.

Bonus if the resulting output with the duplicates removed can be sorted as well, but not a requirement.

We are working with .NET 4.0. We could potentially upgrade to .NET 4.5 as I understand Microsoft removed the 2GB object size limit on arrays? But were not keen to do that unless theres a solid technique for removing duplicates that cant be done in .NET
4.0.
At this time using a database is NOT an option.
Anyone have any ideas?
Thanks!



<br/>

View the full article
 
Back
Top