January 14, 2013
The dreaded custom framework.
Using pipelining for document normalization, article extraction, publication date extraction, link extraction, language detection, deduplication. The problem we're trying to solve is doing the above in a high throughput, scalable, fashion without it all falling appart when I'm asleep.
Work at Arachnys on gathering business intelligence data from every pathologically made site you could think of. Includes on demand user simulated searches across business registries, publication date extraction from new articles, dirty data etc.