Some thoughts about the threading:
1. OptiFine has had some success, from what I understand, spreading chunk [something] across multiple ticks, and multiple cores. However, I'm not sure if this is chunk generation, or chunk rendering. If it's the former, then that might be something to look at.
2. Even with all the complicated mod injections into the chunk generation code, I still believe it should be possible to assign world generation to a different thread, instead of splitting it into multiple. This could also be done with other things like mob path finding, machines from mods, etc.
3. Finally, this is probably the most complicated, but there are commercial and server programs that can take a task that is normally single-threaded and divide it into multiple workloads, then assign each to a core/thread. Instead of intercepting the original chunk generation code, it might be possible to take the math and calculations and assign them to multiple cores.
Please, correct me, I'm sure I'm mostly babbling and wrong, and I don't want to remain the idiot These are some ideas though.
Edit: Thought I'd share some results with the last FastCraft build on KCauldron: The world took 36 seconds to generate before FastCraft, and 24 after, in a low ram environment on a modpack with ~150 mods. While KCauldron may improve chunk generation, FastCraft is certainly NOT useless with KCauldron!