I am working on an open-source tool focused on a simple question:
why is this PyTorch training run slower than it should be, and what is actually bottlenecking it?
I am trying to make this easier to understand for ML engineers and researchers without needing to jump straight into heavy profiling or piece together multiple low-level tools. I would really value input from people running real workloads:
-
what is missing from current tooling?
-
what part of the debugging workflow is still too manual or unclear?
-
what would make a tool like this genuinely useful to you?
I am also open to collaborating with people who care deeply about this problem and may want to contribute to the project over time. My main goal right now is to learn from real users and shape the tool around actual pain, not assumptions.
Repo: GitHub - traceopt-ai/traceml: Find why PyTorch training is slow while it’s still running · GitHub