62M checkpoint appears incomplete/truncated despite matching LFS sha256

by KantaHayashiAI - opened 19 days ago

Hi NVIDIA team,

I may have found an issue with the 62M proxy checkpoint artifact:

nemotron_climb_proxy_model_62m/iter_2500000/mp_rank_00/model_optim_rng.pt

The downloaded file matches the Hugging Face LFS metadata exactly, so this does not look like a local download/cache corruption:

file size: 770,527,232 bytes
LFS sha256 / local sha256: f339e80c501ead58cdd067442a68e3930fd3438ac47f9b95d555ea532c84ca01

However, the checkpoint cannot be loaded as a PyTorch checkpoint:

PytorchStreamReader failed reading zip archive: failed finding central directory

I also inspected the zip/PyTorch-serialization structure:

the file starts with a valid local zip header (PK\x03\x04)
the tail contains no EOCD (PK\x05\x06)
the tail contains no ZIP64 EOCD (PK\x06\x06)
the tail contains no central directory headers (PK\x01\x02)
the last stream-local entry I can find is model_optim_rng/data/798
expected later records such as model_optim_rng/data/8, model_optim_rng/version, and model_optim_rng/.data/serialization_id are absent

As a repairability check, I rebuilt a zip central directory from the surviving local file headers and added version / .data/serialization_id. The rebuilt archive became valid as a zip file, but torch.load still failed:

PytorchStreamReader failed locating file data/8: file not found

So this appears to be more than a missing central directory; some PyTorch storage records themselves seem to be missing from the published 62M artifact.

As a control, the 350M checkpoint:

nemotron_climb_proxy_model_350m/iter_2384053/mp_rank_00/model_optim_rng.pt

does load correctly in the same environment, and its zip structure includes the expected central directory and EOCD records.

Could you please verify whether the 62M model_optim_rng.pt upload is complete, or re-upload the 62M checkpoint? If the checkpoint was intended to be split or there is an alternate 62M checkpoint source, it would be helpful to document that as well.

Thanks for releasing these proxy models.

sarahyurick

NVIDIA org 19 days ago

Hi @KantaHayashiAI thanks for reporting! I am able to reproduce the error too. Let me see about uploading a fix soon.

For documentation purposes my code for reproducing is just:

import torch

path = "nemotron_climb_proxy_model_62m/iter_2500000/mp_rank_00/model_optim_rng.pt"
torch.load(path, map_location="cpu", weights_only=False)

sarahyurick

NVIDIA org 19 days ago

Hi @KantaHayashiAI I have uploaded another file which works on my end. Please confirm whether it works for you too: nemotron_climb_proxy_model_62m/iter_2499000/mp_rank_00/model_optim_rng.pt

Thanks again for reporting this issue!

KantaHayashiAI

19 days ago

Hi @sarahyurick , thank you very much for the quick fix.

I can confirm that the newly uploaded file loads successfully on my side:

nemotron_climb_proxy_model_62m/iter_2499000/mp_rank_00/model_optim_rng.pt

However, I found one remaining issue that may be important for Megatron loading/training.

The args object inside the 62M checkpoint appears to contain the same architecture values as the 350M checkpoint. For example, in both checkpoints I see:

num_layers = 12
hidden_size = 1344
ffn_hidden_size = 5376
num_attention_heads = 12
padded_vocab_size = 32000

But the 62M tensor shapes indicate a much smaller model. For example:

62M: final_layernorm.weight shape is [384], total model tensor numel ~74,932,608
350M: final_layernorm.weight shape is [960], total model tensor numel ~376,075,200

Because of this, when loading the 62M checkpoint with Megatron using --use-checkpoint-args, Megatron builds a ~300M model instead of the small 62M model. It then reports around 0.30B parameters during training.

So the file corruption issue seems fixed, but the checkpoint metadata/args for the 62M checkpoint may still be inconsistent with the tensors.

Thanks again for the fast turnaround!

sarahyurick

NVIDIA org 19 days ago

Ah good catch, sorry for the back and forth @KantaHayashiAI . I have re-uploaded both checkpoints, please let me know if there are any remaining issues.

sarahyurick

NVIDIA org 7 days ago

Closing for now, thank you. Please feel free to re-open as needed.

sarahyurick changed discussion status to closed 7 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment