-
Notifications
You must be signed in to change notification settings - Fork 0
Resume training in dataloader #18
Comments
I did changes over the Class CustomDataset. Those changes allows now to get the current batch and other information that perform in a better way our training. Btw the checkpoint main is just a temporal main where I added an example of the usage. The same is the case with the model checkpoint version, this is an example of usage. I recommend that the best practice is adding this change over the current version, but I cannot evaluate it over the dataset, for that reason I just created this developing version to evaluate it manually. Co-Authored-By: Dong Han <dong.han@uconn.edu> Co-Authored-By: Darren Chen <darren.3.chen@uconn.edu>
If we cannot resume the dataloader, I will create several smaller dataloaders, each containing several UIDs from the large unlabeled data, and then load them for labeling/testing in the future. |
Dear @lrm22005 ,
According to Pytorch official code, I guess when num_workers > 0, we cannot gracefully fetch the segment index. Have you ever checked this answer? https://stackoverflow.com/questions/58834338/how-does-the-getitem-s-idx-work-within-pytorchs-dataloader |
Dear @lrm22005 , I think our current way of blindly reset/adding value to My suggestion is as the Stack Overflow mentioned, create our own mapper function, which can also make our dataloader be prepared for the My proposed steps:
|
Dear @lrm22005 , I tested the resumed training, and the segment names look correct. The version of code I used to test the dataloader training resume: The jupyter notebook I wrote to check the files before resuming and after resuming: The I have some small bug that I have not finished in the code.
Almost done with the resume training data loader! |
Dear @lrm22005 , I tested the resume training on large unlabeled data, and it worked (although I did not have time to compare the file names with all unlabeled data vs. before vs. after resume training). Version of code: Colab output for resume training: I think we should be done with resume training data loader now. The above three bugs have been fixed by now. Remember to switch back using training UIDs for train the model. Right now I am using the unlabeled UIDs for debugging purpose. Reminder: @dac20022 If you have time this week or next week, we can go over the changes together with Luis. Another notice: |
Dear Luis @lrm22005 , Since we are saving the checkpoint after every batch and every epoch, I was not saving the best model. Therefore, I should assign the best model to new variables and return them after the After this line,
I will write: best_model_state = copy.deepcopy(model.state_dict())
best_likelihood_state = copy.deepcopy(likelihood.state_dict())
best_metrics = metrics and when returning, I will return I read from Pytorch tutorial that I need to deep copy the model state for the best model. However, does it apply to the Or the simplest way is still to save the best model with a checkpoint
|
Dear @lrm22005 , Another quick question, why did you not return the Is it because when starting the active learning, you do not need to resume the training from the Thanks! |
Your approach looks the correct way. Deep copying the model state and the likelihood state is the correct way to save the best model. Here's a quick check: After line 225, you can add: best_model_state = copy.deepcopy(model.state_dict()) And when returning, you can do: return best_model_state, best_likelihood_state, best_metrics It looks like you've got it right for both the model and likelihood states. The state_dict method is what you need to deep copy for both. Alternatively, saving the best model with a different checkpoint file name is also a valid and often simpler approach. It's good to have both options depending on your preference and specific use case. The checkpoint method you mentioned from my code should work just fine. |
The reason I didn't return the optimizer from the train_gp_model function is indeed because of the workflow we're following. When starting the active learning phase, we typically don't need to resume training from where the train_gp_model left off. Instead, we often start a new training process, which is why returning the optimizer state isn't necessary. The optimizer state is saved in the checkpoint mainly for scenarios where you want to resume training exactly from a certain point, like in the middle of regular training, and not necessarily for active learning. In active learning, we often retrain the model with new data, making it less crucial to retain the optimizer state from the initial training. |
Thanks, I will sync my latest code once it pass the debug stage on my CentOS system. I will put it running on Google Colab for the next few weeks. |
Dear Luis @lrm22005 , Regarding getting the predicted label in Pytorch, may I ask you why you used "
I saw Pytorch Quickstart tutorial used this way for calculating the testing accuracy (https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html):
Or Pytorch is using this way for calculating the training accuracy ():
I did not see Pytorch use the mean of argmax. Is it because you are calculating the Gaussian Process likelihood? |
Hey Dr. Cassey, The reason I'm using In a standard neural network, you typically use However, with GPs, particularly in the GPyTorch framework, the output ( This approach integrates the probabilistic nature of GPs, where each prediction is a distribution rather than a single point estimate. By taking the mean of the distribution and then the argmax, we are essentially finding the most likely class in a probabilistically informed way. |
How to resume dataloader from certain batch
0. Replicate this issue
The version of code I made to replicate this issue:
https://github.com/Cassey2016/Pulsewatch_labeling/tree/6409a725bcc339b1135c6be15c223c2cd19dec12
Checkpoint saved:
Model checkpoint and dataloader checkpoint. (Shared with lrmercadod@gmail.com)
https://drive.google.com/drive/folders/1BIdOh3lICu__Jj4EZriln4y7bJ_ZaYEj?usp=drive_link
Luis' code version:
https://github.uconn.edu/lrm22005/B_ML_Project/tree/528c4be6ef96b0a42b2ae0a1782b89609e8c8f9d
1. Output
1. Run from the beginning
I ran
ss_main.py
:Open my Colab Notebook on Github.com
You can see that the
Debug: current_batch_index None
. Its value was never changed inside the data loading process.I stopped the running at epoch 5, batch 7:
2. Resume from checkpoints
The epoch started from 5, but the batch index did not. It started from 0 again in epoch 5.
The text was updated successfully, but these errors were encountered: