DPGR on aws
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000009-000010.266.out
[2025-12-12T06:35:35.017+00:00] error: *** JOB 266 ON gpu-a100-2 CANCELLED AT 2025-12-12T06:35:35 DUE TO TIME LIMIT ***
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull
Warning: Permanently added 'github.com,140.82.114.3' (ECDSA) to the list of known hosts.
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (5/5), 2.71 KiB | 23.00 KiB/s, done.
From github.com:QinLab/dpgr_build_training_data
118ab5b..f0259cf main -> origin/main
* [new branch] codex/remove-time-limit-for-slurm-job -> origin/codex/remove-time-limit-for-slurm-job
Updating 118ab5b..f0259cf
Fast-forward
scripts/debug_submit_slice_jobs.sh | 2 --
1 file changed, 2 deletions(-)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh scripts/debug_submit_slice_jobs.sh
Prepared 33 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
Submitted dpgrRows000001-000002: Submitted batch job 268
Submitted dpgrRows000003-000004: Submitted batch job 269
Submitted dpgrRows000005-000006: Submitted batch job 270
Submitted dpgrRows000007-000008: Submitted batch job 271
Submitted dpgrRows000009-000010: Submitted batch job 272
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv
10:01
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
270 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2
271 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2
272 gpu-a100- dpgrRows hqin R 50:17 1 gpu-a100-2
268 gpu-a100- dpgrRows hqin R 50:18 1 gpu-a100-2
269 gpu-a100- dpgrRows hqin R 50:18 1 gpu-a100-2
267 gpu-a100- fine_lr1 malam007 R 6:22:04 1 gpu-a100-2
210 gpu-a100- fine_lr5 malam007 R 1-05:50:04 1 gpu-a100-1
211 gpu-a100- fine_lr5 malam007 R 1-05:50:04 1 gpu-a100-1
212 gpu-a100- fine_lr1 malam007 R 1-05:50:04 1 gpu-a100-1
213 gpu-a100- fine_lr1 malam007 R 1-05:50:04 1 gpu-a100-1
214 gpu-a100- fine_lr2 malam007 R 1-05:50:04 1 gpu-a100-1
215 gpu-a100- fine_lr2 malam007 R 1-05:50:04 1 gpu-a100-1
199 gpu-h100- h100-1Ma hqin PD 0:00 1 (BeginTime)
198 gpu-h100- dpgrVari hqin PD 0:00 1 (BeginTime)
10:39am. 100 job submitted.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull
Warning: Permanently added 'github.com,140.82.114.4' (ECDSA) to the list of known hosts.
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (6/6), 5.84 KiB | 45.00 KiB/s, done.
From github.com:QinLab/dpgr_build_training_data
41290d9..153021a main -> origin/main
* [new branch] codex/fix-slurm-job-submission-issue -> origin/codex/fix-slurm-job-submission-issue
Updating 41290d9..153021a
Fast-forward
README.md | 3 ++-
scripts/submit_slice_jobs_100x100.sh | 20 +++++++++++++++++++-
2 files changed, 21 insertions(+), 2 deletions(-)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/submit_slice_jobs_100x100.sh &
[1] 29301
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ Set PARTITION to a valid Slurm queue before running this helper.
Examples:
PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh
DRY_RUN=1 PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh
Tip: avoid calling the script with `sh`—use `bash` so the environment check works.
[1]+ Exit 1 bash scripts/submit_slice_jobs_100x100.sh
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh
Prepared 1700 slices in tmp/variant_pair_slices_100x100
Submitting only first 100 slice(s) due to --max-jobs
Submitted dpgrRows000001-000100: Submitted batch job 273
Submitted dpgrRows000101-000200: Submitted batch job 274
Submitted dpgrRows000201-000300: Submitted batch job 275
Submitted dpgrRows000301-000400: Submitted batch job 276
Submitted dpgrRows000401-000500: Submitted batch job 277
Submitted dpgrRows000501-000600: Submitted batch job 278
.....
Submitted dpgrRows008801-008900: Submitted batch job 361
Submitted dpgrRows008901-009000: Submitted batch job 362
Submitted dpgrRows009001-009100: Submitted batch job 363
Submitted dpgrRows009101-009200: Submitted batch job 364
Submitted dpgrRows009201-009300: Submitted batch job 365
Submitted dpgrRows009301-009400: Submitted batch job 366
Submitted dpgrRows009401-009500: Submitted batch job 367
Submitted dpgrRows009501-009600: Submitted batch job 368
Submitted dpgrRows009601-009700: Submitted batch job 369
Submitted dpgrRows009701-009800: Submitted batch job 370
Submitted dpgrRows009801-009900: Submitted batch job 371
Submitted dpgrRows009901-010000: Submitted batch job 372
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_100x100/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_100x100/dpgr_pair_labels_all.tsv
10:45, found python code use EPI id to parse fasta sequence, which is wrong. Aborted all slurm jobs. Bugs found.
Then correted the Python code and rerun a small test.
1:58pm. re-try slice test run.
base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh
Prepared 33 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
Submitted dpgrRows000001-000002: Submitted batch job 375
Submitted dpgrRows000003-000004: Submitted batch job 376
Submitted dpgrRows000005-000006: Submitted batch job 377
Submitted dpgrRows000007-000008: Submitted batch job 378
Submitted dpgrRows000009-000010: Submitted batch job 379
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue -u hqin
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
373 gpu-A10G dpgrPair hqin R 1:46:20 1 gpu-A10G-1
379 gpu-a100- dpgrRows hqin R 0:52 1 gpu-a100-1
375 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1
376 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1
377 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1
378 gpu-a100- dpgrRows hqin R 0:53 1 gpu-a100-1
*** Need to come back to check whether EPI accession ID has been map to right FASTA squence headers.
2:20pm. Same mistake. EPI accesstion ID not mapped to FASTA headers.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_and_dpgr_debug.373.out
Node: ip-10-3-135-132.ec2.internal
Job ID: 373
Starting DPGR FASTA/DPGR pairing (debug pilot) at: Fri Dec 12 17:13:07 UTC 2025
Python in use: /home/hqin/miniforge3/envs/dpgr310/bin/python
Python 3.10.19
Sampled 10 random pairs to tmp/debug_random_pairs_10.csv.gz
Built FASTA index with 15971302 entries from data/raw/sequences.fasta.
FASTA index lookup failed; falling back to sequential scan for requested accessions (Missing FASTA index entries for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_18352003, EPI_ISL_18465521, EPI_ISL_19009958, EPI_ISL_19033150).).
Traceback (most recent call last):
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 668, in <module>
main()
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 657, in main
shard_stats = write_records_and_labels(
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 431, in write_records_and_labels
for idx, record in enumerate(records, start=1):
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 637, in indexed_records
sequences = resolve_sequences(batch, requested)
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 580, in resolve_sequences
return load_sequences_from_fasta_file(args.sequences_path, requested, accession_to_virus)
File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 207, in load_sequences_from_fasta_file
raise PairAssemblyError(
__main__.PairAssemblyError: Missing FASTA records for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_15586069, EPI_ISL_17229579, EPI_ISL_17413288, EPI_ISL_18057614).
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
For fast prototypeing, I need to generate synthetic data sets, with fasta sequnecence header mapped to EPI ids, and then generate synthetic pairs with fake DPGR rate for prototyping.
These were done with codex in the codex environment.
5:47pm. back at aws to test the revised codes.
create mode 100644 tmp/virus_name_accessions_20252016.tsv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch build_fasta_pair_and_dpgr_debug.sbatch
Submitted batch job 380
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
380 gpu-A10G dpgrPair hqin CF 0:04 1 gpu-A10G-1
267 gpu-a100- fine_lr1 malam007 R 14:03:34 1 gpu-a100-2
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
6:15pm
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull
Warning: Permanently added 'github.com,140.82.113.3' (ECDSA) to the list of known hosts.
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (5/5), 2.90 KiB | 19.00 KiB/s, done.
From github.com:QinLab/dpgr_build_training_data
f3e7958..8dbe9c5 main -> origin/main
* [new branch] codex/update-debug_submit_slice_jobs.sh-to-process-new-data -> origin/codex/update-debug_submit_slice_jobs.sh-to-process-new-data
Updating f3e7958..8dbe9c5
Fast-forward
scripts/debug_submit_slice_jobs.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh
Prepared 33 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
Submitted dpgrRows000001-000002: Submitted batch job 381
Submitted dpgrRows000003-000004: Submitted batch job 382
Submitted dpgrRows000005-000006: Submitted batch job 383
Submitted dpgrRows000007-000008: Submitted batch job 384
Submitted dpgrRows000009-000010: Submitted batch job 385
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
nothing happend. Swith back data/raw/sequences.fasta
somehow, the build python code does not work anymore.
The previous version worked on two test pairs:
https://github.com/QinLab/dpgr_build_training_data/blob/cfadceeefb323ca76249361e3d16ecfb7abdefea/scripts/build_fasta_pair_and_dpgr.py
Further verification show that the testing data do not match.
No comments:
Post a Comment