Friday, December 12, 2025

dpgr build training data,

DPGR on aws

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000009-000010.266.out

[2025-12-12T06:35:35.017+00:00] error: *** JOB 266 ON gpu-a100-2 CANCELLED AT 2025-12-12T06:35:35 DUE TO TIME LIMIT ***

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.114.3' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 2.71 KiB | 23.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   118ab5b..f0259cf  main                                  -> origin/main

 * [new branch]      codex/remove-time-limit-for-slurm-job -> origin/codex/remove-time-limit-for-slurm-job

Updating 118ab5b..f0259cf

Fast-forward

 scripts/debug_submit_slice_jobs.sh | 2 --

 1 file changed, 2 deletions(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh scripts/debug_submit_slice_jobs.sh

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 268

Submitted dpgrRows000003-000004: Submitted batch job 269

Submitted dpgrRows000005-000006: Submitted batch job 270

Submitted dpgrRows000007-000008: Submitted batch job 271

Submitted dpgrRows000009-000010: Submitted batch job 272


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

10:01

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               270 gpu-a100- dpgrRows     hqin  R      50:17      1 gpu-a100-2

               271 gpu-a100- dpgrRows     hqin  R      50:17      1 gpu-a100-2

               272 gpu-a100- dpgrRows     hqin  R      50:17      1 gpu-a100-2

               268 gpu-a100- dpgrRows     hqin  R      50:18      1 gpu-a100-2

               269 gpu-a100- dpgrRows     hqin  R      50:18      1 gpu-a100-2

               267 gpu-a100- fine_lr1 malam007  R    6:22:04      1 gpu-a100-2

               210 gpu-a100- fine_lr5 malam007  R 1-05:50:04      1 gpu-a100-1

               211 gpu-a100- fine_lr5 malam007  R 1-05:50:04      1 gpu-a100-1

               212 gpu-a100- fine_lr1 malam007  R 1-05:50:04      1 gpu-a100-1

               213 gpu-a100- fine_lr1 malam007  R 1-05:50:04      1 gpu-a100-1

               214 gpu-a100- fine_lr2 malam007  R 1-05:50:04      1 gpu-a100-1

               215 gpu-a100- fine_lr2 malam007  R 1-05:50:04      1 gpu-a100-1

               199 gpu-h100- h100-1Ma     hqin PD       0:00      1 (BeginTime)

               198 gpu-h100- dpgrVari     hqin PD       0:00      1 (BeginTime)

10:39am. 100 job submitted. 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.114.4' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 6, done.

remote: Counting objects: 100% (6/6), done.

remote: Compressing objects: 100% (6/6), done.

remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (6/6), 5.84 KiB | 45.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   41290d9..153021a  main                                 -> origin/main

 * [new branch]      codex/fix-slurm-job-submission-issue -> origin/codex/fix-slurm-job-submission-issue

Updating 41290d9..153021a

Fast-forward

 README.md                            |  3 ++-

 scripts/submit_slice_jobs_100x100.sh | 20 +++++++++++++++++++-

 2 files changed, 21 insertions(+), 2 deletions(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/submit_slice_jobs_100x100.sh &

[1] 29301

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ Set PARTITION to a valid Slurm queue before running this helper.


Examples:

  PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh

  DRY_RUN=1 PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh


Tip: avoid calling the script with `sh`—use `bash` so the environment check works.


[1]+  Exit 1                  bash scripts/submit_slice_jobs_100x100.sh

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ PARTITION=gpu-a100-8 bash scripts/submit_slice_jobs_100x100.sh

Prepared 1700 slices in tmp/variant_pair_slices_100x100

Submitting only first 100 slice(s) due to --max-jobs

Submitted dpgrRows000001-000100: Submitted batch job 273

Submitted dpgrRows000101-000200: Submitted batch job 274

Submitted dpgrRows000201-000300: Submitted batch job 275

Submitted dpgrRows000301-000400: Submitted batch job 276

Submitted dpgrRows000401-000500: Submitted batch job 277

Submitted dpgrRows000501-000600: Submitted batch job 278

.....

Submitted dpgrRows008801-008900: Submitted batch job 361

Submitted dpgrRows008901-009000: Submitted batch job 362

Submitted dpgrRows009001-009100: Submitted batch job 363

Submitted dpgrRows009101-009200: Submitted batch job 364

Submitted dpgrRows009201-009300: Submitted batch job 365

Submitted dpgrRows009301-009400: Submitted batch job 366

Submitted dpgrRows009401-009500: Submitted batch job 367

Submitted dpgrRows009501-009600: Submitted batch job 368

Submitted dpgrRows009601-009700: Submitted batch job 369

Submitted dpgrRows009701-009800: Submitted batch job 370

Submitted dpgrRows009801-009900: Submitted batch job 371

Submitted dpgrRows009901-010000: Submitted batch job 372


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_100x100/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_100x100/dpgr_pair_labels_all.tsv

10:45, found python code use EPI id to parse fasta sequence, which is wrong. Aborted all slurm jobs. Bugs found. 

Then correted the Python code and rerun a small test. 


1:58pm. re-try slice test run. 

base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh 

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 375

Submitted dpgrRows000003-000004: Submitted batch job 376

Submitted dpgrRows000005-000006: Submitted batch job 377

Submitted dpgrRows000007-000008: Submitted batch job 378

Submitted dpgrRows000009-000010: Submitted batch job 379


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue -u hqin

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               373  gpu-A10G dpgrPair     hqin  R    1:46:20      1 gpu-A10G-1

               379 gpu-a100- dpgrRows     hqin  R       0:52      1 gpu-a100-1

               375 gpu-a100- dpgrRows     hqin  R       0:53      1 gpu-a100-1

               376 gpu-a100- dpgrRows     hqin  R       0:53      1 gpu-a100-1

               377 gpu-a100- dpgrRows     hqin  R       0:53      1 gpu-a100-1

               378 gpu-a100- dpgrRows     hqin  R       0:53      1 gpu-a100-1


*** Need to come back to check whether EPI accession ID has been map to right FASTA squence headers. 

2:20pm. Same mistake. EPI accesstion ID not mapped to FASTA headers. 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_and_dpgr_debug.373.out

Node: ip-10-3-135-132.ec2.internal

Job ID: 373

Starting DPGR FASTA/DPGR pairing (debug pilot) at: Fri Dec 12 17:13:07 UTC 2025

Python in use: /home/hqin/miniforge3/envs/dpgr310/bin/python

Python 3.10.19

Sampled 10 random pairs to tmp/debug_random_pairs_10.csv.gz

Built FASTA index with 15971302 entries from data/raw/sequences.fasta.

FASTA index lookup failed; falling back to sequential scan for requested accessions (Missing FASTA index entries for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_18352003, EPI_ISL_18465521, EPI_ISL_19009958, EPI_ISL_19033150).).

Traceback (most recent call last):

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 668, in <module>

    main()

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 657, in main

    shard_stats = write_records_and_labels(

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 431, in write_records_and_labels

    for idx, record in enumerate(records, start=1):

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 637, in indexed_records

    sequences = resolve_sequences(batch, requested)

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 580, in resolve_sequences

    return load_sequences_from_fasta_file(args.sequences_path, requested, accession_to_virus)

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 207, in load_sequences_from_fasta_file

    raise PairAssemblyError(

__main__.PairAssemblyError: Missing FASTA records for 20 accession IDs (sample: EPI_ISL_15266364, EPI_ISL_15586069, EPI_ISL_17229579, EPI_ISL_17413288, EPI_ISL_18057614).

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

For fast prototypeing, I need to generate synthetic data sets, with fasta sequnecence header mapped to EPI ids, and then generate synthetic pairs with fake DPGR rate for prototyping. 

These were done with codex in the codex environment. 

5:47pm. back at aws to test the revised codes. 

 create mode 100644 tmp/virus_name_accessions_20252016.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch build_fasta_pair_and_dpgr_debug.sbatch

Submitted batch job 380

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               380  gpu-A10G dpgrPair     hqin CF       0:04      1 gpu-A10G-1

               267 gpu-a100- fine_lr1 malam007  R   14:03:34      1 gpu-a100-2

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

6:15pm

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.113.3' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 2.90 KiB | 19.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   f3e7958..8dbe9c5  main                                                        -> origin/main

 * [new branch]      codex/update-debug_submit_slice_jobs.sh-to-process-new-data -> origin/codex/update-debug_submit_slice_jobs.sh-to-process-new-data

Updating f3e7958..8dbe9c5

Fast-forward

 scripts/debug_submit_slice_jobs.sh | 4 +++-

 1 file changed, 3 insertions(+), 1 deletion(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash scripts/debug_submit_slice_jobs.sh 

Prepared 33 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 381

Submitted dpgrRows000003-000004: Submitted batch job 382

Submitted dpgrRows000005-000006: Submitted batch job 383

Submitted dpgrRows000007-000008: Submitted batch job 384

Submitted dpgrRows000009-000010: Submitted batch job 385


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

nothing happend. Swith back data/raw/sequences.fasta

somehow, the build python code does not work anymore. 

The previous version worked on two test pairs:
https://github.com/QinLab/dpgr_build_training_data/blob/cfadceeefb323ca76249361e3d16ecfb7abdefea/scripts/build_fasta_pair_and_dpgr.py

Further verification show that the testing data do not match. 






 

No comments:

Post a Comment