aws
9:30am -> Worked on downsample DPGR samples in order to build the training data in a reasonable time.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch sample_pair_accuracy.sbatch
Submitted batch job 241
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
241 gpu-a100- dpgrSamp hqin R 0:03 1 gpu-a100-2
210 gpu-a100- fine_lr5 malam007 R 8:56:48 1 gpu-a100-1
211 gpu-a100- fine_lr5 malam007 R 8:56:48 1 gpu-a100-1
212 gpu-a100- fine_lr1 malam007 R 8:56:48 1 gpu-a100-1
213 gpu-a100- fine_lr1 malam007 R 8:56:48 1 gpu-a100-1
214 gpu-a100- fine_lr2 malam007 R 8:56:48 1 gpu-a100-1
215 gpu-a100- fine_lr2 malam007 R 8:56:48 1 gpu-a100-1
199 gpu-h100- h100-1Ma hqin CF 0:48 1 gpu-h100-1-1
198 gpu-h100- dpgrVari hqin CF 0:48 1 gpu-h100-8-1
13:08, run sampling accuracy check, verify tuples of location, time windows, DPGR rates in the mapping csv.gz file.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 2066 tuples from the DPGR summary were not found in the pairs file.
Checked 100 sampled rows.
No issues found in the sampled subset.
Wrote per-row validation results to sample_pair_validation_results.csv
Wrote tuple comparison summary to logs/sample_pair_tuple_check.241.txt
Finished at: Thu Dec 11 18:25:36 UTC 2025
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.241.txt
Found 2 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-09.csv.gz.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 2066 tuples from the DPGR summary were not found in the pairs file.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git add logs/sample_pair_tuple_check.241.txt
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git commit -m 'only 2 tuples in mapping results, error found'
[main f68e0a6] only 2 tuples in mapping results, error found
1 file changed, 3 insertions(+)
create mode 100644 logs/sample_pair_tuple_check.241.txt
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git push
13:35. find out the mapping was wrong on Dec 9 mapping out. Only 2 tuples of location, time window, and dpgr were there. So, this is a case that my quality check and feedback to codex is incomplete.
So, need to redo the mapping.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/sample_pair_accuracy_small.sbatch
Submitted batch job 242
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
242 gpu-a100- dpgrSamp hqin CF 0:02 1 gpu-a100-2
210 gpu-a100- fine_lr5 malam007 R 9:42:42 1 gpu-a100-1
211 gpu-a100- fine_lr5 malam007 R 9:42:42 1 gpu-a100-1
212 gpu-a100- fine_lr1 malam007 R 9:42:42 1 gpu-a100-1
213 gpu-a100- fine_lr1 malam007 R 9:42:42 1 gpu-a100-1
214 gpu-a100- fine_lr2 malam007 R 9:42:42 1 gpu-a100-1
215 gpu-a100- fine_lr2 malam007 R 9:42:42 1 gpu-a100-1
199 gpu-h100- h100-1Ma hqin CF 1:42 1 gpu-h100-1-1
198 gpu-h100- dpgrVari hqin CF 1:42 1 gpu-h100-8-1
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
13:54 submitted the above slurm job to test the revised mapping code
14:18, revised the slurm job and resubmitted.
Noticed that default python is now 3.12.12. So, my venv of python 3.12 is unnecessary now.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat scripts/sample_pair_accuracy_small.sbatch
#!/bin/bash
#SBATCH --job-name=dpgrSampleAccuracySmall
#SBATCH --time=00:20:00
#SBATCH --partition=gpu-a100-8
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --output=logs/sample_pair_accuracy.small.%j.out
# Small-scale Slurm job to generate a tiny mapping and sample-check it.
set -euo pipefail
# Default to the submission directory so the script works across hosts.
WORKDIR="${SLURM_SUBMIT_DIR:-$(pwd)}"
# Fall back to a home-relative clone if the submission directory is missing
# (e.g., when submitting from a container-only path like /workspace).
if [ ! -d "$WORKDIR" ] && [ -d "$HOME/dpgr_build_training_data" ]; then
WORKDIR="$HOME/dpgr_build_training_data"
fi
cd "$WORKDIR"
# Optional: activate your environment (uncomment and adjust as needed).
# source ~/miniforge3/etc/profile.d/conda.sh
# conda activate dpgr310
PAIRS_DATE=$(date +%Y-%m-%d)
PAIRS_FILE="dpgr_variant_pairs-${PAIRS_DATE}.csv"
python - <<'PY'
import generate_dpgr_variant_mapping as g
g.MAX_CANDIDATES_PER_VARIANT = 500
g.MAX_TOTAL_PAIRS = 2000
g.debug = 1
g.main()
PY
mkdir -p logs
python scripts/sample_pair_accuracy.py \
--pairs "${PAIRS_FILE}" \
--dpgr dpgr_analysis_summary.csv \
--metadata metadata1K.tsv \
--sample-size 25 \
--random-state 7 \
--tuple-output "logs/sample_pair_tuple_check.$SLURM_JOB_ID.txt"
14:26, run above slurm job.
14:33, un-commted virtual enviroment, and run sample_pair_accuracy_small.sbatch again.
16:31. delete old output, and sbatch scripts/generate_dpgr_variant_mapping_full.sbatch
17:34, discovered global paramenters in the py file that set the limit.
debug = 0
|
MAX_CANDIDATES_PER_VARIANT = 50_000
|
MAX_TOTAL_PAIRS = 500_000
|
MAX_PAIRS_PER_DPGR = 250 |
Set MAX_PAIRS_PER_DPGR as 100, and MAX_TOTAL_PAIRS = 1,000,000, MAX_CANDIDATES_PER_VARIANT = 1000
17:39, (base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_full.sbatch
Submitted batch job 250
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ scontrol show job 250
JobId=250 JobName=dpgrVariantMapFull
UserId=hqin(1004) GroupId=hqin(1004) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:05:25 TimeLimit=08:00:00 TimeMin=N/A
SubmitTime=2025-12-11T22:39:06 EligibleTime=2025-12-11T22:39:06
AccrueTime=2025-12-11T22:39:06
StartTime=2025-12-11T22:46:35 EndTime=2025-12-12T06:46:35 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-11T22:39:06 Scheduler=Backfill
Partition=gpu-a100-8 AllocNode:Sid=ip-10-3-4-198:31406
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu-a100-4
BatchHost=gpu-a100-4
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=8,mem=1120665M,node=1,billing=8
AllocTRES=cpu=8,node=1,billing=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=/home/hqin/dpgr_build_training_data/scripts/generate_dpgr_variant_mapping_full.sbatch
WorkDir=/home/hqin/dpgr_build_training_data
StdErr=/home/hqin/dpgr_build_training_data/logs/dpgr_variant_mapping.full.250.out
StdIn=/dev/null
StdOut=/home/hqin/dpgr_build_training_data/logs/dpgr_variant_mapping.full.250.out
TresPerTask=cpu=8
17:55
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cut -d ',' -f 2 /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv | tail -n +2 | sort | uniq
Africa
Asia
Europe
North America
Oceania
South America
17:57, it seems there are 1639 unique tuples of Location and Time window
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cut -d ',' -f 2,3 /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv | tail -n +2 | sort | uniq | wc -l
1639
Python check find our tuples, 2033 unique triple tuples.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.full.250.out
Saved 169942 sampled pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-2025-12-11.csv
Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-11.csv.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 35 tuples from the DPGR summary were not found in the pairs file.
Checked 200 sampled rows.
Found 1 rows with potential issues.
Issue breakdown:
- Metadata collection date missing or invalid: 1
- Collection date could not be parsed: 1
Examples (up to 10):
DPGR row 13 (EPI_ISL_1501662 / EPI_ISL_2163883) -> Metadata collection date missing or invalid, Collection date could not be parsed
Wrote per-row validation results to logs/sample_pair_validation_results.full.250.csv
Wrote tuple comparison summary to logs/sample_pair_tuple_check.full.250.txt
no changes added to commit (use "git add" and/or "git commit -a")
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.full.250.txt
Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-2025-12-11.csv.
DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.
Warning: 35 tuples from the DPGR summary were not found in the pairs file.
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
In 200 sampled entries, one row has an irregular collection date. So, the potential error rate is 1/200 ~ 0.5%.
20:51, committed.
22:30
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue -u hqin
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
259 gpu-A10G dpgrPair hqin CF 1:04 1 gpu-A10G-1
199 gpu-h100- h100-1Ma hqin PD 0:00 1 (BeginTime)
198 gpu-h100- dpgrVari hqin PD 0:00 1 (BeginTime)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh scripts/debug_submit_slice_jobs.sh
Prepared 33 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
DRY RUN: sbatch --job-name dpgrRows000001-000002 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000001-000002.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000001-000002.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000001-000002 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000001-000002.tsv --pair-prefix ROWS000001-000002_ --shard-size 500 --chunk-size 1000'
DRY RUN: sbatch --job-name dpgrRows000003-000004 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000003-000004.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000003-000004.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000003-000004 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000003-000004.tsv --pair-prefix ROWS000003-000004_ --shard-size 500 --chunk-size 1000'
DRY RUN: sbatch --job-name dpgrRows000005-000006 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000005-000006.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000005-000006.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000005-000006 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000005-000006.tsv --pair-prefix ROWS000005-000006_ --shard-size 500 --chunk-size 1000'
DRY RUN: sbatch --job-name dpgrRows000007-000008 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000007-000008.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000007-000008.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000007-000008 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000007-000008.tsv --pair-prefix ROWS000007-000008_ --shard-size 500 --chunk-size 1000'
DRY RUN: sbatch --job-name dpgrRows000009-000010 --partition gpu-a100-8 --cpus-per-task 2 --time 00:15:00 --output logs/build_fasta_pair_000009-000010.%j.out --wrap 'cd /home/hqin/dpgr_build_training_data && source ~/miniforge3/etc/profile.d/conda.sh && conda activate dpgr310 && python scripts/build_fasta_pair_and_dpgr.py --pairs-path tmp/variant_pair_slices_debug/dpgr_variant_pairs-2025-12-11-small_rows000009-000010.csv --sequences-path data/raw/sequences.fasta --metadata-path metadata1K.tsv --output-dir data/processed/row_slices_debug/rows000009-000010 --label-path data/labels/row_slices_debug/dpgr_pair_labels_rows000009-000010.tsv --pair-prefix ROWS000009-000010_ --shard-size 500 --chunk-size 1000'
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
1:13am
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull
Warning: Permanently added 'github.com,140.82.112.3' (ECDSA) to the list of known hosts.
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (6/6), 5.10 KiB | 30.00 KiB/s, done.
From github.com:QinLab/dpgr_build_training_data
ee52044..118ab5b main -> origin/main
* [new branch] codex/turn-off-dry-run-in-debug_submit_slice_jobs.sh -> origin/codex/turn-off-dry-run-in-debug_submit_slice_jobs.sh
Updating ee52044..118ab5b
Fast-forward
README.md | 4 ++--
scripts/debug_submit_slice_jobs.sh | 10 +++++-----
2 files changed, 7 insertions(+), 7 deletions(-)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sh scripts/debug_submit_slice_jobs.sh
Prepared 33 slices in tmp/variant_pair_slices_debug
Submitting only first 5 slice(s) due to --max-jobs
Submitted dpgrRows000001-000002: Submitted batch job 262
Submitted dpgrRows000003-000004: Submitted batch job 263
Submitted dpgrRows000005-000006: Submitted batch job 264
Submitted dpgrRows000007-000008: Submitted batch job 265
Submitted dpgrRows000009-000010: Submitted batch job 266
Concatenate label outputs after jobs complete:
cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
261 gpu-A10G dpgrPair hqin R 15:58 1 gpu-A10G-1
262 gpu-a100- dpgrRows hqin CF 0:07 1 gpu-a100-2
263 gpu-a100- dpgrRows hqin CF 0:07 1 gpu-a100-2
264 gpu-a100- dpgrRows hqin CF 0:07 1 gpu-a100-2
265 gpu-a100- dpgrRows hqin CF 0:07 1 gpu-a100-2
266 gpu-a100- dpgrRows hqin CF 0:07 1 gpu-a100-2
210 gpu-a100- fine_lr5 malam007 R 21:00:58 1 gpu-a100-1
211 gpu-a100- fine_lr5 malam007 R 21:00:58 1 gpu-a100-1
212 gpu-a100- fine_lr1 malam007 R 21:00:58 1 gpu-a100-1
213 gpu-a100- fine_lr1 malam007 R 21:00:58 1 gpu-a100-1
214 gpu-a100- fine_lr2 malam007 R 21:00:58 1 gpu-a100-1
215 gpu-a100- fine_lr2 malam007 R 21:00:58 1 gpu-a100-1
199 gpu-h100- h100-1Ma hqin PD 0:00 1 (BeginTime)
198 gpu-h100- dpgrVari hqin PD 0:00 1 (BeginTime)
(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$
No comments:
Post a Comment