Sunday, December 14, 2025

dpgr, trim the sequence.fasta input

in dpgr_variant_pairs-full-tuples-2025-12-14.csv 


Its columns are: 

pair_id,DPGR Row,DPGR Location,DPGR Time Window,DPGR (Slope),Variant 1,Variant 2,Variant 1 Accession ID,Variant 2 Accession ID,Variant 1 Virus name,Variant 2 Virus name,Variant 1 Collection date,Variant 2 Collection date,Variant 1 Location,Variant 2 Location,Variant 1 Sequence length,Variant 2 Sequence length


It contain pairs if 'virus name' in 'Variant 1 Virus name', 'Variant 2 Virus name'. 


Write a shell code to find the unique 'Virus name' and output them a csv file of "uniq_virus_names_from_dpgr_variant_pairs-full-tuples-2025-12-14.csv"


#!/bin/bash


INPUT_FILE="dpgr_variant_pairs-full-tuples-2025-12-14.csv"

OUTPUT_FILE="uniq_virus_names_from_dpgr_variant_pairs-full-tuples-2025-12-14.csv"


# Check if the input file exists

if [ ! -f "$INPUT_FILE" ]; then

    echo "Error: Input file '$INPUT_FILE' not found."

    exit 1

fi


# Use awk to select the 10th (Variant 1 Virus name) and 11th (Variant 2 Virus name) columns,

# then separate the header and body rows, print column contents on new lines,

# sort them, and finally get the unique names.


# The -F, sets the field delimiter to a comma.

# The 'NR==1' block handles the header: it prints the combined column titles.

# The 'NR>1' block handles the data: it prints the content of column 10 and 11 on separate lines.

# The '| tail -n +2' removes the combined header row from the second awk output,

# so only the virus names are passed to sort and uniq.

# The final result is redirected to the output CSV file.


{

    # Print the header (just a descriptive column title for the output)

    echo "Unique Virus Name"


    # Process the data

    awk -F, 'NR>1 {print $10; print $11}' "$INPUT_FILE" |

    sort |

    uniq


} > "$OUTPUT_FILE"


echo "✅ Success: Unique virus names extracted from '$INPUT_FILE' and saved to '$OUTPUT_FILE'."

echo "---"

echo "The file '$OUTPUT_FILE' contains:"

head "$OUTPUT_FILE"

echo "..."

0:20am. aws

(dpgr310) [hqin@ip-10-3-4-198 dpgr_build_training_data]$  python scripts/filter_fasta_by_virus_names.py --names-csv uniq_virus_names_from_dpgr_variant_pairs-full-tuples-2025-12-14.csv --fasta data/raw/sequences.fasta --output /tmp/output.fasta --missing-output /tmp/missing.txt &

(dpgr310) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ grep ">" /tmp/output.fasta | head

>hCoV-19/USA/CA-CDC-QDX47542243/2023|2023-03-03|2023-03-20

>hCoV-19/USA/CA-CDPH-FS48082135/2022|2022-12-16|2023-01-30

>hCoV-19/Japan/PG-77665/2021|2021-07-15|2021-09-02

>hCoV-19/USA/CT-Yale-11825/2021|2021-10-05|2022-10-29

>hCoV-19/Israel/CVL-18032/2021|2021-08-11|2021-10-29

>hCoV-19/Denmark/DCGC-569088/2022|2022-08-21|2022-08-29

>hCoV-19/Indonesia/JK-NIHRD-WGS-22-16699/2022|2022-08-09|2022-08-23

>hCoV-19/USA/FL-BPHL-1692/2021|2021-03-01|2021-04-22

>hCoV-19/Russia/PRI-5829/2021|2021-01-18|2023-01-25

>hCoV-19/South Korea/KDCA25825/2021|2021-12-13|2022-01-12

(dpgr310) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ grep hCoV-19/USA/CA-CDC-QDX47542243/2023 *csv

dpgr_variant_pairs-full-tuples-2025-12-14.csv:PAIR028262,463,North America,2023-02-13 to 2023-04-17,0.0061116162065131,XBB.1.5.1,XBB.1.5.66,EPI_ISL_18883356,EPI_ISL_17250301,hCoV-19/USA/NY-UB-ECMC-00747/2023hCoV-19/USA/CA-CDC-QDX47542243/2023,2023-03-20,2023-03-03,North America / USA / New York / Erie County,North America / USA / California,29837,29721

dpgr_variant_pairs-full-tuples-2025-12-14-small.csv:PAIR028262,463,North America,2023-02-13 to 2023-04-17,0.0061116162065131,XBB.1.5.1,XBB.1.5.66,EPI_ISL_18883356,EPI_ISL_17250301,hCoV-19/USA/NY-UB-ECMC-00747/2023,hCoV-19/USA/CA-CDC-QDX47542243/2023,2023-03-20,2023-03-03,North America / USA / New York / Erie County,North America / USA / California,29837,29721

sample_pair_validation_results.csv:463,EPI_ISL_18748636,EPI_ISL_17250301,"Virus name: hCoV-19/USA/IL-S23WGS0739/2023; Passage details/history: Original; Type: betacoronavirus; Accession ID: EPI_ISL_18748636; Collection date: 2023-02-26; Location: North America / USA / Illinois / Jefferson; Sequence length: 29752; Host: Human; Patient age: 80; Gender: Male; Clade: GRA; Pango lineage: XBB.1.5.1; Pango version: consensus call; Variant: Former VOI (XBB.1.5+XBB.1.5.*); AA Substitutions: (NSP5_P132H,NSP16_A168V,NSP12_G671S,NSP3_G489S,Spike_L24del,NSP4_T327I,Spike_N969K,Spike_H655Y,Spike_G142D,Spike_A27S,Spike_Q954H,N_P13L,Spike_N501Y,Spike_P25del,N_R32del,Spike_V213E,NS3_T223I,Spike_T19I,Spike_H146Q,M_Q19E,Spike_N440K,NSP4_T492I,Spike_N460K,Spike_N679K,Spike_N764K,E_T11A,NSP6_G107del,Spike_Y505H,NSP14_M58I,Spike_D796Y,Spike_T478K,M_A63T,Spike_R346T,Spike_S371F,Spike_K417N,NSP13_R392C,Spike_L368I,Spike_T376A,NSP6_S106del,Spike_F490S,Spike_R408S,NSP4_L438F,Spike_G339H,NSP14_I42V,NSP4_L264F,Spike_P681H,Spike_Y144del,Spike_V83A,NSP3_T24I,N_S33del,NSP1_S135R,Spike_S375F,Spike_D405N,Spike_Q498R,NSP13_S36P,Spike_Q183E,Spike_S477N,N_E31del,NSP15_T112I,NSP6_F108del,Spike_T573I,E_T9I,NSP1_K47R,Spike_P26del,NSP12_P323L,Spike_D614G,Spike_G252V); Submission date: 2024-01-11; Is complete?: True; N-Content: 0.0254772788384; GC-Content: 0.378940470442; Region: North America","Virus name: hCoV-19/USA/CA-CDC-QDX47542243/2023; Passage details/history: Original; Type: betacoronavirus; Accession ID: EPI_ISL_17250301; Collection date: 2023-03-03; Location: North America / USA / California; Sequence length: 29721; Host: Human; Patient age: 55; Gender: Female; Clade: GRA; Pango lineage: XBB.1.5.66; Pango version: PANGO-v1.23; Variant: Former VOI (XBB.1.5+XBB.1.5.*); AA Substitutions: (NSP5_P132H,NSP12_G671S,NSP3_G489S,Spike_L24del,NSP4_T327I,Spike_S373P,Spike_N969K,Spike_H655Y,N_R203K,NSP2_G339S,Spike_G142D,Spike_A27S,Spike_Q954H,N_P13L,Spike_N501Y,Spike_P25del,N_R32del,Spike_V213E,NS3_T223I,Spike_T19I,Spike_H146Q,M_Q19E,Spike_N440K,NSP4_T492I,Spike_N460K,Spike_N679K,Spike_N764K,E_T11A,NSP6_G107del,Spike_Y505H,Spike_D796Y,N_G204R,Spike_T478K,N_S413R,M_A63T,Spike_R346T,NSP6_M143I,Spike_S371F,Spike_V445P,NSP13_R392C,Spike_K417N,Spike_L368I,Spike_T376A,NSP6_S106del,NS8_G8stop,Spike_F490S,Spike_F486P,Spike_R408S,NSP4_L438F,Spike_G339H,NSP14_I42V,NSP4_L264F,Spike_P681H,Spike_Y144del,Spike_V83A,NSP3_T24I,N_S33del,NSP1_S135R,Spike_S375F,Spike_D405N,Spike_Q498R,Spike_G446S,NSP13_S36P,Spike_Q183E,Spike_S477N,N_E31del,NSP15_T112I,NSP6_F108del,Spike_E484A,E_T9I,NS7a_Q94H,NSP1_K47R,Spike_P26del,NSP12_P323L,Spike_D614G,Spike_G252V); Submission date: 2023-03-20; Is complete?: True; GC-Content: 0.379125870597; Region: North America",Correct,

sample_pair_validation_results.csv:463,EPI_ISL_18748636,EPI_ISL_17250301,"Virus name: hCoV-19/USA/IL-S23WGS0739/2023; Passage details/history: Original; Type: betacoronavirus; Accession ID: EPI_ISL_18748636; Collection date: 2023-02-26; Location: North America / USA / Illinois / Jefferson; Sequence length: 29752; Host: Human; Patient age: 80; Gender: Male; Clade: GRA; Pango lineage: XBB.1.5.1; Pango version: consensus call; Variant: Former VOI (XBB.1.5+XBB.1.5.*); AA Substitutions: (NSP5_P132H,NSP16_A168V,NSP12_G671S,NSP3_G489S,Spike_L24del,NSP4_T327I,Spike_N969K,Spike_H655Y,Spike_G142D,Spike_A27S,Spike_Q954H,N_P13L,Spike_N501Y,Spike_P25del,N_R32del,Spike_V213E,NS3_T223I,Spike_T19I,Spike_H146Q,M_Q19E,Spike_N440K,NSP4_T492I,Spike_N460K,Spike_N679K,Spike_N764K,E_T11A,NSP6_G107del,Spike_Y505H,NSP14_M58I,Spike_D796Y,Spike_T478K,M_A63T,Spike_R346T,Spike_S371F,Spike_K417N,NSP13_R392C,Spike_L368I,Spike_T376A,NSP6_S106del,Spike_F490S,Spike_R408S,NSP4_L438F,Spike_G339H,NSP14_I42V,NSP4_L264F,Spike_P681H,Spike_Y144del,Spike_V83A,NSP3_T24I,N_S33del,NSP1_S135R,Spike_S375F,Spike_D405N,Spike_Q498R,NSP13_S36P,Spike_Q183E,Spike_S477N,N_E31del,NSP15_T112I,NSP6_F108del,Spike_T573I,E_T9I,NSP1_K47R,Spike_P26del,NSP12_P323L,Spike_D614G,Spike_G252V); Submission date: 2024-01-11; Is complete?: True; N-Content: 0.0254772788384; GC-Content: 0.378940470442; Region: North America","Virus name: hCoV-19/USA/CA-CDC-QDX47542243/2023; Passage details/history: Original; Type: betacoronavirus; Accession ID: EPI_ISL_17250301; Collection date: 2023-03-03; Location: North America / USA / California; Sequence length: 29721; Host: Human; Patient age: 55; Gender: Female; Clade: GRA; Pango lineage: XBB.1.5.66; Pango version: PANGO-v1.23; Variant: Former VOI (XBB.1.5+XBB.1.5.*); AA Substitutions: (NSP5_P132H,NSP12_G671S,NSP3_G489S,Spike_L24del,NSP4_T327I,Spike_S373P,Spike_N969K,Spike_H655Y,N_R203K,NSP2_G339S,Spike_G142D,Spike_A27S,Spike_Q954H,N_P13L,Spike_N501Y,Spike_P25del,N_R32del,Spike_V213E,NS3_T223I,Spike_T19I,Spike_H146Q,M_Q19E,Spike_N440K,NSP4_T492I,Spike_N460K,Spike_N679K,Spike_N764K,E_T11A,NSP6_G107del,Spike_Y505H,Spike_D796Y,N_G204R,Spike_T478K,N_S413R,M_A63T,Spike_R346T,NSP6_M143I,Spike_S371F,Spike_V445P,NSP13_R392C,Spike_K417N,Spike_L368I,Spike_T376A,NSP6_S106del,NS8_G8stop,Spike_F490S,Spike_F486P,Spike_R408S,NSP4_L438F,Spike_G339H,NSP14_I42V,NSP4_L264F,Spike_P681H,Spike_Y144del,Spike_V83A,NSP3_T24I,N_S33del,NSP1_S135R,Spike_S375F,Spike_D405N,Spike_Q498R,Spike_G446S,NSP13_S36P,Spike_Q183E,Spike_S477N,N_E31del,NSP15_T112I,NSP6_F108del,Spike_E484A,E_T9I,NS7a_Q94H,NSP1_K47R,Spike_P26del,NSP12_P323L,Spike_D614G,Spike_G252V); Submission date: 2023-03-20; Is complete?: True; GC-Content: 0.379125870597; Region: North America",Correct,

uniq_virus_names_from_dpgr_variant_pairs-full-tuples-2025-12-14.csv:hCoV-19/USA/CA-CDC-QDX47542243/2023












Saturday, December 13, 2025

dpgr, build pairs

 previous 100,000 entries still cover 1098 tuples, missing 998 tuples? So, try 200, 000 entries. 

(base) [hqin@ip-10-3-4-198 ~]$ 

(base) [hqin@ip-10-3-4-198 ~]$ cd dpgr_build_training_data/

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls

AGENTS.md  data  dpgr_analysis_summary.csv  dpgr_variant_pairs-full-tuples-2025-12-13.csv  logs  metadata  models  notebooks  __pycache__  README.md  sample_pair_validation_results.csv  scripts  tmp

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git pull

Warning: Permanently added 'github.com,140.82.113.4' (ECDSA) to the list of known hosts.

remote: Enumerating objects: 5, done.

remote: Counting objects: 100% (5/5), done.

remote: Compressing objects: 100% (5/5), done.

remote: Total 5 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)

Unpacking objects: 100% (5/5), 3.78 KiB | 34.00 KiB/s, done.

From github.com:QinLab/dpgr_build_training_data

   b225cc2..816094a  main                                     -> origin/main

 * [new branch]      codex/increase-total-pair-output-to-200k -> origin/codex/increase-total-pair-output-to-200k

Updating b225cc2..816094a

Fast-forward

 scripts/generate_dpgr_variant_mapping_all_tuples.sbatch | 2 +-

 1 file changed, 1 insertion(+), 1 deletion(-)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch scripts/generate_dpgr_variant_mapping_all_tuples.sbatch

Submitted batch job 410


(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status

On branch main

Your branch is up to date with 'origin/main'.


Untracked files:

  (use "git add <file>..." to include in what will be committed)

dpgr_variant_pairs-full-tuples-2025-12-14.csv

logs/dpgr_variant_mapping.all_tuples.410.out


nothing added to commit but untracked files present (use "git add" to track)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ wc -l *csv

    2069 dpgr_analysis_summary.csv

  100001 dpgr_variant_pairs-full-tuples-2025-12-13.csv

  154556 dpgr_variant_pairs-full-tuples-2025-12-14.csv

      26 sample_pair_validation_results.csv

  256652 total

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out

Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

nothing added to commit but untracked files present (use "git add" to track)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/sample_pair_tuple_check.all_tuples.410.txt

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/dpgr_variant_mapping.all_tuples.410.out

Saved 154555 full-coverage pairs to /home/hqin/dpgr_build_training_data/dpgr_variant_pairs-full-tuples-2025-12-14.csv

Found 2,033 unique (Location, Time Window, DPGR (Slope)) tuples in dpgr_variant_pairs-full-tuples-2025-12-14.csv.

DPGR summary contains 2,068 unique (Location, Time Window, DPGR (Slope)) tuples.

Warning: 35 tuples from the DPGR summary were not found in the pairs file.

Checked 500 sampled rows.

No issues found in the sampled subset.

Wrote per-row validation results to logs/sample_pair_validation_results.all_tuples.410.csv

Wrote tuple comparison summary to logs/sample_pair_tuple_check.all_tuples.410.txt

\nDone. Results are in:

- Pairs: dpgr_variant_pairs-full-tuples-2025-12-14.csv

- Sample tuples: logs/sample_pair_tuple_check.all_tuples.410.txt

- Sample metrics: logs/sample_pair_validation_results.all_tuples.410.csv


(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ ls -lh *csv

-rw-rw-r-- 1 hqin hqin 230K Dec  2 05:02 dpgr_analysis_summary.csv

-rw-rw-r-- 1 hqin hqin  28M Dec 13 04:25 dpgr_variant_pairs-full-tuples-2025-12-13.csv

-rw-rw-r-- 1 hqin hqin  43M Dec 14 02:34 dpgr_variant_pairs-full-tuples-2025-12-14.csv

-rw-rw-r-- 1 hqin hqin  54K Dec 13 03:02 sample_pair_validation_results.csv

So, we build the pairs. the file is dpgr_variant_pairs-full-tuples-2025-12-14.csv

10:41pm, small scale test to build fasta sequence pairs. 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ bash  scripts/debug_submit_slice_jobs.sh

Prepared 5 slices in tmp/variant_pair_slices_debug

Submitting only first 5 slice(s) due to --max-jobs

Submitted dpgrRows000001-000002: Submitted batch job 411

Submitted dpgrRows000003-000004: Submitted batch job 412

Submitted dpgrRows000005-000006: Submitted batch job 413

Submitted dpgrRows000007-000008: Submitted batch job 414

Submitted dpgrRows000009-000009: Submitted batch job 415


Concatenate label outputs after jobs complete:

cat data/labels/row_slices_debug/dpgr_pair_labels_rows*.tsv > data/labels/row_slices_debug/dpgr_pair_labels_all.tsv

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               409 gpu-a100- fine_lr1 malam007 PD       0:00      1 (Resources)

               415 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               414 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               413 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               412 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

               411 gpu-a100- dpgrRows     hqin PD       0:00      1 (Priority)

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ cat logs/build_fasta_pair_000001-000002.411.out

Traceback (most recent call last):

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 843, in <module>

    main()

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 678, in main

    accession_to_virus, virus_to_accession = load_metadata_maps(metadata_path)

  File "/home/hqin/dpgr_build_training_data/scripts/build_fasta_pair_and_dpgr.py", line 195, in load_metadata_maps

    with metadata_path.open("r", encoding="utf-8", newline="") as handle:

  File "/home/hqin/miniforge3/envs/dpgr310/lib/python3.10/pathlib.py", line 1119, in open

    return self._accessor.open(self, mode, buffering, encoding, errors,

FileNotFoundError: [Errno 2] No such file or directory: 'tmp/metadata1K_downsampled.tsv'

Realized that previous python code went back to meta file to look for fasta headers. With the updated csv, this step should be skipped. 

So, revise the build_fasta_pair_and_dpgr.py


BACK TO DRAWING BOARD


As a first principle approach, revise the mapping code to parse the Virus name which is how FASTA record are kept. 

Revise https://github.com/QinLab/dpgr_build_training_data/blob/main/scripts/generate_dpgr_variant_mapping.py

to parse both Accession IDs and Virus name (which is in the FASTA Headers) for Variant 1 and Variant 2. 

First, local test on macbook, passed. scripts/local_small_scale_test.sh

10:03pm, test on aws. 

(base) dpgr_build_training_data]$ sbatch scripts/local_small_scale_test.sbatch

Submitted batch job 397

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               397 gpu-a100- dpgrLoca     hqin        0:04      1 gpu-a100-2

               267 gpu-a100- fine_lr1 malam007    18:22:39      1 gpu-a100-2

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ git status

On branch main

Your branch is up to date with 'origin/main'.


Changes to be committed:

  (use "git restore --staged <file>..." to unstage)

new file:   dpgr_variant_pairs-2025-12-13.csv

new file:   logs/local_small_scale_test.397.out

new file:   logs/sample_pair_tuple_check.397.txt


10:37pm

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ sbatch  scripts/generate_dpgr_variant_mapping_all_tuples.sbatch 

Submitted batch job 399

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ 

(base) [hqin@ip-10-3-4-198 dpgr_build_training_data]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

               399 gpu-a100- dpgrAllT     hqin        0:01      1 gpu-a100-2

               267 gpu-a100- fine_lr1 malam007    18:58:06      1 gpu-a100-2

11:40pm, 

Prompt (does not lead to corrected python code, but it runs)

revise  build_fasta_pair_and_dpgr.py to use dpgr_variant_pairs-full-tuples-2025-12-14-small.csv (for debug) and sequences50K.fasta (debug mode).

dpgr_variant_pairs-full-tuples-2025-12-14-small.csv already contain "Virus names" from metafile, so we do not need meta to match pairs with fasta header. The fasta header info are in dpgr_variant_pairs-full-tuples-2025-12-14-small.csv . 

Host are matched to FASTA headers so that "Variant 1 virus name" or "Variant 2 virus name" can be paired with headers that only contain the virus name (e.g., ``>hCoV-19/USA/FL-BPHL-5466/2022``) (remove the extra yyyy-mm-dd from the fasta headers. In debug mode only process the first 10 rows in  dpgr_variant_pairs-full-tuples-2025-12-14-small.csv In verbose mode print the Virus names and Access Ids in the steps, and whether their FASTA match are found or not.

No fasta sequence were extracted. 


1:21am. verified that fasta sequence contain the first pair

hqin@Hong-MBP2 dpgr_build_training_data % gunzip data/raw/fasta_headers.txt.gz 

hqin@Hong-MBP2 dpgr_build_training_data % grep hCoV-19/USA/WA-PHL-028575/2020  data/raw/fasta_headers.txt 

>hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09

hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/WA-PHL-028575/2020

3154745 >hCoV-19/USA/WA-PHL-028575/2020|2020-03-02|2022-11-09

hqin@Hong-MBP2 dpgr_build_training_data % 

hqin@Hong-MBP2 dpgr_build_training_data % cat data/raw/fasta_headers.txt | nl | grep hCoV-19/USA/NY-WCM-0632-1-P/2020

16249197 >hCoV-19/USA/NY-WCM-0632-1-P/2020|2020-03-15|2021-01-19

hqin@Hong-MBP2 dpgr_build_training_data % 

1:51am, realized that sequence5K.fasta may contain too few entries to test the pairs.csv 

Use chatGPT to filter the pair input. 

2:03am. the following code worked. 

python scripts/build_fasta_pair_and_dpgr.py --pairs-path dpgr_variant_pairs-full-tuples-2025-12-14-small.csv --sequences-path data/raw/sequences50K.fasta --debug --verbose --use-pair-virus-names --allow-missing-fasta --limit 10

called it day. 

TODO: duplicate sequence pair check with the same DPGR rate. If duplicated sequence has different DPGR rates, it should be kept because it increase the diversity of the training data.