Predictive Modelling of Hepatitis C Virus Disease Progression Using PCA and Machine Learning

Sandeep Kumar Sunori; Shilpa Jain; Govind Singh Jethi; Pradeep Juneja

Sunori S. K, Jain S, Jethi G. S, Juneja P. Predictive Modelling of Hepatitis C Virus Disease Progression Using PCA and Machine Learning. Biomed Pharmacol J 2026;19(2).

Manuscript received on :28-11-2025
Manuscript accepted on :12-02-2026
Published online on: 06-05-2026

Plagiarism Check: Yes
Reviewed by: Dr. Yerbolat Iztleuov
Second Review by: Dr. Karthikeyan
Final Approval by: Dr. Anton R Keslav

How to Cite | Publication History

Views:

Visited 67 times, 1 visit(s) today

Predictive Modelling of Hepatitis C Virus Disease Progression Using PCA and Machine Learning

Sandeep Kumar Sunori^1*, Shilpa Jain², Govind Singh Jethi²and Pradeep Juneja¹

¹Department of ECE, Graphic Era Hill University, Bhimtal Campus, India,

²Department of CSE, Graphic Era Hill University, Bhimtal Campus, India

Corresponding Author E-mail:sksunori@gehu.ac.in

Abstract

The stages of chronic infection in hepatitis C Virus (HCV) include Hepatitis and Fibrosis, followed by Cirrhosis, and staging of the diseases must be non-invasive to be effectively used in clinical practices. This research article creates a powerful computational algorithm of multi-class HCV staging with standard serum laboratory biomarkers.A dataset of 12 clinical biomarkers and demographics of 615 subjects has been used. In connection to the intrinsic correlation and high dimensionality of the biomarker panel, Principal Component Analysis (PCA) was used as an essential step in feature engineering as it retained 95% of total data variance. Three supervised machine learning classifiers, Naive Bayes (NB), K-Nearest Neighbors (KNN, k=5) and a multi-class Support Vector Machine (SVM) based on the Error-Correcting Output Codes (ECOC) wrapper with a linear kernel,were trained and compared on the optimal low-dimension set of features obtained through PCA. The SVM-ECOC model has shown better overall predictive performance (highest Accuracy 91 %), Macro-Averaged Precision (0.745) and Macro-Averaged Recall (Sensitivity) of 0.61. The translational usefulness of the SVM model was further validated by further rigorous clinical validation using the Multi-Category Net Reclassification Improvement (MCNRI) measure, which reported a net improvement in proper risk stratification of 8.13 % over Naive Bayes and 4.88 % over K-Nearest Neighbors. This performance justifies the feasibility of PCA in reducing multidimensional biological data to a space of features that can be separated linearly, which boosts the success of classification tremendously. Nevertheless, another significant limitation of the study is pointed out, the difference between the high overall accuracy and moderate Macro-Averaged Recall indicates the insensitivity (high False Negative Rate) of the key minority disease types (Hepatitis, Fibrosis, Cirrhosis) because of the imbalance in the dataset. All models have been simulated on MATLAB. Research in future should focus on the application of data-level methods, such as oversampling, to reduce the bias of the class and determine ethically acceptable, reliable diagnostic sensitivity at all phases of HCV development to be clinically applicable.

Keywords

HCV (Hepatitis C Virus); Liver Fibrosis; MCNRI (Multi-Category Net Reclassification Improvement); Multi-class Classification; PCA (Principal Component Analysis); Predictive modelling; SVM (Support Vector Machine); Serum Biomarkers

Copy the following to cite this article:

Sunori S. K, Jain S, Jethi G. S, Juneja P. Predictive Modelling of Hepatitis C Virus Disease Progression Using PCA and Machine Learning. Biomed Pharmacol J 2026;19(2).

Copy the following to cite this URL:

Sunori S. K, Jain S, Jethi G. S, Juneja P. Predictive Modelling of Hepatitis C Virus Disease Progression Using PCA and Machine Learning. Biomed Pharmacol J 2026;19(2). Available from: https://bit.ly/42RjzAm

Introduction

Background on Hepatitis C Virus (HCV) and Disease Progression

Hepatitis C Virus (HCV) infection is a persistent and major challenge to the global population health, mainly because of the high rate of chronicity and further manifestation into the severe liver disease.¹⁶Clinical course of chronic HCV is a continuum of pathology, which progresses in order of acute and chronic hepatitis to appearance of liver fibrosis, and finally, cirrhosis. Cirrhosis is an irreversible disorder characterized by massive hepatic scarring and functional dysfunction, which is a severe risk factor contributing to life-threatening conditions and events such as hepatocellular carcinoma and end-stage liver failure. The classic techniques of measuring the extent of liver injury, especially the level of fibrosis and cirrhosis, were based on the invasive liver biopsy. Although traditionally viewed as the gold standard of diagnosis, the biopsy procedure suffers because of its expensive nature, complications, sampling error, and related patient discomfort.¹⁷ Such significant weaknesses demand the creation and justification of powerful, non-invasive substitutes.

The Role of Serum Biomarkers in Non-Invasive Diagnosis

Modern non-invasive staging is based on the interpretation of easily available serum biomarkers or Liver Function Tests (LFTs) and related panels. The dataset²⁵ that the research examined in this paper includes a set of ten important laboratory biomarkers-Albumin (ALB), Alkaline Phosphatase (ALP), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Bilirubin (BIL), Cholinesterase (CHE), Cholesterol (CHOL), Creatinine (CREA), Gamma-Glutamyl Transferase (GGT) and Total Protein (PROT), in combination with demographic ones (Age, Sex), that represent the entire picture of the physiological health of the liver in combination.Although in the clinical sense, these biomarkers are informative, but when combined, their predictive strength is likely to be lost in complexity and high dimensionality in the data structure. As an example, AST and ALT frequently coincide in the occurrence of acute damage, and ALP is very likely to be related to GGT in the cholestatic pattern.¹⁸ In order to reliably identify the subtle biochemical patterns that define the five stages of the disease viz. Blood Donor, Suspect Donor, Hepatitis, Fibrosis, and Cirrhosis, there is a need to adopt the complex plans of computation that are able to manage these kinds of relationships and extract the latent patterns associated with the diagnosis.

Combining Dimensionality Reduction and Classification of Diagnostic Modelling

A high feature correlation and dimensionality would present significant challenges, which this study addressed with the help of Principal Component Analysis (PCA) as an upstream feature engineering approach. PCA is a linear transformation of the initial set of correlated biomarker features, which results in a new orthogonal coordinate system, where a set of Lower-dimensional Principal Components (PCs) captures the highest possible amount of original data variance.¹⁹ It is an essential dimension reduction to ensure the maximum amount of computational efficiency and avoid redundancy in the features presented to the classifiers.²⁰

The given feature set after a PCA was then applied to the performance of three different supervised machine learning paradigms, namely, Naïve Bayes (NB), which is a probabilistic approach; K-Nearest Neighbors (KNN), which is an instance-based and non-parametric approach; and Support Vector Machine (SVM), which is a maximum-margin geometric classifier.²¹ A rigorous comparison of these algorithms is crucial to discovering the most appropriate model architecture to use in multi-class classification in the complicated environment of HCV disease staging.

Hypothesis and Objectives

The basic hypothesis upon which this research is based is that when the PCA is used in standardizing and dimensional reduction of the HCV biomarker data, it will produce a low-dimensional feature representation that will significantly improve the predictive capability of the supervised machine learning classifiers. Moreover, the SVM model is also expected to have high classification performance in predicting the different stages of HCV progression than NB and KNN because it has strong generalization abilities and geometrical separation.

The research objectives of this work are:

To use PCA and choose the best number of the principal components necessary to explain 95 per cent of total variance in the standardized biomarker data.

To train, tune and test the relative performance of the NB, KNN and the SVM models on the PCA-reduced set of features.

To ascertain the most effective classification algorithm in general in terms of a critical analysis of multi-class measures, namely Accuracy, Macro-Averaged Precision, Macro-Averaged Recall (Sensitivity) and F1-Score.

In order to conduct MCNRI analysis of the developed models.

Literature Review

Clinical Interpretation of Liver Biomarkers in HCV Staging

Patterns and ratios of LFTs have become essential in the interpretation of liver disease in clinical hepatology, and cannot be interpreted alone. ALT and AST are enzyme indicators of cellular integrity, and a high level of them characterizes hepatocellular damage, which is characteristic of hepatitis. An important discriminating test is the AST:ALT ratio (De Ritis ratio). The strong indication of alcoholic liver disease is a ratio of 2:1, which indicates high levels of AST in the mitochondria. On the other hand, during cases of cholestasis, a lower AST:ALT ratio of < 1.5 tend to indicate extrahepatic obstruction usually where the ALT is significantly higher than AST. Higher concentrations of Alkaline Phosphatase (ALP) and Gamma-Glutamyl Transferase (GGT), often followed by an increase of Bilirubin (BIL), is a cholestatic profile that is the result of the biliary tract obstruction. Enzyme induction that occurs due to chronic alcohol intake or certain medications is especially sensitive to GGT.¹⁸ Synthetic capacity markers, which are mainly Albumin (ALB), Total Protein (PROT) and Cholinesterase (CHE), decrease with the severity of the disease. Reduced levels of ALB is a strong indicator of chronic liver failure, which is normally accompanied by progressive Fibrosis and Cirrhosis. The HCV staging is well presented in the form of a flowchart as shown in Fig.1. This flowchart is a summary of natural history and clinical course of Hepatitis C. Once infected, the individuals go through the acute HCV phase (0-6 months) where 15-45 % of them can attain spontaneous viral clearance. Without clearance of the virus, infection turns into chronic HCV which causes progressive liver damage. The progression of the chronic infection is determined by the stages of the METAVIR fibrosis (F0-F4) of no fibrosis to cirrhosis. DAAs can be treated at any stage of fibrosis and usually leads to SVR (viral cure), stops the progression and could partially reverse fibrosis. When the fibrosis is further developed to F4, the patient already has cirrhosis that can be able to be compensated over a period of time and then degenerate into decompensated cirrhosis that is accompanied by jaundice, ascites, and hepatic encephalopathy. Out of decompensation, patients are predisposed to hepatocellular carcinoma (HCC) and end-stage liver disease (ESLD). Highly developed disease either HCC or ESLD can eventually necessitate liver transplantation.

Methodological Precedents for Dimensionality Reduction (PCA)

PCA is a framework method of high-dimensional (HD) clinicaldata analysis in health science applications, which facilitates the discovery of latent data and compression of computational requirements. The approach used in this case required a very strict inclusion criterion: the selection of a minimum of PCs that would explain 95 % of the cumulative variance. This 95 % threshold is very specific and it is selected in health informatics because it offers much better stability and reliability than less deterministic methodslike the Kaiser-Guttman criterion or Cattell’s Scree Test. The dimensionality reduction step is important in that it ensures that much of the variability in the original data (95 %) is retained, thus ensuring that important diagnostic information that pertains to subtle metabolic patterns, which are important in determining diseases stages, is retained, and agood methodological rigor is established.

Leveraging Advanced Diagnostic Techniques in Liver Disease

The growing use of computational intelligence, biomedical data analytics, and sophisticated diagnostic technology has changed modern studies of hepatitis viruses, liver disease progression, and classification sciences. At the methodological level, the accuracy of the classification results is highly determined by the choice of appropriate loss and accuracy measurement metrices especially in situations, which are influenced by a data imbalance, where a metric bias and misinterpretation can be a major determinant of the actual model performance.¹ These issues have gained even greater importance due to the implementation of machine learning frameworks in the fields of virology, epidemiology and clinical decision support. Indicatively, to illustrate how computational methods can be used to discover mechanistic insights into the hepatitis C virus (HCV) biology, rotation-forest-based classifiers combining position-specific scoring matrices and two-dimensional PCA have been shown to yield high performance in human-viral protein interactions.² Simultaneously, the individualized diagnostic modeling has become a highly promising direction, where patient-specific machine learning systems can achieve high diagnostic accuracy and recall of HCV compared to standardized generalized models.³

In addition to traditional statistical learning, new reasoning approaches have been suggested including neutrosophichypersoft mapping to resolve uncertainties in symptom assessment and treatment distribution that allow more consistent mapping of hepatitis symptoms and treatment decisions in clinically ambiguous scenarios.⁴ Ensemble learning methods have been also found to have a good predictive ability in differentiating between cases of hepatitis C and cirrhosis with some of the critical biomarkers being ALT and AST that have been found to be the major predictors.⁵ Other works have come up with multiclass HCV detection pipelines using random forest, logistic regression, and oversampling methodsto record better early diagnosis using imbalanced clinical data.⁶

Simultaneously with the development of models on a patient level, population-based modeling still serves as a significant aspect of the transmission of hepatitis viruses. Compartmental formulations determined deterministically have helped explain the HCV important epidemiological thresholds, including the basic reproductive number, by which HCV would be persistently controlled or eliminated within high-risk communities.⁷ In order to enable the simulation study of large-scale simulations, the agent-based models accelerated by the parallel sliding-region algorithms make country-level epidemic models based on real demographic data possible without paying a significant computational cost.⁸However, systems-level studies based on protein interaction networks have offered new insights into the mechanisms of fibrosis formation, interference with immune pathways, and biomarker identification in hepatitis B and C.⁹

Machine learning has also been used in clinical risk stratification, where decision trees, regression models, genetic algorithms, and particle swarm optimization have demonstrated significant predictive power of advanced levels of fibrosis, particularly where large populations of patients are considered.¹⁰ In addition to these diagnostic innovations, research in the area of biosensing technologies has resulted in portable electrochemical systems that are able to perform HCV immunodetection in rapid and ultrasensitive modes, opening up the possibilities of decentralized and resource-limited screenings.¹¹ Bayesian inference has continued to gain a significant role at the molecular level in defining drug-resistance mechanisms, especially the identification of resistance-associated mutations in the NS5A region of genotype 1a HCV virus.¹² As well, more general principles of Bayesian methodology have been used to detect intricate mutations in hepatitis B, hepatitis C and HIV making it possible to gain a more precise picture of how high-dimensional genomic interactions can affect phenotypes.¹³

These statistical models are further augmented with sophisticated data-mining strategies. The complete HBV genome sequence studies have been performed using clustering, evolution, fuzzy-measure-based classification and information-gain-driven feature selection to determine the potential oncogenic markers in the development of hepatocellular carcinoma.¹⁴The development of pharmaceutical analytics including carbon-nanotube-enhanced electrochemical sensors have facilitated the sensitive detection of antiviral drugs like daclatasvir, which can be used to measure the correct dosage and treatment effects of antiviral drugs in HCV treatment .¹⁵Machine learning algorithms have also been used in generating decision trees from laboratory data.²⁶

Collectively, these diverse streams of study point to the more multidisciplinary nature of the present-day landscape of the hepatitis virus research, in which the state of the art computational, statistical, epidemiological and biosensing advances intersect to increase disease knowledge, diagnosis, and optimized treatment. Table 1 is summarizing the performance of some existing ML models in hepatitis research.

Comparative research on the hepatology diagnostics topic usually demonstrates mixed outcomes, yet tends to conclude that the more sophisticated machine learning algorithms are effective. Of special interest is the comparison of Naive Bayes (NB) with SVM. NB uses an independence assumption of features, which usually is not the case in complex biological systems where biomarkers interact on a large scale. SVM on the other hand works by projecting data into a high dimensional space in order to detect the best maximum-margin separating hyperplane. Since non-linear interactions among biomarkers are most likely, the SVM framework theoretically stands in a better position to estimate the geometric distance between disease groups than the purely probabilistic nature of NB, and it is in agreement with the existing literature that SVM often yields better results on the more complex biomedical data sets.

In the five category multi-class staging problem with the HCV progression, the SVM was trained with the Error-Correcting Output Codes (ECOC) wrapper. ECOC is a complex method that transforms the multi-class problem of interest into multiple and strong binary classification sub-problems to boost the stability and the ability to generalize the resulting model. The line kernel used in the linear implementation of SVM with application of PCA effectively transforms the data so that the classes of the disease can be effectively separated linearly in the reduced feature space, a result commonly characterized by an effective feature engineering process and a linearly separable data set.

Figure 1: HCV staging

Authors	Research Area	Model/Method	Results	Novelty
L. Chen et al.³	HCV Diagnosis (Patient-Specific)	Customized ML Model	Over 99% accuracy and 94% recall	Demonstrates the strength of a patient-specific model over general- purpose model.
X. Liu et al.²	HCV-Human PPI Prediction	RF-PSSM (Rotation Forest-Position-Specific Scoring Matrix)	Accuracy 93.74%; AUC 94.29%	Exhibits cutting-edge molecular feature utilization.
T. -H. S. Li et al.⁶	HCV Multiclass Detection	Cascade RF-LR (with SMOTE)	Improved performance than that of latest algorithms	Addresses challenges pertaining to data imbalance.
S. Hashem et al.¹⁰	Advanced Liver Fibrosis Prediction	ML Models (DT, GA, PSO, MLR)	AUROC 0.73–0.76; Accuracy 66.3–84.4%	Establishes the performance benchmark for non-invasive clinical prognosis.
D. Chicco and G. Jurman⁵	HCV and Cirrhosis Diagnosis	Ensemble Learning (Random Forest)	Outperforms AST/ALT ratio	Validates efficacy of ensemble machine learning against traditional clinical scores.

Model	Type	Mathematical basis	Strengths
Naïve Bayes	Probabilistic	Bayes’ theorem with conditional independence	Fast, interpretable, handles small data
KNN	Instance-based	Distance computation & majority vote	Captures local structure, simple
SVM (ECOC)	Large-margin classifier	Quadratic optimization, hinge loss	High accuracy, robust in high-dimensional PCA features

Model	Accuracy	Precision	Recall	F1
Naïve Bayes	87.8 %	0.494	0.399	0.646
KNN	89.4 %	0.718	0.34	0.612
SVM	91 %	0.745	0.61	0.594