01_credit_risk_graph

Data Science Projects

Credit Risk with Graph Curvature using Home Credit Data

A Geometric Machine Learning Approach (PySpark)

Roberto SSoares - LfLngLrnng

in/roberto-dos-santos-soares
Portifólio: roberto-ssoares

" [+] Faturamento,
[-] Custo,
[+] Qualidade de vida "
"Mestre Bruno Jardim"

📌 Objetivo

Este projeto investiga se a geometria relacional entre clientes pode enriquecer a modelagem de risco de crédito.
Em vez de tratar clientes apenas como linhas independentes em uma tabela, construímos um grafo de similaridade entre clientes a partir do dataset Home Credit Default Risk.
Em seguida, calculamos medidas geométricas do grafo, como curvatura de Ricci, e incorporamos essas medidas como features para modelagem preditiva.

📌 Hipótese

Clientes estruturalmente semelhantes no espaço de atributos formam padrões relacionais que podem conter sinal preditivo adicional para inadimplência.

📚 Instalando e Carregando os Pacotes¶

Objetivo:

Importar bibliotecas necessárias para manipulação de dados, construção do grafo,

cálculo de curvatura e modelagem preditiva.

Ações realizadas:

Importação de bibliotecas de análise, ML e grafos

Configuração de warnings

Justificativa técnica:

Este projeto integra processamento tabular com modelagem em grafos.

Por isso, precisamos combinar bibliotecas clássicas de ciência de dados

com bibliotecas de análise de redes.

Resultados esperados:

Ambiente pronto para carga, preparação, construção do grafo e modelagem.

import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import ( roc_auc_score, classification_report, confusion_matrix )
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import NearestNeighbors

from sklearn.preprocessing import StandardScaler, OneHotEncoder

import networkx as nx
from GraphRicciCurvature.OllivierRicci import OllivierRicci

✔️ 0. Configurações Iniciais¶

Objetivo:

Definir parâmetros globais do projeto.

Ações realizadas:

Definição de caminhos

Definição de colunas principais

Definição de hiperparâmetros iniciais

Justificativa técnica:

Centralizar parâmetros melhora reprodutibilidade e manutenção do notebook.

Resultados esperados:

Projeto parametrizado e fácil de ajustar.

RANDOM_STATE = 42
DIR_RAW = "../data/00-raw"
FILE = f"{DIR_RAW}/application_train.csv"

ID_COL = "SK_ID_CURR"
TARGET_COL = "TARGET"

SAMPLE_SIZE = 5000     # ajuste conforme memória da sua máquina
K_NEIGHBORS = 8        # número de vizinhos no grafo KNN

✔️ 1. Carga de Dados¶

Objetivo:

Carregar a base principal do Home Credit para análise.

Ações realizadas:

Leitura do CSV principal application_train.csv

Justificativa técnica:

application_train é a tabela central do problema de inadimplência e contém

os clientes rotulados com a variável-alvo TARGET.

Resultados esperados:

DataFrame bruto carregado em memória.

df_raw = pd.read_csv(FILE)
df_raw.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	var_41	var_42	var_43	var_44	var_45	var_46	var_47	var_48	var_49	var_50
0	247330	0	Cash loans	F	N	N	0	157500.0	706410.0	67072.5	...	0.824762	0.333516	0.293260	0.564878	0.115058	0.655605	0.415562	0.092643	0.723331	0.796523
1	425716	1	Cash loans	F	Y	Y	1	121500.0	545040.0	25407.0	...	0.416260	0.404293	0.137944	0.457971	0.303691	0.215059	0.838892	0.608335	0.585643	0.298456
2	331625	0	Cash loans	M	Y	Y	1	225000.0	942300.0	27679.5	...	0.037711	0.124465	0.091840	0.364601	0.978220	0.520309	0.594523	0.559650	0.361873	0.254804
3	455397	0	Revolving loans	F	N	Y	2	144000.0	180000.0	9000.0	...	0.784630	0.831403	0.210872	0.049639	0.814219	0.830179	0.755163	0.216664	0.603002	0.429001
4	449114	0	Cash loans	F	N	Y	0	112500.0	729792.0	37390.5	...	0.265381	0.655344	0.668705	0.171391	0.335702	0.585494	0.619551	0.686738	0.540449	0.343632

5 rows × 172 columns

✔️ 2. Diagnóstico Inicial¶

Objetivo:

Entender dimensão, tipos e qualidade inicial da base.

Ações realizadas:

Inspeção de shape

Contagem de nulos

Distribuição da variável alvo

Justificativa técnica:

Esta etapa corresponde ao Data Understanding do CRISP-DM e orienta

decisões de amostragem, imputação e seleção de variáveis.

Resultados esperados:

Visão geral da base e do desbalanceamento da TARGET.

print("Shape da base:", df_raw.shape)
print(" ")
print("\nDistribuição da TARGET:")
print(df_raw[TARGET_COL].value_counts(dropna=False))
print(" ")
print("\nProporção da TARGET:")
print(df_raw[TARGET_COL].value_counts(normalize=True).round(4))
print(" ")
print("\nProporção de Nulos:")
null_pct = (df_raw.isna().mean() * 100).sort_values(ascending=False)
print(null_pct.head(20))

Shape da base: (215257, 172)
 

Distribuição da TARGET:
TARGET
0    197845
1     17412
Name: count, dtype: int64
 

Proporção da TARGET:
TARGET
0    0.9191
1    0.0809
Name: proportion, dtype: float64
 

Proporção de Nulos:
COMMONAREA_MODE             69.859284
COMMONAREA_MEDI             69.859284
COMMONAREA_AVG              69.859284
NONLIVINGAPARTMENTS_MODE    69.408660
NONLIVINGAPARTMENTS_AVG     69.408660
NONLIVINGAPARTMENTS_MEDI    69.408660
FONDKAPREMONT_MODE          68.375477
LIVINGAPARTMENTS_AVG        68.327162
LIVINGAPARTMENTS_MODE       68.327162
LIVINGAPARTMENTS_MEDI       68.327162
FLOORSMIN_MEDI              67.824043
FLOORSMIN_MODE              67.824043
FLOORSMIN_AVG               67.824043
YEARS_BUILD_MODE            66.496792
YEARS_BUILD_AVG             66.496792
YEARS_BUILD_MEDI            66.496792
OWN_CAR_AGE                 65.891469
LANDAREA_MEDI               59.374143
LANDAREA_MODE               59.374143
LANDAREA_AVG                59.374143
dtype: float64

df_raw.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
SK_ID_CURR	215257.0	278236.387137	102885.029589	100003.000000	189025.000000	278215.000000	367388.000000	4.562550e+05
TARGET	215257.0	0.080889	0.272666	0.000000	0.000000	0.000000	0.000000	1.000000e+00
CNT_CHILDREN	215257.0	0.416637	0.719695	0.000000	0.000000	0.000000	1.000000	1.900000e+01
AMT_INCOME_TOTAL	215257.0	168556.848346	105855.718537	25650.000000	112500.000000	144000.000000	202500.000000	1.350000e+07
AMT_CREDIT	215257.0	599495.998425	402898.914406	45000.000000	270000.000000	514867.500000	808650.000000	4.050000e+06
...	...	...	...	...	...	...	...	...
var_46	215257.0	0.500602	0.288052	0.000004	0.251720	0.500931	0.750341	9.999807e-01
var_47	215257.0	0.499781	0.288411	0.000008	0.249971	0.499138	0.749475	9.999971e-01
var_48	215257.0	0.500618	0.288286	0.000010	0.250336	0.502229	0.749817	9.999943e-01
var_49	215257.0	0.499476	0.288685	0.000017	0.248878	0.499258	0.749368	9.999751e-01
var_50	215257.0	0.499538	0.288505	0.000006	0.249740	0.499035	0.748715	9.999967e-01

156 rows × 8 columns

✔️ 3. Amostragem Controlada¶

Objetivo:

Reduzir o volume de dados para viabilizar a construção do grafo e o cálculo de curvatura.

Ações realizadas:

Amostragem estratificada simples por TARGET

Justificativa técnica:

O cálculo de curvatura em grafos pode ser computacionalmente caro.

Uma amostra controlada mantém representatividade e viabilidade computacional.

O uso de groupby().sample() evita problemas de estrutura causados por groupby().apply().

Resultados esperados:

DataFrame amostrado, menor e mais adequado ao experimento inicial.

sample_frac = SAMPLE_SIZE / len(df_raw)

df1_amostra = (
    df_raw
    .groupby(TARGET_COL, group_keys=False)
    .sample(frac=sample_frac, random_state=RANDOM_STATE)
    .reset_index(drop=True)
)

print("Shape da amostra:", df1_amostra.shape)
print()
print(df1_amostra[TARGET_COL].value_counts(normalize=True).round(4))

Shape da amostra: (5000, 172)

TARGET
0    0.9192
1    0.0808
Name: proportion, dtype: float64

print("TARGET existe em df?", TARGET_COL in df1_amostra.columns)
print(" ")
print(df1_amostra.columns[:10].tolist())

TARGET existe em df? True
 
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY']

✔️ 3.1. Alternativa ainda mais explícita¶

Se você quiser controlar por classe de forma totalmente transparente:
Mas, para neste caso, ficaremos com a primeira versão usando groupby().sample().

n_total = SAMPLE_SIZE

class_counts = df_raw[TARGET_COL].value_counts()
class_props = class_counts / len(df_raw)

sample_parts = []

for cls, prop in class_props.items():
    n_cls = max(1, int(n_total * prop))
    part = df_raw[df_raw[TARGET_COL] == cls].sample(
        n=min(n_cls, len(df_raw[df_raw[TARGET_COL] == cls])),
        random_state=RANDOM_STATE
    )
    sample_parts.append(part)

df2_amostra = pd.concat(sample_parts, axis=0).sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

print("Shape da amostra:", df2_amostra.shape)
print()
print(df2_amostra[TARGET_COL].value_counts(normalize=True).round(4))

Shape da amostra: (4999, 172)

TARGET
0    0.9192
1    0.0808
Name: proportion, dtype: float64

✔️ 4. Seleção inicial de features¶

Objetivo:

Selecionar um conjunto inicial de features mais estável e interpretável para o grafo e para o baseline.

Ações realizadas:

Separação de colunas numéricas e categóricas

Remoção de colunas com excesso de nulos

Justificativa técnica:

Para o experimento inicial, vamos privilegiar um pipeline robusto.

Variáveis com excesso de missing podem gerar ruído desnecessário neste primeiro ciclo.

Resultados esperados:

Lista de colunas numéricas e categóricas elegíveis para o modelo.

#remove colunas com mais de 40% de nulos
missing_ratio = df1_amostra.isna().mean()
keep_cols = missing_ratio[missing_ratio <= 0.40].index.tolist()

#garantir colunas principais
keep_cols = list(set(keep_cols + [ID_COL, TARGET_COL]))

df1_features = df1_amostra[keep_cols].copy()

num_cols = df1_features.select_dtypes(include=["number"]).columns.tolist()
cat_cols = df1_features.select_dtypes(include=["object"]).columns.tolist()

# remover ID e target das listas numéricas
num_cols = [c for c in num_cols if c not in [ID_COL, TARGET_COL]]

print("Qtde colunas numéricas:", len(num_cols))
print("Qtde colunas categóricas:", len(cat_cols))
print("Shape após filtro de missing:", df1_features.shape)

Qtde colunas numéricas: 109
Qtde colunas categóricas: 12
Shape após filtro de missing: (5000, 123)

✔️ 5. Seleção de features para o grafo¶

Objetivo:

Definir quais variáveis numéricas serão usadas para medir similaridade entre clientes.

Ações realizadas:

Seleção de um subconjunto de variáveis numéricas relevantes

Fallback automático caso alguma coluna não exista

Justificativa técnica:

O grafo KNN será construído a partir de distância no espaço de atributos.

Usar variáveis numéricas facilita padronização e cálculo de vizinhança.

Resultados esperados:

Lista final de colunas numéricas para construção do grafo.

preferred_graph_cols = [
    "AMT_INCOME_TOTAL",
    "AMT_CREDIT",
    "AMT_ANNUITY",
    "AMT_GOODS_PRICE",
    "DAYS_BIRTH",
    "DAYS_EMPLOYED",
    "CNT_CHILDREN",
    "CNT_FAM_MEMBERS",
    "REGION_POPULATION_RELATIVE",
    "EXT_SOURCE_1",
    "EXT_SOURCE_2",
    "EXT_SOURCE_3"
]

graph_num_cols = [c for c in preferred_graph_cols if c in df1_features.columns]

# fallback para garantir experimento mesmo se alguma coluna estiver ausente
if len(graph_num_cols) < 5:
    graph_num_cols = num_cols[:10]

print("Colunas usadas no grafo:")
print(graph_num_cols)

Colunas usadas no grafo:
['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'REGION_POPULATION_RELATIVE', 'EXT_SOURCE_2', 'EXT_SOURCE_3']

✔️ 6. Preparação da matriz numérica para KNN¶

Objetivo:

Preparar a matriz numérica que será usada para encontrar vizinhos mais próximos.

Ações realizadas:

Imputação de missing numérico

Padronização das variáveis

Justificativa técnica:

KNN é sensível à escala. A padronização evita que variáveis com magnitude maior

dominem o cálculo de distância.

Resultados esperados:

Matriz numérica padronizada pronta para o cálculo de vizinhança.

imputer_graph = SimpleImputer(strategy="median")
scaler_graph = StandardScaler()

X_graph_num = imputer_graph.fit_transform(df1_features[graph_num_cols])
X_graph_scaled = scaler_graph.fit_transform(X_graph_num)

print("Shape da matriz do grafo:", X_graph_scaled.shape)

Shape da matriz do grafo: (5000, 11)

✔️ 7. Construção do grafo KNN¶

Objetivo:

Construir um grafo de similaridade entre clientes usando k-vizinhos mais próximos.

Ações realizadas:

Ajuste do algoritmo NearestNeighbors

Criação de arestas entre clientes similares

Registro da distância entre clientes

Justificativa técnica:

Como não há arestas explícitas entre clientes na base, modelamos relações

por similaridade de perfil. Isso cria uma estrutura relacional útil para

extração de sinais geométricos.

Resultados esperados:

Grafo não direcionado com nós = clientes e arestas = similaridade.

nbrs = NearestNeighbors(
    n_neighbors=K_NEIGHBORS + 1,   # +1 porque o primeiro vizinho é ele mesmo
    metric="euclidean"
)
nbrs.fit(X_graph_scaled)

distances, indices = nbrs.kneighbors(X_graph_scaled)

G = nx.Graph()

# adiciona nós
for customer_id in df1_features[ID_COL].tolist():
    G.add_node(customer_id)

# adiciona arestas
customer_ids = df1_features[ID_COL].tolist()

for i in range(len(customer_ids)):
    source_id = customer_ids[i]
    
    for j in range(1, K_NEIGHBORS + 1):
        neighbor_idx = indices[i, j]
        target_id = customer_ids[neighbor_idx]
        dist = distances[i, j]
        
        if source_id != target_id:
            G.add_edge(
                source_id,
                target_id,
                weight=float(1 / (1 + dist)),
                distance=float(dist)
            )

print("Número de nós:", G.number_of_nodes())
print("Número de arestas:", G.number_of_edges())

Número de nós: 5000
Número de arestas: 27804

✔️ 8. Inspeção estrutural do grafo¶

Objetivo:

Observar propriedades básicas do grafo construído.

Ações realizadas:

Cálculo de grau médio

Contagem de componentes conectados

Justificativa técnica:

Esta leitura ajuda a entender se o grafo ficou excessivamente esparso,

excessivamente denso ou fragmentado.

Resultados esperados:

Diagnóstico estrutural inicial do grafo.

degrees = [deg for _, deg in G.degree()]
print("Grau médio:", round(np.mean(degrees), 2))
print("Grau mínimo:", np.min(degrees))
print("Grau máximo:", np.max(degrees))
print("Número de componentes conectados:", nx.number_connected_components(G))

Grau médio: 11.12
Grau mínimo: 8
Grau máximo: 26
Número de componentes conectados: 1

✔️ 9. Visualização de subgrafo¶

Objetivo:

Visualizar uma pequena amostra do grafo para inspeção qualitativa.

Ações realizadas:

Seleção de subconjunto de nós

Plot simples com spring layout

Justificativa técnica:

Visualização exploratória auxilia na interpretação do padrão de conectividade.

Resultados esperados:

Representação visual de um subgrafo da rede de clientes.

plt.figure(figsize=(10, 8))

sub_nodes = list(G.nodes())[:150]
subgraph = G.subgraph(sub_nodes)

pos = nx.spring_layout(subgraph, seed=RANDOM_STATE)
nx.draw(
    subgraph,
    pos,
    node_size=25,
    with_labels=False
)

plt.title("Subgrafo amostral de clientes similares")
plt.show()

✔️ 10. Cálculo da curvatura de Ollivier-Ricci¶

Objetivo:

Calcular curvatura de Ricci nas arestas do grafo.

Ações realizadas:

Execução do algoritmo OllivierRicci

Justificativa técnica:

A curvatura resume propriedades geométricas locais do grafo.

Em termos intuitivos, ela ajuda a identificar regiões densas,

gargalos e padrões estruturais relevantes.

Resultados esperados:

Grafo enriquecido com atributo ricciCurvature nas arestas.

orc = OllivierRicci(G, alpha=0.5, verbose="INFO")
orc.compute_ricci_curvature()

✔️ 10.1. Opção - Forçar execução sem multiprocessing¶

Dependendo da versão da biblioteca, pode funcionar passar:

#orc = OllivierRicci(G, alpha=0.5, verbose="INFO", proc=1)
#orc.compute_ricci_curvature()

✔️ 11. Extração das curvaturas de aresta¶

Objetivo:

Organizar os resultados de curvatura em formato tabular.

Ações realizadas:

Iteração sobre arestas

Extração de curvatura e atributos auxiliares

Justificativa técnica:

Isso facilita estatística descritiva, visualização e interpretação.

Resultados esperados:

DataFrame de arestas com curvatura de Ricci.

G_curv = orc.G

edge_rows = []

for u, v, data in G_curv.edges(data=True):
    edge_rows.append({
        "source": u,
        "target": v,
        "distance": data.get("distance", np.nan),
        "weight": data.get("weight", np.nan),
        "ricci_curvature": data.get("ricciCurvature", np.nan)
    })

df_edges = pd.DataFrame(edge_rows)
df_edges.head()

	source	target	distance	weight	ricci_curvature
0	204040	254299	0.419204	0.704620	0.383597
1	204040	356853	1.067864	0.483591	0.040278
2	204040	333990	1.150261	0.465060	0.040947
3	204040	432673	1.212241	0.452030	-0.009316
4	204040	196367	1.263886	0.441718	-0.099400

✔️ 12. Estatística descritiva da curvatura¶

Objetivo:

Descrever a distribuição da curvatura nas arestas do grafo.

Ações realizadas:

Estatística descritiva

Histograma

Justificativa técnica:

A distribuição da curvatura ajuda a entender se a rede apresenta

predominância de regiões densas, neutras ou gargalos.

Resultados esperados:

Visão estatística e gráfica da curvatura.

display(df_edges["ricci_curvature"].describe())

plt.figure(figsize=(8, 5))
df_edges["ricci_curvature"].dropna().hist(bins=30)
plt.title("Distribuição da Curvatura de Ricci")
plt.xlabel("Curvatura")
plt.ylabel("Frequência")
plt.show()

count    27804.000000
mean        -0.145710
std          0.223203
min         -1.644944
25%         -0.290087
50%         -0.140232
75%          0.005637
max          0.685194
Name: ricci_curvature, dtype: float64

✔️ 13. Features geométricas por cliente¶

Objetivo:

Agregar propriedades geométricas do grafo ao nível do cliente.

Ações realizadas:

Cálculo de estatísticas de curvatura por nó

Cálculo de grau e clustering coefficient

Justificativa técnica:

Modelos tabulares precisam de features por linha.

Aqui transformamos estrutura relacional em atributos individuais.

Resultados esperados:

DataFrame com features geométricas por cliente.

node_rows = []

for node in G.nodes():
    incident_curvatures = []
    incident_weights = []
    
    for neighbor in G.neighbors(node):
        data = G.get_edge_data(node, neighbor)
        incident_curvatures.append(data.get("ricciCurvature", np.nan))
        incident_weights.append(data.get("weight", np.nan))
    
    node_rows.append({
        ID_COL: node,
        "graph_degree": G.degree(node),
        "graph_degree_weighted": G.degree(node, weight="weight"),
        "curvature_mean": np.nanmean(incident_curvatures) if len(incident_curvatures) > 0 else np.nan,
        "curvature_min": np.nanmin(incident_curvatures) if len(incident_curvatures) > 0 else np.nan,
        "curvature_max": np.nanmax(incident_curvatures) if len(incident_curvatures) > 0 else np.nan,
        "curvature_std": np.nanstd(incident_curvatures) if len(incident_curvatures) > 0 else np.nan,
        "local_clustering": nx.clustering(G, node, weight=None)
    })

df_graph_features = pd.DataFrame(node_rows)
df_graph_features.head()

	SK_ID_CURR	graph_degree	graph_degree_weighted	curvature_mean	curvature_min	curvature_max	curvature_std	local_clustering
0	204040	9	4.208832	NaN	NaN	NaN	NaN	0.333333
1	151722	12	5.906183	NaN	NaN	NaN	NaN	0.242424
2	402164	12	3.879631	NaN	NaN	NaN	NaN	0.333333
3	281589	13	4.604832	NaN	NaN	NaN	NaN	0.192308
4	213088	15	7.487532	NaN	NaN	NaN	NaN	0.285714

✔️ 14. Merge com a base principal¶

Objetivo:

Incorporar as features geométricas ao dataset de modelagem.

Ações realizadas:

Merge entre base de clientes e atributos do grafo

Justificativa técnica:

Esse passo unifica informação tabular e relacional em uma única base analítica.

Resultados esperados:

Base consolidada para modelagem.

df_model = df1_features.merge(df_graph_features, on=ID_COL, how="left")
print("Shape da base modelagem:", df_model.shape)
df_model.head()

Shape da base modelagem: (5000, 130)

	var_24	var_23	FLAG_PHONE	var_8	FLAG_DOCUMENT_3	var_26	var_38	var_5	OBS_30_CNT_SOCIAL_CIRCLE	...	AMT_REQ_CREDIT_BUREAU_QRT	NAME_HOUSING_TYPE	FLAG_EMP_PHONE	graph_degree	graph_degree_weighted	curvature_mean	curvature_min	curvature_max	curvature_std	local_clustering
0	0.283401	0.422637	0	0.957378	1	0.577099	0.018221	0.817941	0.0	...	NaN	House / apartment	1	9	4.208832	NaN	NaN	NaN	NaN	0.333333
1	0.463920	0.609124	0	0.971677	0	0.955273	0.572177	0.465824	3.0	...	0.0	House / apartment	1	12	5.906183	NaN	NaN	NaN	NaN	0.242424
2	0.533850	0.794818	0	0.612048	0	0.379137	0.834554	0.839051	1.0	...	NaN	House / apartment	0	12	3.879631	NaN	NaN	NaN	NaN	0.333333
3	0.754027	0.085205	0	0.113551	0	0.502462	0.366798	0.554650	0.0	...	1.0	House / apartment	1	13	4.604832	NaN	NaN	NaN	NaN	0.192308
4	0.808076	0.546754	1	0.934963	1	0.911366	0.727425	0.916967	2.0	...	3.0	House / apartment	0	15	7.487532	NaN	NaN	NaN	NaN	0.285714

5 rows × 130 columns

✔️ 15. Definição das colunas geométricas¶

Objetivo:

Identificar explicitamente as features geométricas do experimento.

Ações realizadas:

Lista de colunas derivadas do grafo

Justificativa técnica:

Essa separação será usada para comparar baseline tabular com modelo enriquecido.

Resultados esperados:

Lista formal das features geométricas.

graph_cols = [
    "graph_degree",
    "graph_degree_weighted",
    "curvature_mean",
    "curvature_min",
    "curvature_max",
    "curvature_std",
    "local_clustering"
]

graph_cols = [c for c in graph_cols if c in df_model.columns]
graph_cols

['graph_degree',
 'graph_degree_weighted',
 'curvature_mean',
 'curvature_min',
 'curvature_max',
 'curvature_std',
 'local_clustering']

✔️ 16. Separação treino/teste antes do pré-processamento¶

Objetivo:

Separar treino e teste preservando a proporção da variável-alvo.

Ações realizadas:

Train/test split estratificado

Justificativa técnica:

A separação antes do pipeline evita vazamento de informação.

Resultados esperados:

Conjuntos de treino e teste bem definidos.

train_df, test_df = train_test_split(
    df_model,
    test_size=0.30,
    random_state=RANDOM_STATE,
    stratify=df_model[TARGET_COL]
)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Train shape: (3500, 130)
Test shape: (1500, 130)

✔️ 17. Definição das features baseline e geometric¶

Objetivo:

Construir dois conjuntos de features:

baseline tabular

tabular + geométricas

Ações realizadas:

Identificação de colunas categóricas e numéricas

Remoção de colunas que não entram no modelo

Justificativa técnica:

Isso permitirá uma comparação justa entre as duas abordagens.

Resultados esperados:

Listas de colunas para cada experimento.

exclude_cols = [ID_COL, TARGET_COL]

all_feature_cols = [c for c in df_model.columns if c not in exclude_cols]
baseline_feature_cols = [c for c in all_feature_cols if c not in graph_cols]
geometric_feature_cols = all_feature_cols.copy()

baseline_num_cols = train_df[baseline_feature_cols].select_dtypes(include=["number"]).columns.tolist()
baseline_cat_cols = train_df[baseline_feature_cols].select_dtypes(include=["object"]).columns.tolist()

geo_num_cols = train_df[geometric_feature_cols].select_dtypes(include=["number"]).columns.tolist()
geo_cat_cols = train_df[geometric_feature_cols].select_dtypes(include=["object"]).columns.tolist()

print("Qtde features baseline:", len(baseline_feature_cols))
print("Qtde features geometric:", len(geometric_feature_cols))

Qtde features baseline: 121
Qtde features geometric: 128

✔️ 18. Pipeline do baseline tabular¶

Objetivo:

Treinar um baseline puramente tabular.

Ações realizadas:

Construção de pipeline com imputação, escala e one-hot encoding

Treinamento de regressão logística

Justificativa técnica:

O baseline é essencial para avaliar se as features geométricas realmente agregam valor.

Resultados esperados:

Modelo tabular treinado e pronto para avaliação.

# Como validar antes de treinar
# Você pode checar quantas colunas categóricas existem:
print("Qtde baseline_cat_cols:", len(baseline_cat_cols))
print(baseline_cat_cols[:20])

Qtde baseline_cat_cols: 12
['OCCUPATION_TYPE', 'FLAG_OWN_CAR', 'CODE_GENDER', 'NAME_EDUCATION_TYPE', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'NAME_FAMILY_STATUS', 'NAME_TYPE_SUITE', 'NAME_CONTRACT_TYPE', 'FLAG_OWN_REALTY', 'NAME_HOUSING_TYPE']

baseline_preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), baseline_num_cols),
        
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), baseline_cat_cols)
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

baseline_model = Pipeline([
    ("preprocessor", baseline_preprocessor),
    ("clf", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
])

X_train_base = train_df[baseline_feature_cols]
y_train = train_df[TARGET_COL]

X_test_base = test_df[baseline_feature_cols]
y_test = test_df[TARGET_COL]

baseline_model.fit(X_train_base, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['var_24', 'FLAG_DOCUMENT_2',
                                                   'var_23', 'FLAG_PHONE',
                                                   'var_8', 'FLAG_DOCUMENT_3',
                                                   'var_26', 'var_38', 'var_5',
                                                   'OBS_30_CNT_SOCIAL_CIRCLE',
                                                   'DAYS_EMPLOYED',
                                                   'OBS_60_CNT_SOCIAL_CIRCLE',
                                                   'REGION_RATING_C...
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['OCCUPATION_TYPE',
                                                   'FLAG_OWN_CAR',
                                                   'CODE_GENDER',
                                                   'NAME_EDUCATION_TYPE',
                                                   'ORGANIZATION_TYPE',
                                                   'NAME_INCOME_TYPE',
                                                   'WEEKDAY_APPR_PROCESS_START',
                                                   'NAME_FAMILY_STATUS',
                                                   'NAME_TYPE_SUITE',
                                                   'NAME_CONTRACT_TYPE',
                                                   'FLAG_OWN_REALTY',
                                                   'NAME_HOUSING_TYPE'])],
                                   verbose_feature_names_out=False)),
                ('clf', LogisticRegression(max_iter=1000, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

✔️ 19. Avaliação do baseline¶

Objetivo:

Avaliar o desempenho do baseline tabular.

Ações realizadas:

Predição

Cálculo de ROC AUC

Relatório de classificação

Justificativa técnica:

Este resultado será o referencial mínimo do projeto.

Resultados esperados:

Métricas do baseline.

pred_base = baseline_model.predict(X_test_base)
proba_base = baseline_model.predict_proba(X_test_base)[:, 1]

print("ROC AUC - Baseline Tabular:", round(roc_auc_score(y_test, proba_base), 4))
print("\nClassification Report - Baseline")
print(classification_report(y_test, pred_base))
print("\nConfusion Matrix - Baseline")
print(confusion_matrix(y_test, pred_base))

ROC AUC - Baseline Tabular: 0.6873

Classification Report - Baseline
              precision    recall  f1-score   support

           0       0.92      0.99      0.95      1379
           1       0.20      0.03      0.06       121

    accuracy                           0.91      1500
   macro avg       0.56      0.51      0.51      1500
weighted avg       0.86      0.91      0.88      1500


Confusion Matrix - Baseline
[[1363   16]
 [ 117    4]]

✔️ 20. Pipeline com features geométricas¶

Objetivo:

Treinar um modelo com enriquecimento geométrico.

Ações realizadas:

Construção de pipeline com as mesmas etapas do baseline

Inclusão explícita das features derivadas do grafo

Justificativa técnica:

Assim isolamos o efeito da geometria na performance do modelo.

Resultados esperados:

Modelo tabular + geométrico treinado.

geo_preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), geo_num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
             ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), geo_cat_cols)
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

geo_model = Pipeline([
    ("preprocessor", geo_preprocessor),
    ("clf", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
])

X_train_geo = train_df[geometric_feature_cols]
X_test_geo = test_df[geometric_feature_cols]

geo_model.fit(X_train_geo, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['var_24', 'FLAG_DOCUMENT_2',
                                                   'var_23', 'FLAG_PHONE',
                                                   'var_8', 'FLAG_DOCUMENT_3',
                                                   'var_26', 'var_38', 'var_5',
                                                   'OBS_30_CNT_SOCIAL_CIRCLE',
                                                   'DAYS_EMPLOYED',
                                                   'OBS_60_CNT_SOCIAL_CIRCLE',
                                                   'REGION_RATING_C...
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['OCCUPATION_TYPE',
                                                   'FLAG_OWN_CAR',
                                                   'CODE_GENDER',
                                                   'NAME_EDUCATION_TYPE',
                                                   'ORGANIZATION_TYPE',
                                                   'NAME_INCOME_TYPE',
                                                   'WEEKDAY_APPR_PROCESS_START',
                                                   'NAME_FAMILY_STATUS',
                                                   'NAME_TYPE_SUITE',
                                                   'NAME_CONTRACT_TYPE',
                                                   'FLAG_OWN_REALTY',
                                                   'NAME_HOUSING_TYPE'])],
                                   verbose_feature_names_out=False)),
                ('clf', LogisticRegression(max_iter=1000, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

✔️ 21. Avaliação do modelo geométrico¶

Objetivo:

Avaliar o efeito das features geométricas no desempenho do modelo.

Ações realizadas:

Predição

Cálculo de ROC AUC

Relatório de classificação

Justificativa técnica:

Esta é a comparação central do experimento.

Resultados esperados:

Métricas do modelo enriquecido com geometria.

pred_geo = geo_model.predict(X_test_geo)
proba_geo = geo_model.predict_proba(X_test_geo)[:, 1]

print("ROC AUC - Tabular + Geometria:", round(roc_auc_score(y_test, proba_geo), 4))
print("\nClassification Report - Geometric")
print(classification_report(y_test, pred_geo))
print("\nConfusion Matrix - Geometric")
print(confusion_matrix(y_test, pred_geo))

ROC AUC - Tabular + Geometria: 0.6908

Classification Report - Geometric
              precision    recall  f1-score   support

           0       0.92      0.99      0.95      1379
           1       0.28      0.04      0.07       121

    accuracy                           0.91      1500
   macro avg       0.60      0.52      0.51      1500
weighted avg       0.87      0.91      0.88      1500


Confusion Matrix - Geometric
[[1366   13]
 [ 116    5]]

✔️ 22. Random Forest complementar¶

Objetivo:

Testar um modelo não linear com a base enriquecida.

Ações realizadas:

Pré-processamento manual

Treinamento de Random Forest

Justificativa técnica:

Alguns sinais estruturais podem ser melhor capturados por modelos não lineares.

Resultados esperados:

Métricas complementares e comparação adicional.

# transforma treino e teste
X_train_geo_transformed = geo_preprocessor.fit_transform(X_train_geo)
X_test_geo_transformed = geo_preprocessor.transform(X_test_geo)

rf_model = RandomForestClassifier(
    n_estimators=250,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

rf_model.fit(X_train_geo_transformed, y_train)

pred_rf = rf_model.predict(X_test_geo_transformed)
proba_rf = rf_model.predict_proba(X_test_geo_transformed)[:, 1]

print("ROC AUC - Random Forest + Geometria:", round(roc_auc_score(y_test, proba_rf), 4))
print("\nClassification Report - RF Geometric")
print(classification_report(y_test, pred_rf))

ROC AUC - Random Forest + Geometria: 0.693

Classification Report - RF Geometric
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1379
           1       0.00      0.00      0.00       121

    accuracy                           0.92      1500
   macro avg       0.46      0.50      0.48      1500
weighted avg       0.85      0.92      0.88      1500

✔️ 23. Importância das features¶

Objetivo:

Identificar quais atributos mais contribuíram para o Random Forest.

Ações realizadas:

Extração das importâncias

Plot das top features

Justificativa técnica:

Essa leitura ajuda a verificar se as features geométricas entraram entre os

sinais mais relevantes do modelo.

Resultados esperados:

Ranking visual das features mais importantes.

feature_names = geo_preprocessor.get_feature_names_out()
importances = pd.Series(rf_model.feature_importances_, index=feature_names).sort_values(ascending=False)

top_n = 20
plt.figure(figsize=(10, 6))
importances.head(top_n).sort_values().plot(kind="barh")
plt.title(f"Top {top_n} Features Mais Importantes - RF")
plt.xlabel("Importance")
plt.show()

display(importances.head(top_n))

EXT_SOURCE_3             0.079706
EXT_SOURCE_2             0.036946
DAYS_ID_PUBLISH          0.022009
var_13                   0.021651
var_8                    0.019988
var_36                   0.017757
DAYS_BIRTH               0.016998
var_2                    0.016885
var_29                   0.015534
var_35                   0.015483
var_10                   0.015478
var_50                   0.015325
graph_degree_weighted    0.014911
DAYS_EMPLOYED            0.014649
var_7                    0.014519
var_12                   0.014331
local_clustering         0.014062
var_28                   0.014062
var_39                   0.013877
var_27                   0.013778
dtype: float64

✔️ 24. Verificação específica das features geométricas¶

Objetivo:

Avaliar explicitamente a relevância das features geométricas.

Ações realizadas:

Filtragem das importâncias para atributos geométricos

Justificativa técnica:

Esta célula facilita a narrativa do projeto: mostrar se a geometria entrou

ou não como sinal relevante.

Resultados esperados:

Ranking de features geométricas no modelo.

geom_importances = importances[importances.index.str.contains(
    "graph_degree|curvature|local_clustering",
    regex=True
)]

display(geom_importances.sort_values(ascending=False))

graph_degree_weighted    0.014911
local_clustering         0.014062
graph_degree             0.005535
dtype: float64

✔️ 25. Leitura técnica dos resultados¶

Interpretação técnica dos resultados

Neste experimento, comparamos duas abordagens:

Baseline tabular: clientes tratados como registros independentes

Tabular + geometria: clientes enriquecidos com atributos derivados de um grafo de similaridade

Questões centrais

As features geométricas agregaram sinal preditivo?

A curvatura ajudou a resumir estrutura local relevante?

O modelo passou a capturar relações invisíveis em uma abordagem puramente tabular?

Possíveis leituras

Se o ROC AUC melhorar, há evidência de que a estrutura relacional entre clientes contém informação útil

Se as features geométricas aparecerem entre as mais importantes, a geometria do grafo passou a atuar como sinal de risco

Se não houver ganho, ainda assim o experimento tem valor técnico, pois valida ou refuta uma hipótese de modelagem de forma explícita

✔️ 26. Conclusão do notebook¶

Conclusão

Este notebook apresentou uma abordagem de Machine Learning Geométrico aplicada ao problema de risco de crédito com o dataset Home Credit Default Risk.

A partir de um grafo de similaridade entre clientes, calculamos curvatura de Ricci e outras estatísticas estruturais, transformando a geometria do grafo em features utilizáveis por modelos supervisionados.

Principais aprendizados

Clientes podem ser modelados como uma estrutura relacional, não apenas como linhas independentes

Curvatura de Ricci oferece um resumo geométrico local da rede

Features geométricas podem enriquecer a modelagem de crédito, especialmente em contextos onde relações estruturais importam

Próximos passos

Incorporar tabelas auxiliares do Home Credit, como bureau e previous_application

Testar Graph Neural Networks

Comparar KNN graph com regras de conexão por contexto ocupacional e regional

Publicar dashboard interativo e página HTML para portfólio

Fim

#!uv pip install nbconvert -U -q
!jupyter nbconvert 01_credit_risk_graph_curvature.ipynb --to html --template my-template-html-v07.tpl

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('preprocessor', ...), ('clf', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	transformers transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using ``set_params`` and searched in grid search. transformer : {'drop', 'passthrough'} or estimator Estimator must support :term:`fit` and :term:`transform`. Special-cased strings 'drop' and 'passthrough' are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where ``transformer`` expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data `X` and can return any of the above. To select multiple columns by name or dtype, you can use :obj:`make_column_selector`.	[('num', ...), ('cat', ...)]
	remainder remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.	'drop'
	sparse_threshold sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.	0.3
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	transformer_weights transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.	False
	verbose_feature_names_out verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix all feature names with the name of the transformer that generated that feature. It is equivalent to setting `verbose_feature_names_out="{transformer_name}__{feature_name}"`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not prefix any feature names and will error if feature names are not unique. - If ``Callable[[str, str], str]``, :meth:`ColumnTransformer.get_feature_names_out` will rename all the features using the name of the transformer. The first argument of the callable is the transformer name and the second argument is the feature name. The returned string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will be formatted using two field names: ``transformer_name`` and ``feature_name``. e.g. ``"{feature_name}__{transformer_name}"``. See :meth:`str.format` method from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6 `verbose_feature_names_out` can be a callable or a string to be formatted.	False
	force_int_remainder_cols force_int_remainder_cols: bool, default=False This parameter has no effect. .. note:: If you do not access the list of columns for the remainder columns in the `transformers_` fitted attribute, you do not need to set this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7 The default value for `force_int_remainder_cols` will change from `True` to `False` in version 1.7. .. deprecated:: 1.7 `force_int_remainder_cols` is deprecated and will be removed in 1.9.	'deprecated'

	missing_values missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.	nan
	strategy strategy: str or Callable, default='mean' The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. - If an instance of Callable, then replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column. .. versionadded:: 0.20 strategy="constant" for fixed value imputation. .. versionadded:: 1.5 strategy=callable for custom value imputation.	'most_frequent'
	fill_value fill_value: str or numerical value, default=None When strategy == "constant", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and "missing_value" for strings or object data types.	None
	copy copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.	True
	add_indicator add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.	False
	keep_empty_features keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy="constant"` in which case `fill_value` will be used instead. .. versionadded:: 1.2	False

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20	'auto'
	drop drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21 The parameter `drop` was added in 0.21. .. versionchanged:: 0.23 The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1 Support for dropping infrequent categories.	None
	sparse_output sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in "Compressed Sparse Row" (CSR) format. .. versionadded:: 1.2 `sparse` was renamed to `sparse_output`	True
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted `'infrequent'` if it exists. If the `'infrequent'` category does not exist, then :meth:`transform` and :meth:`inverse_transform` will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`. Read more in the :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform a warning is issued, and the encoding then proceeds as described for `handle_unknown="infrequent_if_exist"`. .. versionchanged:: 1.1 `'infrequent_if_exist'` was added to automatically handle unknown categories and infrequent categories. .. versionadded:: 1.6 The option `"warn"` was added in 1.6.	'ignore'
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	feature_name_combiner feature_name_combiner: "concat" or callable, default="concat" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `"concat"` concatenates encoded feature name and category with `feature + "_" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3	'concat'

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	missing_values missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.	nan
	strategy strategy: str or Callable, default='mean' The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. - If an instance of Callable, then replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column. .. versionadded:: 0.20 strategy="constant" for fixed value imputation. .. versionadded:: 1.5 strategy=callable for custom value imputation.	'median'
	fill_value fill_value: str or numerical value, default=None When strategy == "constant", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and "missing_value" for strings or object data types.	None
	copy copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.	True
	add_indicator add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.	False
	keep_empty_features keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy="constant"` in which case `fill_value` will be used instead. .. versionadded:: 1.2	False

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True