Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random min/max cell voltages go very high or very low in a single read. #1046

Open
cupertinomiranda opened this issue Apr 28, 2024 · 27 comments
Labels
bug Something isn't working question Further information is requested

Comments

@cupertinomiranda
Copy link
Contributor

Describe the bug

I noticed that sometimes dbus-serialbattery reports very low or high single cell voltage, affecting SOC calculation and general state of the system.
To diagnose the problem I made node-red in venus os to report as a sensor in home assistant, every single read from cell voltages.
I could then identify that the reported low/high value occurred random and in a single reading. It occurs randomly, usually once a day.

How to reproduce

Wait!

Expected behavior

Not having random reads from cell voltages.

Driver version

v1.2.20240408

Venus OS device type

Raspberry Pi 3

Venus OS version

v3.13

BMS type

JKBMS / Heltec BMS

Cell count

16

Battery count

1

Connection type

Bluetooth

Config file

[DEFAULT]

; --------- Set logging level ---------
; ERROR: Only errors are logged
; WARNING: Errors and warnings are logged
; INFO: Errors, warnings and info messages are logged
; DEBUG: Errors, warnings, info and debug messages are logged
LOGGING = DEBUG
#LOGGING = INFO


; --------- Battery Current limits ---------
MAX_BATTERY_CHARGE_CURRENT    = 100.0
MAX_BATTERY_DISCHARGE_CURRENT = 100.0


; --------- Cell Voltages ---------
; Description: Cell min/max voltages which are used to calculate the min/max battery voltage
; Example: 16 cells * 3.45V/cell = 55.2V max charge voltage. 16 cells * 2.90V = 46.4V min discharge voltage
MIN_CELL_VOLTAGE   = 2.900
; Max voltage (can seen as absorption voltage)
MAX_CELL_VOLTAGE   = 3.450
; Float voltage (can be seen as resting voltage)
FLOAT_CELL_VOLTAGE = 3.375


; --------- SOC reset voltage ---------
; Description: May be needed to reset the SoC to 100% once in a while for some BMS, because of SoC drift.
;              Specify the cell voltage where the SoC should be reset to 100% by the BMS.
;                - JKBMS: SoC is reset to 100% if one cell reaches OVP (over voltage protection) voltage
;              As you have to adopt this value to your system, I reccomend to start with
;              OVP voltage - 0.030 (see Example).
;                - Try to increase (add) by 0.005 in steps, if the system does not switch to float mode, even if
;                  the target voltage SOC_RESET_VOLTAGE * CELL_COUNT is reached.
;                - Try to decrease (lower) by 0.005 in steps, if the system hits the OVP too fast, before all
;                  cells could be balanced and the system goes into protection mode multiple times.
; Example: If OVP is 3.650, then start with 3.620 and increase/decrease by 0.005
; Note: The value has to be higher as the MAX_CELL_VOLTAGE
#SOC_RESET_VOLTAGE = 3.650
SOC_RESET_VOLTAGE = 3.470
; Specify after how many days the soc reset voltage should be reached again
; The timer is reset when the soc reset voltage is reached
; Leave empty if you don't want to use this
; Example: Value is set to 15
; day 1: soc reset reached once
; day 16: soc reset reached twice
; day 31: soc reset not reached since it's very cloudy
; day 34: soc reset reached since the sun came out
; day 49: soc reset reached again, since last time it took 3 days to reach soc reset voltage
SOC_RESET_AFTER_DAYS = 7


; --------- Bluetooth BMS ---------
; Description: Specify the Bluetooth BMS and it's MAC address that you want to install. Leave emty to disable
; -- Available Bluetooth BMS:
; Jkbms_Ble, LltJbd_Ble
; Example for one BMS:
; BLUETOOTH_BMS = Jkbms_Ble C8:47:8C:00:00:00
; Example for multiple BMS:
; BLUETOOTH_BMS = Jkbms_Ble C8:47:8C:00:00:00, Jkbms_Ble C8:47:8C:00:00:11, Jkbms_Ble C8:47:8C:00:00:22
BLUETOOTH_BMS = Jkbms_Ble .....


; --------- Bluetooth use USB ---------
; Description:  Some users reported issues to the built in bluetooth module, you can try to fix it with an USB
;               module. After a change you have to run reinstall-local.sh and to manual reboot the device!
;               The usb bluetooth module must have BLE support (bluetooth version >= 4.0)
;               Other bluetooth devices such as Ruuvi tags not tested yet.
; False: Use the built in bluetooth module
; True: Disable built in bluetooth module and try to use USB module
BLUETOOTH_USE_USB = False


; --------- CAN BMS ---------
; Description: Specify the CAN port(s) where the BMS is connected to. Leave empty to disable
; -- Available CAN BMS:
; Daly_Can, Jkbms_Can
; Example for one CAN port:
; CAN_PORT = can0
; Example for multiple CAN ports:
; CAN_PORT = can0, can8, can9
CAN_PORT =


; --------- BMS disconnect behaviour ---------
; Description: Block charge and discharge when the communication to the BMS is lost. If you are removing the
;              BMS on purpose, then you have to restart the driver/system to reset the block.
; False: Charge and discharge is not blocked on BMS communication loss for 20 minutes, if cell voltages are between 3.25 V and 3.35 V. Else the driver block charge and discharge after 60 seconds.
; True: Charge and discharge is blocked on BMS communication loss, it's unblocked when connection is established
;       again or the driver/system is restarted. This is the Victron Energy default behaviour.
BLOCK_ON_DISCONNECT = False


; --------- Charge mode ---------
; Choose the mode for voltage / current limitations (True / False)
; False is a step mode: This is the default with limitations on hard boundary steps
; True is a linear mode:
;     For CCL and DCL the values between the steps are calculated for smoother values (by WaldemarFech)
;     For CVL max battery voltage is calculated dynamically in order that the max cell voltage is not exceeded
LINEAR_LIMITATION_ENABLE = True

; Specify in seconds how often the linear values should be recalculated
LINEAR_RECALCULATION_EVERY = 60
; Specify in percent when the linear values should be recalculated immediately
; Example: 5 for a immediate change, when the value changes by more than 5%
LINEAR_RECALCULATION_ON_PERC_CHANGE = 5


; --------- Charge Voltage limitation (affecting CVL) ---------
; Description: Limit max charging voltage (MAX_CELL_VOLTAGE * cell count), switch from max voltage to float
;              voltage (FLOAT_CELL_VOLTAGE * cell count) and back
;     False: Max charging voltage is always kept
;     True: Max charging voltage is reduced based on charge mode
;         Step mode: After max voltage is reached for MAX_VOLTAGE_TIME_SEC it switches to float voltage. After
;                    SoC is below SOC_LEVEL_TO_RESET_VOLTAGE_LIMIT it switches back to max voltage.
;         Linear mode: After max voltage is reachend and cell voltage difference is smaller or equal to
;                      CELL_VOLTAGE_DIFF_KEEP_MAX_VOLTAGE_UNTIL it switches to float voltage after 300 (fixed)
;                      additional seconds.
;                      After cell voltage difference is greater or equal to CELL_VOLTAGE_DIFF_TO_RESET_VOLTAGE_LIMIT
;                      OR
;                      SoC is below SOC_LEVEL_TO_RESET_VOLTAGE_LIMIT
;                      it switches back to max voltage.
; Example: The battery reached max voltage of 55.2V and hold it for 900 seconds, the the CVL is switched to
;          float voltage of 53.6V to don't stress the batteries. Allow max voltage of 55.2V again, if SoC is
;          once below 80%
;          OR
;          The battery reached max voltage of 55.2V and the max cell difference is 0.010V, then switch to float
;          voltage of 53.6V after 300 additional seconds to don't stress the batteries. Allow max voltage of
;          55.2V again if max cell difference is above 0.080V or SoC below 80%.
; Charge voltage control management enable (True/False).
CVCM_ENABLE = True

; -- CVL reset based on cell voltage diff (linear mode)
; Specify cell voltage diff where CVL limit is kept until diff is equal or lower
CELL_VOLTAGE_DIFF_KEEP_MAX_VOLTAGE_UNTIL        = 0.010
; Specify cell voltage diff where MAX_VOLTAGE_TIME_SEC restarts if diff is bigger
CELL_VOLTAGE_DIFF_KEEP_MAX_VOLTAGE_TIME_RESTART = 0.013
; Specify cell voltage diff where CVL limit is reset to max voltage, if value get above
; the cells are considered as imbalanced, if the cell diff exceeds 5% of the nominal cell voltage
; e.g. 3.2 V * 5 / 100 = 0.160 V
CELL_VOLTAGE_DIFF_TO_RESET_VOLTAGE_LIMIT        = 0.080

; -- CVL reset based on SoC option (step mode & linear mode)
; Specify how long the max voltage should be kept
;     Step mode: If reached then switch to float voltage
;     Linear mode: If cells are balanced keep max voltage for further MAX_VOLTAGE_TIME_SEC seconds
MAX_VOLTAGE_TIME_SEC = 900
; Specify SoC where CVL limit is reset to max voltage
;     Step mode: If SoC gets below
;     Linear mode: If cells are unbalanced or if SoC gets below
SOC_LEVEL_TO_RESET_VOLTAGE_LIMIT = 90


; --------- SOC calculation ---------
; Description: Calculate the SOC in the driver. Do not use the SOC reported by the BMS
; SOC_CALCULATION:
;     True: Calc SOC in the driver, do not use SOC reported from BMS
;         - The SOC is calculated by integration of the current reported by the BMS
;         - The current reported from the BMS can be corrected by
;           the map (SOC_CALC_CURRENT_REPORTED_BY_BMS, SOC_CALC_CURRENT_MEASURED_BY_USER)
;         - The SOC is set to 100% if the following conditions apply for at least SOC_RESET_TIME seconds:
;             * Current is lower than SOC_RESET_CURRENT amps
;             * Sum of cell voltages >= self.max_battery_voltage - VOLTAGE_DROP
;         - The calculated SOC is stored in dbus to persist a driver restart
;     False: Use SOC reported from BMS (none of the other parameters apply)
; More info: https://github.com/Louisvdw/dbus-serialbattery/pull/868
SOC_CALCULATION = True
SOC_RESET_CURRENT = 7
SOC_RESET_TIME = 60
; SOC_CALC_CURRENT_REPORTED_BY_BMS  = -300, 300
; SOC_CALC_CURRENT_MEASURED_BY_USER = -300, 300
SOC_CALC_CURRENT_REPORTED_BY_BMS  = -300, -0.5, 0.5, 300
SOC_CALC_CURRENT_MEASURED_BY_USER = -300,    0,   0, 300
; Example to set small currents to zero
; SOC_CALC_CURRENT_REPORTED_BY_BMS  = -300, -0.5, 0.5, 300
; SOC_CALC_CURRENT_MEASURED_BY_USER = -300,    0,   0, 300


; --------- Cell Voltage Current limitation (affecting CCL/DCL) ---------
; Description: Maximal charge / discharge current will be in-/decreased depending on min and max cell voltages
; Example: 18 cells * 3.55V/cell = 63.9V max charge voltage
;          18 cells * 2.70V/cell = 48.6V min discharge voltage
;          But in reality not all cells reach the same voltage at the same time. The (dis)charge current
;          will be (in-/)decreased, if even ONE SINGLE BATTERY CELL reaches the limits

; Charge current control management referring to cell-voltage enable (True/False).
CCCM_CV_ENABLE = True
; Discharge current control management referring to cell-voltage enable (True/False).
DCCM_CV_ENABLE = True

; Set steps to reduce battery current
; The current will be changed linear between those steps if LINEAR_LIMITATION_ENABLE is set to True
CELL_VOLTAGES_WHILE_CHARGING   = 3.55, 3.50, 3.45, 3.30
MAX_CHARGE_CURRENT_CV_FRACTION =    0, 0.05,  0.5,    1

CELL_VOLTAGES_WHILE_DISCHARGING   = 2.70, 2.80, 2.90, 3.10
MAX_DISCHARGE_CURRENT_CV_FRACTION =    0,  0.1,  0.5,    1


; --------- Cell Voltage limitation (affecting CVL) ---------
; This function prevents a bad balanced battery to overcharge the cell with the highest voltage and the bms to
; switch off because of overvoltage of this cell.
;
; Example:
; 15 cells are at 3.4v, 1 cell is at 3.6v. Total voltage of battery is 54.6v and the Victron System sees no reason to
; lower the charging current as the control_voltage (Absorbtion Voltage) ist 55.2v
; In this case the Cell Voltage limitation kicks in and lowers the control_voltage to keep it close to the MAX_CELL_VOLTAGE.
;
; In theory this can also be done with CCL, but doing it with CVL has 2 advantages:
;   - In a well balanced system the current can be kept quite high till the end of charge by using MAX_CELL_VOLTAGE for charging.
;   - In systems with MPPTs and DC-feed-in activated the victron systems do not respect CCL, so CVL is the only way to prevent the
;     highest cell in a bad balanced system from overcharging.
;
; There are 2 methods implemented to calculate CVL:
;   1. penalty_sum-Method (CVL_ICONTROLLER_MODE = False)
;      The voltage-overshoot of all cells that exceed MAX_CELL_VOLTAGE is summed up and the control voltage is lowered by this "penalty_sum".
;      This is calculated every LINEAR_RECALCULATION_EVERY seconds.
;      In fact, this is a P-Controller.
;   2. I-Controller (CVL_ICONTROLLER_MODE = True)
;      An I-Controller tries to control the voltage of the highest cell to MAX_CELL_VOLTAGE + CELL_VOLTAGE_DIFF_KEEP_MAX_VOLTAGE_UNTIL.
;      (for example 3.45V+0.01V =3.46V). If the voltage of the highest cell is above this level, CVL is reduced. If the voltage is below, CVL is
;      increased until cellcount*MAX_CELL_VOLTAGE.
;      An I-Part of 0.2 V/Vs (CVL_ICONTROLLER_FACTOR) has proved to be a stable and fast controlling-behaviour.
;      This method is not as fast as the penalty_sum-Method but usually smoother and more stable against toggeling and has no stationary deviation.
; More info: https://github.com/Louisvdw/dbus-serialbattery/pull/882
CVL_ICONTROLLER_MODE = False
CVL_ICONTROLLER_FACTOR = 0.2


; --------- Temperature limitation (affecting CCL/DCL) ---------
; Description: Maximal charge / discharge current will be in-/decreased depending on temperature
; Example: The temperature limit will be monitored to control the currents. If there are two temperature senors,
;          then the worst case will be calculated and the more secure lower current will be set.
; Charge current control management referring to temperature enable (True/False).
CCCM_T_ENABLE = True
; Charge current control management referring to temperature enable (True/False).
DCCM_T_ENABLE = True

; Set steps to reduce battery current
; The current will be changed linear between those steps if LINEAR_LIMITATION_ENABLE is set to True
TEMPERATURE_LIMITS_WHILE_CHARGING = 0,   2,   5,  10,  15, 20, 35,  40, 55
MAX_CHARGE_CURRENT_T_FRACTION     = 0, 0.1, 0.2, 0.4, 0.8,  1,  1, 0.4,  0

TEMPERATURE_LIMITS_WHILE_DISCHARGING = -20,   0,   5,  10, 15, 45, 55
MAX_DISCHARGE_CURRENT_T_FRACTION     =   0, 0.2, 0.3, 0.4,  1,  1,  0


; --------- SOC limitation (affecting CCL/DCL) ---------
; Description: Maximal charge / discharge current will be increased / decreased depending on State of Charge,
;              see CC_SOC_LIMIT1 etc.
; Example: The SoC limit will be monitored to control the currents.
; Charge current control management enable (True/False).
CCCM_SOC_ENABLE = False
; Discharge current control management enable (True/False).
DCCM_SOC_ENABLE = False

; Charge current SoC limits
; CC_SOC_LIMIT1 = 98
; CC_SOC_LIMIT2 = 95
; CC_SOC_LIMIT3 = 91
; 
;  Charge current limits
; CC_CURRENT_LIMIT1_FRACTION = 0.1
; CC_CURRENT_LIMIT2_FRACTION = 0.3
; CC_CURRENT_LIMIT3_FRACTION = 0.5
; 
;  Discharge current SoC limits
; DC_SOC_LIMIT1 = 10
; DC_SOC_LIMIT2 = 20
; DC_SOC_LIMIT3 = 30
; 
;  Discharge current limits
; DC_CURRENT_LIMIT1_FRACTION = 0.1
; DC_CURRENT_LIMIT2_FRACTION = 0.3
; DC_CURRENT_LIMIT3_FRACTION = 0.5


; --------- Time-To-Go ---------
; Description: Calculates the time to go shown in the GUI
;              Recalculation is done based on TIME_TO_SOC_RECALCULATE_EVERY
TIME_TO_GO_ENABLE = True


; --------- Time-To-Soc ---------
; Description: Calculates the time to a specific SoC
; Example: TIME_TO_SOC_POINTS = 50, 25, 15, 0
;          6h 24m remaining until 50% SoC
;          17h 36m remaining until 25% SoC
;          22h 5m remaining until 15% SoC
;          28h 48m remaining until 0% SoC
; Set of SoC percentages to report on dbus and MQTT. The more you specify the more it will impact system performance.
; [Valid values 0-100, comma separated list. More that 20 intervals are not recommended]
; Example: TIME_TO_SOC_POINTS = 100, 95, 90, 85, 75, 50, 25, 20, 10, 0
; Leave empty to disable
TIME_TO_SOC_POINTS =
; Specify TimeToSoc value type [Valid values 1, 2, 3]
; 1 Seconds
; 2 Time string <days>d <hours>h <minutes>m <seconds>s
; 3 Both seconds and time string "<seconds> [<days>d <hours>h <minutes>m <seconds>s]"
TIME_TO_SOC_VALUE_TYPE = 1
; Specify in seconds how often the TimeToSoc should be recalculated
; Minimum are 5 seconds to prevent CPU overload
TIME_TO_SOC_RECALCULATE_EVERY = 60
; Include TimeToSoC points when moving away from the SoC point [Valid values True, False]
; These will be as negative time. Disabling this improves performance slightly
TIME_TO_SOC_INC_FROM = False


; --------- Additional settings ---------
; Specify one or more BMS types to load else leave empty to try to load all available
; -- Available BMS:
; Daly, Ecs, HeltecModbus, HLPdataBMS4S, Jkbms, Lifepower, LltJbd, Renogy, Seplos
; -- Available BMS, but disabled by default (just enter one or more below and it will be enabled):
; ANT, MNB, Sinowealth
BMS_TYPE =

; Exclute this serial devices from the driver startup
; Example: /dev/ttyUSB2, /dev/ttyUSB4
EXCLUDED_DEVICES =

; Auto reset SoC
; If on, then SoC is reset to 100%, if the value switches from absorption to float voltage
; Currently only working for Daly BMS and JKBMS BLE
AUTO_RESET_SOC = True

; Publish the config settings to the dbus path "/Info/Config/"
PUBLISH_CONFIG_VALUES = True

; Select the format of cell data presented on dbus [Valid values 0,1,2,3]
; 0 Do not publish all the cells (only the min/max cell data as used by the default GX)
; 1 Format: /Voltages/Cell (also available for display on Remote Console)
; 2 Format: /Cell/#/Volts
; 3 Both formats 1 and 2
BATTERY_CELL_DATA_FORMAT = 1

; Simulate Midpoint graph (True/False).
MIDPOINT_ENABLE = False

; Battery temperature
; Specify how the battery temperature is assembled
; 0 Get mean of temperature sensor 1 to sensor 4
; 1 Get only temperature from temperature sensor 1
; 2 Get only temperature from temperature sensor 2
; 3 Get only temperature from temperature sensor 3
; 4 Get only temperature from temperature sensor 4
TEMP_BATTERY = 0

; Temperature sensor 1 name
TEMP_1_NAME = Temp 1

; Temperature sensor 2 name
TEMP_2_NAME = Temp 2

; Temperature sensor 2 name
TEMP_3_NAME = Temp 3

; Temperature sensor 2 name
TEMP_4_NAME = Temp 4


; --------- BMS specific settings ---------

; -- LltJbd settings
; SoC low levels
; NOTE: SOC_LOW_WARNING is also used to calculate the Time-To-Go even if you are not using a LltJbd BMS
SOC_LOW_WARNING = 20
SOC_LOW_ALARM   = 10

; -- Daly settings
; Battery capacity (amps), if the BMS does not support reading it
BATTERY_CAPACITY = 50
; Invert Battery Current. Default non-inverted. Set to -1 to invert
INVERT_CURRENT_MEASUREMENT = 1

; -- ESC GreenMeter and Lipro device settings
GREENMETER_ADDRESS  = 1
LIPRO_START_ADDRESS = 2
LIPRO_END_ADDRESS   = 4
LIPRO_CELL_COUNT    = 15

; -- HeltecModbus (Heltec SmartBMS/YYBMS) settings
; Set the Modbus addresses from the adapters
; Separate each address to check by a comma like: 1, 2, 3, ...
; factory default address will be 1
HELTEC_MODBUS_ADDR = 1


; --------- Voltage drop ---------
; If you have a voltage drop between the BMS and the charger because of wire size or length
; then you can specify the voltage drop here. The driver will then add the voltage drop
; to the calculated CVL to compensate.
; Example:
;     cell count: 16
;     MAX_CELL_VOLTAGE = 3.45
;     max voltage calculated = 16 * 3.45 = 55.20
;     CVL is set to 55.20 V and the battery is now charged until the charger reaches 55.20 V.
;     The BMS now measures 55.05 V since there is a voltage drop of 0.15 V on the cable.
;     Since the dbus-serialbattery reads the voltage of 55.05 V from the BMS the max voltage
;     of 55.20 V is never reached and max voltage is kept forever.
;     By setting the VOLTAGE_DROP to 0.15 V the voltage on the charger is increased and the
;     target voltage on the BMS is reached.
VOLTAGE_DROP = 0.00

Relevant log output

Nothing particular

Any other information that may be helpful

No response

@cupertinomiranda cupertinomiranda added the bug Something isn't working label Apr 28, 2024
@cupertinomiranda
Copy link
Contributor Author

For completeness, I was able to solve the issue with the following patch.
Please noticed I did not generalized the problem for other BMS.

I wonder if it occurs in other platforms as well and the solution could be generalized.

diff --git a/etc/dbus-serialbattery/bms/jkbms.py b/etc/dbus-serialbattery/bms/jkbms.py
index da27924..68b4a33 100644
--- a/etc/dbus-serialbattery/bms/jkbms.py
+++ b/etc/dbus-serialbattery/bms/jkbms.py
@@ -12,6 +12,7 @@ class Jkbms(Battery):
         super(Jkbms, self).__init__(port, baud, address)
         self.type = self.BATTERYTYPE
         self.unique_identifier_tmp = ""
+        self.read_error_count = 0

     BATTERYTYPE = "Jkbms"
     LENGTH_CHECK = 1
@@ -94,10 +95,21 @@ class Jkbms(Battery):
         if cellbyte_count == 3 * self.cell_count and self.cell_count == len(self.cells):
             offset = 1
             celldata = self.get_data(status_data, b"\x79", offset, 1 + cellbyte_count)
+            error = False
             for c in range(self.cell_count):
-                self.cells[c].voltage = (
-                    unpack_from(">xH", celldata, c * 3 + 1)[0] / 1000
+                min_value = self.cells[c].voltage * 0.9
+                max_value = self.cells[c].voltage * 1.1
+                value = (
+                        unpack_from(">xH", celldata, c * 3 + 1)[0] / 1000
                 )
+                if((value < min_value or value > max_value) and self.read_error_count < 5):
+                    error = True
+                else:
+                    self.cells[c].voltage = value
+            if error:
+                self.read_error_count += 1
+            else:
+                self.read_error_count = 0

         # MOSFET temperature
         offset = cellbyte_count + 3
@@ -223,7 +235,13 @@ class Jkbms(Battery):
                     self.cells[c].balance = False

         # logger.info(self.hardware_version)
-        return True
+        if (self.read_error_count >= 5):
+            self.read_error_count = 0
+
+        if (self.read_error_count == 0):
+            return True
+        else:
+            return False

     def unique_identifier(self) -> str:
         """

@mr-manuel
Copy link
Collaborator

You state that your JKBMS is connected via Bluetooth, but you edited the serial JKBMS driver? That makes no sense to me.

@cupertinomiranda
Copy link
Contributor Author

Your right! The funny is that the issue does not show in days, while it was happening every day. :-)
Strange !

@cupertinomiranda
Copy link
Contributor Author

Removed my patch, thinking that there might be some fix in the master branch, that made the issue not being reproduced for 3 days.
Well today it happened:
image
image

Will prepare a new patch to fix this in the proper driver.

@mr-manuel
Copy link
Collaborator

Please use my branch, to keep things simple.

@mr-manuel
Copy link
Collaborator

I hope you tested, if the error exists also in the latest nightly available.

@cupertinomiranda
Copy link
Contributor Author

Well, based on my suspicion now, which is a race condition between decoding and reporting, the bug will be there for sure.
I have now this patch applied and will be testing it for a few days.

commit 29fabb1bebc381a5f08f4d1458ebc1c33727d80a (HEAD -> master)
Author: Cupertino Miranda <cupertinomiranda@gmail.com>
Date:   Mon Apr 29 23:13:35 2024 +0100

    Race condition between decoding and reporting bms_status.

    Since decoding happens in an independent thread there is a chance that
    when data is collected for reporting to system, that it is in middle of
    decoding and there will be reported some data corruption.
    The solution was to add mutexes in the decoding process, such that there
    will be no reading during the decoding of bms_status.

diff --git a/etc/dbus-serialbattery/bms/jkbms_brn.py b/etc/dbus-serialbattery/bms/jkbms_brn.py
index e81c885..a911008 100644
--- a/etc/dbus-serialbattery/bms/jkbms_brn.py
+++ b/etc/dbus-serialbattery/bms/jkbms_brn.py
@@ -144,6 +144,7 @@ class Jkbms_Brn:
     def __init__(self, addr, reset_bt_callback=None):
         self.address = addr
         self.bt_thread = None
+        self.mutex = threading.Lock()
         self.bt_thread_monitor = threading.Thread(
             target=self.monitor_scraping, name="Thread-JKBMS-Monitor"
         )
@@ -362,7 +363,8 @@ class Jkbms_Brn:
             logger.debug(f"compair recvd. crc: {rcrc} vs calc. crc: {ccrc}")
             if ccrc == rcrc:
                 logger.debug("great success! frame complete and sane, lets decode")
-                self.decode()
+                with self.mutex:
+                    self.decode()
                 self.frame_buffer = []
                 if self._new_data_callback is not None:
                     self._new_data_callback()
@@ -450,10 +452,11 @@ class Jkbms_Brn:
         await self.write_register(cmd, b"\0\0\0\0", 0x00, client, False)

     def get_status(self):
-        if "settings" in self.bms_status and "cell_info" in self.bms_status:
-            return self.bms_status
-        else:
-            return None
+        with self.mutex:
+            if "settings" in self.bms_status and "cell_info" in self.bms_status:
+                return self.bms_status
+            else:
+                return None

     def connect_and_scrape(self):
         asyncio.run(self.asy_connect_and_scrape())

@mr-manuel
Copy link
Collaborator

Any updates?

@cupertinomiranda
Copy link
Contributor Author

That particular patch is no good. I am now very sure there is a race between the collection of the data and the submission to dbus.
I know this, both from analyzing the code and also the fact that it usually affects 2 values. Also, I had it happen in current and temperature readings simultaneously, which is suspicious.

I have a new patch, but has reported a suspicious min cell voltage. I cannot really know for sure if it was an actual reading from the BMS or a corruption. I will run it for a week to make sure it does not persist. I will create a pull request, once done.

I include the patch for reference. The patch seems big, but it is not.
I create a new thread for decoding and a list to which decoding data can be collected from the thread.
Also I create a mutex to disable the decoder from running in parallel with dbus data submission.
The decoding thread is needed since I think the bluetooth conection is very sensible to timeout. I could not lock the decoding in the same thread as bluetooth data collection, without constantly having bluetooth disconnections.

diff --git a/etc/dbus-serialbattery/bms/jkbms_ble.py b/etc/dbus-serialbattery/bms/jkbms_ble.py
index d9f049a..9a4b570 100644
--- a/etc/dbus-serialbattery/bms/jkbms_ble.py
+++ b/etc/dbus-serialbattery/bms/jkbms_ble.py
@@ -7,6 +7,7 @@ from time import sleep, time
 from bms.jkbms_brn import Jkbms_Brn
 import os
 import sys
+import threading
 
 # from bleak import BleakScanner, BleakError
 # import asyncio
@@ -148,114 +149,116 @@ class Jkbms_Ble(Battery):
 
         # result = self.read_soc_data()
         # TODO: check for errors
-        st = self.jk.get_status()
-        if st is None:
-            return False
-
-        last_update = int(time() - st["last_update"])
-        if last_update >= 15 and last_update % 15 == 0:
-            logger.info(
-                f"Jkbms_Ble: Bluetooth connection interrupted. Got no fresh data since {last_update}s."
-            )
-            # show Bluetooth signal strength (RSSI)
-            bluetoothctl_info = os.popen(
-                "bluetoothctl info "
-                + self.address
-                + ' | grep -i -E "device|name|alias|pair|trusted|blocked|connected|rssi|power"'
-            )
-            logger.info(bluetoothctl_info.read())
-            bluetoothctl_info.close()
-
-            # if the thread is still alive but data too old there is something
-            # wrong with the bt-connection; restart whole stack
-            if not self.resetting and last_update >= 60:
-                logger.error(
-                    "Jkbms_Ble: Bluetooth died. Restarting Bluetooth system driver."
+        with self.jk.mutex:
+            st = self.jk.get_status()
+
+            if st is None:
+                return False
+
+            last_update = int(time() - st["last_update"])
+            if last_update >= 15 and last_update % 15 == 0:
+                logger.info(
+                    f"Jkbms_Ble: Bluetooth connection interrupted. Got no fresh data since {last_update}s."
                 )
-                self.reset_bluetooth()
-                sleep(2)
-                self.jk.start_scraping()
-                sleep(2)
+                # show Bluetooth signal strength (RSSI)
+                bluetoothctl_info = os.popen(
+                    "bluetoothctl info "
+                    + self.address
+                    + ' | grep -i -E "device|name|alias|pair|trusted|blocked|connected|rssi|power"'
+                )
+                logger.info(bluetoothctl_info.read())
+                bluetoothctl_info.close()
+
+                # if the thread is still alive but data too old there is something
+                # wrong with the bt-connection; restart whole stack
+                if not self.resetting and last_update >= 60:
+                    logger.error(
+                        "Jkbms_Ble: Bluetooth died. Restarting Bluetooth system driver."
+                    )
+                    self.reset_bluetooth()
+                    sleep(2)
+                    self.jk.start_scraping()
+                    sleep(2)
+
+                return False
+            else:
+                self.resetting = False
 
-            return False
-        else:
-            self.resetting = False
+            for c in range(self.cell_count):
+                self.cells[c].voltage = st["cell_info"]["voltages"][c]
 
-        for c in range(self.cell_count):
-            self.cells[c].voltage = st["cell_info"]["voltages"][c]
+            temp_mos = st["cell_info"]["temperature_mos"]
+            self.to_temp(0, temp_mos if temp_mos < 32767 else (65535 - temp_mos) * -1)
 
-        temp_mos = st["cell_info"]["temperature_mos"]
-        self.to_temp(0, temp_mos if temp_mos < 32767 else (65535 - temp_mos) * -1)
+            temp1 = st["cell_info"]["temperature_sensor_1"]
+            self.to_temp(1, temp1 if temp1 < 32767 else (65535 - temp1) * -1)
 
-        temp1 = st["cell_info"]["temperature_sensor_1"]
-        self.to_temp(1, temp1 if temp1 < 32767 else (65535 - temp1) * -1)
+            temp2 = st["cell_info"]["temperature_sensor_2"]
+            self.to_temp(1, temp2 if temp2 < 32767 else (65535 - temp2) * -1)
 
-        temp2 = st["cell_info"]["temperature_sensor_2"]
-        self.to_temp(1, temp2 if temp2 < 32767 else (65535 - temp2) * -1)
+            self.current = round(st["cell_info"]["current"], 1)
+            self.voltage = round(st["cell_info"]["total_voltage"], 2)
 
-        self.current = round(st["cell_info"]["current"], 1)
-        self.voltage = round(st["cell_info"]["total_voltage"], 2)
+            self.soc = st["cell_info"]["battery_soc"]
+            self.cycles = st["cell_info"]["cycle_count"]
 
-        self.soc = st["cell_info"]["battery_soc"]
-        self.cycles = st["cell_info"]["cycle_count"]
+            self.charge_fet = st["settings"]["charging_switch"]
+            self.discharge_fet = st["settings"]["discharging_switch"]
+            self.balance_fet = st["settings"]["balancing_switch"]
 
-        self.charge_fet = st["settings"]["charging_switch"]
-        self.discharge_fet = st["settings"]["discharging_switch"]
-        self.balance_fet = st["settings"]["balancing_switch"]
+            self.balancing = False if st["cell_info"]["balancing_action"] == 0.000 else True
+            self.balancing_current = (
+                st["cell_info"]["balancing_current"]
+                if st["cell_info"]["balancing_current"] < 32767
+                else (65535 / 1000 - st["cell_info"]["balancing_current"]) * -1
+            )
+            self.balancing_action = st["cell_info"]["balancing_action"]
 
-        self.balancing = False if st["cell_info"]["balancing_action"] == 0.000 else True
-        self.balancing_current = (
-            st["cell_info"]["balancing_current"]
-            if st["cell_info"]["balancing_current"] < 32767
-            else (65535 / 1000 - st["cell_info"]["balancing_current"]) * -1
-        )
-        self.balancing_action = st["cell_info"]["balancing_action"]
+            # show wich cells are balancing
+            for c in range(self.cell_count):
+                if self.balancing and (
+                    st["cell_info"]["min_voltage_cell"] == c
+                    or st["cell_info"]["max_voltage_cell"] == c
+                ):
+                    self.cells[c].balance = True
+                else:
+                    self.cells[c].balance = False
 
-        # show wich cells are balancing
-        for c in range(self.cell_count):
-            if self.balancing and (
-                st["cell_info"]["min_voltage_cell"] == c
-                or st["cell_info"]["max_voltage_cell"] == c
+            # protection bits
+            # self.protection.soc_low = 2 if status["cell_info"]["battery_soc"] < 10.0 else 0
+
+            # trigger cell imbalance warning when delta is to great
+            if st["cell_info"]["delta_cell_voltage"] > min(
+                st["settings"]["cell_ovp"] * 0.05, 0.200
+            ):
+                self.protection.cell_imbalance = 2
+            elif st["cell_info"]["delta_cell_voltage"] > min(
+                st["settings"]["cell_ovp"] * 0.03, 0.120
             ):
-                self.cells[c].balance = True
+                self.protection.cell_imbalance = 1
             else:
-                self.cells[c].balance = False
-
-        # protection bits
-        # self.protection.soc_low = 2 if status["cell_info"]["battery_soc"] < 10.0 else 0
-
-        # trigger cell imbalance warning when delta is to great
-        if st["cell_info"]["delta_cell_voltage"] > min(
-            st["settings"]["cell_ovp"] * 0.05, 0.200
-        ):
-            self.protection.cell_imbalance = 2
-        elif st["cell_info"]["delta_cell_voltage"] > min(
-            st["settings"]["cell_ovp"] * 0.03, 0.120
-        ):
-            self.protection.cell_imbalance = 1
-        else:
-            self.protection.cell_imbalance = 0
-
-        self.protection.voltage_high = 2 if st["warnings"]["cell_overvoltage"] else 0
-        self.protection.voltage_low = 2 if st["warnings"]["cell_undervoltage"] else 0
-
-        self.protection.current_over = (
-            2
-            if (
-                st["warnings"]["charge_overcurrent"]
-                or st["warnings"]["discharge_overcurrent"]
+                self.protection.cell_imbalance = 0
+
+            self.protection.voltage_high = 2 if st["warnings"]["cell_overvoltage"] else 0
+            self.protection.voltage_low = 2 if st["warnings"]["cell_undervoltage"] else 0
+
+            self.protection.current_over = (
+                2
+                if (
+                    st["warnings"]["charge_overcurrent"]
+                    or st["warnings"]["discharge_overcurrent"]
+                )
+                else 0
             )
-            else 0
-        )
-        self.protection.set_IC_inspection = (
-            2 if st["cell_info"]["temperature_mos"] > 80 else 0
-        )
-        self.protection.temp_high_charge = 2 if st["warnings"]["charge_overtemp"] else 0
-        self.protection.temp_low_charge = 2 if st["warnings"]["charge_undertemp"] else 0
-        self.protection.temp_high_discharge = (
-            2 if st["warnings"]["discharge_overtemp"] else 0
-        )
-        return True
+            self.protection.set_IC_inspection = (
+                2 if st["cell_info"]["temperature_mos"] > 80 else 0
+            )
+            self.protection.temp_high_charge = 2 if st["warnings"]["charge_overtemp"] else 0
+            self.protection.temp_low_charge = 2 if st["warnings"]["charge_undertemp"] else 0
+            self.protection.temp_high_discharge = (
+                2 if st["warnings"]["discharge_overtemp"] else 0
+            )
+            return True
 
     def reset_bluetooth(self):
         logger.info("Reset of system Bluetooth daemon triggered")
diff --git a/etc/dbus-serialbattery/bms/jkbms_brn.py b/etc/dbus-serialbattery/bms/jkbms_brn.py
index e81c885..bc48c85 100644
--- a/etc/dbus-serialbattery/bms/jkbms_brn.py
+++ b/etc/dbus-serialbattery/bms/jkbms_brn.py
@@ -121,6 +121,8 @@ class Jkbms_Brn:
     # [[py dict entry as list, each entry ] ]
 
     frame_buffer = bytearray()
+    decoding_frame = None
+    frames = []
     bms_status = {}
 
     waiting_for_response = ""
@@ -147,9 +149,16 @@ class Jkbms_Brn:
         self.bt_thread_monitor = threading.Thread(
             target=self.monitor_scraping, name="Thread-JKBMS-Monitor"
         )
+        
         self.bt_reset = reset_bt_callback
         self.should_be_scraping = False
         self.trigger_soc_reset = False
+        self.mutex = threading.Lock()
+
+        self.bt_decoder_thread = threading.Thread(
+            target=self.decoding, name="Thread-JKBMS-Decoding"
+        )
+        self.bt_decoder_thread.start()
 
     async def scanForDevices(self):
         devices = await BleakScanner.discover()
@@ -159,8 +168,8 @@ class Jkbms_Brn:
     # check where the bms data starts and
     # if the bms is a 24s or 32s type
     def get_bms_max_cell_count(self):
-        fb = self.frame_buffer
-        logger.debug(self.frame_buffer)
+        fb = self.decoding_frame
+        logger.debug(self.decoding_frame)
 
         # old check to recognize 32s
         # what does this check validate?
@@ -266,29 +275,30 @@ class Jkbms_Brn:
         # bis hierhin verifiziert, rest zu testen
 
     def decode_device_info_jk02(self):
-        fb = self.frame_buffer
+        fb = self.decoding_frame
         for t in TRANSLATE_DEVICE_INFO:
             self.translate(fb, t, self.bms_status)
 
     def decode_cellinfo_jk02(self):
-        fb = self.frame_buffer
+        fb = self.decoding_frame
         has32s = self.bms_max_cell_count == 32
         for t in self.translate_cell_info:
             self.translate(fb, t, self.bms_status, f32s=has32s)
         self.decode_warnings(fb)
-        logger.debug("decode_cellinfo_jk02(): self.frame_buffer")
-        logger.debug(self.frame_buffer)
+        logger.debug("decode_cellinfo_jk02(): self.decoding_frame")
+        logger.debug(self.decoding_frame)
         logger.debug(self.bms_status)
 
     def decode_settings_jk02(self):
-        fb = self.frame_buffer
+        fb = self.decoding_frame
         for t in TRANSLATE_SETTINGS:
             self.translate(fb, t, self.bms_status)
         logger.debug(self.bms_status)
 
     def decode(self):
         # check what kind of info the frame contains
-        info_type = self.frame_buffer[4]
+
+        info_type = self.decoding_frame[4]
         self.get_bms_max_cell_count()
         if info_type == 0x01:
             logger.debug("Processing frame with settings info")
@@ -362,7 +372,7 @@ class Jkbms_Brn:
             logger.debug(f"compair recvd. crc: {rcrc} vs calc. crc: {ccrc}")
             if ccrc == rcrc:
                 logger.debug("great success! frame complete and sane, lets decode")
-                self.decode()
+                self.frames.append(self.frame_buffer)
                 self.frame_buffer = []
                 if self._new_data_callback is not None:
                     self._new_data_callback()
@@ -517,6 +527,10 @@ class Jkbms_Brn:
                         await self.reset_soc_jk(client)
                     await asyncio.sleep(0.01)
 
+                logger.debug("is_connected: %s", client.is_connected)
+                logger.debug("run: %s", self.run)
+                logger.debug("self.main_thread.is_alive(): %s", self.main_thread.is_alive())
+
             except exc.BleakDeviceNotFoundError:
                 logger.info(
                     f"--> asy_connect_and_scrape(): device not found: {self.address}"
@@ -541,6 +555,7 @@ class Jkbms_Brn:
                 self.run = False
                 if client.is_connected:
                     try:
+                        logger.debug("client.disconnect()")
                         await client.disconnect()
                     except Exception:
                         (
@@ -557,6 +572,15 @@ class Jkbms_Brn:
 
         logger.info("--> asy_connect_and_scrape(): Exit")
 
+    def decoding(self):
+        while(True):
+            with self.mutex:
+                if len(self.frames) > 0:
+                    logger.debug ("decoding frame")
+                    self.decoding_frame = self.frames.pop()
+                    self.decode();
+            asyncio.sleep(0.1)
+
     def monitor_scraping(self):
         while self.should_be_scraping is True:
             self.bt_thread = threading.Thread(

@mr-manuel
Copy link
Collaborator

Please open a PR for it.

@cupertinomiranda
Copy link
Contributor Author

cupertinomiranda commented May 23, 2024

I will, however I still see the random values showing up.
I am presuming the corruption actually happens in the device communication, however I we cannot control that I presume.
Did you ever seen this type of behavior before in other devices ?
Is it just mine?
The patch above might still fix a concurrency issue, but did not fixed my problem. :-(

I will do an implementation similar to initial propose solution, by detection of corrupted values.

@mr-manuel
Copy link
Collaborator

I'm not using the BLE connection and most of the users don't track the values in Grafana. So I do not saw this behavior.

@howetech
Copy link

I'm also seeing this behaviour monitoring 2x JKBMS via bluetooth. Every day or 2 a single cell will report an extremely high value for a single reading and then return to normal. This is easy to see in Grafana, but it also triggers a cell inbalance alarm in the VRM portal.

I'm running the latest stable release, but can try the nightly if you think that will help?

I have 2 other systems with different configurations which don't have this issue. 1 is monitoring a JKBMS via serial / RS485 and the other a JBD BMS via bluetooth and neither of these have ever exhibited the same behaviour as the JKBMS monitoring via bluetooth.

image

mr-manuel added a commit to mr-manuel/venus-os_dbus-serialbattery that referenced this issue May 24, 2024
@mr-manuel
Copy link
Collaborator

Can you try with v1.3.20240524dev?

@howetech
Copy link

Can you try with v1.3.20240524dev?

Have just installed and will monitor over the next few days

@cupertinomiranda
Copy link
Contributor Author

Just to give an update, even with my concurrency fixes in the update process, I still see this random values. I am starting to believe it is related to some firmware bug in jkbms bluetooth.
I will soon try a similar approach to my initial patch, since I think there is no way around then detecting and avoiding those random values. Its not the best but it is the possible compromise.

@mr-manuel
Copy link
Collaborator

Which values do you see? I limited them to be => 1 V and <= 5 V. Therefore they are filtered if different from the BMS.

@cupertinomiranda
Copy link
Contributor Author

cupertinomiranda commented May 26, 2024

Ok, but that is not sufficient. You should check from differences in consecutive readings and only allow them if they persist.
Be aware that the corruption might happen in any value coming from the BMS. Even on boolean values coming from BMS.
Personally I have seen values that would not be so bad, like a 2.5V but that would still cause glimpses on the system.

@mr-manuel
Copy link
Collaborator

Theoretically there is checksum to verify integrity.

def crc(self, arr: bytearray, length: int) -> int:
crc = 0
for a in arr[:length]:
crc = crc + a
return crc.to_bytes(2, "little")[0]

@cupertinomiranda
Copy link
Contributor Author

I know that, however I meant that it was some data corruption within hardware before the checksum is computed. It could also be our rpi that might have some less stable memory that gets corrupted as well. At this point I am out of ideas for the cause. :smile
Will likely replace it by some more recent rpi4 for validation.

@mr-manuel
Copy link
Collaborator

It could also be a faulty BMS or the connection to the VMR portal. Do you also log the values on a local InfluxDB? You could try that

@cupertinomiranda
Copy link
Contributor Author

I did also get the values being logged by homeassistant collected from the node-red. So I am sure it is not VRM. Faulty BMS, I doubt unless it is a problem of the firmware itself. I think @howetech confirmed the problem, actually.
@howetech: Can you check which firmware version you have on your jkbms?

image

@mr-manuel
Copy link
Collaborator

I limited them to be => 1 V and <= 5 V. Therefore they are filtered if different from the BMS.

If it is a firmware problem it should been have filtered with this

@howetech
Copy link

image
image

Since running the latest nightly release I haven't seen any spikes or errors in the data received by MQTT into influxdb. I have received 2 VRM notifications about cell inbalance during this time, however the VRM graphs do not show it so I'm guessing the device is sending the notification to VRM but its potentially so brief it doesn't show in any of the graphs.

Either way it is better than previously and I'll keeep monitoring it.

@cupertinomiranda
Copy link
Contributor Author

@mr-manuel It is filtered with your condition, however the problem is more general then just the cell voltages. I believe that it is why @howetech is still seeing those errors. Indeed @howetech VRM does seem to log some sort of average and not all readings.
@mr-manuel could you easily come up with a solution similar to my initial patch, that did not target BLE by mistake ?
Please notice it would need to filter any of the values coming from the BMS. It is tough for me to do it since I only have a production system to which I can test it on. 😞

@mr-manuel
Copy link
Collaborator

mr-manuel commented May 29, 2024

Just open a PR and I wil llook through it. Please open the PR in my repository, this will make things easier.

@cupertinomiranda
Copy link
Contributor Author

I still did not have the time to work on fixing by checking the values. I will try to do it tomorrow which is an holiday around here. I will open a PR with a rough implementation, if you so allow it. Thanks

@mr-manuel mr-manuel added the question Further information is requested label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants